Skip to main content
View source

Code

View as Markdown

A RocketRide preprocessor node that splits source code into syntax-aware chunks for embedding, search, or LLM processing.

What it does

Accepts source code text and emits each syntactic construct (function, class, statement, block) as a separate document, so downstream nodes receive chunks that respect code boundaries rather than cutting mid-construct.

Uses tree-sitter with per-language grammar packages (tree-sitter-python, tree-sitter-javascript, tree-sitter-typescript, tree-sitter-c, tree-sitter-cpp). If the optional tree_sitter_languages package is installed it is used as a fast path for grammar loading; otherwise the individual per-language modules are loaded directly. Parsers are cached per language for the lifetime of the pipeline run.

By default (language: auto) the language is detected from the content of each file using weighted regex heuristics, not the filename extension. If detection cannot identify a supported language with sufficient confidence, the file is skipped and a warning is emitted; no documents are produced for that file.


Configuration

Lanes

Lane inLane outDescription
textdocumentsSplit source code into syntax-aware document chunks

Both writeText and writeTable inputs are processed. Table input marks the resulting documents with isTable: true and an incrementing tableId. Each emitted document carries an incrementing chunkId starting at 0 per object.

Fields

Configuration is profile-based. Select a profile to fix the parsing language, or leave the default auto profile to detect the language per file.

FieldTypeDescription
strlennumberDefault 512.
languagestringDefault "auto".
profilestringDefault "auto".

Note: strlen is stored in the configuration and profile but the current splitter determines chunk boundaries purely from syntax nodes. A single large function or class becomes a single chunk regardless of strlen.

Profiles

ProfileLanguage settingNotes
auto (default)autoDetects language from file content
ccC source and headers
cppcppC++ source
pythonpythonPython source
javascriptjavascriptJavaScript source
typescripttypescriptTypeScript source

The UI profile picker exposes auto, c (labelled "C/C++ source"), python, javascript, and typescript. The cpp profile is also present in preconfig for direct configuration use.


Language auto-detection

With language: auto, the node scores the text against weighted regex patterns for Python, TypeScript, JavaScript, C++, and C (only the first 5 MB of the text is sampled).

Example signals used per language:

  • Python: def ...():, class ...:, decorator patterns, from X import, async def
  • TypeScript: interface, type X =, import type, enum, type annotations, export ... type
  • JavaScript: export default/const/function/class, require(, module.exports, arrow functions
  • C++: std::, template <, using namespace std, :: resolution
  • C: include guards (#ifndef/#define/#endif), typedef struct, prototypes ending with ;

The winner must score at least 3 and lead the runner-up by at least 2, otherwise detection fails and the file is skipped with a warning. Tie-break rules for C vs C++:

  • extern "C" present with no C++-only markers (std::, templates, namespaces, ::) resolves to C.
  • A near-tie where the top two candidates are C and C++ resolves to C (conservative, matches typical header-like code).

If detection regularly fails on valid source (very short snippets, unusual dialects), pin the language with an explicit profile instead of auto.


How chunks are extracted

The tree-sitter syntax tree is walked recursively. The following node types become chunks:

  • Python: function_definition, class_definition, decorated_definition, plus module-level import_statement, import_from_statement, assignment, and expression_statement.
  • JavaScript / TypeScript: function_declaration, function_expression, arrow_function, class_declaration, method_definition, and function values in minified patterns (const f = () => {...}, object pair values that are functions).
  • C / C++: function_definition, class_specifier, struct_specifier, entire extern "C" { ... } linkage blocks, and top-level declarations in headers (prototypes, typedefs, field declarations). Preprocessor directives (#include, #define, etc.) are skipped.

Because the walk recurses into matched nodes, nested constructs produce overlapping chunks: a class is emitted as one chunk and each of its methods is also emitted as its own chunk. This gives both whole-construct and per-member granularity for downstream retrieval.


Schema

FieldTypeDescriptionDefault
code.splitter.auto.languagestringconst: "auto"
code.splitter.c.languagestringconst: "c"
code.splitter.cpp.languagestringconst: "cpp"
code.splitter.javascript.languagestringconst: "javascript"
code.splitter.profilestringCode splitter profile"auto"
code.splitter.python.languagestringconst: "python"
code.splitter.strlennumberMaximum string length512
code.splitter.typescript.languagestringconst: "typescript"

Dependencies

  • tree-sitter
  • tree-sitter-c
  • tree-sitter-cpp
  • tree-sitter-javascript
  • tree-sitter-python
  • tree-sitter-typescript