LLM
A RocketRide preprocessor node that uses an LLM to split document text into semantically coherent chunks for downstream vector embedding storage.
What it does
Splits incoming text into semantically coherent chunks by asking a connected LLM to detect context boundaries. Unlike rule-based splitters, the LLM preserves meaning across chunk boundaries: each chunk contains complete thoughts, respects document structure (paragraphs, sections, lists), and keeps the exact text of the input document. The LLM also appends a short summary chunk for the whole document as the last chunk.
The node accumulates incoming text per object, then chunks the whole document at close time. Documents larger than the LLM's context window are first pre-split locally with a hierarchical splitter (paragraphs, then lines, then sentences, then words, capped at 64 KB of text per piece) so every piece fits within the connected LLM's context and output limits; each piece is then chunked by the LLM.
Table content arriving on the table lane bypasses the LLM entirely: tables are split locally at row boundaries to stay under the chunk token limit, and every resulting chunk is tagged with table metadata.
Output chunks are emitted as documents using the LangChain document format (page_content plus metadata) via the langchain and langchain-core libraries. Requires an LLM connection (min 1).
Configuration
Lanes
| Lane in | Lane out | Description |
|---|---|---|
text | documents | Split text into semantically chunked documents via LLM |
table | documents | Split table content into documents with table metadata preserved |
Fields
| Field | Type | Description |
|---|---|---|
numberOfTokens | number | Default 384. |
profile | string | Default "default". |
The node ships a single preconfig profile, default, which sets numberOfTokens: 384.
Connections
| Channel | Required | Description |
|---|---|---|
llm | yes (min 1) | LLM used to analyze and chunk the document |
The node queries the connected LLM at runtime for its context length, output length, and token counter, and sizes its requests accordingly (reserving room for the chunking prompt itself plus a 500-token margin for JSON formatting). This initialization happens on the first document processed.
Chunk metadata
Each emitted document carries the following metadata fields:
| Field | Description |
|---|---|
chunkId | Sequential index of the chunk within the document, starting at 0 |
isSummary | true for the LLM-generated document summary chunk |
isTable | true for chunks produced from table-lane content |
tableId | Index of the source table; chunks split from one table share an id |
isDeleted | Always false on emit |
Chunks whose content is empty after trimming are dropped and never emitted.
Schema
| Field | Type | Description | Default |
|---|---|---|---|
preprocessor_llm.numberOfTokens | number | Number of tokens per document chunk. Needs to match your embedding model. | 384 |
preprocessor_llm.profile | string | "default" |
Dependencies
langchainlangchain-core