LLM

View as Markdown

A RocketRide preprocessor node that uses an LLM to split document text into semantically coherent chunks for downstream vector embedding storage.

What it does

Splits incoming text into semantically coherent chunks by asking a connected LLM to detect context boundaries. Unlike rule-based splitters, the LLM preserves meaning across chunk boundaries: each chunk contains complete thoughts, respects document structure (paragraphs, sections, lists), and keeps the exact text of the input document. The LLM also appends a short summary chunk for the whole document as the last chunk.

The node accumulates incoming text per object, then chunks the whole document at close time. Documents larger than the LLM's context window are first pre-split locally with a hierarchical splitter (paragraphs, then lines, then sentences, then words, capped at 64 KB of text per piece) so every piece fits within the connected LLM's context and output limits; each piece is then chunked by the LLM.

Table content arriving on the table lane bypasses the LLM entirely: tables are split locally at row boundaries to stay under the chunk token limit, and every resulting chunk is tagged with table metadata.

Output chunks are emitted as documents using the LangChain document format (page_content plus metadata) via the langchain and langchain-core libraries. Requires an LLM connection (min 1).

Configuration

Lanes

Lane in	Lane out	Description
`text`	`documents`	Split text into semantically chunked documents via LLM
`table`	`documents`	Split table content into documents with table metadata preserved

Fields

Field	Type	Description
`numberOfTokens`	number	Default 384.
`profile`	string	Default "default".

The node ships a single preconfig profile, default, which sets numberOfTokens: 384.

Connections

Channel	Required	Description
`llm`	yes (min 1)	LLM used to analyze and chunk the document

The node queries the connected LLM at runtime for its context length, output length, and token counter, and sizes its requests accordingly (reserving room for the chunking prompt itself plus a 500-token margin for JSON formatting). This initialization happens on the first document processed.

Chunk metadata

Each emitted document carries the following metadata fields:

Field	Description
`chunkId`	Sequential index of the chunk within the document, starting at `0`
`isSummary`	`true` for the LLM-generated document summary chunk
`isTable`	`true` for chunks produced from `table`-lane content
`tableId`	Index of the source table; chunks split from one table share an id
`isDeleted`	Always `false` on emit

Chunks whose content is empty after trimming are dropped and never emitted.

Schema

Field	Type	Description	Default
`preprocessor_llm.numberOfTokens`	`number`	Number of tokens per document chunk. Needs to match your embedding model.	`384`
`preprocessor_llm.profile`	`string`		`"default"`

Dependencies

langchain
langchain-core

What it does​

Configuration​

Lanes​

Fields​

Connections​

Chunk metadata​

Schema​

Dependencies​