Skip to main content
View source

General Text

View as Markdown

A RocketRide preprocessor node ("General Text") that splits incoming text into chunks for downstream embedding or LLM processing.

What it does

Splits text into chunks using LangChain text splitters (langchain_text_splitters). Choose a profile tuned for the content type: general prose, markdown, LaTeX, sentence-based NLP, or any custom splitter class the library exports. No LLM is required.

The splitter class is loaded dynamically by name from langchain_text_splitters, so the custom profile can target any class that library exports. Constructor kwargs are filtered against the target class signature, preventing "unexpected keyword argument" errors across splitters. Chunk overlap is fixed at 0.

Incoming text is accumulated per file and split once when the file closes. Each incoming table is split immediately as its own unit. Every chunk is emitted as a document with a sequential chunkId (reset per file); tables additionally carry a tableId.

By default chunks are measured by string length with a maximum of 512 characters. Token mode uses a conservative byte-length estimator instead of a real tokenizer (no transformers model is loaded). See Token mode below.


Configuration

Lanes

Lane inLane outDescription
textdocumentsSplit plain text into document chunks
tabledocumentsSplit table content into document chunks

Fields

FieldTypeDescription
strlennumberDefault 512.
tokensnumberDefault 512.
modestringDefault "strlen".
splitterstringDefault "RecursiveCharacterTextSplitter".
separatorsstringDefault "'\n\n', '\n', ' ', ''".
separatorstringDefault ""\n"".
modelstringDefault "en_core_web_sm".
profilestringDefault "default".

Separator syntax

separators and separator are parsed as comma-separated Python string literals (e.g. '\n\n', '\n', ' ', ''). Escape sequences such as \n are interpreted. Every element must be a string. The character profile accepts exactly one element. An invalid format raises an error at startup.

Advanced token-mode options

These keys are read from the node config but are not exposed in the UI shape:

FieldType / DefaultDescription
bytes_per_tokenfloat, 3.0Bytes-per-token ratio used by the estimator. Lower values estimate more tokens (safer).
max_model_tokensint, unsetHard cap for the model's max token context. When set, caps the chunk size and enables the post-split safety net.
token_safety_marginint, 32Subtracted from max_model_tokens to leave headroom for special tokens.

Profiles

ProfileSplitterBest for
default (default)RecursiveCharacterTextSplitterGeneral-purpose prose
recursiveRecursiveCharacterTextSplitterGeneral-purpose prose with custom separators
characterCharacterTextSplitterSimple splitting on a fixed separator
markdownMarkdownTextSplitterStructured Markdown documents (separators kept in chunks)
latexLatexTextSplitterScientific and academic documents (separators kept in chunks)
nltkNLTKTextSplitterSentence-based splitting
spacySpacyTextSplitterNLP-based sentence splitting (English, German, French, Spanish models)
customRecursiveCharacterTextSplitterUser-defined splitter class from langchain_text_splitters

NLTK

Dependencies (nltk) are installed lazily the first time this profile is used. The punkt tokenizer data (and punkt_tab, required by NLTK 3.9+) is downloaded automatically if missing. Pass a language key in the node config (e.g. "english", "spanish") to forward it to the splitter.

spaCy

Dependencies (spacy) are installed lazily the first time this profile is used. The configured pipeline model (default en_core_web_sm) is downloaded automatically if not already installed. Small, medium, and large models are available for English, German, French, and Spanish; an English transformer model (en_core_web_trf) is also supported (best accuracy, slower).

Custom

Set splitter to the class name of any splitter exported by langchain_text_splitters. An unknown class name raises Splitter '<name>' not found in LangChain at startup. Only kwargs accepted by the chosen class's constructor are forwarded; unrecognized kwargs are silently dropped.


Token mode

With mode: tokens, chunk size is measured by an estimated token count. No tokenizer or transformers model is loaded. The estimate is the UTF-8 byte length of the text divided by bytes_per_token (default 3.0), rounded up. This is conservative by design, so real token counts should come in at or under the estimate.

When max_model_tokens is set:

  • The effective token budget is max_model_tokens - token_safety_margin.
  • The requested chunk size (tokens) is capped to that budget.
  • After splitting, any chunk that still exceeds the budget is force-subdivided by proportional character cuts until every piece fits.

This guarantees no emitted chunk exceeds the model's context budget even without an exact tokenizer.


Schema

FieldTypeDescriptionDefault
langchain.splitter.character.separatorstringSplit separator"\"\\n\""
langchain.splitter.character.splitterstringconst: "CharacterTextSplitter"
langchain.splitter.custom.splitterstringSplitter class nameconst: "RecursiveCharacterTextSplitter"
langchain.splitter.default.splitterstringconst: "RecursiveCharacterTextSplitter"
langchain.splitter.latex.splitterstringconst: "LatexTextSplitter"
langchain.splitter.markdown.splitterstringconst: "MarkdownTextSplitter"
langchain.splitter.modestringSplit by"strlen"
langchain.splitter.nltk.splitterstringconst: "NLTKTextSplitter"
langchain.splitter.profilestringText splitter"default"
langchain.splitter.recursive.separatorsstringSplit separators"'\\n\\n', '\\n', ' ', ''"
langchain.splitter.recursive.splitterstringconst: "RecursiveCharacterTextSplitter"
langchain.splitter.spacy.modelstringModel"en_core_web_sm"
langchain.splitter.spacy.splitterstringconst: "SpacyTextSplitter"
langchain.splitter.strlennumberString length512
langchain.splitter.tokensnumberNumber of tokens512

Dependencies

  • langchain
  • langchain-text-splitters
  • langchain-core
  • accelerate
  • transformers
  • tokenizers
  • huggingface-hub
  • pyyaml
  • filelock
  • regex
  • tqdm
  • safetensors