Data Extractor
A RocketRide filter node that uses an LLM to pull a user-defined set of structured fields out of unstructured text or tables.
What it does
Reads incoming text or table chunks and, for each chunk, builds a prompt listing every configured column name, type, and optional default value, then calls the connected LLM (with expectJson: true) to extract values. The model is instructed to infer values even when column names don't appear verbatim in the source, and to always return a JSON array (empty if nothing is found). After the first chunk, the running table is passed back as context so subsequent chunks merge into it: duplicates are combined and empty fields are filled in from newer chunks.
Intermediate LLM answers are intercepted by the node (via preventDefault) and never passed downstream. Only the final consolidated result is emitted when the object closes. Fields with an empty column name or type are skipped at pipeline start with a warning and excluded from all extraction.
No third-party Python dependencies (requirements.txt is empty).
Connections
| Connection | Required | Description |
|---|---|---|
llm | yes (min 1) | LLM used to extract field values |
Configuration
Lanes
| Lane in | Lane out | Description |
|---|---|---|
text | answers | Extract fields from text, emit as JSON |
text | documents | Extract fields from text, emit one document per row |
table | answers | Extract/transform fields from a table, emit as JSON |
table | documents | Extract/transform fields from a table, emit one document per row |
On close, the answers lane (if connected) receives one JSON answer containing the full table. The documents lane (if connected) receives one document per extracted row, with the row serialized as JSON in the document content.
Fields
The node takes a list of fields to extract (fields, 1-32 entries). Each entry has:
| Field | Type | Description |
|---|---|---|
column | string | Default "column". Name of column |
type | string | Default "text". |
defval | string | Default empty. |
fields | array | |
profile | string | Default "default". |
Supported types: text (Text), decimal (Number), int (Integer), date (Date), time (Time), datetime (DateTime), timestamp (Timestamp), binary (Binary), json (JSON), html (HTML), url (URL), email (Email), phone (Phone), ipv4 (IPv4), ipv6 (IPv6), uuid (UUID), guid (GUID)
A single configuration profile exists (default); it carries the field list above. The profile selector field (extract.profile) is hidden in the UI.
Behaviour
The LLM infers field values even when the source text does not use the exact column names: it reasons about what each column likely contains based on document context. Multiple chunks are merged progressively, filling in any gaps from earlier chunks, before the final result is emitted. The accumulated table is reset for every new object, so extraction state never leaks between objects in the pipeline.
Schema
| Field | Type | Description | Default |
|---|---|---|---|
extract.column | string | Column Name of column | "column" |
extract.defval | string | Default Value | "" |
extract.fields | array | ||
extract.profile | string | "default" | |
extract.type | string | Type | "text" |