Data Extractor

View as Markdown

A RocketRide filter node that uses an LLM to pull a user-defined set of structured fields out of unstructured text or tables.

What it does

Reads incoming text or table chunks and, for each chunk, builds a prompt listing every configured column name, type, and optional default value, then calls the connected LLM (with expectJson: true) to extract values. The model is instructed to infer values even when column names don't appear verbatim in the source, and to always return a JSON array (empty if nothing is found). After the first chunk, the running table is passed back as context so subsequent chunks merge into it: duplicates are combined and empty fields are filled in from newer chunks.

Intermediate LLM answers are intercepted by the node (via preventDefault) and never passed downstream. Only the final consolidated result is emitted when the object closes. Fields with an empty column name or type are skipped at pipeline start with a warning and excluded from all extraction.

No third-party Python dependencies (requirements.txt is empty).

Connections

Connection	Required	Description
`llm`	yes (min 1)	LLM used to extract field values

Configuration

Lanes

Lane in	Lane out	Description
`text`	`answers`	Extract fields from text, emit as JSON
`text`	`documents`	Extract fields from text, emit one document per row
`table`	`answers`	Extract/transform fields from a table, emit as JSON
`table`	`documents`	Extract/transform fields from a table, emit one document per row

On close, the answers lane (if connected) receives one JSON answer containing the full table. The documents lane (if connected) receives one document per extracted row, with the row serialized as JSON in the document content.

Fields

The node takes a list of fields to extract (fields, 1-32 entries). Each entry has:

Field	Type	Description
`column`	string	Default "column". Name of column
`type`	string	Default "text".
`defval`	string	Default empty.
`fields`	array
`profile`	string	Default "default".

Supported types: text (Text), decimal (Number), int (Integer), date (Date), time (Time), datetime (DateTime), timestamp (Timestamp), binary (Binary), json (JSON), html (HTML), url (URL), email (Email), phone (Phone), ipv4 (IPv4), ipv6 (IPv6), uuid (UUID), guid (GUID)

A single configuration profile exists (default); it carries the field list above. The profile selector field (extract.profile) is hidden in the UI.

Behaviour

The LLM infers field values even when the source text does not use the exact column names: it reasons about what each column likely contains based on document context. Multiple chunks are merged progressively, filling in any gaps from earlier chunks, before the final result is emitted. The accumulated table is reset for every new object, so extraction state never leaks between objects in the pipeline.

Schema

Field	Type	Description	Default
`extract.column`	`string`	Column Name of column	`"column"`
`extract.defval`	`string`	Default Value	`""`
`extract.fields`	`array`
`extract.profile`	`string`		`"default"`
`extract.type`	`string`	Type	`"text"`

What it does​

Connections​

Configuration​

Lanes​

Fields​

Behaviour​

Schema​