Skip to main content
View source

Data Extractor

View as Markdown

A RocketRide filter node that uses an LLM to pull a user-defined set of structured fields out of unstructured text or tables.

What it does

Reads incoming text or table chunks and, for each chunk, builds a prompt listing every configured column name, type, and optional default value, then calls the connected LLM (with expectJson: true) to extract values. The model is instructed to infer values even when column names don't appear verbatim in the source, and to always return a JSON array (empty if nothing is found). After the first chunk, the running table is passed back as context so subsequent chunks merge into it: duplicates are combined and empty fields are filled in from newer chunks.

Intermediate LLM answers are intercepted by the node (via preventDefault) and never passed downstream. Only the final consolidated result is emitted when the object closes. Fields with an empty column name or type are skipped at pipeline start with a warning and excluded from all extraction.

No third-party Python dependencies (requirements.txt is empty).


Connections

ConnectionRequiredDescription
llmyes (min 1)LLM used to extract field values

Configuration

Lanes

Lane inLane outDescription
textanswersExtract fields from text, emit as JSON
textdocumentsExtract fields from text, emit one document per row
tableanswersExtract/transform fields from a table, emit as JSON
tabledocumentsExtract/transform fields from a table, emit one document per row

On close, the answers lane (if connected) receives one JSON answer containing the full table. The documents lane (if connected) receives one document per extracted row, with the row serialized as JSON in the document content.

Fields

The node takes a list of fields to extract (fields, 1-32 entries). Each entry has:

FieldTypeDescription
columnstringDefault "column". Name of column
typestringDefault "text".
defvalstringDefault empty.
fieldsarray
profilestringDefault "default".

Supported types: text (Text), decimal (Number), int (Integer), date (Date), time (Time), datetime (DateTime), timestamp (Timestamp), binary (Binary), json (JSON), html (HTML), url (URL), email (Email), phone (Phone), ipv4 (IPv4), ipv6 (IPv6), uuid (UUID), guid (GUID)

A single configuration profile exists (default); it carries the field list above. The profile selector field (extract.profile) is hidden in the UI.


Behaviour

The LLM infers field values even when the source text does not use the exact column names: it reasons about what each column likely contains based on document context. Multiple chunks are merged progressively, filling in any gaps from earlier chunks, before the final result is emitted. The accumulated table is reset for every new object, so extraction state never leaks between objects in the pipeline.


Schema

FieldTypeDescriptionDefault
extract.columnstringColumn
Name of column
"column"
extract.defvalstringDefault Value""
extract.fieldsarray
extract.profilestring"default"
extract.typestringType"text"