Skip to main content
View source

Named Entity Recognition

View as Markdown

A RocketRide text-processing node that identifies and extracts named entities from text and documents using HuggingFace transformer models.

What it does

Runs a HuggingFace token-classification (NER) pipeline over everything flowing through the node and attaches the recognized entities to document metadata for downstream filtering, search, and analysis.

Uses the transformers pipeline via the RocketRide model server (ai.common.models.transformers): the pipeline automatically uses the model server when available and falls back to local execution otherwise, so the node has no local Python dependencies of its own. The node is GPU-capable and registers as a filter with class type text.

The model is loaded once per pipeline run (in the global context) and shared across all instances. Entities below the configured confidence threshold (default 0.9) are discarded. If entity extraction fails on a piece of text, the error is logged and an empty entity list is returned; the pipeline keeps running and the original content still passes through.

Each extracted entity carries: entity_group (type such as PER, ORG, LOC), word (the entity text), score (confidence), and start / end (character offsets).


Configuration

Lanes

Lane inLane outDescription
texttextExtract entities, pass the original text through unchanged
documentsdocumentsExtract entities from each document's content and enrich document metadata

On the documents lane, when Store in metadata is on (the default), each document copy gains:

  • entities_<type>: one key per entity type, lowercased (e.g. entities_per, entities_org, entities_loc), holding a deduplicated, sorted list of entity texts
  • entities_count: total number of entities found in the document

The original documents are never mutated; enriched copies are written downstream.

Fields

The node is configured by picking a model profile (see below). The custom profile additionally exposes the model name field.

FieldTypeDescription
modelstringHuggingFace model to use for NER
aggregation_strategystringDefault "simple". How to combine word pieces into entities
min_confidencenumberDefault 0.9. Minimum confidence score (0.0-1.0) for entity detection
store_in_metadatabooleanDefault true. Add extracted entities to document metadata fields
profilestringDefault "bertLarge". NER model configuration

If no model is configured, the recognizer falls back to dbmdz/bert-large-cased-finetuned-conll03-english.


Profiles

The default profile is bertLarge.

Profile keyTitleModelNotes
bertLargeBERT Large (English) - high accuracy for English textdbmdz/bert-large-cased-finetuned-conll03-englishDefault
bertBaseBERT Base (English) - balanced performancedslim/bert-base-NER
distilbertDistilBERT (English) - fast and lightweightDavlan/distilbert-base-multilingual-cased-ner-hrlMultilingual model despite the title
xlmRobertaXLM-RoBERTa (Multilingual) - supports 100+ languagesDavlan/xlm-roberta-base-ner-hrl
debertaDeBERTa v3 (English) - state-of-the-art accuracydslim/distilbert-NERCurrently maps to DistilBERT NER
biomedicalBioBERT (Biomedical) - medical/scientific entitiesdmis-lab/biobert-base-cased-v1.1min_confidence defaults to 0.85
customCustom model(user-specified)Any compatible HuggingFace NER model

All profiles use aggregation_strategy: simple and min_confidence: 0.9 unless noted above; both can be overridden in the node config.


Schema

FieldTypeDescriptionDefault
ner.aggregation_strategystringEntity aggregation strategy
How to combine word pieces into entities
"simple"
ner.min_confidencenumberMinimum confidence threshold
Minimum confidence score (0.0-1.0) for entity detection
0.9
ner.modelstringModel name
HuggingFace model to use for NER
ner.profilestringModel
NER model configuration
"bertLarge"
ner.store_in_metadatabooleanStore entities in document metadata
Add extracted entities to document metadata fields
true