Skip to main content
View source

Astra DB

View as Markdown

A RocketRide store node that persists embedded documents in DataStax Astra DB and retrieves them by semantic or keyword search.

What it does

Connects to an Astra DB collection via the astrapy DataAPIClient and exposes it as a standard RocketRide document store. Documents arriving on the documents lane are written to the collection; questions arriving on the questions lane are answered by searching that collection and emitting matching documents downstream.

Collections are created on demand: the vector dimension is taken from the first incoming embedding, the similarity metric comes from configuration, and BM25 lexical search is enabled automatically with the standard analyzer. No manual collection setup is required.

Key defaults every user should know:

  • Documents must be run through an embedding node before reaching this node: a chunk without an embedding raises an error at ingest time.
  • Semantic search results with a similarity score below 0.20 are silently dropped.
  • Inserts are batched 500 documents at a time; chunks with a near-zero vector magnitude are skipped.
  • Writing chunk 0 of an object deletes all existing chunks with the same objectId before inserting (upsert semantics for re-ingested documents).
  • Deletion is soft by default: markDeleted sets meta.isDeleted: true and those chunks are excluded from every query unless the filter explicitly requests them.

Configuration

Lanes

Lane inLane outDescription
documents(none)Ingest pre-embedded documents into the collection
questionsdocumentsReturn matching documents
questionsanswersReturn matching documents as an answer
questionsquestionsEnrich the question with matching documents for downstream nodes

Fields

FieldTypeDescription
api_endpointstringEnter the server API endpoint e.g. ..apps.astra.datastax.com
application_tokenstringEnter the server API application token
providerstringDefault "astra_db".

Profiles

Two profiles are built in; cloud is the default.

ProfileDescription
cloudAstra DB cloud server, requires api_endpoint and application_token
localLocal test server at http://localhost:8080 with token test-token and collection ROCKETRIDE

Search modes

Both modes are available without extra configuration. The pipeline's question type determines which runs:

  • Semantic: vector similarity search using the $vector sort on the question's embedding. Results carry similarity scores; anything scoring below 0.20 is discarded before results are returned.
  • Keyword: native BM25 lexical search using the $lexical sort on the question's text.

Both modes honour the standard document filter: node id, parent path, permissions, object ids, table ids, chunk-id ranges, and the soft-delete flag.


Document lifecycle

  • Ingest: each chunk is inserted with a generated UUID _id, the embedding stored as $vector, the text stored as content, and all metadata stored under meta. Re-ingesting an object (chunk 0 received again) first deletes that object's previous chunks, then inserts the new batch.
  • Soft delete / restore: markDeleted / markActive flip meta.isDeleted on all chunks for the given object ids. Soft-deleted chunks are hidden from default queries.
  • Hard delete: remove permanently deletes all chunks matching the given object ids.
  • Render: rebuilds a full document by fetching all non-deleted chunks for an object id, sorting them by chunkId in application code (Astra DB does not guarantee order), and concatenating the content in order.

All read operations return empty results when the collection does not yet exist; the collection is not created until the first ingest.


Authentication

Set api_endpoint to the database's Data API URL and application_token to the token generated in the Astra DB console. The token is passed directly to DataAPIClient at pipeline startup. No other authentication modes are supported.


Schema

FieldTypeDescriptionDefault
astra_db.api_endpointstringAPI Endpoint
Enter the server API endpoint e.g. ..apps.astra.datastax.com
astra_db.application_tokenstringApplication Token
Enter the server API application token
astra_db.providerstringconst: "astra_db"

Dependencies

  • astrapy