# astra_db A RocketRide store node that persists embedded documents in DataStax Astra DB and retrieves them by semantic or keyword search. ## What it does Connects to an Astra DB collection via the **astrapy** `DataAPIClient` and exposes it as a standard RocketRide document store. Documents arriving on the `documents` lane are written to the collection; questions arriving on the `questions` lane are answered by searching that collection and emitting matching documents downstream. Collections are created on demand: the vector dimension is taken from the first incoming embedding, the similarity metric comes from configuration, and **BM25 lexical search is enabled automatically** with the `standard` analyzer. No manual collection setup is required. Key defaults every user should know: - Documents must be run through an embedding node before reaching this node: a chunk without an embedding raises an error at ingest time. - Semantic search results with a similarity score **below 0.20 are silently dropped**. - Inserts are batched **500 documents at a time**; chunks with a near-zero vector magnitude are skipped. - Writing chunk `0` of an object **deletes all existing chunks with the same `objectId`** before inserting (upsert semantics for re-ingested documents). - Deletion is soft by default: `markDeleted` sets `meta.isDeleted: true` and those chunks are excluded from every query unless the filter explicitly requests them. --- ## Configuration ### Lanes | Lane in | Lane out | Description | | ----------- | ----------- | ---------------------------------------------------------------- | | `documents` | (none) | Ingest pre-embedded documents into the collection | | `questions` | `documents` | Return matching documents | | `questions` | `answers` | Return matching documents as an answer | | `questions` | `questions` | Enrich the question with matching documents for downstream nodes | ### Fields | Field | Type | Description | |---|---|---| | `api_endpoint` | string | Enter the server API endpoint e.g. ..apps.astra.datastax.com | | `application_token` | string | Enter the server API application token | | `provider` | string | Default "astra_db". | --- ## Profiles Two profiles are built in; `cloud` is the default. | Profile | Description | | ------- | ------------------------------------------------------------------------------------------------ | | `cloud` | Astra DB cloud server, requires `api_endpoint` and `application_token` | | `local` | Local test server at `http://localhost:8080` with token `test-token` and collection `ROCKETRIDE` | --- ## Search modes Both modes are available without extra configuration. The pipeline's question type determines which runs: - **Semantic**: vector similarity search using the `$vector` sort on the question's embedding. Results carry similarity scores; anything scoring below 0.20 is discarded before results are returned. - **Keyword**: native BM25 lexical search using the `$lexical` sort on the question's text. Both modes honour the standard document filter: node id, parent path, permissions, object ids, table ids, chunk-id ranges, and the soft-delete flag. --- ## Document lifecycle - **Ingest**: each chunk is inserted with a generated UUID `_id`, the embedding stored as `$vector`, the text stored as `content`, and all metadata stored under `meta`. Re-ingesting an object (chunk `0` received again) first deletes that object's previous chunks, then inserts the new batch. - **Soft delete / restore**: `markDeleted` / `markActive` flip `meta.isDeleted` on all chunks for the given object ids. Soft-deleted chunks are hidden from default queries. - **Hard delete**: `remove` permanently deletes all chunks matching the given object ids. - **Render**: rebuilds a full document by fetching all non-deleted chunks for an object id, sorting them by `chunkId` in application code (Astra DB does not guarantee order), and concatenating the content in order. All read operations return empty results when the collection does not yet exist; the collection is not created until the first ingest. --- ## Authentication Set `api_endpoint` to the database's Data API URL and `application_token` to the token generated in the Astra DB console. The token is passed directly to `DataAPIClient` at pipeline startup. No other authentication modes are supported. --- ## Schema | Field | Type | Description | Default | |---|---|---|---| | `astra_db.api_endpoint` | `string` | **API Endpoint**
Enter the server API endpoint e.g. ..apps.astra.datastax.com | | | `astra_db.application_token` | `string` | **Application Token**
Enter the server API application token | | | `astra_db.provider` | `string` | | const: `"astra_db"` | ## Dependencies - `astrapy` ## Source [ View source](https://github.com/rocketride-org/rocketride-server/tree/develop/nodes/src/nodes/astra_db)