# astra_db

A RocketRide store node that persists embedded documents in DataStax Astra DB and retrieves them by semantic or keyword search.

## What it does

Connects to an Astra DB collection via the **astrapy** `DataAPIClient` and exposes it as a standard RocketRide document store. Documents arriving on the `documents` lane are written to the collection; questions arriving on the `questions` lane are answered by searching that collection and emitting matching documents downstream.

Collections are created on demand: the vector dimension is taken from the first incoming embedding, the similarity metric comes from configuration, and **BM25 lexical search is enabled automatically** with the `standard` analyzer. No manual collection setup is required.

Key defaults every user should know:

- Documents must be run through an embedding node before reaching this node: a chunk without an embedding raises an error at ingest time.
- Semantic search results with a similarity score **below 0.20 are silently dropped**.
- Inserts are batched **500 documents at a time**; chunks with a near-zero vector magnitude are skipped.
- Writing chunk `0` of an object **deletes all existing chunks with the same `objectId`** before inserting (upsert semantics for re-ingested documents).
- Deletion is soft by default: `markDeleted` sets `meta.isDeleted: true` and those chunks are excluded from every query unless the filter explicitly requests them.

---

## Configuration

### Lanes

| Lane in     | Lane out    | Description                                                      |
| ----------- | ----------- | ---------------------------------------------------------------- |
| `documents` | (none)      | Ingest pre-embedded documents into the collection                |
| `questions` | `documents` | Return matching documents                                        |
| `questions` | `answers`   | Return matching documents as an answer                           |
| `questions` | `questions` | Enrich the question with matching documents for downstream nodes |

### Fields

| Field | Type | Description |
|---|---|---|
| `api_endpoint` | string | Enter the server API endpoint e.g. <instance-name>.<region>.apps.astra.datastax.com |
| `application_token` | string | Enter the server API application token |
| `provider` | string | Default "astra_db".  |

---

## Profiles

Two profiles are built in; `cloud` is the default.

| Profile | Description                                                                                      |
| ------- | ------------------------------------------------------------------------------------------------ |
| `cloud` | Astra DB cloud server, requires `api_endpoint` and `application_token`                           |
| `local` | Local test server at `http://localhost:8080` with token `test-token` and collection `ROCKETRIDE` |

---

## Search modes

Both modes are available without extra configuration. The pipeline's question type determines which runs:

- **Semantic**: vector similarity search using the `$vector` sort on the question's embedding. Results carry similarity scores; anything scoring below 0.20 is discarded before results are returned.
- **Keyword**: native BM25 lexical search using the `$lexical` sort on the question's text.

Both modes honour the standard document filter: node id, parent path, permissions, object ids, table ids, chunk-id ranges, and the soft-delete flag.

---

## Document lifecycle

- **Ingest**: each chunk is inserted with a generated UUID `_id`, the embedding stored as `$vector`, the text stored as `content`, and all metadata stored under `meta`. Re-ingesting an object (chunk `0` received again) first deletes that object's previous chunks, then inserts the new batch.
- **Soft delete / restore**: `markDeleted` / `markActive` flip `meta.isDeleted` on all chunks for the given object ids. Soft-deleted chunks are hidden from default queries.
- **Hard delete**: `remove` permanently deletes all chunks matching the given object ids.
- **Render**: rebuilds a full document by fetching all non-deleted chunks for an object id, sorting them by `chunkId` in application code (Astra DB does not guarantee order), and concatenating the content in order.

All read operations return empty results when the collection does not yet exist; the collection is not created until the first ingest.

---

## Authentication

Set `api_endpoint` to the database's Data API URL and `application_token` to the token generated in the Astra DB console. The token is passed directly to `DataAPIClient` at pipeline startup. No other authentication modes are supported.

---

<!-- ROCKETRIDE:GENERATED:PARAMS START -->
<!-- Generated by nodes:docs-generate. Do not edit by hand. -->

## Schema

| Field | Type | Description | Default |
|---|---|---|---|
| `astra_db.api_endpoint` | `string` | **API Endpoint**<br/>Enter the server API endpoint e.g. <instance-name>.<region>.apps.astra.datastax.com |  |
| `astra_db.application_token` | `string` | **Application Token**<br/>Enter the server API application token |  |
| `astra_db.provider` | `string` |  | const: `"astra_db"` |

## Dependencies

- `astrapy`

## Source

[<svg viewBox="0 0 16 16" width="15" height="15" fill="currentColor" aria-hidden="true" style="vertical-align:-0.15em;margin-right:0.35em"><path d="M8 0C3.58 0 0 3.58 0 8c0 3.54 2.29 6.53 5.47 7.59.4.07.55-.17.55-.38 0-.19-.01-.82-.01-1.49-2.01.37-2.53-.49-2.69-.94-.09-.23-.48-.94-.82-1.13-.28-.15-.68-.52-.01-.53.63-.01 1.08.58 1.23.82.72 1.21 1.87.87 2.33.66.07-.52.28-.87.51-1.07-1.78-.2-3.64-.89-3.64-3.95 0-.87.31-1.59.82-2.15-.08-.2-.36-1.02.08-2.12 0 0 .67-.21 2.2.82.64-.18 1.32-.27 2-.27.68 0 1.36.09 2 .27 1.53-1.04 2.2-.82 2.2-.82.44 1.1.16 1.92.08 2.12.51.56.82 1.27.82 2.15 0 3.07-1.87 3.75-3.65 3.95.29.25.54.73.54 1.48 0 1.07-.01 1.93-.01 2.2 0 .21.15.46.55.38A8.013 8.013 0 0016 8c0-4.42-3.58-8-8-8z"/></svg> View source](https://github.com/rocketride-org/rocketride-server/tree/develop/nodes/src/nodes/astra_db)
<!-- ROCKETRIDE:GENERATED:PARAMS END -->
