Skip to main content
View source

Weaviate

View as Markdown

A RocketRide store node that persists embedded document chunks in a Weaviate instance and retrieves them by semantic or keyword search.

What it does

Stores pre-embedded documents in a Weaviate collection and answers searches against them. Supports both self-hosted Weaviate and Weaviate Cloud, selected via a profile.

Uses the official weaviate-client Python SDK (v4 API): connect_to_local for self-hosted instances and connect_to_weaviate_cloud for cloud clusters, with connection timeouts of 30 s (init), 60 s (query), and 120 s (insert).

Key behavior to know:

  • Documents must arrive pre-embedded. Run them through an embedding node first: the collection is created with Vectorizer.none(), so Weaviate never embeds anything itself; the pipeline supplies all vectors. A document without an embedding raises an error on ingest.
  • The collection is created automatically on first write if it does not exist, with an HNSW vector index using the configured distance metric.
  • Re-ingesting is idempotent per document. Before inserting, all existing chunks with the same objectId are deleted, then the new chunks are written via Weaviate's dynamic batch API. If any batch objects fail, the node raises an error.
  • Deletes are soft by default. Documents can be marked deleted (isDeleted: true) and later re-activated; soft-deleted chunks are excluded from every search and get unless the filter explicitly asks for deleted documents. Hard removal by objectId is also supported.
  • The configured host is normalized automatically: leading http:// / https:// and trailing slashes are stripped, and the API key is trimmed of whitespace.

Configuration

Lanes

Lane inLane outDescription
documents-Ingest pre-embedded documents into the collection
questionsdocumentsReturn matching documents
questionsanswersReturn matching documents as an answer
questionsquestionsEnrich the question with matching documents for downstream nodes

The node can also render a stored object back to text: given an object id, it rehydrates all chunks in chunkId order (fetched in windows of renderChunkSize) and streams the joined text to the text lane.

Fields

FieldType / DefaultDescription
hoststringWeaviate server address. Cloud: <your-instance-name>.weaviate.cloud. Local default: localhost. Scheme and trailing slashes are stripped automatically.
portint: 8080 local, 443 cloudREST port
grpc_portint: 50051gRPC port (local profile only)
apikeystringAPI key. Required for cloud; optional for local (used only when non-empty)
scorenumber: 0.5Minimum retrieval similarity threshold
collectionstring: ROCKETRIDECollection name: must start with an uppercase letter and contain only letters, numbers, and underscores
similaritystring: cosineDistance metric: cosine · dot · l2-squared · hamming · manhattan. Any other value raises an error at startup
renderChunkSizeint: 33554432Number of chunk ids fetched per window when rendering a full document
modestring (set by profile)local or cloud: selects the connection method

Each ingested chunk is stored with these properties alongside its vector: content, objectId, nodeId, parent, permissionId, isDeleted, chunkId, isTable, tableId, vectorSize, modelName.


Profiles

ProfileModeDefault hostPort
Weaviate cloud servercloud(your Weaviate Cloud endpoint)443
Your own Weaviate serverlocallocalhost8080

The preconfig default profile is cloud. The cloud profile exposes host, API key, score, and collection; the local profile exposes host, port, gRPC port, score, and collection.


Search behavior

  • Semantic search runs a near_vector query with the question's embedding. The question must carry an embedding (bind an embedding node), and a non-zero result offset is not supported. When the requested limit is 10 or less, the node queries with a limit of 25.
  • Keyword search matches the question text against chunk content with a *query* wildcard like filter.
  • Both searches apply the document filter (node id, parent, permissions, object ids, chunk id ranges, table flags) and exclude soft-deleted chunks unless deleted documents are requested.
  • Scoring: with the cosine metric the returned distance is mapped to (distance + 1) / 2; for all other metrics a sigmoid 1 / (1 + exp(distance / -100)) is used. Results scoring below 0.20 are discarded outright, before the configured score threshold is applied.

Configuration validation

When the node config is saved, a fast probe validates it and surfaces problems as warnings:

  • The collection name is checked against the official Weaviate rule (^[A-Z][_0-9A-Za-z]*$): start with an uppercase letter; only letters, numbers, and underscores; no spaces or special characters.
  • Hosts of localhost / 127.* are treated as local, anything else as cloud.
  • Cloud: an HTTP GET to /v1/meta with the API key as a Bearer token (3 s timeout).
  • Local: the SDK lists collections over REST, then verifies the gRPC port is reachable (channel-ready check, falling back to a plain TCP connect if grpc is unavailable).

HTTP error responses are surfaced with their status code and the server's message/error body so misconfigurations are easy to diagnose.


Authentication

  • Cloud profile: set apikey to your Weaviate Cloud API key, it is passed as Auth.api_key credentials.
  • Local profile: anonymous by default. If apikey is set to a non-empty value, it is sent as API-key credentials to the local instance.

Upstream docs


Schema

FieldTypeDescriptionDefault
vector.cloud.hostEnter the server IP address e.g. .weaviate.cloud
vector.cloud.port443
vector.local.grpc_port50051
vector.local.host"localhost"
vector.local.port8080
weaviate.profilestringType of Weaviate host
Connect to...
"local"
weaviate.providerstringconst: "weaviate"

Dependencies

  • authlib
  • grpcio
  • grpcio-health-checking
  • grpcio-tools
  • httpx
  • pydantic
  • requests
  • validators
  • weaviate-client
  • numpy