Skip to main content
View source

OpenAI Vision

View as Markdown

A RocketRide filter node that sends images to OpenAI's vision-capable models and returns the text analysis.

What it does

Accepts single image frames or streams of image documents and calls OpenAI's Chat Completions API with each image encoded as a base64 data URL (with detail: "auto"), returning the model's text response. Supports the GPT-4.1 and GPT-4o model families for use cases including image analysis, OCR, visual understanding, and scene description.

Uses the official openai Python SDK (>=2.38.0). If no analysis prompt is configured, the node defaults to "Describe this image in detail.".

Each API call runs in a daemon thread with a 30-second hard timeout and is retried once on retryable errors (rate limits, connection errors, timeouts, 5xx responses). Rate-limit retries honor the retry-after response header (default 60 s); other retries use exponential backoff starting at 1 s. A fresh HTTP client is created per attempt to avoid exhausting the connection pool from a prior timed-out attempt. API errors are translated to user-friendly messages covering authentication failure, rate limits, quota or billing issues, invalid input, model not found, timeout, and server unavailability.

When both lanes carry the same frame, the node makes only one API call per frame: the first lane to process the frame caches the answer, and the second lane reuses it. The cache is cleared at the start of each new frame.


Configuration

Lanes

Lane inLane outDescription
imagetextAnalyze a single image frame and emit the model's text response
documentsdocumentsAnalyze image documents and emit text analysis with original metadata preserved

On the documents lane, each incoming Image document is replaced by a Text document containing the model's answer; the original metadata (frame number, timestamp, chunk id) is carried over. The original Image documents do not flow downstream. Documents with a type other than Image or with empty content are skipped with a warning. Image document content is expected to be base64-encoded PNG: all Image document producers (frame_grabber, thumbnail, embedding_image) normalize to PNG.

If inference fails for a document after retries, the node logs a warning and continues with the next document. On the image lane, a failure logs a warning and emits nothing for that frame. Empty image frames on the image lane are also skipped with a warning.

Fields

FieldTypeDescription
apikeystringOpenAI API key. Get one at https://platform.openai.com/api-keys
modelstringOpenAI Vision model
modelTotalTokensnumberMaximum context length in tokens
systemPromptstringDefine the model's role and behavior for image analysis
promptstringDescribe what you want to analyze or extract from the image
profilestringDefault "openai-4-1". Select the OpenAI vision model to use

The selected profile supplies the model identifier and modelTotalTokens context limit. The API key, system prompt, and analysis prompt are configured per profile.


Profiles

ProfileModelContext (tokens)
openai-4-1 (default)gpt-4.11,047,576
openai-4-1-minigpt-4.1-mini1,047,576
openai-4-1-nanogpt-4.1-nano1,047,576
openai-4ogpt-4o128,000
openai-4o-minigpt-4o-mini128,000

Authentication

Provide an OpenAI API key in image_vision_openai.apikey. The key is validated at pipeline start: it must be present and must begin with sk-, otherwise the node raises a configuration error before any image is processed.

Upstream references:


Schema

FieldTypeDescriptionDefault
image_vision_openai.apikeystringAPI Key
OpenAI API key. Get one at https://platform.openai.com/api-keys
image_vision_openai.profilestringVision Model
Select the OpenAI vision model to use
"openai-4-1"
modelstringModel
OpenAI Vision model
modelTotalTokensnumberTokens
Maximum context length in tokens
vision.promptstringAnalysis Prompt
Describe what you want to analyze or extract from the image
vision.systemPromptstringSystem Instructions
Define the model's role and behavior for image analysis

Dependencies

  • openai >=2.38.0