Video
A pipeline filter node that extracts frames from a video stream and encodes each frame as a vector embedding using a Hugging Face vision model.
What it does
Receives a video stream on its video input lane, buffers the full video in memory, then decodes it with OpenCV (cv2.VideoCapture) via a temporary file. Frames are sampled at a configurable interval and each is passed through a vision embedding model (CLIP by default, loaded via PyTorch and transformers) to produce a fixed-length numeric vector. One document is emitted per frame; all documents for a video are flushed as a single batch when processing completes.
The embedding engine is shared with the embedding_image node and a thread lock serializes GPU access, so concurrent video streams do not race for the model. The node is flagged experimental and declares a gpu capability.
Three safety limits apply out of the box:
- Videos larger than
maxVideoSizeMB(default 500 MB) are rejected with a warning and produce no output. - Frame extraction is capped at
max_frames(default 50) per video; set to0for unlimited. - If the container does not report a frame rate, a fallback of 30 fps is used when computing timestamps.
Supported containers: MP4 (video/mp4), AVI (video/x-msvideo), QuickTime (video/quicktime), WebM (video/webm). An unrecognized MIME type is still processed, treated as MP4.
Configuration
Lanes
| Lane in | Lane out | Description |
|---|---|---|
video | documents | One document per extracted frame, with an embedding |
Each output document contains:
type:Image, withpage_contentholding the frame as a base64-encoded PNGembedding: the frame's embedding vector (list of floats), plusembedding_model(the model identifier string)- metadata:
time_stamp(seconds from the start of the video),frame_number(frame index in the source), and a per-videochunkIdcounter
Fields
| Field | Type | Description |
|---|---|---|
model | string | Hugging Face model to use for frame embedding |
profile | string | Default "openai-patch16". Embedding model for video frames |
interval | number | Default 5. Time in seconds between extracted frames |
max_frames | number | Default 50. Limit the total number of frames extracted from the video. Set to 0 for unlimited. |
start_time | number | Default 0. |
duration | number | Default 0. |
maxVideoSizeMB | number | Default 500. Maximum allowed video file size in megabytes. Videos exceeding this limit will be rejected. |
The extraction window runs from start_time to start_time + duration, clamped to the actual video length.
Profiles
| Profile ID | Model | Notes |
|---|---|---|
openai-patch16 (default) | openai/clip-vit-base-patch16 | Good performance, lower memory |
openai-patch32 | openai/clip-vit-base-patch32 | Lower performance, better recognition |
google16x224 | google/vit-base-patch16-224 | Fast, accurate, general-purpose |
custom | user-specified via model | Any Hugging Face vision model |
Schema
| Field | Type | Description | Default |
|---|---|---|---|
embedding.duration | number | Duration (in seconds) for frame extraction (0=end of video) | 0 |
embedding.interval | number | Interval (in seconds) between frames Time in seconds between extracted frames | 5 |
embedding.maxVideoSizeMB | number | Maximum video file size (MB) Maximum allowed video file size in megabytes. Videos exceeding this limit will be rejected. | 500 |
embedding.max_frames | number | Maximum number of frames to extract (0=unlimited) Limit the total number of frames extracted from the video. Set to 0 for unlimited. | 50 |
embedding.model | string | Model name Hugging Face model to use for frame embedding | |
embedding.profile | string | Model Embedding model for video frames | "openai-patch16" |
embedding.start_time | number | Start time (in seconds) for frame extraction (0=beginning) | 0 |
Dependencies
transformersaccelerate