Skip to main content
View source

Video

View as Markdown

A pipeline filter node that extracts frames from a video stream and encodes each frame as a vector embedding using a Hugging Face vision model.

What it does

Receives a video stream on its video input lane, buffers the full video in memory, then decodes it with OpenCV (cv2.VideoCapture) via a temporary file. Frames are sampled at a configurable interval and each is passed through a vision embedding model (CLIP by default, loaded via PyTorch and transformers) to produce a fixed-length numeric vector. One document is emitted per frame; all documents for a video are flushed as a single batch when processing completes.

The embedding engine is shared with the embedding_image node and a thread lock serializes GPU access, so concurrent video streams do not race for the model. The node is flagged experimental and declares a gpu capability.

Three safety limits apply out of the box:

  • Videos larger than maxVideoSizeMB (default 500 MB) are rejected with a warning and produce no output.
  • Frame extraction is capped at max_frames (default 50) per video; set to 0 for unlimited.
  • If the container does not report a frame rate, a fallback of 30 fps is used when computing timestamps.

Supported containers: MP4 (video/mp4), AVI (video/x-msvideo), QuickTime (video/quicktime), WebM (video/webm). An unrecognized MIME type is still processed, treated as MP4.


Configuration

Lanes

Lane inLane outDescription
videodocumentsOne document per extracted frame, with an embedding

Each output document contains:

  • type: Image, with page_content holding the frame as a base64-encoded PNG
  • embedding: the frame's embedding vector (list of floats), plus embedding_model (the model identifier string)
  • metadata: time_stamp (seconds from the start of the video), frame_number (frame index in the source), and a per-video chunkId counter

Fields

FieldTypeDescription
modelstringHugging Face model to use for frame embedding
profilestringDefault "openai-patch16". Embedding model for video frames
intervalnumberDefault 5. Time in seconds between extracted frames
max_framesnumberDefault 50. Limit the total number of frames extracted from the video. Set to 0 for unlimited.
start_timenumberDefault 0.
durationnumberDefault 0.
maxVideoSizeMBnumberDefault 500. Maximum allowed video file size in megabytes. Videos exceeding this limit will be rejected.

The extraction window runs from start_time to start_time + duration, clamped to the actual video length.


Profiles

Profile IDModelNotes
openai-patch16 (default)openai/clip-vit-base-patch16Good performance, lower memory
openai-patch32openai/clip-vit-base-patch32Lower performance, better recognition
google16x224google/vit-base-patch16-224Fast, accurate, general-purpose
customuser-specified via modelAny Hugging Face vision model

Schema

FieldTypeDescriptionDefault
embedding.durationnumberDuration (in seconds) for frame extraction (0=end of video)0
embedding.intervalnumberInterval (in seconds) between frames
Time in seconds between extracted frames
5
embedding.maxVideoSizeMBnumberMaximum video file size (MB)
Maximum allowed video file size in megabytes. Videos exceeding this limit will be rejected.
500
embedding.max_framesnumberMaximum number of frames to extract (0=unlimited)
Limit the total number of frames extracted from the video. Set to 0 for unlimited.
50
embedding.modelstringModel name
Hugging Face model to use for frame embedding
embedding.profilestringModel
Embedding model for video frames
"openai-patch16"
embedding.start_timenumberStart time (in seconds) for frame extraction (0=beginning)0

Dependencies

  • transformers
  • accelerate