Transcribe

A RocketRide audio filter node that transcribes spoken audio or video to text using OpenAI Whisper.

What it does

Receives an audio or video stream, extracts the audio track as 16 kHz mono PCM, buffers it in 60-second chunks (forced flush at 120 seconds), and runs Whisper with built-in voice activity detection (VAD). Segments are merged until they end in terminal punctuation (., ?, !), so output arrives as whole sentences, each carrying the timestamp of its first segment.

Uses ai.common.models.Whisper: transcription routes to the model server when the engine is started with --modelserver, otherwise it runs locally via faster-whisper. No API key is required either way. Whisper is invoked with beam_size=5 and vad_filter=True, and transcription calls are serialized through a global lock so a single loaded model is shared safely across instances.

Models are downloaded from HuggingFace on first use. GPU is used automatically when available (compute_type defaults to float16).

Configuration

Lanes

Input lane	Output lane	Behaviour
`audio`	`text`	Transcribed sentences, one per segment
`video`	`text`	Audio track is extracted from the video and transcribed

When a documents listener is attached, the node also emits one document per merged sentence with chunkId (sequential per stream, reset on each new stream) and time_stamp (seconds from stream start) in the document metadata.

Fields

Field	Type	Description
`model`	string	Default "base". The Whisper model to use for transcription
`silence_threshold`	number	Default 0.25. The silence threshold to detect silence in speech (in seconds)
`min_seconds`	number	Default 240. The minimum seconds of audio to process in a batch and looking for silence
`max_seconds`	number	Default 300. The maximum seconds of audio to buffer to process
`vad_level`	number	Default 1. The VAD level to use for silence detection (0-3)
`profile`	string	Default "default".

VAD levels

Level	Behaviour
`0`	Most permissive: detects the most audio as speech (risk: includes noise)
`1`	Slightly more aggressive: skips minor background noise (default)
`2`	Balanced: moderate filtering of non-speech
`3`	Most aggressive: filters aggressively, may cut off quiet or short speech

Models

Model	Notes
`tiny`	Fastest, least accurate
`base`	Fast, low accuracy (default)
`small`	Medium speed and accuracy
`medium`	Slower, high accuracy
`large-v3`	Slowest, highest accuracy

Profiles

The node ships one profile per model size (tiny, base, small, medium, large-v3) plus default, which is an alias for base. Every profile uses the same defaults: language: en, silence_threshold: 0.25, min_seconds: 240, max_seconds: 300, vad_level: 1. Only the model differs between profiles.

Language

Defaults to English (en). Change the language config value to transcribe other languages. Any language supported by Whisper is accepted.

Schema

Field	Type	Description	Default
`transcribe.max_seconds`	`number`	Maximum Seconds The maximum seconds of audio to buffer to process	`300`
`transcribe.min_seconds`	`number`	Minimum Seconds The minimum seconds of audio to process in a batch and looking for silence	`240`
`transcribe.model`	`string`	Model The Whisper model to use for transcription	`"base"`
`transcribe.profile`	`string`		`"default"`
`transcribe.silence_threshold`	`number`	Silence Threshold The silence threshold to detect silence in speech (in seconds)	`0.25`
`transcribe.vad_level`	`number`	VAD Level The VAD level to use for silence detection (0-3)	`1`

Dependencies

faster-whisper
ctranslate2
av
tokenizers
huggingface-hub
tqdm
onnxruntime-gpu ==1.20.1; platform_system != 'Darwin'
onnxruntime ==1.20.1; platform_system == 'Darwin'

What it does​

Configuration​

Lanes​

Fields​

VAD levels​

Models​

Profiles​

Language​

Schema​

Dependencies​