Transcribe
A RocketRide audio filter node that transcribes spoken audio or video to text using OpenAI Whisper.
What it does
Receives an audio or video stream, extracts the audio track as 16 kHz mono PCM, buffers it in 60-second chunks (forced flush at 120 seconds), and runs Whisper with built-in voice activity detection (VAD). Segments are merged until they end in terminal punctuation (., ?, !), so output arrives as whole sentences, each carrying the timestamp of its first segment.
Uses ai.common.models.Whisper: transcription routes to the model server when the engine is started with --modelserver, otherwise it runs locally via faster-whisper. No API key is required either way. Whisper is invoked with beam_size=5 and vad_filter=True, and transcription calls are serialized through a global lock so a single loaded model is shared safely across instances.
Models are downloaded from HuggingFace on first use. GPU is used automatically when available (compute_type defaults to float16).
Configuration
Lanes
| Input lane | Output lane | Behaviour |
|---|---|---|
audio | text | Transcribed sentences, one per segment |
video | text | Audio track is extracted from the video and transcribed |
When a documents listener is attached, the node also emits one document per merged sentence with chunkId (sequential per stream, reset on each new stream) and time_stamp (seconds from stream start) in the document metadata.
Fields
| Field | Type | Description |
|---|---|---|
model | string | Default "base". The Whisper model to use for transcription |
silence_threshold | number | Default 0.25. The silence threshold to detect silence in speech (in seconds) |
min_seconds | number | Default 240. The minimum seconds of audio to process in a batch and looking for silence |
max_seconds | number | Default 300. The maximum seconds of audio to buffer to process |
vad_level | number | Default 1. The VAD level to use for silence detection (0-3) |
profile | string | Default "default". |
VAD levels
| Level | Behaviour |
|---|---|
0 | Most permissive: detects the most audio as speech (risk: includes noise) |
1 | Slightly more aggressive: skips minor background noise (default) |
2 | Balanced: moderate filtering of non-speech |
3 | Most aggressive: filters aggressively, may cut off quiet or short speech |
Models
| Model | Notes |
|---|---|
tiny | Fastest, least accurate |
base | Fast, low accuracy (default) |
small | Medium speed and accuracy |
medium | Slower, high accuracy |
large-v3 | Slowest, highest accuracy |
Profiles
The node ships one profile per model size (tiny, base, small, medium, large-v3) plus default, which is an alias for base. Every profile uses the same defaults: language: en, silence_threshold: 0.25, min_seconds: 240, max_seconds: 300, vad_level: 1. Only the model differs between profiles.
Language
Defaults to English (en). Change the language config value to transcribe other languages. Any language supported by Whisper is accepted.
Schema
| Field | Type | Description | Default |
|---|---|---|---|
transcribe.max_seconds | number | Maximum Seconds The maximum seconds of audio to buffer to process | 300 |
transcribe.min_seconds | number | Minimum Seconds The minimum seconds of audio to process in a batch and looking for silence | 240 |
transcribe.model | string | Model The Whisper model to use for transcription | "base" |
transcribe.profile | string | "default" | |
transcribe.silence_threshold | number | Silence Threshold The silence threshold to detect silence in speech (in seconds) | 0.25 |
transcribe.vad_level | number | VAD Level The VAD level to use for silence detection (0-3) | 1 |
Dependencies
faster-whisperctranslate2avtokenizershuggingface-hubtqdmonnxruntime