Skip to main content
View source

Transcribe

View as Markdown

A RocketRide audio filter node that transcribes spoken audio or video to text using OpenAI Whisper.

What it does

Receives an audio or video stream, extracts the audio track as 16 kHz mono PCM, buffers it in 60-second chunks (forced flush at 120 seconds), and runs Whisper with built-in voice activity detection (VAD). Segments are merged until they end in terminal punctuation (., ?, !), so output arrives as whole sentences, each carrying the timestamp of its first segment.

Uses ai.common.models.Whisper: transcription routes to the model server when the engine is started with --modelserver, otherwise it runs locally via faster-whisper. No API key is required either way. Whisper is invoked with beam_size=5 and vad_filter=True, and transcription calls are serialized through a global lock so a single loaded model is shared safely across instances.

Models are downloaded from HuggingFace on first use. GPU is used automatically when available (compute_type defaults to float16).


Configuration

Lanes

Input laneOutput laneBehaviour
audiotextTranscribed sentences, one per segment
videotextAudio track is extracted from the video and transcribed

When a documents listener is attached, the node also emits one document per merged sentence with chunkId (sequential per stream, reset on each new stream) and time_stamp (seconds from stream start) in the document metadata.

Fields

FieldTypeDescription
modelstringDefault "base". The Whisper model to use for transcription
silence_thresholdnumberDefault 0.25. The silence threshold to detect silence in speech (in seconds)
min_secondsnumberDefault 240. The minimum seconds of audio to process in a batch and looking for silence
max_secondsnumberDefault 300. The maximum seconds of audio to buffer to process
vad_levelnumberDefault 1. The VAD level to use for silence detection (0-3)
profilestringDefault "default".

VAD levels

LevelBehaviour
0Most permissive: detects the most audio as speech (risk: includes noise)
1Slightly more aggressive: skips minor background noise (default)
2Balanced: moderate filtering of non-speech
3Most aggressive: filters aggressively, may cut off quiet or short speech

Models

ModelNotes
tinyFastest, least accurate
baseFast, low accuracy (default)
smallMedium speed and accuracy
mediumSlower, high accuracy
large-v3Slowest, highest accuracy

Profiles

The node ships one profile per model size (tiny, base, small, medium, large-v3) plus default, which is an alias for base. Every profile uses the same defaults: language: en, silence_threshold: 0.25, min_seconds: 240, max_seconds: 300, vad_level: 1. Only the model differs between profiles.


Language

Defaults to English (en). Change the language config value to transcribe other languages. Any language supported by Whisper is accepted.


Schema

FieldTypeDescriptionDefault
transcribe.max_secondsnumberMaximum Seconds
The maximum seconds of audio to buffer to process
300
transcribe.min_secondsnumberMinimum Seconds
The minimum seconds of audio to process in a batch and looking for silence
240
transcribe.modelstringModel
The Whisper model to use for transcription
"base"
transcribe.profilestring"default"
transcribe.silence_thresholdnumberSilence Threshold
The silence threshold to detect silence in speech (in seconds)
0.25
transcribe.vad_levelnumberVAD Level
The VAD level to use for silence detection (0-3)
1

Dependencies

  • faster-whisper
  • ctranslate2
  • av
  • tokenizers
  • huggingface-hub
  • tqdm
  • onnxruntime