# audio_transcribe A RocketRide audio filter node that transcribes spoken audio or video to text using OpenAI Whisper. ## What it does Receives an audio or video stream, extracts the audio track as 16 kHz mono PCM, buffers it in 60-second chunks (forced flush at 120 seconds), and runs Whisper with built-in voice activity detection (VAD). Segments are merged until they end in terminal punctuation (`.`, `?`, `!`), so output arrives as whole sentences, each carrying the timestamp of its first segment. Uses `ai.common.models.Whisper`: transcription routes to the model server when the engine is started with `--modelserver`, otherwise it runs locally via `faster-whisper`. No API key is required either way. Whisper is invoked with `beam_size=5` and `vad_filter=True`, and transcription calls are serialized through a global lock so a single loaded model is shared safely across instances. Models are downloaded from HuggingFace on first use. GPU is used automatically when available (`compute_type` defaults to `float16`). --- ## Configuration ### Lanes | Input lane | Output lane | Behaviour | |------------|-------------|-----------| | `audio` | `text` | Transcribed sentences, one per segment | | `video` | `text` | Audio track is extracted from the video and transcribed | When a `documents` listener is attached, the node also emits one document per merged sentence with `chunkId` (sequential per stream, reset on each new stream) and `time_stamp` (seconds from stream start) in the document metadata. ### Fields | Field | Type | Description | |---|---|---| | `model` | string | Default "base". The Whisper model to use for transcription | | `silence_threshold` | number | Default 0.25. The silence threshold to detect silence in speech (in seconds) | | `min_seconds` | number | Default 240. The minimum seconds of audio to process in a batch and looking for silence | | `max_seconds` | number | Default 300. The maximum seconds of audio to buffer to process | | `vad_level` | number | Default 1. The VAD level to use for silence detection (0-3) | | `profile` | string | Default "default". | ### VAD levels | Level | Behaviour | |-------|-----------| | `0` | Most permissive: detects the most audio as speech (risk: includes noise) | | `1` | Slightly more aggressive: skips minor background noise (default) | | `2` | Balanced: moderate filtering of non-speech | | `3` | Most aggressive: filters aggressively, may cut off quiet or short speech | --- ## Models | Model | Notes | |------------|-------| | `tiny` | Fastest, least accurate | | `base` | Fast, low accuracy (default) | | `small` | Medium speed and accuracy | | `medium` | Slower, high accuracy | | `large-v3` | Slowest, highest accuracy | --- ## Profiles The node ships one profile per model size (`tiny`, `base`, `small`, `medium`, `large-v3`) plus `default`, which is an alias for `base`. Every profile uses the same defaults: `language: en`, `silence_threshold: 0.25`, `min_seconds: 240`, `max_seconds: 300`, `vad_level: 1`. Only the model differs between profiles. --- ## Language Defaults to English (`en`). Change the `language` config value to transcribe other languages. Any language supported by Whisper is accepted. --- ## Schema | Field | Type | Description | Default | |---|---|---|---| | `transcribe.max_seconds` | `number` | **Maximum Seconds**
The maximum seconds of audio to buffer to process | `300` | | `transcribe.min_seconds` | `number` | **Minimum Seconds**
The minimum seconds of audio to process in a batch and looking for silence | `240` | | `transcribe.model` | `string` | **Model**
The Whisper model to use for transcription | `"base"` | | `transcribe.profile` | `string` | | `"default"` | | `transcribe.silence_threshold` | `number` | **Silence Threshold**
The silence threshold to detect silence in speech (in seconds) | `0.25` | | `transcribe.vad_level` | `number` | **VAD Level**
The VAD level to use for silence detection (0-3) | `1` | ## Dependencies - `faster-whisper` - `ctranslate2` - `av` - `tokenizers` - `huggingface-hub` - `tqdm` - `onnxruntime` ## Source [ View source](https://github.com/rocketride-org/rocketride-server/tree/develop/nodes/src/nodes/audio_transcribe)