Text To Speech
A RocketRide pipe node that converts incoming text into spoken audio using the Kokoro-82M text-to-speech engine.
What it does
Takes text arriving on any of its input lanes, synthesizes it with Kokoro-82M (hexgrad/Kokoro-82M), and emits the result on the audio lane as WAV bytes (MIME audio/wav) via the writeAudio BEGIN / WRITE / END sequence. Locally generated audio is mono, 16-bit, 24 kHz; synthesis speed is fixed at 1.
The node runs in one of two modes, chosen automatically at startup:
- Local: when no model server is configured, the node installs its own requirements (
numpy,kokoro,soundfile) at runtime and constructs akokoro.KPipelinein-process. The spaCyen_core_web_smmodel (needed by Kokoro's misaki G2P) is downloaded and installed automatically, matched to the installed spaCy version. - Model server (
--modelserver): when a model server address is available, the node connects aModelClientand loads thekokoroloader on the server instead. The heavy local dependencies are skipped entirely; audio comes back base64-encoded over the inference command.
Audio is written to a temporary WAV file during synthesis and deleted as soon as the bytes have been streamed, including on error, so no orphan files are left on disk. Empty or whitespace-only input is silently skipped and produces no output. Startup fails with Kokoro: choose a voice from the list if no voice is configured.
Configuration
Lanes
All four input lanes produce output on the audio lane.
| Input lane | What gets synthesized |
|---|---|
text | The raw text, as-is. |
documents | The page_content of each document, joined with newlines. Documents of type Image, Audio, or Video are skipped. |
questions | The text of every question, joined with spaces. |
answers | The answer text (via getText()). |
Fields
The node has a single profile, kokoro (the default), selected by the profile field.
| Field | Type | Description |
|---|---|---|
kokoro_voice | string | Default "af_heart". Kokoro voice. The language is derived automatically from the voice prefix (af_/am_ → American, bf_/bm_ → British, ef_/em_ → Spanish, etc.). |
profile | string | Default "kokoro". |
Voices and language
The language is derived automatically from the first character of the voice id (af_* / am_* is American English, bf_* / bm_* is British English, ef_* / em_* is Spanish, and so on). Available voice families:
| Prefix | Language | Examples |
|---|---|---|
af_ / am_ | American English | af_heart (default), af_bella, am_adam |
bf_ / bm_ | British English | bf_emma, bm_george, bm_fable |
jf_ / jm_ | Japanese | jf_alpha, jm_kumo |
zf_ / zm_ | Mandarin | zf_xiaoxiao, zm_yunxi |
ef_ / em_ | Spanish | ef_dora, em_alex |
ff_ | French | ff_siwis |
hf_ / hm_ | Hindi | hf_alpha, hm_omega |
if_ / im_ | Italian | if_sara, im_nicola |
pf_ / pm_ | Portuguese | pf_dora, pm_alex |
The full list of voice ids is defined in services.json.
Troubleshooting (Exception: 1 / wasabi)
If misaki/spaCy initialization fails (for example Exception: 1 or a missing wasabi dependency), ensure the spaCy English model is installed: this node downloads en_core_web_sm automatically from the official spaCy GitHub release wheel, matched to the installed spaCy version. Verify that numpy, kokoro, and soundfile from requirements.txt are installed, and that the model download was not blocked by network restrictions.
Schema
| Field | Type | Description | Default |
|---|---|---|---|
audio_tts.kokoro_voice | string | Voice Kokoro voice. The language is derived automatically from the voice prefix (af_/am_ → American, bf_/bm_ → British, ef_/em_ → Spanish, etc.). | "af_heart" |
audio_tts.profile | string | TTS profile | "kokoro" |
Dependencies
numpykokoro>=0.9.4soundfile>=0.13.1