Skip to main content
View source

Text To Speech

View as Markdown

A RocketRide pipe node that converts incoming text into spoken audio using the Kokoro-82M text-to-speech engine.

What it does

Takes text arriving on any of its input lanes, synthesizes it with Kokoro-82M (hexgrad/Kokoro-82M), and emits the result on the audio lane as WAV bytes (MIME audio/wav) via the writeAudio BEGIN / WRITE / END sequence. Locally generated audio is mono, 16-bit, 24 kHz; synthesis speed is fixed at 1.

The node runs in one of two modes, chosen automatically at startup:

  • Local: when no model server is configured, the node installs its own requirements (numpy, kokoro, soundfile) at runtime and constructs a kokoro.KPipeline in-process. The spaCy en_core_web_sm model (needed by Kokoro's misaki G2P) is downloaded and installed automatically, matched to the installed spaCy version.
  • Model server (--modelserver): when a model server address is available, the node connects a ModelClient and loads the kokoro loader on the server instead. The heavy local dependencies are skipped entirely; audio comes back base64-encoded over the inference command.

Audio is written to a temporary WAV file during synthesis and deleted as soon as the bytes have been streamed, including on error, so no orphan files are left on disk. Empty or whitespace-only input is silently skipped and produces no output. Startup fails with Kokoro: choose a voice from the list if no voice is configured.


Configuration

Lanes

All four input lanes produce output on the audio lane.

Input laneWhat gets synthesized
textThe raw text, as-is.
documentsThe page_content of each document, joined with newlines. Documents of type Image, Audio, or Video are skipped.
questionsThe text of every question, joined with spaces.
answersThe answer text (via getText()).

Fields

The node has a single profile, kokoro (the default), selected by the profile field.

FieldTypeDescription
kokoro_voicestringDefault "af_heart". Kokoro voice. The language is derived automatically from the voice prefix (af_/am_ → American, bf_/bm_ → British, ef_/em_ → Spanish, etc.).
profilestringDefault "kokoro".

Voices and language

The language is derived automatically from the first character of the voice id (af_* / am_* is American English, bf_* / bm_* is British English, ef_* / em_* is Spanish, and so on). Available voice families:

PrefixLanguageExamples
af_ / am_American Englishaf_heart (default), af_bella, am_adam
bf_ / bm_British Englishbf_emma, bm_george, bm_fable
jf_ / jm_Japanesejf_alpha, jm_kumo
zf_ / zm_Mandarinzf_xiaoxiao, zm_yunxi
ef_ / em_Spanishef_dora, em_alex
ff_Frenchff_siwis
hf_ / hm_Hindihf_alpha, hm_omega
if_ / im_Italianif_sara, im_nicola
pf_ / pm_Portuguesepf_dora, pm_alex

The full list of voice ids is defined in services.json.


Troubleshooting (Exception: 1 / wasabi)

If misaki/spaCy initialization fails (for example Exception: 1 or a missing wasabi dependency), ensure the spaCy English model is installed: this node downloads en_core_web_sm automatically from the official spaCy GitHub release wheel, matched to the installed spaCy version. Verify that numpy, kokoro, and soundfile from requirements.txt are installed, and that the model download was not blocked by network restrictions.


Schema

FieldTypeDescriptionDefault
audio_tts.kokoro_voicestringVoice
Kokoro voice. The language is derived automatically from the voice prefix (af_/am_ → American, bf_/bm_ → British, ef_/em_ → Spanish, etc.).
"af_heart"
audio_tts.profilestringTTS profile"kokoro"

Dependencies

  • numpy
  • kokoro >=0.9.4
  • soundfile >=0.13.1