Text To Speech

A RocketRide pipe node that converts incoming text into spoken audio using the Kokoro-82M text-to-speech engine.

What it does

Takes text arriving on any of its input lanes, synthesizes it with Kokoro-82M (hexgrad/Kokoro-82M), and emits the result on the audio lane as WAV bytes (MIME audio/wav) via the writeAudio BEGIN / WRITE / END sequence. Locally generated audio is mono, 16-bit, 24 kHz; synthesis speed is fixed at 1.

The node runs in one of two modes, chosen automatically at startup:

Local: when no model server is configured, the node installs its own requirements (numpy, kokoro, soundfile) at runtime and constructs a kokoro.KPipeline in-process. The spaCy en_core_web_sm model (needed by Kokoro's misaki G2P) is downloaded and installed automatically, matched to the installed spaCy version.
Model server (--modelserver): when a model server address is available, the node connects a ModelClient and loads the kokoro loader on the server instead. The heavy local dependencies are skipped entirely; audio comes back base64-encoded over the inference command.

Audio is written to a temporary WAV file during synthesis and deleted as soon as the bytes have been streamed, including on error, so no orphan files are left on disk. Empty or whitespace-only input is silently skipped and produces no output. Startup fails with Kokoro: choose a voice from the list if no voice is configured.

Configuration

Lanes

All four input lanes produce output on the audio lane.

Input lane	What gets synthesized
`text`	The raw text, as-is.
`documents`	The `page_content` of each document, joined with newlines. Documents of type `Image`, `Audio`, or `Video` are skipped.
`questions`	The text of every question, joined with spaces.
`answers`	The answer text (via `getText()`).

Fields

The node has a single profile, kokoro (the default), selected by the profile field.

Field	Type	Description
`kokoro_voice`	string	Default "af_heart". Kokoro voice. The language is derived automatically from the voice prefix (af_/am_ → American, bf_/bm_ → British, ef_/em_ → Spanish, etc.).
`profile`	string	Default "kokoro".

Voices and language

The language is derived automatically from the first character of the voice id (af_* / am_* is American English, bf_* / bm_* is British English, ef_* / em_* is Spanish, and so on). Available voice families:

Prefix	Language	Examples
`af_` / `am_`	American English	`af_heart` (default), `af_bella`, `am_adam`
`bf_` / `bm_`	British English	`bf_emma`, `bm_george`, `bm_fable`
`jf_` / `jm_`	Japanese	`jf_alpha`, `jm_kumo`
`zf_` / `zm_`	Mandarin	`zf_xiaoxiao`, `zm_yunxi`
`ef_` / `em_`	Spanish	`ef_dora`, `em_alex`
`ff_`	French	`ff_siwis`
`hf_` / `hm_`	Hindi	`hf_alpha`, `hm_omega`
`if_` / `im_`	Italian	`if_sara`, `im_nicola`
`pf_` / `pm_`	Portuguese	`pf_dora`, `pm_alex`

The full list of voice ids is defined in services.json.

Troubleshooting (`Exception: 1` / wasabi)

If misaki/spaCy initialization fails (for example Exception: 1 or a missing wasabi dependency), ensure the spaCy English model is installed: this node downloads en_core_web_sm automatically from the official spaCy GitHub release wheel, matched to the installed spaCy version. Verify that numpy, kokoro, and soundfile from requirements.txt are installed, and that the model download was not blocked by network restrictions.

Schema

Field	Type	Description	Default
`audio_tts.kokoro_voice`	`string`	Voice Kokoro voice. The language is derived automatically from the voice prefix (af_/am_ → American, bf_/bm_ → British, ef_/em_ → Spanish, etc.).	`"af_heart"`
`audio_tts.profile`	`string`	TTS profile	`"kokoro"`

Dependencies

numpy
--only-binary docopt
--only-binary num2words
kokoro >=0.9.4
soundfile >=0.13.1

What it does​

Configuration​

Lanes​

Fields​

Voices and language​

Troubleshooting (Exception: 1 / wasabi)​

Schema​

Dependencies​