Convert text to speech with support for multiple voices, emotion control, speed adjustment, and more. Text length limit is less than 10,000 characters. For text longer than 3,000 characters, streaming output is recommended.
The text to be synthesized into speech. Length limit is less than 10,000 characters. For text longer than 3,000 characters, streaming output is recommended. Supports paragraph breaks (newline characters), pause control (<#x#> markers), and interjection tags (such as (laughs), (coughs), etc., only supported by speech-2.8-hd/turbo).
Pitch adjustment (deep/bright), range [-100, 100]. Values closer to -100 produce a deeper sound; values closer to 100 produce a brighter sound.Range: [-100, 100]
Timbre adjustment (rich/crisp), range [-100, 100]. Values closer to -100 produce a richer sound; values closer to 100 produce a crisper sound.Range: [-100, 100]
Intensity adjustment (powerful/soft), range [-100, 100]. Values closer to -100 produce a more powerful sound; values closer to 100 produce a softer sound.Range: [-100, 100]
Sound effect settings. Only one can be selected at a time. Possible values: spacious_echo (Spacious Echo), auditorium_echo (Auditorium Broadcast), lofi_telephone (Telephone Distortion), robotic (Electronic Voice).Possible values: spacious_echo, auditorium_echo, lofi_telephone, robotic
The bitrate of the generated audio. Options: [32000, 64000, 128000, 256000], Default: 128000. This parameter only applies to mp3 format audio.Possible values: 32000, 64000, 128000, 256000
Controls constant bitrate (CBR) encoding for audio. When set to true, audio will be encoded at a constant bitrate. Note: This parameter only takes effect when streaming output is enabled and the audio format is mp3.
Parameter that controls the output format. Possible values: url, hex. Default: hex. This parameter only takes effect in non-streaming scenarios; streaming scenarios only support hex output. The returned URL is valid for 24 hours.Possible values: url, hex
Controls the emotion of the synthesized speech. Corresponds to 9 emotions: happy, sad, angry, fearful, disgusted, surprised, neutral (calm), vivid (fluent), whisper. The model automatically matches appropriate emotions based on input text; manual specification is generally unnecessary.Possible values: happy, sad, angry, fearful, disgusted, surprised, calm, fluent, whisper
The voice ID for the synthesized audio. To set mixed voices, use the timber_weights parameter and set this parameter to an empty value. Supports system voices, cloned voices, and text-generated voices.
Controls whether to read LaTeX formulas aloud. Default: false. Only supports Chinese; when enabled, the language_boost parameter will be set to Chinese.
Enables Chinese and English text normalization. When enabled, it can improve performance in number reading scenarios but slightly increases latency. Default: false.
Controls whether to add an audio rhythm identifier at the end of the synthesized audio. Default: false. This parameter only applies to non-streaming synthesis.
Whether to enhance recognition of specified minority languages and dialects. Default: null. Can be set to auto to let the model automatically determine the language type.Possible values: Chinese, Chinese,Yue, English, Arabic, Russian, Spanish, French, Portuguese, German, Turkish, Dutch, Ukrainian, Vietnamese, Indonesian, Japanese, Italian, Korean, Thai, Polish, Romanian, Greek, Czech, Finnish, Hindi, Bulgarian, Danish, Hebrew, Malay, Persian, Slovak, Swedish, Croatian, Filipino, Hungarian, Norwegian, Slovenian, Catalan, Nynorsk, Tamil, Afrikaans, auto
Controls whether the last chunk includes the concatenated speech hex data. Default: false, meaning the last chunk contains the complete concatenated speech hex data.
The weight of each voice in the synthesized audio. Must be provided together with voice_id. Range: [1, 100]. Supports mixing up to 4 voices. A higher proportion for a single voice results in greater similarity to that voice.Range: [1, 100]
The voice ID for the synthesized audio. Must be provided together with the weight parameter. Supports system voices, cloned voices, and text-generated voices.
Controls whether to enable the subtitle service. Default: false. This parameter is only effective in non-streaming output scenarios and only applies to speech-2.6-hd, speech-2.6-turbo, speech-02-turbo, speech-02-hd, speech-01-turbo, speech-01-hd models.
Define pronunciation annotation or replacement rules for characters or symbols that require special marking. In Chinese text, tones are represented by numbers: 1st tone is 1, 2nd tone is 2, 3rd tone is 3, 4th tone is 4, neutral tone is 5. Example: [“omg/oh my god”]