This API supports synchronous text-to-speech generation, with a maximum of 10,000 characters per request. Supports 100+ system voices and cloned voices; supports volume, pitch, speed, and output format adjustments; supports proportional voice mixing and fixed interval time control; supports multiple audio specifications and formats including: mp3, pcm, flac, wav, and streaming output.After submitting a long text speech synthesis request, please note that the returned URL is valid for 24 hours from the time it is returned. Please be mindful of the download timing.
Suitable for short sentence generation, voice chat, online social scenarios, with low latency but a text length limit of less than 10,000 characters. For long texts, it is recommended to use Async Text-to-Speech.
The text to be synthesized, length limit less than 10,000 characters. Use newline characters for paragraph breaks. (To control the pause duration in the speech, insert <#x#> between characters, where x is in seconds, supporting 0.01-99.99 with up to two decimal places). Supports custom time intervals between text segments to achieve custom speech pause durations. Note that the interval must be set between two text segments that can be vocalized, and multiple consecutive intervals cannot be set.
The voice ID for the request. Either this or timbre_weights is required.Supports system voices (ID) and cloned voices (ID). The available system voice IDs are as follows:
Youthful Young Man: male-qn-qingse
Elite Young Man: male-qn-jingying
Assertive Young Man: male-qn-badao
College Student: male-qn-daxuesheng
Young Girl: female-shaonv
Mature Lady: female-yujie
Mature Woman: female-chengshu
Sweet Woman: female-tianmei
Male Presenter: presenter_male
Female Presenter: presenter_female
Male Audiobook 1: audiobook_male_1
Male Audiobook 2: audiobook_male_2
Female Audiobook 1: audiobook_female_1
Female Audiobook 2: audiobook_female_2
Youthful Young Man (beta): male-qn-qingse-jingpin
Elite Young Man (beta): male-qn-jingying-jingpin
Assertive Young Man (beta): male-qn-badao-jingpin
College Student (beta): male-qn-daxuesheng-jingpin
This parameter enables English text normalization, which can improve performance in number-reading scenarios but may slightly increase latency. Default: false.
Possible values: [32000, 64000, 128000, 256000]Bitrate of the generated audio. Optional, Default: 128000. This parameter only applies to mp3 format audio.
Replace text, symbols, and their corresponding pronunciations that require special annotation.Pronunciation replacement (adjust tones / replace with other character pronunciations), format as follows:["omg/oh my god"]For Chinese text, tones are represented by numbers: 1st tone (high level) is 1, 2nd tone (rising) is 2, 3rd tone (dipping) is 3, 4th tone (falling) is 4, neutral tone is 5.
Range [1, 100]Weight, must be filled in together with voice_id. Supports up to 4 voice mixtures. Values must be integers; a higher proportion for a single voice makes the synthesized voice more similar to it.
When set to True, the last chunk in streaming will not contain the concatenated complete audio hex data. Default: False, meaning the last chunk includes the concatenated complete audio hex data.
Enhances recognition capability for specified minority languages and dialects. When set, it can improve speech performance for the specified language/dialect. If the language type is unclear, you can select “auto” and the model will automatically determine the language type. Supported values:'Chinese', 'Chinese,Yue', 'English', 'Arabic', 'Russian', 'Spanish', 'French', 'Portuguese', 'German', 'Turkish', 'Dutch', 'Ukrainian', 'Vietnamese', 'Indonesian', 'Japanese', 'Italian', 'Korean', 'Thai', 'Polish', 'Romanian', 'Greek', 'Czech', 'Finnish', 'Hindi', 'Bulgarian', 'Danish', 'Hebrew', 'Malay', 'Persian', 'Slovak', 'Swedish', 'Croatian', 'Filipino', 'Hungarian', 'Norwegian', 'Slovenian', 'Catalan', 'Nynorsk', 'Tamil', 'Afrikaans', 'auto'
Controls the output format. Possible values: url, hex. Default: hex. This parameter only takes effect in non-streaming scenarios; streaming only supports hex format. The returned URL is valid for 24 hours.
Intensity adjustment (powerful/soft), Range [-100, 100]. Values closer to -100 produce a more forceful voice; values closer to 100 produce a softer voice.
The synthesized audio segment, hex-encoded, generated in the format defined by the input (audio_setting.format) (mp3/pcm/flac). The return format is determined by the output_format setting. When stream is true, only hex format is supported.