Use this API to create an async text-to-speech task. Supports text or file input, with a text length limit of 50,000 characters and a file limit of 100,000 characters.
This is an async API that only returns the task_id of the async task. Use the task_id to call the Get Async Task Result API to retrieve the generated result.
The text file ID for audio synthesis. Single file length limit is less than 100,000 characters. Supported file formats: txt, zip. Either this or text is required; format is automatically validated upon submission. txt file: Length limit <100,000 characters. Supports custom pauses using <#x#> markers. x is the pause duration (in seconds), range [0.01, 99.99], up to two decimal places. Note that pauses must be placed between two vocalizable text segments; consecutive pause markers cannot be used. zip file: The archive must contain txt or json files of the same format. json file format: Supports [title, content, extra] three fields, representing title, body, and additional information respectively. If all three fields exist, 3 sets of results will be produced, totaling 9 files stored in a single folder. If a field does not exist or is empty, no corresponding result will be generated for that field.
Pitch adjustment (deep/bright), range [-100, 100]. Values closer to -100 produce a deeper sound; values closer to 100 produce a brighter sound.Range: [-100, 100]
Timbre adjustment (rich/crisp), range [-100, 100]. Values closer to -100 produce a richer sound; values closer to 100 produce a crisper sound.Range: [-100, 100]
Intensity adjustment (powerful/soft), range [-100, 100]. Values closer to -100 produce a more powerful sound; values closer to 100 produce a softer sound.Range: [-100, 100]
Controls the emotion of the synthesized speech. Possible values: [“happy”, “sad”, “angry”, “fearful”, “disgusted”, “surprised”, “calm”, “fluent”, “whisper”], corresponding to 9 emotions: happy, sad, angry, fearful, disgusted, surprised, neutral, vivid, whisper.
The model automatically matches appropriate emotions based on input text; manual specification is generally unnecessary.
This parameter only applies to speech-2.6-hd, speech-2.6-turbo, speech-02-hd, speech-02-turbo, speech-01-hd, speech-01-turbo models.
Options fluent and whisper only apply to speech-2.6-turbo and speech-2.6-hd models.Possible values: happy, sad, angry, fearful, disgusted, surprised, calm, fluent, whisper
The voice ID for the synthesized audio. To set mixed voices, use the timber_weights parameter and set this parameter to an empty value. Supports system voices, cloned voices, and text-generated voices. Below are some of the latest system voice IDs; see the official documentation for the full list of supported voices.
Chinese: moss_audio_ce44fc67-7ce3-11f0-8de5-96e35d26fb85 moss_audio_aaa1346a-7ce7-11f0-8e61-2e6e3c7ee85d Chinese (Mandarin)_Lyrical_Voice Chinese (Mandarin)_HK_Flight_Attendant English: English_Graceful_Lady English_Insightful_Speaker English_radiant_girl English_Persuasive_Man moss_audio_6dc281eb-713c-11f0-a447-9613c873494c moss_audio_570551b1-735c-11f0-b236-0adeeecad052 moss_audio_ad5baf92-735f-11f0-8263-fe5a2fe98ec8 English_Lucky_Robot Japanese: Japanese_Whisper_Belle moss_audio_24875c4a-7be4-11f0-9359-4e72c55db738 moss_audio_7f4ee608-78ea-11f0-bb73-1e2a4cfcd245 moss_audio_c1a6a3ac-7be6-11f0-8e8e-36b92fbb4f95
Enables English text normalization. When enabled, it can improve performance in number reading scenarios but slightly increases latency. Default: false.
Controls whether to add an audio rhythm identifier at the end of the synthesized audio. Default: False. This parameter only applies to non-streaming synthesis.
Whether to enhance recognition of specified minority languages and dialects. Default: null. Can be set to auto to let the model automatically determine the language type.Possible values: Chinese, Chinese,Yue, English, Arabic, Russian, Spanish, French, Portuguese, German, Turkish, Dutch, Ukrainian, Vietnamese, Indonesian, Japanese, Italian, Korean, Thai, Polish, Romanian, Greek, Czech, Finnish, Hindi, Bulgarian, Danish, Hebrew, Malay, Persian, Slovak, Swedish, Croatian, Filipino, Hungarian, Norwegian, Slovenian, Catalan, Nynorsk, Tamil, Afrikaans, auto
Define pronunciation annotation or replacement rules for characters or symbols that require special marking. In Chinese text, tones are represented by numbers:
1st tone is 1, 2nd tone is 2, 3rd tone is 3, 4th tone is 4, neutral tone is 5.
Example:
[“omg/oh my god”]
The ID of the corresponding audio file returned after successful task creation.
After the task is completed, you can query using the file_id. This field is not returned when the request fails.Note: The returned download URL is valid for 9 hours (32,400 seconds) from generation. After expiration, the file will become invalid and the generated content will be lost. Please download in time.