MiniMax Speech 2.8 Turbo Async Text-to-Speech

curl --request POST \
  --url https://api.myrouter.ai/v3/async/minimax-speech-2.8-turbo \
  --header 'Authorization: <authorization>' \
  --header 'Content-Type: <content-type>' \
  --data '
{
  "text": "<string>",
  "text_file_id": 123,
  "voice_modify": {
    "pitch": 123,
    "timbre": 123,
    "intensity": 123,
    "sound_effects": "<string>"
  },
  "audio_setting": {
    "format": "<string>",
    "bitrate": 123,
    "channel": 123,
    "audio_sample_rate": 123
  },
  "voice_setting": {
    "vol": 123,
    "pitch": 123,
    "speed": 123,
    "emotion": "<string>",
    "voice_id": "<string>",
    "english_normalization": true
  },
  "aigc_watermark": true,
  "language_boost": "<string>",
  "continuous_sound": true,
  "pronunciation_dict": {
    "tone": [
      {}
    ]
  }
}
'

{
  "file_id": 123,
  "task_id": "<string>",
  "base_resp": {
    "status_msg": "<string>",
    "status_code": 123
  },
  "task_token": "<string>",
  "usage_characters": 123
}

POST

async

minimax-speech-2.8-turbo

MiniMax Speech 2.8 Turbo Async Text-to-Speech

curl --request POST \
  --url https://api.myrouter.ai/v3/async/minimax-speech-2.8-turbo \
  --header 'Authorization: <authorization>' \
  --header 'Content-Type: <content-type>' \
  --data '
{
  "text": "<string>",
  "text_file_id": 123,
  "voice_modify": {
    "pitch": 123,
    "timbre": 123,
    "intensity": 123,
    "sound_effects": "<string>"
  },
  "audio_setting": {
    "format": "<string>",
    "bitrate": 123,
    "channel": 123,
    "audio_sample_rate": 123
  },
  "voice_setting": {
    "vol": 123,
    "pitch": 123,
    "speed": 123,
    "emotion": "<string>",
    "voice_id": "<string>",
    "english_normalization": true
  },
  "aigc_watermark": true,
  "language_boost": "<string>",
  "continuous_sound": true,
  "pronunciation_dict": {
    "tone": [
      {}
    ]
  }
}
'

{
  "file_id": 123,
  "task_id": "<string>",
  "base_resp": {
    "status_msg": "<string>",
    "status_code": 123
  },
  "task_token": "<string>",
  "usage_characters": 123
}

Use this API to create an async text-to-speech task. Supports text or file input, with a text length limit of 50,000 characters and a file limit of 100,000 characters.

This is an async API that only returns the task_id of the async task. Use the task_id to call the Get Async Task Result API to retrieve the generated result.

Request Headers

Content-Type

string

required

Enum: application/json

Authorization

string

required

Bearer authentication format: Bearer {{API Key}}.

Request Body

text

string

The text to be synthesized into audio, limited to a maximum of 50,000 characters. Either this or text_file_id is required.

Interjection tags: Only when the model is speech-2.8-hd or speech-2.8-turbo, interjection tags can be inserted into the text. Supported interjections: (laughs) (laughter), (chuckle) (chuckle), (coughs) (cough), (clear-throat) (throat clearing), (groans) (groan), (breath) (normal breathing), (pant) (panting), (inhale) (inhale), (exhale) (exhale), (gasps) (gasp), (sniffs) (sniff), (sighs) (sigh), (snorts) (snort), (burps) (burp), (lip-smacking) (lip smacking), (humming) (humming), (hissing) (hissing), (emm) (umm), (whistles) (whistle), (sneezes) (sneeze), (crying) (sobbing), (applause) (applause)

text_file_id

integer

The text file ID for audio synthesis. Single file length limit is less than 100,000 characters. Supported file formats: txt, zip. Either this or text is required; format is automatically validated upon submission.
txt file: Length limit <100,000 characters. Supports custom pauses using <#x#> markers. x is the pause duration (in seconds), range [0.01, 99.99], up to two decimal places. Note that pauses must be placed between two vocalizable text segments; consecutive pause markers cannot be used.
zip file:
The archive must contain txt or json files of the same format.
json file format: Supports [title, content, extra] three fields, representing title, body, and additional information respectively. If all three fields exist, 3 sets of results will be produced, totaling 9 files stored in a single folder. If a field does not exist or is empty, no corresponding result will be generated for that field.

voice_modify

object

Hide properties

pitch

integer

Pitch adjustment (deep/bright), range [-100, 100]. Values closer to -100 produce a deeper sound; values closer to 100 produce a brighter sound.Range: [-100, 100]

timbre

integer

Timbre adjustment (rich/crisp), range [-100, 100]. Values closer to -100 produce a richer sound; values closer to 100 produce a crisper sound.Range: [-100, 100]

intensity

integer

Intensity adjustment (powerful/soft), range [-100, 100]. Values closer to -100 produce a more powerful sound; values closer to 100 produce a softer sound.Range: [-100, 100]

sound_effects

string

Sound effect settings. Only one can be selected at a time. Possible values:

spacious_echo (Spacious Echo)
auditorium_echo (Auditorium Broadcast)
lofi_telephone (Telephone Distortion)
robotic (Electronic Voice)

Possible values: spacious_echo, auditorium_echo, lofi_telephone, robotic

audio_setting

object

Hide properties

format

string

default:"mp3"

The format of the generated audio. Options: [mp3, pcm, flac], Default: mp3Possible values: mp3, pcm, flac

bitrate

integer

default:128000

The bitrate of the generated audio. Options: [32000, 64000, 128000, 256000], Default: 128000. This parameter only applies to mp3 format audio.

channel

integer

default:2

The number of audio channels. Options: [1, 2], where 1 is mono and 2 is stereo. Default: 1

audio_sample_rate

integer

default:32000

The sample rate of the generated audio. Options: [8000, 16000, 22050, 24000, 32000, 44100], Default: 32000

voice_setting

object

required

Hide properties

vol

number

default:1

The volume of the synthesized audio. Higher values result in louder volume. Range (0, 10], Default: 1.0Range: [0, 10]

pitch

integer

default:0

The pitch of the synthesized audio. Range [-12, 12], Default: 0, where 0 outputs the original voice.Range: [-12, 12]

speed

number

default:1

The speech rate of the synthesized audio. Higher values result in faster speech. Range [0.5, 2], Default: 1.0Range: [0.5, 2]

emotion

string

Controls the emotion of the synthesized speech. Possible values: [“happy”, “sad”, “angry”, “fearful”, “disgusted”, “surprised”, “calm”, “fluent”, “whisper”], corresponding to 9 emotions: happy, sad, angry, fearful, disgusted, surprised, neutral, vivid, whisper.
The model automatically matches appropriate emotions based on input text; manual specification is generally unnecessary.
This parameter only applies to speech-2.6-hd, speech-2.6-turbo, speech-02-hd, speech-02-turbo, speech-01-hd, speech-01-turbo models.
Options fluent and whisper only apply to speech-2.6-turbo and speech-2.6-hd models.Possible values: happy, sad, angry, fearful, disgusted, surprised, calm, fluent, whisper

voice_id

string

required

The voice ID for the synthesized audio. To set mixed voices, use the timber_weights parameter and set this parameter to an empty value. Supports system voices, cloned voices, and text-generated voices. Below are some of the latest system voice IDs; see the official documentation for the full list of supported voices.
Chinese:
moss_audio_ce44fc67-7ce3-11f0-8de5-96e35d26fb85
moss_audio_aaa1346a-7ce7-11f0-8e61-2e6e3c7ee85d
Chinese (Mandarin)_Lyrical_Voice
Chinese (Mandarin)_HK_Flight_Attendant
English:
English_Graceful_Lady
English_Insightful_Speaker
English_radiant_girl
English_Persuasive_Man
moss_audio_6dc281eb-713c-11f0-a447-9613c873494c
moss_audio_570551b1-735c-11f0-b236-0adeeecad052
moss_audio_ad5baf92-735f-11f0-8263-fe5a2fe98ec8
English_Lucky_Robot
Japanese:
Japanese_Whisper_Belle
moss_audio_24875c4a-7be4-11f0-9359-4e72c55db738
moss_audio_7f4ee608-78ea-11f0-bb73-1e2a4cfcd245
moss_audio_c1a6a3ac-7be6-11f0-8e8e-36b92fbb4f95

english_normalization

boolean

default:false

Enables English text normalization. When enabled, it can improve performance in number reading scenarios but slightly increases latency. Default: false.

aigc_watermark

boolean

default:false

Controls whether to add an audio rhythm identifier at the end of the synthesized audio. Default: False. This parameter only applies to non-streaming synthesis.

language_boost

string

Whether to enhance recognition of specified minority languages and dialects. Default: null. Can be set to auto to let the model automatically determine the language type.Possible values: Chinese, Chinese,Yue, English, Arabic, Russian, Spanish, French, Portuguese, German, Turkish, Dutch, Ukrainian, Vietnamese, Indonesian, Japanese, Italian, Korean, Thai, Polish, Romanian, Greek, Czech, Finnish, Hindi, Bulgarian, Danish, Hebrew, Malay, Persian, Slovak, Swedish, Croatian, Filipino, Hungarian, Norwegian, Slovenian, Catalan, Nynorsk, Tamil, Afrikaans, auto

continuous_sound

boolean

default:false

Enable this parameter to make clause transitions more natural. Only supported for speech-2.8-hd and speech-2.8-turbo models.

pronunciation_dict

object

Hide properties

tone

array

Define pronunciation annotation or replacement rules for characters or symbols that require special marking. In Chinese text, tones are represented by numbers: 1st tone is 1, 2nd tone is 2, 3rd tone is 3, 4th tone is 4, neutral tone is 5. Example: [“omg/oh my god”]

Response

file_id

integer

The ID of the corresponding audio file returned after successful task creation.

After the task is completed, you can query using the file_id. This field is not returned when the request fails.Note: The returned download URL is valid for 9 hours (32,400 seconds) from generation. After expiration, the file will become invalid and the generated content will be lost. Please download in time.

task_id

string

Use the task_id to call the Get Async Task Result API to retrieve the generated output.

base_resp

object

Hide properties

status_msg

string

required

Status details

status_code

integer

required

Status code

0: Normal
1002: Rate limited
1004: Authentication failed
1039: TPM rate limit triggered
1042: Illegal characters exceed 10%
2013: Parameter error

task_token

string

The key information used to complete the current task

usage_characters

integer

Billed character count

MiniMax Speech-2.6-turbo Async Text-to-Speech

MiniMax Speech 2.8 Turbo Sync Text-to-Speech

​Request Headers

​Request Body

​Response

Request Headers

Request Body

Response