MiniMax Speech 2.8 HD Sync Text-to-Speech

curl --request POST \
  --url https://api.myrouter.ai/v3/minimax-speech-2.8-hd \
  --header 'Authorization: <authorization>' \
  --header 'Content-Type: <content-type>' \
  --data '
{
  "text": "<string>",
  "stream": true,
  "voice_modify": {
    "pitch": 123,
    "timbre": 123,
    "intensity": 123,
    "sound_effects": "<string>"
  },
  "audio_setting": {
    "format": "<string>",
    "bitrate": 123,
    "channel": 123,
    "force_cbr": true,
    "sample_rate": 123
  },
  "output_format": "<string>",
  "voice_setting": {
    "vol": 123,
    "pitch": 123,
    "speed": 123,
    "emotion": "<string>",
    "voice_id": "<string>",
    "latex_read": true,
    "text_normalization": true
  },
  "aigc_watermark": true,
  "language_boost": "<string>",
  "stream_options": {
    "exclude_aggregated_audio": true
  },
  "timber_weights": [
    {
      "weight": 123,
      "voice_id": "<string>"
    }
  ],
  "subtitle_enable": true,
  "continuous_sound": true,
  "pronunciation_dict": {
    "tone": [
      {}
    ]
  }
}
'

{
  "data": {},
  "trace_id": "<string>",
  "base_resp": {},
  "extra_info": {}
}

POST

minimax-speech-2.8-hd

curl --request POST \
  --url https://api.myrouter.ai/v3/minimax-speech-2.8-hd \
  --header 'Authorization: <authorization>' \
  --header 'Content-Type: <content-type>' \
  --data '
{
  "text": "<string>",
  "stream": true,
  "voice_modify": {
    "pitch": 123,
    "timbre": 123,
    "intensity": 123,
    "sound_effects": "<string>"
  },
  "audio_setting": {
    "format": "<string>",
    "bitrate": 123,
    "channel": 123,
    "force_cbr": true,
    "sample_rate": 123
  },
  "output_format": "<string>",
  "voice_setting": {
    "vol": 123,
    "pitch": 123,
    "speed": 123,
    "emotion": "<string>",
    "voice_id": "<string>",
    "latex_read": true,
    "text_normalization": true
  },
  "aigc_watermark": true,
  "language_boost": "<string>",
  "stream_options": {
    "exclude_aggregated_audio": true
  },
  "timber_weights": [
    {
      "weight": 123,
      "voice_id": "<string>"
    }
  ],
  "subtitle_enable": true,
  "continuous_sound": true,
  "pronunciation_dict": {
    "tone": [
      {}
    ]
  }
}
'

{
  "data": {},
  "trace_id": "<string>",
  "base_resp": {},
  "extra_info": {}
}

Convert text to speech with support for multiple voices, emotion control, speed adjustment, and more. Text length limit is less than 10,000 characters. For text longer than 3,000 characters, streaming output is recommended.

Request Headers

Content-Type

string

required

Enum: application/json

Authorization

string

required

Bearer authentication format: Bearer {{API Key}}.

Request Body

text

string

required

The text to be synthesized into speech. Length limit is less than 10,000 characters. For text longer than 3,000 characters, streaming output is recommended. Supports paragraph breaks (newline characters), pause control (<#x#> markers), and interjection tags (such as (laughs), (coughs), etc., only supported by speech-2.8-hd/turbo).

stream

boolean

default:false

Controls whether to enable streaming output. Default: false (streaming disabled).

voice_modify

object

Hide properties

pitch

integer

Pitch adjustment (deep/bright), range [-100, 100]. Values closer to -100 produce a deeper sound; values closer to 100 produce a brighter sound.Range: [-100, 100]

timbre

integer

Timbre adjustment (rich/crisp), range [-100, 100]. Values closer to -100 produce a richer sound; values closer to 100 produce a crisper sound.Range: [-100, 100]

intensity

integer

Intensity adjustment (powerful/soft), range [-100, 100]. Values closer to -100 produce a more powerful sound; values closer to 100 produce a softer sound.Range: [-100, 100]

sound_effects

string

Sound effect settings. Only one can be selected at a time. Possible values: spacious_echo (Spacious Echo), auditorium_echo (Auditorium Broadcast), lofi_telephone (Telephone Distortion), robotic (Electronic Voice).Possible values: spacious_echo, auditorium_echo, lofi_telephone, robotic

audio_setting

object

Hide properties

format

string

default:"mp3"

The format of the generated audio. wav is only supported in non-streaming output.Possible values: mp3, pcm, flac, wav

bitrate

integer

default:128000

The bitrate of the generated audio. Options: [32000, 64000, 128000, 256000], Default: 128000. This parameter only applies to mp3 format audio.Possible values: 32000, 64000, 128000, 256000

channel

integer

default:1

The number of audio channels. Options: [1, 2], where 1 is mono and 2 is stereo. Default: 1Possible values: 1, 2

force_cbr

boolean

default:false

Controls constant bitrate (CBR) encoding for audio. When set to true, audio will be encoded at a constant bitrate. Note: This parameter only takes effect when streaming output is enabled and the audio format is mp3.

sample_rate

integer

default:32000

The sample rate of the generated audio. Options: [8000, 16000, 22050, 24000, 32000, 44100], Default: 32000Possible values: 8000, 16000, 22050, 24000, 32000, 44100

output_format

string

default:"hex"

Parameter that controls the output format. Possible values: url, hex. Default: hex. This parameter only takes effect in non-streaming scenarios; streaming scenarios only support hex output. The returned URL is valid for 24 hours.Possible values: url, hex

voice_setting

object

Hide properties

vol

number

default:1

The volume of the synthesized audio. Higher values result in louder volume. Range (0, 10], Default: 1.0Range: [0, 10]

pitch

integer

default:0

The pitch of the synthesized audio. Range [-12, 12], Default: 0, where 0 outputs the original voice.Range: [-12, 12]

speed

number

default:1

The speech rate of the synthesized audio. Higher values result in faster speech. Range [0.5, 2], Default: 1.0Range: [0.5, 2]

emotion

string

Controls the emotion of the synthesized speech. Corresponds to 9 emotions: happy, sad, angry, fearful, disgusted, surprised, neutral (calm), vivid (fluent), whisper. The model automatically matches appropriate emotions based on input text; manual specification is generally unnecessary.Possible values: happy, sad, angry, fearful, disgusted, surprised, calm, fluent, whisper

voice_id

string

required

The voice ID for the synthesized audio. To set mixed voices, use the timber_weights parameter and set this parameter to an empty value. Supports system voices, cloned voices, and text-generated voices.

latex_read

boolean

default:false

Controls whether to read LaTeX formulas aloud. Default: false. Only supports Chinese; when enabled, the language_boost parameter will be set to Chinese.

text_normalization

boolean

default:false

Enables Chinese and English text normalization. When enabled, it can improve performance in number reading scenarios but slightly increases latency. Default: false.

aigc_watermark

boolean

default:false

Controls whether to add an audio rhythm identifier at the end of the synthesized audio. Default: false. This parameter only applies to non-streaming synthesis.

language_boost

string

Whether to enhance recognition of specified minority languages and dialects. Default: null. Can be set to auto to let the model automatically determine the language type.Possible values: Chinese, Chinese,Yue, English, Arabic, Russian, Spanish, French, Portuguese, German, Turkish, Dutch, Ukrainian, Vietnamese, Indonesian, Japanese, Italian, Korean, Thai, Polish, Romanian, Greek, Czech, Finnish, Hindi, Bulgarian, Danish, Hebrew, Malay, Persian, Slovak, Swedish, Croatian, Filipino, Hungarian, Norwegian, Slovenian, Catalan, Nynorsk, Tamil, Afrikaans, auto

stream_options

object

Hide properties

exclude_aggregated_audio

boolean

default:false

Controls whether the last chunk includes the concatenated speech hex data. Default: false, meaning the last chunk contains the complete concatenated speech hex data.

timber_weights

array

Mixed voice settings. Supports mixing up to 4 voices.

Hide properties

weight

integer

required

The weight of each voice in the synthesized audio. Must be provided together with voice_id. Range: [1, 100]. Supports mixing up to 4 voices. A higher proportion for a single voice results in greater similarity to that voice.Range: [1, 100]

voice_id

string

required

The voice ID for the synthesized audio. Must be provided together with the weight parameter. Supports system voices, cloned voices, and text-generated voices.

subtitle_enable

boolean

default:false

Controls whether to enable the subtitle service. Default: false. This parameter is only effective in non-streaming output scenarios and only applies to speech-2.6-hd, speech-2.6-turbo, speech-02-turbo, speech-02-hd, speech-01-turbo, speech-01-hd models.

continuous_sound

boolean

default:false

Enable this parameter to make clause transitions more natural. Only supported for speech-2.8-hd and speech-2.8-turbo models.

pronunciation_dict

object

Hide properties

tone

array

Define pronunciation annotation or replacement rules for characters or symbols that require special marking. In Chinese text, tones are represented by numbers: 1st tone is 1, 2nd tone is 2, 3rd tone is 3, 4th tone is 4, neutral tone is 5. Example: [“omg/oh my god”]

Response

data

object

The returned synthesis data object. May be null; null-check is required.

trace_id

string

The session ID for this request, used to help locate issues during inquiries or feedback.

base_resp

object

The status code and details of this request.

extra_info

object

Additional information about the audio.

MiniMax Speech 2.8 HD Async Text-to-Speech

MiniMax Quick Voice Cloning

⌘I

API Basics

LLM

Image

Video

Audio

MiniMax Speech 2.8 HD Sync Text-to-Speech

Request Headers

Request Body

Response

API Basics

LLM

Image

Video

Audio

​Request Headers

​Request Body

​Response

Request Headers

Request Body

Response