Gemini 2.5 Flash TTS Text-to-Speech

curl --request POST \
  --url https://api.myrouter.ai/v3/gemini-2.5-flash-tts \
  --header 'Authorization: <authorization>' \
  --header 'Content-Type: <content-type>' \
  --data '
{
  "contents": {
    "role": "<string>",
    "parts": {
      "text": "<string>"
    }
  },
  "generation_config": {
    "temperature": 123,
    "speech_config": {
      "voice_config": {
        "prebuilt_voice_config": {
          "voice_name": "<string>"
        }
      },
      "language_code": "<string>",
      "multi_speaker_voice_config": {
        "speaker_voice_configs": [
          {
            "speaker": "<string>",
            "voice_config": {
              "prebuilt_voice_config": {
                "voice_name": "<string>"
              }
            }
          }
        ]
      }
    }
  }
}
'

{
  "audioContent": "<string>",
  "usageMetadata": {
    "totalTokenCount": 123,
    "promptTokenCount": 123,
    "candidatesTokenCount": 123
  }
}

POST

gemini-2.5-flash-tts

curl --request POST \
  --url https://api.myrouter.ai/v3/gemini-2.5-flash-tts \
  --header 'Authorization: <authorization>' \
  --header 'Content-Type: <content-type>' \
  --data '
{
  "contents": {
    "role": "<string>",
    "parts": {
      "text": "<string>"
    }
  },
  "generation_config": {
    "temperature": 123,
    "speech_config": {
      "voice_config": {
        "prebuilt_voice_config": {
          "voice_name": "<string>"
        }
      },
      "language_code": "<string>",
      "multi_speaker_voice_config": {
        "speaker_voice_configs": [
          {
            "speaker": "<string>",
            "voice_config": {
              "prebuilt_voice_config": {
                "voice_name": "<string>"
              }
            }
          }
        ]
      }
    }
  }
}
'

{
  "audioContent": "<string>",
  "usageMetadata": {
    "totalTokenCount": 123,
    "promptTokenCount": 123,
    "candidatesTokenCount": 123
  }
}

Converts text to speech based on the Vertex AI generateContent API. The request body format is fully consistent with the official Vertex AI API. Supports both synchronous (single request, single response) and streaming (single request, streamed response) modes. Output is in LINEAR16 PCM format (24kHz, mono, 16-bit signed little-endian) without a WAV header.

Request Headers

Content-Type

string

required

Enum: application/json

Authorization

string

required

Bearer authentication format: Bearer {{API Key}}.

Request Body

contents

object

required

Hide properties

role

string

default:"user"

required

Role, fixed as userPossible values: user

parts

object

required

Hide properties

text

string

required

The text content to synthesize into speech. The Vertex AI API combines prompt and text in a single field, formatted as ’: ’, e.g., ‘Say the following in a curious way: OK, so… tell me about this AI thing.’. Maximum size is 8000 bytes; audio exceeding 655 seconds will be truncated. Supports inline markup tags: [sigh], [laughing], [uhm], [sarcasm], [robotic], [shouting], [whispering], [extremely fast], [short pause], [medium pause], [long pause]Length limit: 0 - 8000

generation_config

object

required

Hide properties

temperature

number

default:2

Temperature parameter that controls randomness and creativity in speech generation. Higher values produce more creative and diverse output; lower values produce more predictable and focused output. Valid range is (0.0, 2.0], recommended value is 2.0Range: [0, 2]

speech_config

object

required

Hide properties

voice_config

object

Single-speaker voice configuration. Mutually exclusive with multi_speaker_voice_config

Hide properties

prebuilt_voice_config

object

Hide properties

voice_name

string

Prebuilt voice name (case-insensitive). 30 available voices (both male and female)Possible values: Achernar, Achird, Algenib, Algieba, Alnilam, Aoede, Autonoe, Callirrhoe, Charon, Despina, Enceladus, Erinome, Fenrir, Gacrux, Iapetus, Kore, Laomedeia, Leda, Orus, Pulcherrima, Puck, Rasalgethi, Sadachbia, Sadaltager, Schedar, Sulafat, Umbriel, Vindemiatrix, Zephyr, Zubenelgenubi

language_code

string

required

Language code (BCP-47 format, case-insensitive). GA languages: ar-EG, bn-BD, nl-NL, en-IN, en-US, fr-FR, de-DE, hi-IN, id-ID, it-IT, ja-JP, ko-KR, mr-IN, pl-PL, pt-BR, ro-RO, ru-RU, es-ES, ta-IN, te-IN, th-TH, tr-TR, uk-UA, vi-VN. Preview languages include cmn-CN (Mandarin Chinese) and 63 othersPossible values: af-ZA, am-ET, ar-001, ar-EG, az-AZ, be-BY, bg-BG, bn-BD, ca-ES, ceb-PH, cmn-CN, cmn-TW, cs-CZ, da-DK, de-DE, el-GR, en-AU, en-GB, en-IN, en-US, es-419, es-ES, es-MX, et-EE, eu-ES, fa-IR, fi-FI, fil-PH, fr-CA, fr-FR, gl-ES, gu-IN, he-IL, hi-IN, hr-HR, ht-HT, hu-HU, hy-AM, id-ID, is-IS, it-IT, ja-JP, jv-JV, ka-GE, kn-IN, ko-KR, kok-IN, la-VA, lb-LU, lo-LA, lt-LT, lv-LV, mai-IN, mg-MG, mk-MK, ml-IN, mn-MN, mr-IN, ms-MY, my-MM, nb-NO, ne-NP, nl-NL, nn-NO, or-IN, pa-IN, pl-PL, ps-AF, pt-BR, pt-PT, ro-RO, ru-RU, sd-IN, si-LK, sk-SK, sl-SI, sq-AL, sr-RS, sv-SE, sw-KE, ta-IN, te-IN, th-TH, tr-TR, uk-UA, ur-PK, vi-VN

multi_speaker_voice_config

object

Multi-speaker voice configuration. Mutually exclusive with voice_config. Note: gemini-2.5-flash-lite-preview-tts does not support multi-speaker synthesis

Hide properties

speaker_voice_configs

array

List of speaker voice configurations

Hide properties

speaker

string

required

Speaker alias, must consist of alphanumeric characters only, no spaces. Must match the speaker identifier in contents.parts.text

voice_config

object

required

Hide properties

prebuilt_voice_config

object

Hide properties

voice_name

string

Response

audioContent

string

Base64 encoded audio content. Format is LINEAR16 PCM (24kHz, mono, 16-bit signed little-endian) without a WAV header. Clients can convert using ffmpeg: ffmpeg -f s16le -ar 24k -ac 1 -i input.raw output.wav

usageMetadata

object

Hide properties

totalTokenCount

integer

Total token count (promptTokenCount + candidatesTokenCount)

promptTokenCount

integer

Number of tokens consumed by the input text

candidatesTokenCount

integer

Number of tokens consumed by the output audio (approximately 25 tokens per second of audio)

Fish Audio Voice Cloning

​Request Headers

​Request Body

​Response

Request Headers

Request Body

Response