Skip to main content
POST
/
v4beta
/
txt2speech
Fish Audio Text-to-Speech
curl --request POST \
  --url https://api.myrouter.ai/v4beta/txt2speech \
  --header 'Authorization: <authorization>' \
  --header 'Content-Type: <content-type>' \
  --data '
{
  "text": "<string>",
  "temperature": 123,
  "top_p": 123,
  "references": {
    "text": "<string>"
  },
  "reference_id": {},
  "prosody": {
    "speed": 123,
    "volume": 123
  },
  "chunk_length": 123,
  "normalize": true,
  "format": {},
  "sample_rate": {},
  "mp3_bitrate": {},
  "opus_bitrate": {},
  "latency": {}
}
'
For best results, it is recommended to upload reference audio using the Voice Cloning API before using this API. This will improve voice quality and reduce latency.
Fish Audio converts text to speech. Supported audio formats:
  • WAV / PCM
    • Sample rates: 8kHz, 16kHz, 24kHz, 32kHz, 44.1kHz
    • Default sample rate: 44.1kHz
    • 16-bit, mono
  • MP3
    • Sample rates: 32kHz, 44.1kHz
    • Default sample rate: 44.1kHz
    • Mono
    • Bitrates: 64kbps, 128kbps (default), 192kbps
  • Opus
    • Sample rate: 48kHz
    • Default sample rate: 48kHz
    • Mono
    • Bitrates: -1000 (auto), 24kbps, 32kbps (default), 48kbps, 64kbps

Request Headers

Content-Type
string
required
Enum: application/json
Authorization
string
required
Bearer authentication format: Bearer {{API Key}}.

Request Body

text
string
required
The text to be converted to speech.
temperature
number
Controls the randomness of speech generation. Higher values (e.g., 1.0) make the output more random, lower values (e.g., 0.1) make it more deterministic. We recommend 0.9 for the s1 model.Required range: 0 <= x <= 1
top_p
number
Controls diversity through nucleus sampling. Lower values (e.g., 0.1) make the output more focused, higher values (e.g., 1.0) allow more diversity. We recommend 0.9 for the s1 model.Required range: 0 <= x <= 1
references
ReferenceAudio ยท object[] | null
Reference audio for the voice. This requires MessagePack serialization, which will override reference_voices and reference_texts.
reference_id
string | null
Reference model ID for the voice.
prosody
ProsodyControl ยท object
Prosody control for the voice.
chunk_length
integer
default:200
Chunk length for the voice.Required range: 100 <= x <= 300
normalize
boolean
default:true
Whether to normalize the voice. This will reduce latency but may decrease performance on numbers and dates.
format
enum<string>
default:"mp3"
Format for the voice.Possible values: wav, pcm, mp3, opus
sample_rate
integer | null
Sample rate for the voice.
mp3_bitrate
enum<integer>
default:128
MP3 bitrate for the voice.Possible values: 64, 128, 192
opus_bitrate
enum<integer>
default:32
Opus bitrate for the voice.Possible values: -1000, 24, 32, 48, 64
latency
enum<string>
default:"normal"
Latency setting for the voice. balanced will reduce latency but may result in decreased performance.Possible values: normal, balanced

Response

The API will return an audio stream in the format specified by the format parameter (Default: mp3).