Text-to-Speech API | Fish Audio

Fish Audio Text-to-Speech

curl --request POST \
  --url https://api.myrouter.ai/v4beta/txt2speech \
  --header 'Authorization: <authorization>' \
  --header 'Content-Type: <content-type>' \
  --data '
{
  "text": "<string>",
  "temperature": 123,
  "top_p": 123,
  "references": {
    "text": "<string>"
  },
  "reference_id": {},
  "prosody": {
    "speed": 123,
    "volume": 123
  },
  "chunk_length": 123,
  "normalize": true,
  "format": {},
  "sample_rate": {},
  "mp3_bitrate": {},
  "opus_bitrate": {},
  "latency": {}
}
'

POST

v4beta

txt2speech

Fish Audio Text-to-Speech

curl --request POST \
  --url https://api.myrouter.ai/v4beta/txt2speech \
  --header 'Authorization: <authorization>' \
  --header 'Content-Type: <content-type>' \
  --data '
{
  "text": "<string>",
  "temperature": 123,
  "top_p": 123,
  "references": {
    "text": "<string>"
  },
  "reference_id": {},
  "prosody": {
    "speed": 123,
    "volume": 123
  },
  "chunk_length": 123,
  "normalize": true,
  "format": {},
  "sample_rate": {},
  "mp3_bitrate": {},
  "opus_bitrate": {},
  "latency": {}
}
'

For best results, it is recommended to upload reference audio using the Voice Cloning API before using this API. This will improve voice quality and reduce latency.

Fish Audio converts text to speech. Supported audio formats:

WAV / PCM
- Sample rates: 8kHz, 16kHz, 24kHz, 32kHz, 44.1kHz
- Default sample rate: 44.1kHz
- 16-bit, mono
MP3
- Sample rates: 32kHz, 44.1kHz
- Default sample rate: 44.1kHz
- Mono
- Bitrates: 64kbps, 128kbps (default), 192kbps
Opus
- Sample rate: 48kHz
- Default sample rate: 48kHz
- Mono
- Bitrates: -1000 (auto), 24kbps, 32kbps (default), 48kbps, 64kbps

Request Headers

Content-Type

string

required

Enum: application/json

Authorization

string

required

Bearer authentication format: Bearer {{API Key}}.

Request Body

text

string

required

The text to be converted to speech.

temperature

number

Controls the randomness of speech generation. Higher values (e.g., 1.0) make the output more random, lower values (e.g., 0.1) make it more deterministic. We recommend 0.9 for the s1 model.Required range: 0 <= x <= 1

top_p

number

Controls diversity through nucleus sampling. Lower values (e.g., 0.1) make the output more focused, higher values (e.g., 1.0) allow more diversity. We recommend 0.9 for the s1 model.Required range: 0 <= x <= 1

references

ReferenceAudio · object[] | null

Reference audio for the voice. This requires MessagePack serialization, which will override reference_voices and reference_texts.

Show properties

audio

file

required

Reference audio file.

text

string

required

Reference text corresponding to the audio.

reference_id

string | null

Reference model ID for the voice.

prosody

ProsodyControl · object

Prosody control for the voice.

Show properties

speed

number

default:1

Voice speed control.

volume

number

default:0

Voice volume control.

chunk_length

integer

default:200

Chunk length for the voice.Required range: 100 <= x <= 300

normalize

boolean

default:true

Whether to normalize the voice. This will reduce latency but may decrease performance on numbers and dates.

format

enum<string>

default:"mp3"

Format for the voice.Possible values: wav, pcm, mp3, opus

sample_rate

integer | null

Sample rate for the voice.

mp3_bitrate

enum<integer>

default:128

MP3 bitrate for the voice.Possible values: 64, 128, 192

opus_bitrate

enum<integer>

default:32

Opus bitrate for the voice.Possible values: -1000, 24, 32, 48, 64

latency

enum<string>

default:"normal"

Latency setting for the voice. balanced will reduce latency but may result in decreased performance.Possible values: normal, balanced

Response

The API will return an audio stream in the format specified by the format parameter (Default: mp3).

GLM Voice Clone

Fish Audio Voice Cloning

​Request Headers

​Request Body

​Response

Request Headers

Request Body

Response