Elevenlabs scribe v1 Speech to Text

curl --request POST \
  --url https://api.myrouter.ai/v3/elevenlabs-scribe-v1 \
  --header 'Authorization: <authorization>' \
  --header 'Content-Type: <content-type>' \
  --data '
{
  "seed": 123,
  "diarize": true,
  "file_format": "<string>",
  "temperature": 123,
  "num_speakers": 123,
  "language_code": "<string>",
  "tag_audio_events": true,
  "cloud_storage_url": "<string>",
  "use_multi_channel": true,
  "diarization_threshold": 123,
  "timestamps_granularity": "<string>"
}
'

POST

elevenlabs-scribe-v1

curl --request POST \
  --url https://api.myrouter.ai/v3/elevenlabs-scribe-v1 \
  --header 'Authorization: <authorization>' \
  --header 'Content-Type: <content-type>' \
  --data '
{
  "seed": 123,
  "diarize": true,
  "file_format": "<string>",
  "temperature": 123,
  "num_speakers": 123,
  "language_code": "<string>",
  "tag_audio_events": true,
  "cloud_storage_url": "<string>",
  "use_multi_channel": true,
  "diarization_threshold": 123,
  "timestamps_granularity": "<string>"
}
'

Transcribes audio or video files. When use_multi_channel is true and the uploaded audio has multiple channels, returns a ‘transcripts’ object with one transcription per channel. Otherwise returns a single transcription result.

Request Headers

Content-Type

string

required

Enum: application/json

Authorization

string

required

Bearer authentication format: Bearer {{API Key}}.

Request Body

seed

integer

If specified, the system will make a best effort to sample deterministically. Repeated requests with the same seed and parameters should return the same result, but determinism is not guaranteed. Must be an integer between 0 and 2147483647.Range: [0, 2147483647]

diarize

boolean

default:false

Whether to annotate which speaker is currently speaking in the uploaded file.

file_format

string

default:"other"

Input audio format. Options are ‘pcm_s16le_16’ or ‘other’. pcm_s16le_16 requires audio to be 16kHz sample rate, 16-bit integer, mono, little-endian format, which has lower latency compared to encoded waveforms.Possible values: pcm_s16le_16, other

temperature

number

Controls the randomness of the transcription output. Value range is 0.0 to 2.0; higher values produce more diverse and less certain results. If omitted, the default temperature of the selected model will be used (typically 0).Range: [0, 2]

num_speakers

integer

Maximum number of speakers in the uploaded file. Can be used to help distinguish speakers. Up to 32 speakers supported.Range: [1, 32]

language_code

string

Specifies the ISO-639-1 or ISO-639-3 language code of the audio file. Specifying it in advance can sometimes improve transcription performance. Defaults to null, which will automatically detect the language.

tag_audio_events

boolean

default:true

Whether to tag audio events such as (laughter), (footsteps), etc. in the transcription.

cloud_storage_url

string

required

HTTPS URL of the file to transcribe. Either file or cloud_storage_url must be provided. The file must be accessible via HTTPS and smaller than 2GB. Supports any valid HTTPS address, including cloud storage (AWS S3, GCS, Cloudflare R2, etc.), CDNs, or other HTTPS sources. Supports pre-signed URLs with tokens or URL query parameter authentication.

use_multi_channel

boolean

default:false

Whether the audio file is multi-channel with each channel containing only a single speaker. When enabled, each channel will be transcribed independently and the results will be combined. Each word in the output will include a channel_index field. Up to 5 channels supported.

diarization_threshold

number

Speaker diarization threshold. A higher value means a lower probability of one person being split into multiple speakers, but a higher probability of different people being merged into one speaker (fewer speakers identified). A lower value means a higher probability of one person being split into multiple speakers, but a lower probability of different people being merged (more speakers identified). Can only be set when diarize=True and num_speakers=None. Defaults to None, which selects a threshold based on the model ID (typically 0.22).Range: [0.1, 0.4]

timestamps_granularity

string

default:"word"

Granularity of timestamps in the transcription. ‘word’ provides word-level timestamps, ‘character’ provides character-level timestamps.Possible values: none, word, character

Response

The response may be one of the following response types:

Response Type 1

text

string

required

The raw transcribed text.

words

array

required

List of words and their timing information.

Hide properties

end

number

End time (in seconds) of this word or sound in the audio.

text

string

required

The transcribed word or sound content.

type

string

required

The type of this word or sound. ‘audio_event’ is used for non-word sounds such as laughter or footsteps.Possible values: word, spacing, audio_event

start

number

Start time (in seconds) of this word or sound in the audio.

logprob

number

required

The log probability of predicting this word. logprob ranges from [-infinity, 0]; higher values indicate greater model confidence.

characters

array

Characters that make up the word and their corresponding timing information.

Hide properties

end

number

End time (in seconds) of the character in the audio.

text

string

required

The transcribed character content.

start

number

Start time (in seconds) of the character in the audio.

speaker_id

string

Unique identifier of the speaker for this word.

channel_index

integer

Channel index corresponding to this transcription (applicable for multi-channel audio).

language_code

string

required

Detected language code (e.g., ‘eng’ for English).

transcription_id

string

Unique transcription ID for this response.

language_probability

number

required

Language detection confidence (between 0 and 1).

Response Type 2

transcripts

array

required

List of transcriptions for each audio channel. Each transcription contains the text and word-level details for its respective channel.

Hide properties

text

string

required

The raw transcribed text.

words

array

required

List of words and their timing information.

Hide properties

end

number

End time (in seconds) of this word or sound in the audio.

text

string

required

The transcribed word or sound content.

type

string

required

The type of this word or sound. ‘audio_event’ is used for non-word sounds such as laughter or footsteps.Possible values: word, spacing, audio_event

start

number

Start time (in seconds) of this word or sound in the audio.

logprob

number

required

The log probability of predicting this word. logprob ranges from [-infinity, 0]; higher values indicate greater model confidence.

characters

array

Characters that make up the word and their corresponding timing information.

Hide properties

end

number

End time (in seconds) of the character in the audio.

text

string

required

The transcribed character content.

start

number

Start time (in seconds) of the character in the audio.

speaker_id

string

Unique identifier of the speaker for this word.

channel_index

integer

Channel index corresponding to this transcription (applicable for multi-channel audio).

language_code

string

required

Detected language code (e.g., ‘eng’ for English).

transcription_id

string

Unique transcription ID for this response.

language_probability

number

required

Language detection confidence (between 0 and 1).

transcription_id

string

Unique transcription ID for this response.

MiniMax Quick Voice Cloning

Elevenlabs scribe v2 Speech to Text

​Request Headers

​Request Body

​Response

Request Headers

Request Body

Response