Skip to main content
POST
/
v3
/
elevenlabs-scribe-v1
Elevenlabs scribe v1 Speech to Text
curl --request POST \
  --url https://api.myrouter.ai/v3/elevenlabs-scribe-v1 \
  --header 'Authorization: <authorization>' \
  --header 'Content-Type: <content-type>' \
  --data '
{
  "seed": 123,
  "diarize": true,
  "file_format": "<string>",
  "temperature": 123,
  "num_speakers": 123,
  "language_code": "<string>",
  "tag_audio_events": true,
  "cloud_storage_url": "<string>",
  "use_multi_channel": true,
  "diarization_threshold": 123,
  "timestamps_granularity": "<string>"
}
'
Transcribes audio or video files. When use_multi_channel is true and the uploaded audio has multiple channels, returns a ‘transcripts’ object with one transcription per channel. Otherwise returns a single transcription result.

Request Headers

Content-Type
string
required
Enum: application/json
Authorization
string
required
Bearer authentication format: Bearer {{API Key}}.

Request Body

seed
integer
If specified, the system will make a best effort to sample deterministically. Repeated requests with the same seed and parameters should return the same result, but determinism is not guaranteed. Must be an integer between 0 and 2147483647.Range: [0, 2147483647]
diarize
boolean
default:false
Whether to annotate which speaker is currently speaking in the uploaded file.
file_format
string
default:"other"
Input audio format. Options are ‘pcm_s16le_16’ or ‘other’. pcm_s16le_16 requires audio to be 16kHz sample rate, 16-bit integer, mono, little-endian format, which has lower latency compared to encoded waveforms.Possible values: pcm_s16le_16, other
temperature
number
Controls the randomness of the transcription output. Value range is 0.0 to 2.0; higher values produce more diverse and less certain results. If omitted, the default temperature of the selected model will be used (typically 0).Range: [0, 2]
num_speakers
integer
Maximum number of speakers in the uploaded file. Can be used to help distinguish speakers. Up to 32 speakers supported.Range: [1, 32]
language_code
string
Specifies the ISO-639-1 or ISO-639-3 language code of the audio file. Specifying it in advance can sometimes improve transcription performance. Defaults to null, which will automatically detect the language.
tag_audio_events
boolean
default:true
Whether to tag audio events such as (laughter), (footsteps), etc. in the transcription.
cloud_storage_url
string
required
HTTPS URL of the file to transcribe. Either file or cloud_storage_url must be provided. The file must be accessible via HTTPS and smaller than 2GB. Supports any valid HTTPS address, including cloud storage (AWS S3, GCS, Cloudflare R2, etc.), CDNs, or other HTTPS sources. Supports pre-signed URLs with tokens or URL query parameter authentication.
use_multi_channel
boolean
default:false
Whether the audio file is multi-channel with each channel containing only a single speaker. When enabled, each channel will be transcribed independently and the results will be combined. Each word in the output will include a channel_index field. Up to 5 channels supported.
diarization_threshold
number
Speaker diarization threshold. A higher value means a lower probability of one person being split into multiple speakers, but a higher probability of different people being merged into one speaker (fewer speakers identified). A lower value means a higher probability of one person being split into multiple speakers, but a lower probability of different people being merged (more speakers identified). Can only be set when diarize=True and num_speakers=None. Defaults to None, which selects a threshold based on the model ID (typically 0.22).Range: [0.1, 0.4]
timestamps_granularity
string
default:"word"
Granularity of timestamps in the transcription. ‘word’ provides word-level timestamps, ‘character’ provides character-level timestamps.Possible values: none, word, character

Response

The response may be one of the following response types:
text
string
required
The raw transcribed text.
words
array
required
List of words and their timing information.
channel_index
integer
Channel index corresponding to this transcription (applicable for multi-channel audio).
language_code
string
required
Detected language code (e.g., ‘eng’ for English).
transcription_id
string
Unique transcription ID for this response.
language_probability
number
required
Language detection confidence (between 0 and 1).
transcripts
array
required
List of transcriptions for each audio channel. Each transcription contains the text and word-level details for its respective channel.
transcription_id
string
Unique transcription ID for this response.