Transcribes audio or video files. When use_multi_channel is true and the uploaded audio has multiple channels, returns a ‘transcripts’ object with one transcription per channel. Otherwise returns a single transcription result.
If specified, the system will make a best effort to sample deterministically. Repeated requests with the same seed and parameters should return the same result, but determinism is not guaranteed. Must be an integer between 0 and 2147483647.Range: [0, 2147483647]
Input audio format. Options are ‘pcm_s16le_16’ or ‘other’. pcm_s16le_16 requires audio to be 16kHz sample rate, 16-bit integer, mono, little-endian format, which has lower latency compared to encoded waveforms.Possible values: pcm_s16le_16, other
Controls the randomness of the transcription output. Value range is 0.0 to 2.0; higher values produce more diverse and less certain results. If omitted, the default temperature of the selected model will be used (typically 0).Range: [0, 2]
Specifies the ISO-639-1 or ISO-639-3 language code of the audio file. Specifying it in advance can sometimes improve transcription performance. Defaults to null, which will automatically detect the language.
HTTPS URL of the file to transcribe. Either file or cloud_storage_url must be provided. The file must be accessible via HTTPS and smaller than 2GB. Supports any valid HTTPS address, including cloud storage (AWS S3, GCS, Cloudflare R2, etc.), CDNs, or other HTTPS sources. Supports pre-signed URLs with tokens or URL query parameter authentication.
Whether the audio file is multi-channel with each channel containing only a single speaker. When enabled, each channel will be transcribed independently and the results will be combined. Each word in the output will include a channel_index field. Up to 5 channels supported.
Speaker diarization threshold. A higher value means a lower probability of one person being split into multiple speakers, but a higher probability of different people being merged into one speaker (fewer speakers identified). A lower value means a higher probability of one person being split into multiple speakers, but a lower probability of different people being merged (more speakers identified). Can only be set when diarize=True and num_speakers=None. Defaults to None, which selects a threshold based on the model ID (typically 0.22).Range: [0.1, 0.4]
Granularity of timestamps in the transcription. ‘word’ provides word-level timestamps, ‘character’ provides character-level timestamps.Possible values: none, word, character