Audio Transcriber

Tools / Audio Transcriber

Audio Transcriber converts spoken audio into accurate text with speaker labels, word-level timestamps, and audio event tagging. It supports 99+ languages with automatic detection, making it suitable for meetings, interviews, podcasts, lectures, and any multi-speaker recordings.

Getting clean, structured transcripts normally requires a separate service and manual cleanup. This tool handles language detection, speaker diarization, timestamp granularity, and non-speech event tagging (laughter, applause, music) in a single call. The output includes both a full transcript string and a per-word array with timing and speaker IDs — ready for subtitles, summaries, or further analysis.

What you can do

Transcribe any audio file from a URL in 99+ languages with automatic language detection
Label different speakers separately (diarization) with or without knowing the speaker count
Get word-level or character-level timestamps for subtitle generation
Tag non-speech audio events like laughter, applause, and background music
Process MP3, WAV, M4A, FLAC, OGG, and other common formats

Who it's for

Podcast producers generating episode transcripts. Journalists and researchers transcribing interviews. Teams needing meeting notes with speaker attribution. Developers building transcription pipelines or subtitle generation workflows.

How to use it

Call transcribe_audio with the URL of your audio file — language auto-detects if you don't specify
Set diarize: true to get separate speaker labels; add num_speakers if you know the count for better accuracy
Set timestamps_granularity: "word" if you need per-word timing for subtitle generation
Enable tag_audio_events: true to capture laughter, applause, music, and other non-speech sounds

Getting started

For noisy recordings, run the audio through Audio Isolator first for cleaner transcription results. Then call transcribe_audio with the cleaned file URL.