Audio Transcriber converts spoken audio into accurate text with speaker labels, word-level timestamps, and audio event tagging. It supports 99+ languages with automatic detection, making it suitable for meetings, interviews, podcasts, lectures, and any multi-speaker recordings.
Getting clean, structured transcripts normally requires a separate service and manual cleanup. This tool handles language detection, speaker diarization, timestamp granularity, and non-speech event tagging (laughter, applause, music) in a single call. The output includes both a full transcript string and a per-word array with timing and speaker IDs — ready for subtitles, summaries, or further analysis.
What you can do
- Transcribe any audio file from a URL in 99+ languages with automatic language detection
- Label different speakers separately (diarization) with or without knowing the speaker count
- Get word-level or character-level timestamps for subtitle generation
- Tag non-speech audio events like laughter, applause, and background music
- Process MP3, WAV, M4A, FLAC, OGG, and other common formats
Who it's for
Podcast producers generating episode transcripts. Journalists and researchers transcribing interviews. Teams needing meeting notes with speaker attribution. Developers building transcription pipelines or subtitle generation workflows.
How to use it
- Call transcribe_audio with the URL of your audio file — language auto-detects if you don't specify
- Set diarize: true to get separate speaker labels; add num_speakers if you know the count for better accuracy
- Set timestamps_granularity: "word" if you need per-word timing for subtitle generation
- Enable tag_audio_events: true to capture laughter, applause, music, and other non-speech sounds
Getting started
For noisy recordings, run the audio through Audio Isolator first for cleaner transcription results. Then call transcribe_audio with the cleaned file URL.