A fun exercise in AI transcription with faster-whisper, stable-ts, and whisper-timestamped

As part of my role in the open source/AI/gaming communities I’m involved in, I’ve been handling transcriptions of video and audio. Most of them have been recorded interviews or conference presentations.

I wanted to run a little test on 3 of these tools today, all of the building upon OpenAI’s Whisper:

TL;DR: They’re all great, but none are 100% correct.

To set this test up, I wanted to run them against something short but interesting, not super high quality but still fairly clean audio, and I found this gem. It’s a small portion of a longer presentation given by Admiral Grace Hopper, where she explains nanoseconds:

You can download the YouTube file and follow along at home :slight_smile:

Demo

With the 3 tools installed, I ran each with the following commands:

faster-whisper --model large-v3 --output_format srt --output_dir faster-whisper Admiral\ Grace\ Hopper\ Explains\ the\ Nanosecond\ \[9eyFDBPk4Yw\].mkv

Result
Admiral Grace Hopper Explains the Nanosecond [9eyFDBPk4Yw].srt (2.5 KB)

stable-ts --model large-v3 --output_format srt --output_dir stable-ts Admiral\ Grace\ Hopper\ Explains\ the\ Nanosecond\ \[9eyFDBPk4Yw\].mkv

Result
Admiral Grace Hopper Explains the Nanosecond [9eyFDBPk4Yw].srt (40.0 KB)

whisper_timestamped --model large-v3 --output_format srt --output_dir whisper_timestamped Admiral\ Grace\ Hopper\ Explains\ the\ Nanosecond\ \[9eyFDBPk4Yw\].mkv

Result
Admiral Grace Hopper Explains the Nanosecond [9eyFDBPk4Yw].mkv.srt (2.9 KB)
Admiral Grace Hopper Explains the Nanosecond [9eyFDBPk4Yw].mkv.words.srt (13.3 KB)


You can see that they differ in size, and that’s because of different types of default rendering that these tools do.
Faster-whisper takes a clean approach, with just the text. Stable-ts adds color to each word, similar to what can be seen in karaoke. Whisper-timestamped outputs 2 files, one clean like faster-whisper, and one that’s word for word, like what can be seen on TikTok and YouTube Shorts.

To see these in action yourself, you can download the video above using a tool such as yt-dlp, then open it with VLC, and then go to Subtitle -> Add Subtitle File....

Result

All 3 tools are fantastic at what they accomplish, together with the large-v3 model. From previous experience, they are also great at handling transcriptions where the speaker has a heavy accent.

Homework

Here’s a little exercise for you all. When watching the video with the attached subtiltes, see if you can find the error in each one. They range from small to large, but there’s one error in each one that would throw off a viewer. If you find one, reply here with a spoiler tag :slight_smile:

Jonas, nice work in doing this analysis and comparison. FWIW, I heard Admiral Hopper give a similar talk about the “nanosecond” when I was with Digital Equipment in the late 80’s.

Thank you, and that’s so cool! It must have been a fun lecture :slight_smile: