A fun exercise in AI transcription with faster-whisper, stable-ts, and whisper-timestamped

jonasrosland · October 3, 2024, 2:25pm

As part of my role in the open source/AI/gaming communities I’m involved in, I’ve been handling transcriptions of video and audio. Most of them have been recorded interviews or conference presentations.

I wanted to run a little test on 3 of these tools today, all of the building upon OpenAI’s Whisper:

TL;DR: They’re all great, but none are 100% correct.

To set this test up, I wanted to run them against something short but interesting, not super high quality but still fairly clean audio, and I found this gem. It’s a small portion of a longer presentation given by Admiral Grace Hopper, where she explains nanoseconds:

You can download the YouTube file and follow along at home

Demo

With the 3 tools installed, I ran each with the following commands:

faster-whisper --model large-v3 --output_format srt --output_dir faster-whisper Admiral\ Grace\ Hopper\ Explains\ the\ Nanosecond\ \[9eyFDBPk4Yw\].mkv

Result
Admiral Grace Hopper Explains the Nanosecond [9eyFDBPk4Yw].srt (2.5 KB)

stable-ts --model large-v3 --output_format srt --output_dir stable-ts Admiral\ Grace\ Hopper\ Explains\ the\ Nanosecond\ \[9eyFDBPk4Yw\].mkv

Result
Admiral Grace Hopper Explains the Nanosecond [9eyFDBPk4Yw].srt (40.0 KB)

whisper_timestamped --model large-v3 --output_format srt --output_dir whisper_timestamped Admiral\ Grace\ Hopper\ Explains\ the\ Nanosecond\ \[9eyFDBPk4Yw\].mkv

Result
Admiral Grace Hopper Explains the Nanosecond [9eyFDBPk4Yw].mkv.srt (2.9 KB)
Admiral Grace Hopper Explains the Nanosecond [9eyFDBPk4Yw].mkv.words.srt (13.3 KB)

You can see that they differ in size, and that’s because of different types of default rendering that these tools do.
Faster-whisper takes a clean approach, with just the text. Stable-ts adds color to each word, similar to what can be seen in karaoke. Whisper-timestamped outputs 2 files, one clean like faster-whisper, and one that’s word for word, like what can be seen on TikTok and YouTube Shorts.

To see these in action yourself, you can download the video above using a tool such as yt-dlp, then open it with VLC, and then go to Subtitle -> Add Subtitle File....

Result

All 3 tools are fantastic at what they accomplish, together with the large-v3 model. From previous experience, they are also great at handling transcriptions where the speaker has a heavy accent.

Homework

Here’s a little exercise for you all. When watching the video with the attached subtiltes, see if you can find the error in each one. They range from small to large, but there’s one error in each one that would throw off a viewer. If you find one, reply here with a spoiler tag

bob.olwig · October 8, 2024, 5:53pm

Jonas, nice work in doing this analysis and comparison. FWIW, I heard Admiral Hopper give a similar talk about the “nanosecond” when I was with Digital Equipment in the late 80’s.

jonasrosland · October 8, 2024, 6:46pm

Thank you, and that’s so cool! It must have been a fun lecture

Topic		Replies	Views
Contextual translation of PDFs Use Cases help	6	102	October 18, 2024
Hacking LLM Chat and Building a Local Video Summarization Tool Use Cases application , use-case , featured	1	115	July 12, 2025
Get started with AI - AI-Orientation Express available Education	2	31	October 30, 2024
Anthropic Claude.ai Styles: Can AI Mimic Written Voices? Use Cases llm , application , use-case	2	58	December 11, 2024
Claude.ai adds possibility to both write, analyze, AND run JavaScript code News	0	21	October 31, 2024

A fun exercise in AI transcription with faster-whisper, stable-ts, and whisper-timestamped

Demo

Result

Homework

Related topics