OpenAI Whisper: My Take After Two Years of Local and API Transcription

In short: OpenAI Whisper is an MIT-licensed speech-to-text model you can run locally or through the OpenAI API. Running it locally keeps audio files off the cloud, which matters for confidential meetings and regulated sectors. It comes in several sizes from tiny to large-v3, and pairs well with Faster-Whisper for speed and WhisperX for speaker diarization.

I started using Whisper shortly after OpenAI released it in September 2022. Back then, I was looking for a way to transcribe long interviews for articles without paying a cloud service by the hour. Three years later, Whisper has become a core piece of my workflow: transcribing podcasts, writing up client meeting notes, subtitling videos. Here's what I've actually taken away from it in practice, rather than a rundown of the specs.

Why Whisper changed everything for me

Whisper is released under the MIT license and can be downloaded directly from OpenAI's GitHub repository. That means you can run it on your own machine, without sending a single audio file to the cloud. For someone transcribing meetings full of strategic data, or consultations that touch on confidentiality (lawyers, doctors, HR), that difference is structural.

Before Whisper, I was paying somewhere between 1 andconditions sur demande per transcribed hour depending on the service. Over 80 hours of transcription a year, the math adds up fast. Local Whisper let me pay in electricity what I used to pay in access plan.

The model sizes, in practice

Whisper comes in several sizes. Here's what I use depending on the context:

tiny and base: I use these for personal voice notes or "just to get the gist" transcriptions of a badly recorded audio. Very fast even on CPU, acceptable quality but errors on proper nouns.
small: my default on a machine without a GPU. Good quality/speed ratio for standard French.
medium: when I have a GPU available. A very good compromise, plenty for most professional content.
large-v3: for my publishable podcasts and client subtitles. The quality is clearly a notch above, especially on proper nouns and technical passages.

The jump from large-v2 to large-v3 was a real step up in quality on French, particularly on transitions and punctuation.

When I reach for local Whisper

I pull out the local version when:

The content is confidential (steering meetings, strategic briefs, HR interviews).
The volume is high (a weekly podcast plus several hours of calls per week).
I want to be able to rerun the transcription multiple times to compare parameters.

For GDPR constraints in a company setting, my compliance checklist lays out the concrete questions to ask yourself before defaulting to a cloud service.

When I go through the OpenAI API

When the content isn't sensitive and I want to move fast, I send the audio to OpenAI's Whisper API. It's simpler to wire into a Make or n8n script, it doesn't require a local GPU, and the automatic language detection works really well.

One limit to know about: the file has to be under 25 MB. For long audio, I split it with ffmpeg before sending.

My alternatives when Whisper isn't enough

Faster-Whisper: an optimized implementation that gets me a 3x to 4x speedup on local transcription, at equivalent quality. It's become my default base for batch work.
WhisperX: adds diarization, meaning speaker identification. Essential for multi-voice podcasts or meetings where I want to know who said what.
Deepgram: a cloud service, account-based, but excellent at real-time streaming. I use it when a client wants live transcription during an event.
AssemblyAI: a cloud service with entity extraction and summarization. Handy when you chain transcription then text processing in a single pipeline.
NVIDIA's Parakeet: very fast on NVIDIA GPUs, interesting for massive volumes.

What tripped me up in practice

Whisper sometimes hallucinates on long silences: it invents plausible sentences that were never spoken. This is documented (a 2024 AP/Cornell study), and I've seen the phenomenon several times on my own files. My workaround: cut out the long silences upstream with a voice-activity-detection threshold (VAD), or use Faster-Whisper, which offers a built-in VAD mode.

Another point: on very strong accents or significant background noise, large-v3 drops the ball. For podcasts recorded outdoors, I first clean up the audio with a denoiser (Auphonic or a local plugin) before transcribing.

My typical pipeline

For a one-hour podcast episode:

Audio cleanup (denoising, normalization).
Transcription with Faster-Whisper large-v3 locally (15 to 20 minutes on my machine).
A pass through Claude to proofread and fix proper nouns and domain-specific jargon.
Export to SRT for YouTube subtitling.

The gain compared to a purely human transcription: a factor of 5 on time, for a comparable final quality after proofreading.

What I think for Trust-Vault

Whisper ticks the boxes I look at first: open source code (transparency), the ability to run fully local (privacy), an active community around optimized variants. That's rare in the AI landscape. For voice synthesis, which is the exact opposite of transcription, my take on ElevenLabs is the complementary read.

For those just starting out who want a turnkey interface with no installation, Otter.ai remains a good entry point — it's just a trade-off on confidentiality that everyone has to make with their eyes open.

In short: OpenAI Whisper is an MIT-licensed speech-to-text model you can run locally or through the OpenAI API. Running it locally keeps audio files off the cloud, which matters for confidential meetings and regulated sectors. It comes in several sizes from tiny to large-v3, and pairs well with Faster-Whisper for speed and WhisperX for speaker diarization.

Why Whisper changed everything for me

The model sizes, in practice

Whisper comes in several sizes. Here's what I use depending on the context:

tiny and base: I use these for personal voice notes or "just to get the gist" transcriptions of a badly recorded audio. Very fast even on CPU, acceptable quality but errors on proper nouns.
small: my default on a machine without a GPU. Good quality/speed ratio for standard French.
medium: when I have a GPU available. A very good compromise, plenty for most professional content.
large-v3: for my publishable podcasts and client subtitles. The quality is clearly a notch above, especially on proper nouns and technical passages.

The jump from large-v2 to large-v3 was a real step up in quality on French, particularly on transitions and punctuation.

When I reach for local Whisper

I pull out the local version when:

The content is confidential (steering meetings, strategic briefs, HR interviews).
The volume is high (a weekly podcast plus several hours of calls per week).
I want to be able to rerun the transcription multiple times to compare parameters.

For GDPR constraints in a company setting, my compliance checklist lays out the concrete questions to ask yourself before defaulting to a cloud service.

When I go through the OpenAI API

One limit to know about: the file has to be under 25 MB. For long audio, I split it with ffmpeg before sending.

My alternatives when Whisper isn't enough

Faster-Whisper: an optimized implementation that gets me a 3x to 4x speedup on local transcription, at equivalent quality. It's become my default base for batch work.
WhisperX: adds diarization, meaning speaker identification. Essential for multi-voice podcasts or meetings where I want to know who said what.
Deepgram: a cloud service, account-based, but excellent at real-time streaming. I use it when a client wants live transcription during an event.
AssemblyAI: a cloud service with entity extraction and summarization. Handy when you chain transcription then text processing in a single pipeline.
NVIDIA's Parakeet: very fast on NVIDIA GPUs, interesting for massive volumes.

What tripped me up in practice

My typical pipeline

For a one-hour podcast episode:

Audio cleanup (denoising, normalization).
Transcription with Faster-Whisper large-v3 locally (15 to 20 minutes on my machine).
A pass through Claude to proofread and fix proper nouns and domain-specific jargon.
Export to SRT for YouTube subtitling.

The gain compared to a purely human transcription: a factor of 5 on time, for a comparable final quality after proofreading.

OpenAI Whisper: My Take After Two Years of Local and API Transcription

Why Whisper changed everything for me

The model sizes, in practice

When I reach for local Whisper

When I go through the OpenAI API

My alternatives when Whisper isn't enough

What tripped me up in practice

My typical pipeline

What I think for Trust-Vault

Further reading

Compare AI tools

Trust Ranking

Comprendre les LLM

Copilot vs ChatGPT

Official sources and method

Related Articles

ChatGPT vs Claude: Which One I Recommend by Use Case in 2026

Microsoft Copilot : mon retour après huit mois de déploiement en ETI

J'ai utilisé l'IA pour rédiger 40 CV et lettres de motivation : ce qui marche vraiment

OpenAI Whisper: My Take After Two Years of Local and API Transcription

Why Whisper changed everything for me

The model sizes, in practice

When I reach for local Whisper

When I go through the OpenAI API

My alternatives when Whisper isn't enough

What tripped me up in practice

My typical pipeline

What I think for Trust-Vault

Further reading

Compare AI tools

Trust Ranking

Comprendre les LLM

Copilot vs ChatGPT

Official sources and method

Related Articles

ChatGPT vs Claude: Which One I Recommend by Use Case in 2026

Microsoft Copilot : mon retour après huit mois de déploiement en ETI

J'ai utilisé l'IA pour rédiger 40 CV et lettres de motivation : ce qui marche vraiment