OpenAI Whisper: My Take After Two Years of Local and API Transcription
I've used Whisper since 2023 to transcribe podcasts, meetings and interviews. Here's what works locally, what works via the API, and where I actually saved time.
In short: OpenAI Whisper is an MIT-licensed speech-to-text model you can run locally or through the OpenAI API. Running it locally keeps audio files off the cloud, which matters for confidential meetings and regulated sectors. It comes in several sizes from tiny to large-v3, and pairs well with Faster-Whisper for speed and WhisperX for speaker diarization.
I started using Whisper shortly after OpenAI released it in September 2022. Back then, I was looking for a way to transcribe long interviews for articles without paying a cloud service by the hour. Three years later, Whisper has become a core piece of my workflow: transcribing podcasts, writing up client meeting notes, subtitling videos. Here's what I've actually taken away from it in practice, rather than a rundown of the specs.
Why Whisper changed everything for me
Whisper is released under the MIT license and can be downloaded directly from OpenAI's GitHub repository. That means you can run it on your own machine, without sending a single audio file to the cloud. For someone transcribing meetings full of strategic data, or consultations that touch on confidentiality (lawyers, doctors, HR), that difference is structural.
Before Whisper, I was paying somewhere between 1 andconditions sur demande per transcribed hour depending on the service. Over 80 hours of transcription a year, the math adds up fast. Local Whisper let me pay in electricity what I used to pay in access plan.
The model sizes, in practice
Whisper comes in several sizes. Here's what I use depending on the context:
- tiny and base: I use these for personal voice notes or "just to get the gist" transcriptions of a badly recorded audio. Very fast even on CPU, acceptable quality but errors on proper nouns.
- small: my default on a machine without a GPU. Good quality/speed ratio for standard French.
- medium: when I have a GPU available. A very good compromise, plenty for most professional content.
- large-v3: for my publishable podcasts and client subtitles. The quality is clearly a notch above, especially on proper nouns and technical passages.
The jump from large-v2 to large-v3 was a real step up in quality on French, particularly on transitions and punctuation.
When I reach for local Whisper
I pull out the local version when:
- The content is confidential (steering meetings, strategic briefs, HR interviews).
- The volume is high (a weekly podcast plus several hours of calls per week).
- I want to be able to rerun the transcription multiple times to compare parameters.
For GDPR constraints in a company setting, my compliance checklist lays out the concrete questions to ask yourself before defaulting to a cloud service.
When I go through the OpenAI API
When the content isn't sensitive and I want to move fast, I send the audio to OpenAI's Whisper API. It's simpler to wire into a Make or n8n script, it doesn't require a local GPU, and the automatic language detection works really well.
One limit to know about: the file has to be under 25 MB. For long audio, I split it with ffmpeg before sending.
My alternatives when Whisper isn't enough
- Faster-Whisper: an optimized implementation that gets me a 3x to 4x speedup on local transcription, at equivalent quality. It's become my default base for batch work.
- WhisperX: adds diarization, meaning speaker identification. Essential for multi-voice podcasts or meetings where I want to know who said what.
- Deepgram: a cloud service, account-based, but excellent at real-time streaming. I use it when a client wants live transcription during an event.
- AssemblyAI: a cloud service with entity extraction and summarization. Handy when you chain transcription then text processing in a single pipeline.
- NVIDIA's Parakeet: very fast on NVIDIA GPUs, interesting for massive volumes.
What tripped me up in practice
Whisper sometimes hallucinates on long silences: it invents plausible sentences that were never spoken. This is documented (a 2024 AP/Cornell study), and I've seen the phenomenon several times on my own files. My workaround: cut out the long silences upstream with a voice-activity-detection threshold (VAD), or use Faster-Whisper, which offers a built-in VAD mode.
Another point: on very strong accents or significant background noise, large-v3 drops the ball. For podcasts recorded outdoors, I first clean up the audio with a denoiser (Auphonic or a local plugin) before transcribing.
My typical pipeline
For a one-hour podcast episode:
- Audio cleanup (denoising, normalization).
- Transcription with Faster-Whisper large-v3 locally (15 to 20 minutes on my machine).
- A pass through Claude to proofread and fix proper nouns and domain-specific jargon.
- Export to SRT for YouTube subtitling.
The gain compared to a purely human transcription: a factor of 5 on time, for a comparable final quality after proofreading.
What I think for Trust-Vault
Whisper ticks the boxes I look at first: open source code (transparency), the ability to run fully local (privacy), an active community around optimized variants. That's rare in the AI landscape. For voice synthesis, which is the exact opposite of transcription, my take on ElevenLabs is the complementary read.
For those just starting out who want a turnkey interface with no installation, Otter.ai remains a good entry point — it's just a trade-off on confidentiality that everyone has to make with their eyes open.
Further reading
Compare AI tools
Compare tools by use case, category, and trust signals.
Trust Ranking
Review reliability, transparency, and product maturity signals.
Comprendre les LLM
Définition, limites, prompts, contexte et critères de choix d'un modèle.
Copilot vs ChatGPT
Comparer assistant généraliste, intégration bureautique et usage professionnel.
Official sources and method
Trust-Vault combines field usage with institutional sources to strengthen verification, compliance, and comparison clarity.
- Google Search Central - helpful content - Google. Official guidance on helpful, reliable, people-first content.
- Google Search Central - structured data - Google. Official documentation for structured data recognized by Google Search.
- The /llms.txt file - llmstxt.org. Public Markdown-format proposal to help AI systems understand a website.
- AI Act policy overview - European Commission. Official overview of the European framework for safe, human-centric AI.
Laurent Duplat
Editor-in-Chief — Trust-Vault