Whisper
OpenAI's open-source speech recognition model, released September 2022 (two months before ChatGPT). Trained on 680,000 hours of web audio in 99 languages, it dramatically lowered the bar for "transcribe audio in any language" applications. The open release displaced commercial speech-to-text APIs (Google, AWS Transcribe, Rev, Otter) for many use cases overnight.
How are Intelligence, Speed & Cost bucketed?
- Top 1%≤ 1%
- Top 5%≤ 5%
- Top 10%≤ 10%
- Good≤ 25%
- Medium≤ 50%
- Below avg> 50%
- Top 1%≥ 345 tok/s
- Top 5%≥ 237 tok/s
- Top 10%≥ 196 tok/s
- Good≥ 146 tok/s
- Medium≥ 90 tok/s
- Slow< 90 tok/s
- Freeopen weights · self-host
- Low< $1 / M out
- Moderate$1–5 / M out
- High≥ $5 / M out
Why it matters
Whisper is the speech-recognition layer underneath most "AI listens to audio" products since 2023. Without it, the meeting-AI category, the voice-mode-AI category, and much of consumer-facing AI audio infrastructure would have been gated by commercial-API economics rather than commodity open-weights.
Core Capabilities
Context Window
Context window not disclosed.
Availability
Pricing Model
Capability / Performance
Where this model sits relative to the middle 60% of models in the tree. All scores are 0–10 (higher is better).
What it feels like
- Trained on 680,000 hours of weakly-supervised audio — among the largest supervised speech datasets ever
- 55% fewer errors than fine-tuned ASR models on broad/diverse data — best zero-shot ASR at release
- Robust to background noise: maintains performance below 10dB SNR where competitors fail
- 99 languages officially supported, but only 50 of 82 evaluated have <20% WER — multilingual quality varies wildly
- Best on Romance languages, German, Japanese; weak on under-resourced languages
- Hallucinations and proper-noun mistakes are the main failure modes — not perfect, but free
Best use cases
- Self-hosted transcription replacing Google / AWS / Rev / Otter for many workloads
- Podcast / video subtitle generation in 50+ languages
- Multilingual call-center analytics where API privacy matters
- Research baselines for ASR — open weights + paper made it the canonical citation
Tools to try
Not ideal for
- Real-time transcription at scale without aggressive optimisation (latency)
- Under-resourced languages where WER is still >20%
- Use cases needing speaker diarization out of the box (Whisper does ASR, not diarization)