AUDIO MODEL OpenAI

Whisper

Open Multilingual Speech Recognition

OpenAI's open-source speech recognition model, released September 2022 (two months before ChatGPT). Trained on 680,000 hours of web audio in 99 languages, it dramatically lowered the bar for "transcribe audio in any language" applications. The open release displaced commercial speech-to-text APIs (Google, AWS Transcribe, Rev, Otter) for many use cases overnight.

Cost
Free
Open weights — self-host
How are Intelligence, Speed & Cost bucketed?
Intelligence and Speed buckets are percentile ranks on Artificial Analysis. Cost buckets are fixed dollar thresholds keyed off output-token price ($/M out).
Intelligence
  • Top 1%≤ 1%
  • Top 5%≤ 5%
  • Top 10%≤ 10%
  • Good≤ 25%
  • Medium≤ 50%
  • Below avg> 50%
Speed
  • Top 1%≥ 345 tok/s
  • Top 5%≥ 237 tok/s
  • Top 10%≥ 196 tok/s
  • Good≥ 146 tok/s
  • Medium≥ 90 tok/s
  • Slow< 90 tok/s
Cost
  • Freeopen weights · self-host
  • Low< $1 / M out
  • Moderate$1–5 / M out
  • High≥ $5 / M out

Why it matters

Whisper is the speech-recognition layer underneath most "AI listens to audio" products since 2023. Without it, the meeting-AI category, the voice-mode-AI category, and much of consumer-facing AI audio infrastructure would have been gated by commercial-API economics rather than commodity open-weights.

Core Capabilities

Audio
Speech, music, or other audio understanding/synthesis.
Generative
Produces images, video, audio, or other media.
Multimodal
Combines text, vision, and audio in one model.

Context Window

Context window not disclosed.

Availability

API
Not available
Product / App
Not available
Open Source
Released
Enterprise

Pricing Model

Free / self-host
Open weights — pay only for compute.
Self-host

Capability / Performance

Where this model sits relative to the middle 60% of models in the tree. All scores are 0–10 (higher is better).

Lower 20% Upper 80% This model
Quality
No data reported · placeholder
5.0
Speed
No data reported · placeholder
5.0
Control
No data reported · placeholder
5.0
Consistency
No data reported · placeholder
5.0
Lower 20% 20th percentile — 20% of models score below this This model Where the current model lands Upper 80% 80th percentile — only 20% of models score above this Percentile boundaries are computed across every model in the tree that reports the underlying benchmark for each capability.

What it feels like

  • Trained on 680,000 hours of weakly-supervised audio — among the largest supervised speech datasets ever
  • 55% fewer errors than fine-tuned ASR models on broad/diverse data — best zero-shot ASR at release
  • Robust to background noise: maintains performance below 10dB SNR where competitors fail
  • 99 languages officially supported, but only 50 of 82 evaluated have <20% WER — multilingual quality varies wildly
  • Best on Romance languages, German, Japanese; weak on under-resourced languages
  • Hallucinations and proper-noun mistakes are the main failure modes — not perfect, but free

Best use cases

  • Self-hosted transcription replacing Google / AWS / Rev / Otter for many workloads
  • Podcast / video subtitle generation in 50+ languages
  • Multilingual call-center analytics where API privacy matters
  • Research baselines for ASR — open weights + paper made it the canonical citation

Tools to try

Not ideal for

  • Real-time transcription at scale without aggressive optimisation (latency)
  • Under-resourced languages where WER is still >20%
  • Use cases needing speaker diarization out of the box (Whisper does ASR, not diarization)

Radford, A. · Kim, J. W. · Xu, T. · Brockman, G. · McLeavey, C. · Sutskever, I.