AUDIO MODEL Sep 2022 OpenAI

Whisper

Open Multilingual Speech Recognition

OpenAI's open-source speech recognition model, released September 2022 (two months before ChatGPT). Trained on 680,000 hours of web audio in 99 languages, it dramatically lowered the bar for "transcribe audio in any language" applications. The open release displaced commercial speech-to-text APIs (Google, AWS Transcribe, Rev, Otter) for many use cases overnight.

Try ChatGPT API Docs ↗

Cost

Free

Open weights — self-host

How are Intelligence, Speed & Cost bucketed?

Intelligence and Speed buckets are percentile ranks on Artificial Analysis. Cost buckets are fixed dollar thresholds keyed off output-token price ($/M out).

Intelligence

Top 1%≤ 1%
Top 5%≤ 5%
Top 10%≤ 10%
Good≤ 25%
Medium≤ 50%
Below avg> 50%

Speed

Top 1%≥ 345 tok/s
Top 5%≥ 237 tok/s
Top 10%≥ 196 tok/s
Good≥ 146 tok/s
Medium≥ 90 tok/s
Slow< 90 tok/s

Cost

Freeopen weights · self-host
Low< $1 / M out
Moderate$1–5 / M out
High≥ $5 / M out

Official ↗ GitHub ↗

Why it matters

Whisper is the speech-recognition layer underneath most "AI listens to audio" products since 2023. Without it, the meeting-AI category, the voice-mode-AI category, and much of consumer-facing AI audio infrastructure would have been gated by commercial-API economics rather than commodity open-weights.

Core Capabilities

Audio

Speech, music, or other audio understanding/synthesis.

Generative

Produces images, video, audio, or other media.

Multimodal

Combines text, vision, and audio in one model.

Context Window

Context window not disclosed.

Availability

API

Not available

Product / App

Not available

Open Source

Released

Enterprise

—

Pricing Model

Free / self-host

Open weights — pay only for compute.

Self-host

Capability / Performance

Where this model sits relative to the middle 60% of models in the tree. All scores are 0–10 (higher is better).

Lower 20% Upper 80% This model

Quality

No data reported · placeholder

5.0

Speed

No data reported · placeholder

5.0

Control

No data reported · placeholder

5.0

Consistency

No data reported · placeholder

5.0

Lower 20% 20th percentile — 20% of models score below this This model Where the current model lands Upper 80% 80th percentile — only 20% of models score above this Percentile boundaries are computed across every model in the tree that reports the underlying benchmark for each capability.

What it feels like

Trained on 680,000 hours of weakly-supervised audio — among the largest supervised speech datasets ever
55% fewer errors than fine-tuned ASR models on broad/diverse data — best zero-shot ASR at release
Robust to background noise: maintains performance below 10dB SNR where competitors fail
99 languages officially supported, but only 50 of 82 evaluated have <20% WER — multilingual quality varies wildly
Best on Romance languages, German, Japanese; weak on under-resourced languages
Hallucinations and proper-noun mistakes are the main failure modes — not perfect, but free

Reviews: OpenAI — Introducing Whisper ↗ · Whisper paper (PDF) ↗ · GitHub — openai/whisper ↗

Best use cases

Self-hosted transcription replacing Google / AWS / Rev / Otter for many workloads
Podcast / video subtitle generation in 50+ languages
Multilingual call-center analytics where API privacy matters
Research baselines for ASR — open weights + paper made it the canonical citation

Tools to try

ChatGPT Codex CLI Cursor GitHub Copilot Continue.dev

Not ideal for

Real-time transcription at scale without aggressive optimisation (latency)
Under-resourced languages where WER is still >20%
Use cases needing speaker diarization out of the box (Whisper does ASR, not diarization)

Radford, A. · Kim, J. W. · Xu, T. · Brockman, G. · McLeavey, C. · Sutskever, I.