AUDIO MODEL Apr 2025 Nvidia Last updated: Apr 29, 2026

NVIDIA Parakeet-TDT

Open ASR Leader

NVIDIA's open speech-recognition family — Parakeet-TDT-0.6B currently tops the Hugging Face Open ASR Leaderboard at 6.05% word error rate. The Token-Duration Transducer architecture is cheap to deploy (single-GPU realtime) and licensed permissively via NVIDIA's NeMo stack.

Try demo API Docs ↗

Cost

Free

Open weights — self-host

How are Intelligence, Speed & Cost bucketed?

Intelligence and Speed buckets are percentile ranks on Artificial Analysis. Cost buckets are fixed dollar thresholds keyed off output-token price ($/M out).

Intelligence

Top 1%≤ 1%
Top 5%≤ 5%
Top 10%≤ 10%
Good≤ 25%
Medium≤ 50%
Below avg> 50%

Speed

Top 1%≥ 345 tok/s
Top 5%≥ 237 tok/s
Top 10%≥ 196 tok/s
Good≥ 146 tok/s
Medium≥ 90 tok/s
Slow< 90 tok/s

Cost

Freeopen weights · self-host
Low< $1 / M out
Moderate$1–5 / M out
High≥ $5 / M out

Official ↗ GitHub ↗ Hugging Face ↗

Why it matters

Replaced Whisper as the default open-source ASR for many production deployments. Combined with the Canary-Qwen-2.5B record (5.63% WER) released months later, NVIDIA owns the open ASR frontier in 2025–2026.

Core Capabilities

Audio

Speech, music, or other audio understanding/synthesis.

Generative

Produces images, video, audio, or other media.

Context Window

Context window not disclosed.

Availability

API

Not available

Product / App

Not available

Open Source

Released

Enterprise

—

Pricing Model

Free / self-host

Open weights — pay only for compute.

Self-host

What it feels like

Audio model from Nvidia — see the linked sources below for benchmark and review coverage
Audio synthesis or transcription per the published model card

Best use cases

Audio synthesis / transcription tasks per the model card
See the model spec and sources block for benchmarked use cases

Tools to try

NIM Microservices Build catalog

Not ideal for

Tasks far outside the modalities listed in this model's spec
Workflows where a more recent successor in the same family scores higher