AUDIO MODEL Stability AI Last updated:

Stable Audio 2.0

Open Latent-Diffusion Audio

Stability AI's open text-to-audio generator, releasing 3-minute 44.1kHz stereo tracks from text prompts. Less viral than Suno (which had launched v3 two weeks earlier as a polished consumer product), but the only open-weight option in the category — a role analogous to what Stable Diffusion played in image generation while DALL-E was closed.

Cost
Free
Open weights — self-host
How are Intelligence, Speed & Cost bucketed?
Intelligence and Speed buckets are percentile ranks on Artificial Analysis. Cost buckets are fixed dollar thresholds keyed off output-token price ($/M out).
Intelligence
  • Top 1%≤ 1%
  • Top 5%≤ 5%
  • Top 10%≤ 10%
  • Good≤ 25%
  • Medium≤ 50%
  • Below avg> 50%
Speed
  • Top 1%≥ 345 tok/s
  • Top 5%≥ 237 tok/s
  • Top 10%≥ 196 tok/s
  • Good≥ 146 tok/s
  • Medium≥ 90 tok/s
  • Slow< 90 tok/s
Cost
  • Freeopen weights · self-host
  • Low< $1 / M out
  • Moderate$1–5 / M out
  • High≥ $5 / M out

Why it matters

Stable Audio 2 is the open baseline that makes the closed audio-AI category contestable. Without it, Suno / Udio / ElevenLabs would have unchallenged pricing power.

Core Capabilities

Audio
Speech, music, or other audio understanding/synthesis.
Generative
Produces images, video, audio, or other media.
Multimodal
Combines text, vision, and audio in one model.

Context Window

Context window not disclosed.

Availability

API
Not available
Product / App
Not available
Open Source
Released
Enterprise

Pricing Model

Free / self-host
Open weights — pay only for compute.
Self-host

Capability / Performance

Where this model sits relative to the middle 60% of models in the tree. All scores are 0–10 (higher is better).

Lower 20% Upper 80% This model
Quality
No data reported · placeholder
5.0
Speed
No data reported · placeholder
5.0
Control
No data reported · placeholder
5.0
Consistency
No data reported · placeholder
5.0
Lower 20% 20th percentile — 20% of models score below this This model Where the current model lands Upper 80% 80th percentile — only 20% of models score above this Percentile boundaries are computed across every model in the tree that reports the underlying benchmark for each capability.

What it feels like

  • Stability's audio diffusion sibling — same lab heritage.
  • First text-to-image model with DALL·E 2-class quality and permissive open weights
  • Latent diffusion innovation — denoising in compressed latent space, not pixel space — made consumer-GPU inference viable
  • Ran on <8GB VRAM at release — first generative model regular people could use locally

Best use cases

  • Self-hosted image generation pipelines (privacy / volume / customisation)
  • Custom-style fine-tuning via LoRA / textual inversion / Dreambooth
  • ControlNet-style guided generation requiring weight access

Tools to try

Not ideal for

  • Out-of-the-box photorealistic aesthetics — Midjourney still the default for that
  • Reliable in-image text rendering (FLUX.2 and later models leapfrogged here)

Model Evolution

View full evolution tree →