IMAGE MODEL Stability AI

Stable Diffusion

Open-Weight Latent Diffusion for Images

An image generator released with open weights and a permissive license that, within weeks, ran on consumer graphics cards in apartments around the world. It was the first generative AI model with quality comparable to closed competitors (DALL·E 2, Midjourney) that anyone could download, modify, and commercialize without permission.

Cost
Free
Open weights — self-host
How are Intelligence, Speed & Cost bucketed?
Intelligence and Speed buckets are percentile ranks on Artificial Analysis. Cost buckets are fixed dollar thresholds keyed off output-token price ($/M out).
Intelligence
  • Top 1%≤ 1%
  • Top 5%≤ 5%
  • Top 10%≤ 10%
  • Good≤ 25%
  • Medium≤ 50%
  • Below avg> 50%
Speed
  • Top 1%≥ 345 tok/s
  • Top 5%≥ 237 tok/s
  • Top 10%≥ 196 tok/s
  • Good≥ 146 tok/s
  • Medium≥ 90 tok/s
  • Slow< 90 tok/s
Cost
  • Freeopen weights · self-host
  • Low< $1 / M out
  • Moderate$1–5 / M out
  • High≥ $5 / M out

Why it matters

Stable Diffusion is the image-gen equivalent of what Llama became for text generation a year later. It proved that open-weight models could ship at frontier quality — a precedent that now shapes every conversation about LLM regulation, China competitiveness, and the viability of closed-model moats.

Core Capabilities

Generative
Produces images, video, audio, or other media.
Vision
Understands images, scenes, and visual context.
Multimodal
Combines text, vision, and audio in one model.

Context Window

Context window not disclosed.

Availability

API
Not available
Product / App
Available
Open Source
Released
Enterprise
Contact sales

Pricing Model

Free / self-host
Open weights — pay only for compute.
Self-host

Capability / Performance

Where this model sits relative to the middle 60% of models in the tree. All scores are 0–10 (higher is better).

Lower 20% Upper 80% This model
Quality
No data reported · placeholder
5.0
Speed
No data reported · placeholder
5.0
Control
No data reported · placeholder
5.0
Consistency
No data reported · placeholder
5.0
Lower 20% 20th percentile — 20% of models score below this This model Where the current model lands Upper 80% 80th percentile — only 20% of models score above this Percentile boundaries are computed across every model in the tree that reports the underlying benchmark for each capability.

What it feels like

  • First text-to-image model with DALL·E 2-class quality and permissive open weights
  • Latent diffusion innovation — denoising in compressed latent space, not pixel space — made consumer-GPU inference viable
  • Ran on <8GB VRAM at release — first generative model regular people could use locally
  • Spawned an enormous ecosystem (LoRAs, ControlNet, AUTOMATIC1111, ComfyUI) within months
  • Triggered the 'AI art' cultural moment of late 2022 along with Midjourney
  • Lower default photorealism than DALL·E 2 / Midjourney v3 — but customisability completely flipped the trade-off

Best use cases

  • Self-hosted image generation pipelines (privacy / volume / customisation)
  • Custom-style fine-tuning via LoRA / textual inversion / Dreambooth
  • ControlNet-style guided generation requiring weight access
  • Research and academic experimentation on diffusion models

Tools to try

Not ideal for

  • Out-of-the-box photorealistic aesthetics — Midjourney still the default for that
  • Reliable in-image text rendering (FLUX.2 and later models leapfrogged here)
  • Production deployments without a community-finetuned LoRA / model card

Model Evolution

View full evolution tree →

Rombach, R. · Blattmann, A. · Lorenz, D. · Esser, P. · Ommer, B.