AUDIO MODEL OpenAI Last updated:

Sora 2

Native Audio + Improved Physics

OpenAI's September 2025 Sora successor — the first version with native synchronized audio generation (sound effects, speech, ambient audio matched to visuals) and substantially improved physical consistency (objects fall correctly, fluids flow plausibly, characters maintain identity across cuts). Distributed initially through a TikTok-style consumer app where users could generate and remix short AI videos with friends.

Why it matters

Sora 2 represents the moment AI video moved from research demo to ambient consumer product. The downstream implications — for content moderation, deepfakes, advertising labor markets, and the social-media platform landscape — are still being absorbed in 2026.

Core Capabilities

Generative
Produces images, video, audio, or other media.
Audio
Speech, music, or other audio understanding/synthesis.
Multimodal
Combines text, vision, and audio in one model.

Context Window

Context window not disclosed.

Availability

API
Not available
Product / App
Available
Open Source
Not released
Enterprise
Contact sales

Pricing Model

Subscription
Bundled inside the host product.
Subscription

Capability / Performance

Where this model sits relative to the middle 60% of models in the tree. All scores are 0–10 (higher is better).

Lower 20% Upper 80% This model
Quality
No data reported · placeholder
5.0
Speed
No data reported · placeholder
5.0
Control
No data reported · placeholder
5.0
Consistency
No data reported · placeholder
5.0
Lower 20% 20th percentile — 20% of models score below this This model Where the current model lands Upper 80% 80th percentile — only 20% of models score above this Percentile boundaries are computed across every model in the tree that reports the underlying benchmark for each capability.

What it feels like

  • First Sora generation with native synchronized audio — speech, sound effects, ambient soundscape from one model
  • Real physics: missed basketball rebounds off the backboard; objects respect buoyancy and rigidity
  • Olympic gymnastics, paddleboard backflips, ice-skating triple axels — motion that prior systems couldn't render
  • Cameo feature: insert real team members from a reference video into any generated scene with accurate voice
  • NYT called the September 2025 launch 'jaw-dropping (for better and worse)' — TikTok-style social app launched alongside
  • Per OpenAI's own framing: errors look like mistakes of the implicit agent, not the model — failure modes are more 'physical' than 'glitch'

Best use cases

  • Short-form social video generation (the Sora app's whole purpose)
  • Storyboards / previz where physical accuracy matters more than fine-grained creative control
  • Custom-character video using cameo for personalisation
  • Sound + video pipelines that previously needed separate models stitched together

Tools to try

Not ideal for

  • Frame-perfect creative control — no full keyframe editing the way Runway / Kling offer
  • Long-form (>60s) coherent narrative — best on short clips
  • Production work where IP ownership / training-data provenance matters legally

Model Evolution

View full evolution tree →