NVIDIA Cosmos-Predict 2.5
A video diffusion model trained on 200M curated clips, designed not for entertainment but for physical AI: robots, autonomous vehicles, and simulators. Given a text prompt, image, or seed video, it predicts how a scene evolves under physically-plausible dynamics. The 2.5 release unifies Text2World, Image2World, and Video2World in one model, with action-conditioned variants for robotics policies.
How are Intelligence, Speed & Cost bucketed?
- Top 1%≤ 1%
- Top 5%≤ 5%
- Top 10%≤ 10%
- Good≤ 25%
- Medium≤ 50%
- Below avg> 50%
- Top 1%≥ 345 tok/s
- Top 5%≥ 237 tok/s
- Top 10%≥ 196 tok/s
- Good≥ 146 tok/s
- Medium≥ 90 tok/s
- Slow< 90 tok/s
- Freeopen weights · self-host
- Low< $1 / M out
- Moderate$1–5 / M out
- High≥ $5 / M out
Why it matters
Marks the point where world models stopped being a research curiosity and became deployable infrastructure for robotics. 2M+ downloads of the Cosmos family by January 2026, and partnerships with most major robotics labs. The split between "video models for entertainment" (Sora, Veo) and "video models for embodied AI" (Cosmos, Genie) crystallized around this release.
Core Capabilities
Context Window
Context window not disclosed.
Availability
Pricing Model
What it feels like
Best use cases
- Robot policy training (RoboCasa, Libero benchmarks) (NVIDIA)
- Autonomous driving simulation (NVIDIA)
Tools to try
Not ideal for
- Turnkey hosted reliability (you’ll need deployment/ops).
- Text-heavy reasoning and coding workloads (use an LLM).