AUDIO MODEL May 2025 Google/DeepMind Last updated: Apr 29, 2026

Veo 3

Google's Native-Audio Video Model

Google's third-generation text-to-video model, released May 2025 and integrated into Gemini and Google Vertex AI by summer. Headline capability: native synchronized audio generation (speech, music, sound effects) at the same time as the video — beating Sora 2 (which added native audio in September 2025) to that capability by 4 months.

Try Gemini API Docs ↗

Official ↗

Why it matters

Veo 3 demonstrated that Google's model labs could ship at the frontier of a major modality on Google's calendar, not in response to OpenAI. The shipping capability matters more than the specific Veo 3 release for the longer-term competitive positioning.

Core Capabilities

Generative

Produces images, video, audio, or other media.

Audio

Speech, music, or other audio understanding/synthesis.

Multimodal

Combines text, vision, and audio in one model.

Context Window

Context window not disclosed.

Availability

API

Available

Product / App

Available

Open Source

Not released

Enterprise

Contact sales

Pricing Model

Pay per token

Input and output billed separately.

Pay-per-token

Capability / Performance

Where this model sits relative to the middle 60% of models in the tree. All scores are 0–10 (higher is better).

Lower 20% Upper 80% This model

Quality

No data reported · placeholder

5.0

Speed

No data reported · placeholder

5.0

Control

No data reported · placeholder

5.0

Consistency

No data reported · placeholder

5.0

Lower 20% 20th percentile — 20% of models score below this This model Where the current model lands Upper 80% 80th percentile — only 20% of models score above this Percentile boundaries are computed across every model in the tree that reports the underlying benchmark for each capability.

What it feels like

First Google video model with native synced audio — dialogue, effects, ambient soundscape from one model
Sound is generated from raw pixels of the video itself, not stitched as a post-process
Lip-syncing and dubbing are the standout strength — realistic conversation in generated scenes
Up to 8s clips in 1080p; some configurations support 4K — covers draft and finished-quality use cases
Audio sync only works on first attempt for ~25% of generations — needs iteration
Best at scene composition / camera direction when prompts are detailed and structured

Reviews: Google DeepMind — Veo product page ↗ · TechCrunch — Veo 3 generates videos and soundtracks ↗ · Cybernews — Google Veo 3 review ↗

Best use cases

Cinematic short-form content where audio integration matters (ads, trailers, social)
Character dialogue scenes — best lip-sync of any major video model at release
Storyboard-to-finished-shot pipelines via Vertex AI for studio workflows
Replacing two-model stacks (video + voice generation) with one

Tools to try

Gemini app AI Studio Vertex AI

Not ideal for

Long-form (>8s) coherent narrative shots
Frame-perfect creative control — limited keyframe editing vs Runway / Kling
Workloads where the 25% audio-on-first-try hit rate is unacceptable