Gemini Robotics
VLA from DeepMind
DeepMind's vision-language-action (VLA) family — Gemini Robotics, Robotics-ER, On-Device, and Robotics 1.5/ER 1.5 (Sept 2025). Same Gemini multimodal brain, fine-tuned to output robot motor commands instead of text. Deployed on Boston Dynamics Atlas, Apptronik Apollo, and partner humanoid platforms with multi- embodiment Motion Transfer.
Why it matters
Established VLA as a real product category, not just research. Combined with Physical Intelligence π series, NVIDIA GR00T, and Figure Helix, embodied AI is at the same "early commercialization" point that LLMs were in 2022.
Core Capabilities
Agent Workflows
Built for tool use and autonomous tasks.
Multimodal
Combines text, vision, and audio in one model.
Generative
Produces images, video, audio, or other media.
Audio
Speech, music, or other audio understanding/synthesis.
Context Window
Context window not disclosed.
Availability
API
Available
Product / App
Not available
Open Source
Not released
Enterprise
Contact sales
Pricing Model
Pay per token
Input and output billed separately.
Pay-per-token What it feels like
- Audio model from Google DeepMind — see the linked sources below for benchmark and review coverage
- Tool-use and agent loops are the typical fit per the published model card
- Vision and multimodal tasks are the typical fit per the published model card
- Audio synthesis or transcription per the published model card
Best use cases
- Agent / tool-use workflows that match the model's published benchmarks
- Vision tasks (charts, documents, images) per the model card
- Audio synthesis / transcription tasks per the model card
- See the model spec and sources block for benchmarked use cases
Tools to try
Not ideal for
- Tasks far outside the modalities listed in this model's spec
- Workflows where a more recent successor in the same family scores higher