AUDIO MODEL Google/DeepMind Last updated:

Gemini 1

Google's Native Multimodal Frontier

Google's first model family launched under the Gemini brand, designed natively for text + images + audio + video as a single integrated system. Released in three sizes: Ultra (flagship), Pro (balanced, on Bard), and Nano (1.8B and 3.25B, on Pixel devices). Marketed as the first model to surpass GPT-4 on MMLU, though under contested benchmarking conditions.

Intelligence
Good
Speed
Top 5%
296 tok/s output
Cost
Moderate
$0.25 in / $1.50 out
Context
33K
Up to 32,768 tokens
How are Intelligence, Speed & Cost bucketed?
Intelligence and Speed buckets are percentile ranks on Artificial Analysis. Cost buckets are fixed dollar thresholds keyed off output-token price ($/M out).
Intelligence
  • Top 1%≤ 1%
  • Top 5%≤ 5%
  • Top 10%≤ 10%
  • Good≤ 25%
  • Medium≤ 50%
  • Below avg> 50%
Speed
  • Top 1%≥ 345 tok/s
  • Top 5%≥ 237 tok/s
  • Top 10%≥ 196 tok/s
  • Good≥ 146 tok/s
  • Medium≥ 90 tok/s
  • Slow< 90 tok/s
Cost
  • Freeopen weights · self-host
  • Low< $1 / M out
  • Moderate$1–5 / M out
  • High≥ $5 / M out

Why it matters

Gemini 1 was less a technical breakthrough than a strategic reset — Google reorganized its AI efforts under DeepMind's Demis Hassabis (post the May 2023 Brain + DeepMind merger) and Gemini was the first product output of that consolidation.

Core Capabilities

Long Documents
Handles entire codebases, books, and multi-doc RAG.
Multimodal
Combines text, vision, and audio in one model.
Generative
Produces images, video, audio, or other media.
Agent Workflows
Built for tool use and autonomous tasks.

Context Window

33k tokens
≈ 25 pages
4k Chat 聊天
32k This model 本模型
128k Books 整本书
400k Multi-doc 多文档
1M Codebase 整个代码库
10M

Availability

API
Available
Product / App
Available
Open Source
Not released
Enterprise
Contact sales

Pricing Model

Pay per token
Input and output billed separately.
Pay-per-token

Capability / Performance

Where this model sits relative to the middle 60% of models in the tree. All scores are 0–10 (higher is better).

Lower 20% Upper 80% This model
Quality
AA Intelligence Index · scaled to 10
1.7
5.6
4.8
Speed
Output throughput · log-scaled
10.0
Cost efficiency
Input price ($/M tokens) · cheaper scores higher
6.2
10.0
10.0
Consistency
No data reported · placeholder
5.0
Lower 20% 20th percentile — 20% of models score below this This model Where the current model lands Upper 80% 80th percentile — only 20% of models score above this Percentile boundaries are computed across every model in the tree that reports the underlying benchmark for each capability.

What it feels like

  • Original Gemini 1 — historical; superseded by 1.5+.
  • First model where 1M-token context was a real product feature, not a benchmark headline
  • 99% needle-in-haystack accuracy at 1M tokens; 99.2% even at 10M in research configurations
  • Multi-needle recall drops to ~60% — single fact retrieval is solid, multi-fact is harder

Best use cases

  • Whole-codebase analysis (30K+ lines) without tedious chunking pipelines
  • Document QA over 1,500-page PDFs / batches of 100 emails
  • Hour-long video summarisation and audio transcription QA

Tools to try

Not ideal for

  • Tasks requiring multi-fact retrieval across long contexts (recall drops sharply)
  • Pure short-context chat — Flash variants are cheaper and faster

Model Evolution

View full evolution tree →