AUDIO MODEL May 2024 OpenAI Last updated: Apr 29, 2026

GPT-4o

Natively Multimodal OpenAI Frontier

OpenAI's successor to GPT-4, with a single unified model handling text, audio, and images natively — instead of separate models stitched together. The "omni" of the name refers to this multimodal integration. Launched with a real-time voice mode designed to mimic natural conversation, complete with interruption handling and emotional tone. Made flagship-tier capability free to consumer ChatGPT users for the first time.

Try ChatGPT API Docs ↗

Intelligence

Below avg

Speed

Medium

114 tok/s output

Cost

High

$2.50 in / $10.00 out

Context

128K

Up to 128,000 tokens

How are Intelligence, Speed & Cost bucketed?

Intelligence and Speed buckets are percentile ranks on Artificial Analysis. Cost buckets are fixed dollar thresholds keyed off output-token price ($/M out).

Intelligence

Top 1%≤ 1%
Top 5%≤ 5%
Top 10%≤ 10%
Good≤ 25%
Medium≤ 50%
Below avg> 50%

Speed

Top 1%≥ 345 tok/s
Top 5%≥ 237 tok/s
Top 10%≥ 196 tok/s
Good≥ 146 tok/s
Medium≥ 90 tok/s
Slow< 90 tok/s

Cost

Freeopen weights · self-host
Low< $1 / M out
Moderate$1–5 / M out
High≥ $5 / M out

Official ↗ Artificial Analysis ↗

Why it matters

GPT-4o ended the era when "using the best AI" meant paying $20/mo. Once frontier capability became free with an email address, the user population expanded by an order of magnitude — reshaping regulatory scrutiny, labor-market debate, and education policy everywhere.

Core Capabilities

Long Documents

Handles entire codebases, books, and multi-doc RAG.

Multimodal

Combines text, vision, and audio in one model.

Generative

Produces images, video, audio, or other media.

Agent Workflows

Built for tool use and autonomous tasks.

Context Window

128k tokens

≈ 98 pages

4k Chat 聊天

32k Long docs 长文档

128k This model 本模型

400k Multi-doc 多文档

1M Codebase 整个代码库

10M

Availability

API

Available

Product / App

Available

Open Source

Not released

Enterprise

Contact sales

Pricing Model

Pay per token

Input and output billed separately.

Pay-per-token

Capability / Performance

Where this model sits relative to the middle 60% of models in the tree. All scores are 0–10 (higher is better).

Lower 20% Upper 80% This model

Quality

AA Intelligence Index · scaled to 10

1.7

5.6

2.7

Speed

Output throughput · log-scaled

10.0

Cost efficiency

Input price ($/M tokens) · cheaper scores higher

6.2

10.0

5.6

Consistency

No data reported · placeholder

5.0

Lower 20% 20th percentile — 20% of models score below this This model Where the current model lands Upper 80% 80th percentile — only 20% of models score above this Percentile boundaries are computed across every model in the tree that reports the underlying benchmark for each capability.

What it feels like

First end-to-end omni model — text, vision, audio share one neural net (not stitched pipelines)
2x faster, half the price, and 5x higher rate limits than GPT-4 Turbo
Native voice-to-voice latency around 320ms — close to human conversational rhythm (210ms)
Realtime API (Oct 2024) opened up always-on voice assistants for developers
Reasoning is solid but not o1-class — by 2025 it's the 'speed/cost tier', not the 'IQ tier'
GPT-4o mini (Jul 2024) became the GPT-3.5 Turbo replacement at much better quality

Reviews: OpenAI — Hello GPT-4o ↗ · TechCrunch — OpenAI debuts GPT-4o omni model ↗ · Wikipedia — GPT-4o ↗

Best use cases

Voice-first applications and real-time multimodal interfaces
Cost-sensitive bulk inference where GPT-4-class quality is enough
Image understanding workflows — strong vision pipeline at API price
Replacing GPT-3.5/Turbo deployments with 4-tier quality at similar cost

Tools to try

ChatGPT Codex CLI Cursor GitHub Copilot Continue.dev

Not ideal for

Hard reasoning, math, or research — o1/o3/GPT-5 are the right tier
Frontier-leaderboard coding (Claude 3.5 Sonnet+ outscored GPT-4o on SWE-bench by mid-2024)
Self-hosted / open-weights workflows

Model Evolution

GPT is OpenAI's audio model family.

View full evolution tree →