Grok 4
xAI's August 2025 reasoning flagship — Grok 4 and Grok 4 Heavy. The "Heavy" variant runs multiple reasoning agents in parallel and synthesizes their outputs, achieving a step change on the hardest benchmarks (Humanity's Last Exam, ARC-AGI v2). Released ~6 months after Grok 3, demonstrating xAI's continued ability to ship at frontier cadence post- Colossus-cluster build-out.
How are Intelligence, Speed & Cost bucketed?
- Top 1%≤ 1%
- Top 5%≤ 5%
- Top 10%≤ 10%
- Good≤ 25%
- Medium≤ 50%
- Below avg> 50%
- Top 1%≥ 345 tok/s
- Top 5%≥ 237 tok/s
- Top 10%≥ 196 tok/s
- Good≥ 146 tok/s
- Medium≥ 90 tok/s
- Slow< 90 tok/s
- Freeopen weights · self-host
- Low< $1 / M out
- Moderate$1–5 / M out
- High≥ $5 / M out
Why it matters
Grok 4 Heavy sits at the absolute frontier of reasoning benchmarks in late 2025 / early 2026 — the model that "what's the hardest thing AI can do today?" debates cite for the upper bound. Whether the multi-agent inference architecture is durable or gets matched by single-model successors (Opus 4.7, GPT-5.5) is the open competitive question.
Core Capabilities
Context Window
Availability
Pricing Model
Capability / Performance
Where this model sits relative to the middle 60% of models in the tree. All scores are 0–10 (higher is better).
What it feels like
- AA Intelligence Index of 73 at release — leading frontier slot, ahead of o3 (70), Gemini 2.5 Pro (70), Claude 4 Opus (64)
- GPQA Diamond all-time high of 88%; Humanity's Last Exam 24%; AIME 2024 94%; MMLU-Pro 87%
- Leads Coding Index (LiveCodeBench + SciCode) and Math Index (AIME24 + MATH-500)
- 256K-token context in API (128K in app); 75 tok/s — slower than o3 but faster than Claude 4 Opus Thinking
- Tool-integrated reasoning trained via large-scale RL with verifiable rewards
- Launched under controversy — Elon-favoritism concerns and prior Grok-3 benchmark trust issues still weigh on reception
Best use cases
- Math, science, and reasoning research where Grok 4 leads on raw benchmark numbers
- Real-time information lookup inside the X (Twitter) product with full timeline access
- Tool-using agents that benefit from extended reasoning + verifiable-rewards training
- Less-restricted prompting than ChatGPT/Claude defaults
Tools to try
Not ideal for
- Customer-facing products where political-bias risk is a brand concern
- Workflows requiring system-prompt-free behaviour — RL tuning shows different defaults depending on scaffold
- Self-hosted or air-gapped deployments — closed weights, hosted only via X / API
Model Evolution
grok is xAI's language model family.