Grok 3
xAI's third-generation flagship model, released February 2025 after a sub-12-month build of the largest known single-site GPU training cluster (Colossus, 200,000 H100s in Memphis). Approached or matched GPT-4o / Claude 3.7 / Gemini 2.5 Pro on benchmarks at launch and was integrated into X (Twitter) as the default conversational AI for X Premium subscribers.
How are Intelligence, Speed & Cost bucketed?
- Top 1%≤ 1%
- Top 5%≤ 5%
- Top 10%≤ 10%
- Good≤ 25%
- Medium≤ 50%
- Below avg> 50%
- Top 1%≥ 345 tok/s
- Top 5%≥ 237 tok/s
- Top 10%≥ 196 tok/s
- Good≥ 146 tok/s
- Medium≥ 90 tok/s
- Slow< 90 tok/s
- Freeopen weights · self-host
- Low< $1 / M out
- Moderate$1–5 / M out
- High≥ $5 / M out
Why it matters
Grok 3 demonstrated that a startup founded in 2023 could reach frontier capability in under 24 months given sufficient capital and infrastructure willingness. The competitive implications: the frontier-model market likely sustains 4-6 well-funded labs in 2026-2030 rather than consolidating to 2-3. Whether xAI's specific positioning as the "most pro-free-speech AI" converts to durable customer preference remains to be seen.
Core Capabilities
Context Window
Availability
Pricing Model
Capability / Performance
Where this model sits relative to the middle 60% of models in the tree. All scores are 0–10 (higher is better).
What it feels like
- First xAI model trained on the Colossus supercluster — 10x compute of Grok-2
- AIME 2025 93.3% and GPQA 84.6% — credibly frontier-tier on math and graduate science
- Chatbot Arena Elo 1402 at release — not a benchmark stunt, real human-preference signal
- TechCrunch flagged that the released graphs omitted o3-mini-high's cons@64 score, sparking benchmark trust debate
- Trained with reinforcement learning at unprecedented scale to refine chain-of-thought
- Most useful inside the X (Twitter) product — outside that surface the alternatives feel more polished
Best use cases
- Research and scientific Q&A where chain-of-thought matters
- Real-time information retrieval inside the X platform with full timeline access
- Math and competition-style coding (LiveCodeBench 79.4%)
- Users who want a less-restricted model than ChatGPT/Claude defaults
Tools to try
Not ideal for
- Coding-agent leaderboards — Claude Opus 4.5 and DeepSeek V4 score higher on SWE-bench
- Workflows where benchmark transparency matters — release controversy is a real concern
- Multimodal-first applications — Grok 3 lags on image tasks vs Gemini 2.5 Pro / GPT-5
Model Evolution
grok is xAI's image model family.