Llama 4
Meta's first MoE-native Llama generation, released April 2025 with three sizes — Scout (109B / 17B active, 10M context), Maverick (400B / 17B active), and Behemoth (~2T total, still in preview at launch). Open weights but with stricter licensing than Llama 3 (acceptable use policy expanded). Mixed reception: benchmark scores trailed expectations vs internal Meta commentary that had hyped Behemoth as "GPT-5 class."
How are Intelligence, Speed & Cost bucketed?
- Top 1%≤ 1%
- Top 5%≤ 5%
- Top 10%≤ 10%
- Good≤ 25%
- Medium≤ 50%
- Below avg> 50%
- Top 1%≥ 345 tok/s
- Top 5%≥ 237 tok/s
- Top 10%≥ 196 tok/s
- Good≥ 146 tok/s
- Medium≥ 90 tok/s
- Slow< 90 tok/s
- Freeopen weights · self-host
- Low< $1 / M out
- Moderate$1–5 / M out
- High≥ $5 / M out
Why it matters
Llama 4 illustrates that open-weight leadership is no longer a single-lab story. Meta's structural advantages (compute, data, reach) didn't translate to an obvious win in the generation after Llama 3 — pointing to either a methodological gap or a talent / strategy gap that Meta is now actively trying to close.
Core Capabilities
Context Window
Availability
Pricing Model
Capability / Performance
Where this model sits relative to the middle 60% of models in the tree. All scores are 0–10 (higher is better).
What it feels like
- Meta's first MoE-native Llama and first natively multimodal open-weights generation
- Maverick (17B active / 128 experts) beats GPT-4o and Gemini 2.0 Flash on most benchmarks at release
- Scout (17B active / 16 experts) fits in a single H100 with a claimed 10M-token context window
- Benchmark numbers landed under a cloud — community questioned whether the LMArena score reflected the open-weight checkpoint
- Real-world testers found gaps between announcement claims and independent reproduction
- Despite controversy, the open weights gave the ecosystem a viable post-Llama-3 baseline
Best use cases
- Self-hosted multimodal applications where API models can't go
- Long-context retrieval and document QA (especially Scout's 10M window)
- Fine-tuning on private data while staying inside Meta's open license
- Cost-sensitive multimodal inference at scale
Tools to try
Not ideal for
- Frontier-leaderboard reasoning — Claude 4, GPT-5, DeepSeek R1 score higher
- Edge / single-consumer-GPU deployments (even Scout needs an H100)
- Workflows where the LMArena-style controversy is a credibility risk