DINOv3
Meta's Self-Supervised Vision Foundation
Meta's August 2025 self-supervised vision encoder — a vision transformer trained on 1.7 BILLION images without any labels or captions. The model learns visual structure on its own, then transfers to detection, segmentation, depth, and feature-matching tasks. The default backbone for serious computer-vision research in late 2025.
Cost
Free
Open weights — self-host
How are Intelligence, Speed & Cost bucketed?
Intelligence and Speed buckets are percentile ranks on
Artificial Analysis. Cost buckets are fixed dollar
thresholds keyed off output-token price ($/M out).
Intelligence
- Top 1%≤ 1%
- Top 5%≤ 5%
- Top 10%≤ 10%
- Good≤ 25%
- Medium≤ 50%
- Below avg> 50%
Speed
- Top 1%≥ 345 tok/s
- Top 5%≥ 237 tok/s
- Top 10%≥ 196 tok/s
- Good≥ 146 tok/s
- Medium≥ 90 tok/s
- Slow< 90 tok/s
Cost
- Freeopen weights · self-host
- Low< $1 / M out
- Moderate$1–5 / M out
- High≥ $5 / M out
Why it matters
Established that self-supervised pretraining at scale beats image-text contrastive (CLIP) on most non-zero-shot CV tasks. Combined with SAM 3 (segmentation) and DINOv3, Meta now anchors the academic CV stack the way it anchored NLP with Llama.
Core Capabilities
Vision
Understands images, scenes, and visual context.
Research
Foundational paper or scientific contribution.
Context Window
Context window not disclosed.
Availability
API
Not available
Product / App
Not available
Open Source
Released
Enterprise
—
Pricing Model
Free / self-host
Open weights — pay only for compute.
Self-host What it feels like
- Vision-language model from Meta AI — see the linked sources below for benchmark and review coverage
- Vision and multimodal tasks are the typical fit per the published model card
Best use cases
- Vision tasks (charts, documents, images) per the model card
- See the model spec and sources block for benchmarked use cases
Tools to try
Not ideal for
- Tasks far outside the modalities listed in this model's spec
- Workflows where a more recent successor in the same family scores higher