CLIP
Contrastive Language-Image Pretraining
A model trained on 400 million image-caption pairs from the web that learned to match images to their descriptions. The breakthrough: it could classify images it had never been trained to classify, just by being told the candidate labels in plain English ("a photo of a cat" vs "a photo of a dog").
Cost
Free
Open weights — self-host
How are Intelligence, Speed & Cost bucketed?
Intelligence and Speed buckets are percentile ranks on
Artificial Analysis. Cost buckets are fixed dollar
thresholds keyed off output-token price ($/M out).
Intelligence
- Top 1%≤ 1%
- Top 5%≤ 5%
- Top 10%≤ 10%
- Good≤ 25%
- Medium≤ 50%
- Below avg> 50%
Speed
- Top 1%≥ 345 tok/s
- Top 5%≥ 237 tok/s
- Top 10%≥ 196 tok/s
- Good≥ 146 tok/s
- Medium≥ 90 tok/s
- Slow< 90 tok/s
Cost
- Freeopen weights · self-host
- Low< $1 / M out
- Moderate$1–5 / M out
- High≥ $5 / M out
Why it matters
Before CLIP, "AI" was a separate model per modality. After CLIP, the field accepted that vision and language could share a single representation space — which is what makes Claude, GPT-4o, and Gemini "see" images natively today.
Core Capabilities
Vision
Understands images, scenes, and visual context.
Multimodal
Combines text, vision, and audio in one model.
Research
Foundational paper or scientific contribution.
Context Window
Context window not disclosed.
Availability
API
Not available
Product / App
Not available
Open Source
Released
Enterprise
—
Pricing Model
Free / self-host
Open weights — pay only for compute.
Self-host Capability / Performance
Where this model sits relative to the middle 60% of models in the tree. All scores are 0–10 (higher is better).
Lower 20% Upper 80% This model
Lower 20% 20th percentile — 20% of models score below this This model Where the current model lands Upper 80% 80th percentile — only 20% of models score above this
Percentile boundaries are computed across every model in the tree that reports the underlying benchmark for each capability.
What it feels like
- Zero-shot ImageNet 76.2% — matched fully-supervised ResNet-50 without seeing a single label
- Established the contrastive image-text pretraining recipe — every modern VLM (Stable Diffusion, DALL·E, GPT-4V) traces back here
- 400M (image, text) pairs from the open web — first widely-cited demonstration of internet-scale weak supervision
- Open weights + open code on day one — sparked OpenCLIP, LAION, and the open-source vision-language ecosystem
- Used as the text-image alignment backbone inside Stable Diffusion, Imagen, and many other generative models
- Robustness +5.3 pts over ResNet-50 across 27 datasets — strong out-of-distribution generalization
Best use cases
- Image-text retrieval and search at scale
- Zero-shot image classification with novel categories
- Backbone for diffusion-style image generators and any text-conditioned vision system
- Foundation for understanding modern multimodal architectures
Tools to try
Not ideal for
- Fine-grained visual reasoning (modern VLMs like Molmo, Pixtral, Qwen-VL outperform)
- Generation tasks — CLIP is encoder-only by design