LANGUAGE MODEL OpenAI

CLIP

Contrastive Language-Image Pretraining

A model trained on 400 million image-caption pairs from the web that learned to match images to their descriptions. The breakthrough: it could classify images it had never been trained to classify, just by being told the candidate labels in plain English ("a photo of a cat" vs "a photo of a dog").

Cost
Free
Open weights — self-host
How are Intelligence, Speed & Cost bucketed?
Intelligence and Speed buckets are percentile ranks on Artificial Analysis. Cost buckets are fixed dollar thresholds keyed off output-token price ($/M out).
Intelligence
  • Top 1%≤ 1%
  • Top 5%≤ 5%
  • Top 10%≤ 10%
  • Good≤ 25%
  • Medium≤ 50%
  • Below avg> 50%
Speed
  • Top 1%≥ 345 tok/s
  • Top 5%≥ 237 tok/s
  • Top 10%≥ 196 tok/s
  • Good≥ 146 tok/s
  • Medium≥ 90 tok/s
  • Slow< 90 tok/s
Cost
  • Freeopen weights · self-host
  • Low< $1 / M out
  • Moderate$1–5 / M out
  • High≥ $5 / M out

Why it matters

Before CLIP, "AI" was a separate model per modality. After CLIP, the field accepted that vision and language could share a single representation space — which is what makes Claude, GPT-4o, and Gemini "see" images natively today.

Core Capabilities

Vision
Understands images, scenes, and visual context.
Multimodal
Combines text, vision, and audio in one model.
Research
Foundational paper or scientific contribution.

Context Window

Context window not disclosed.

Availability

API
Not available
Product / App
Not available
Open Source
Released
Enterprise

Pricing Model

Free / self-host
Open weights — pay only for compute.
Self-host

Capability / Performance

Where this model sits relative to the middle 60% of models in the tree. All scores are 0–10 (higher is better).

Lower 20% Upper 80% This model
Lower 20% 20th percentile — 20% of models score below this This model Where the current model lands Upper 80% 80th percentile — only 20% of models score above this Percentile boundaries are computed across every model in the tree that reports the underlying benchmark for each capability.

What it feels like

  • Zero-shot ImageNet 76.2% — matched fully-supervised ResNet-50 without seeing a single label
  • Established the contrastive image-text pretraining recipe — every modern VLM (Stable Diffusion, DALL·E, GPT-4V) traces back here
  • 400M (image, text) pairs from the open web — first widely-cited demonstration of internet-scale weak supervision
  • Open weights + open code on day one — sparked OpenCLIP, LAION, and the open-source vision-language ecosystem
  • Used as the text-image alignment backbone inside Stable Diffusion, Imagen, and many other generative models
  • Robustness +5.3 pts over ResNet-50 across 27 datasets — strong out-of-distribution generalization

Best use cases

  • Image-text retrieval and search at scale
  • Zero-shot image classification with novel categories
  • Backbone for diffusion-style image generators and any text-conditioned vision system
  • Foundation for understanding modern multimodal architectures

Tools to try

Not ideal for

  • Fine-grained visual reasoning (modern VLMs like Molmo, Pixtral, Qwen-VL outperform)
  • Generation tasks — CLIP is encoder-only by design

Model Evolution

View full evolution tree →

Radford, A. · Kim, J. W. · Hallacy, C. · Ramesh, A. · et al.