LANGUAGE MODEL Jan 2021 OpenAI

CLIP

Contrastive Language-Image Pretraining

A model trained on 400 million image-caption pairs from the web that learned to match images to their descriptions. The breakthrough: it could classify images it had never been trained to classify, just by being told the candidate labels in plain English ("a photo of a cat" vs "a photo of a dog").

Try ChatGPT API Docs ↗

Cost

Free

Open weights — self-host

How are Intelligence, Speed & Cost bucketed?

Intelligence and Speed buckets are percentile ranks on Artificial Analysis. Cost buckets are fixed dollar thresholds keyed off output-token price ($/M out).

Intelligence

Top 1%≤ 1%
Top 5%≤ 5%
Top 10%≤ 10%
Good≤ 25%
Medium≤ 50%
Below avg> 50%

Speed

Top 1%≥ 345 tok/s
Top 5%≥ 237 tok/s
Top 10%≥ 196 tok/s
Good≥ 146 tok/s
Medium≥ 90 tok/s
Slow< 90 tok/s

Cost

Freeopen weights · self-host
Low< $1 / M out
Moderate$1–5 / M out
High≥ $5 / M out

Official ↗ GitHub ↗

Why it matters

Before CLIP, "AI" was a separate model per modality. After CLIP, the field accepted that vision and language could share a single representation space — which is what makes Claude, GPT-4o, and Gemini "see" images natively today.

Core Capabilities

Vision

Understands images, scenes, and visual context.

Multimodal

Combines text, vision, and audio in one model.

Research

Foundational paper or scientific contribution.

Context Window

Context window not disclosed.

Availability

API

Not available

Product / App

Not available

Open Source

Released

Enterprise

—

Pricing Model

Free / self-host

Open weights — pay only for compute.

Self-host

Capability / Performance

Where this model sits relative to the middle 60% of models in the tree. All scores are 0–10 (higher is better).

Lower 20% Upper 80% This model

Lower 20% 20th percentile — 20% of models score below this This model Where the current model lands Upper 80% 80th percentile — only 20% of models score above this Percentile boundaries are computed across every model in the tree that reports the underlying benchmark for each capability.

What it feels like

Zero-shot ImageNet 76.2% — matched fully-supervised ResNet-50 without seeing a single label
Established the contrastive image-text pretraining recipe — every modern VLM (Stable Diffusion, DALL·E, GPT-4V) traces back here
400M (image, text) pairs from the open web — first widely-cited demonstration of internet-scale weak supervision
Open weights + open code on day one — sparked OpenCLIP, LAION, and the open-source vision-language ecosystem
Used as the text-image alignment backbone inside Stable Diffusion, Imagen, and many other generative models
Robustness +5.3 pts over ResNet-50 across 27 datasets — strong out-of-distribution generalization

Reviews: OpenAI — CLIP announcement ↗ · CLIP paper (arXiv) ↗ · GitHub — openai/CLIP ↗

Best use cases

Image-text retrieval and search at scale
Zero-shot image classification with novel categories
Backbone for diffusion-style image generators and any text-conditioned vision system
Foundation for understanding modern multimodal architectures

Tools to try

ChatGPT Codex CLI Cursor GitHub Copilot Continue.dev

Not ideal for

Fine-grained visual reasoning (modern VLMs like Molmo, Pixtral, Qwen-VL outperform)
Generation tasks — CLIP is encoder-only by design

Model Evolution

View full evolution tree →

Radford, A. · Kim, J. W. · Hallacy, C. · Ramesh, A. · et al.