LANGUAGE MODEL Cognition Last updated:

Devin

First Autonomous Software Engineer Agent

Cognition AI's autonomous SWE agent — released as a demo in March 2024, generally available later. Devin ran in a sandboxed Linux environment with shell, browser, and code editor, and completed multi-hour software engineering tasks unsupervised. The SWE-bench score (13.86%) at launch was 7× the prior best, kicking off the agentic-coding arms race.

Try demo

Why it matters

Defined the autonomous-coding-agent category. Whether or not Devin itself dominates long-term, every coding-agent product (Cursor Composer, Windsurf Cascade, Claude Code, Codex CLI) is a response to Devin's framing of "agent in a sandbox" as the right product shape.

Core Capabilities

Agent Workflows
Built for tool use and autonomous tasks.
Generative
Produces images, video, audio, or other media.
Coding
Strong real-world software engineering.
Multimodal
Combines text, vision, and audio in one model.

Context Window

Context window not disclosed.

Availability

API
Not available
Product / App
Available
Open Source
Not released
Enterprise
Contact sales

Pricing Model

Subscription
Bundled inside the host product.
Subscription

Capability / Performance

Where this model sits relative to the middle 60% of models in the tree. All scores are 0–10 (higher is better).

Lower 20% Upper 80% This model
Lower 20% 20th percentile — 20% of models score below this This model Where the current model lands Upper 80% 80th percentile — only 20% of models score above this Percentile boundaries are computed across every model in the tree that reports the underlying benchmark for each capability.

What it feels like

  • Language model from Cognition — see the linked sources below for benchmark and review coverage
  • Code-leaning workloads are the typical fit per the published model card
  • Tool-use and agent loops are the typical fit per the published model card

Best use cases

  • Coding workflows that match the model's published benchmarks
  • Agent / tool-use workflows that match the model's published benchmarks
  • See the model spec and sources block for benchmarked use cases

Not ideal for

  • Tasks far outside the modalities listed in this model's spec
  • Workflows where a more recent successor in the same family scores higher