LANGUAGE MODEL Jun 2017 Google Brain

Attention Is All You Need (Transformer)

A neural network architecture that processes all words in a sentence in parallel — instead of one at a time — using an "attention" mechanism that lets each word directly consider every other word. This both sped up training (massively) and improved quality.

Try demo

Context

512

Up to 512 tokens

Official ↗ GitHub ↗

Why it matters

If you are investing in, competing with, regulating, or being professionally affected by any generative AI company in 2026, this 2017 paper is the technical root of what you are dealing with. Every debate about "AI safety," "AI value," or "AI moats" implicitly assumes a transformer.

Core Capabilities

Long Documents

Handles entire codebases, books, and multi-doc RAG.

Research

Foundational paper or scientific contribution.

Context Window

512 tokens

short prompt

4k Chat 聊天

32k Long docs 长文档

128k Books 整本书

400k Multi-doc 多文档

1M Codebase 整个代码库

10M

512

Availability

API

Not available

Product / App

Not available

Open Source

Not released

Enterprise

—

Pricing Model

Research artifact

Not commercially released.

Research

Capability / Performance

Where this model sits relative to the middle 60% of models in the tree. All scores are 0–10 (higher is better).

Lower 20% Upper 80% This model

Context / memory

Context window size · log-scaled

6.0

9.0

0.0

Lower 20% 20th percentile — 20% of models score below this This model Where the current model lands Upper 80% 80th percentile — only 20% of models score above this Percentile boundaries are computed across every model in the tree that reports the underlying benchmark for each capability.

What it feels like

The architecture that enabled the modern AI era — every major LLM since 2018 is a Transformer descendant
Replaced RNN/LSTM sequential processing with parallel self-attention — cut training time from weeks to days
200K+ citations on Google Scholar — among the most cited ML papers of all time
Originally framed for machine translation; the impact spread to virtually every sequence task
Encoder-decoder design later split into encoder-only (BERT) and decoder-only (GPT) lineages
Position encodings + multi-head attention + layer norm became the default kit for sequence modelling

Reviews: Attention Is All You Need (arXiv) ↗ · Google AI Blog — Transformer announcement ↗ · Annotated Transformer (Harvard NLP) ↗

Best use cases

Foundational paper to read before any other ML architecture work
Citation in any work involving sequence-to-sequence modelling, language, or attention
Teaching introductory NLP / deep learning courses
Understanding why every frontier model in this tree exists

Not ideal for

Use as a deployable model — this is an architecture paper, not a model checkpoint
Cost-sensitive long-context inference without the post-2023 efficiency improvements

Vaswani, A. · Shazeer, N. · Parmar, N. · Uszkoreit, J. · et al.