LANGUAGE MODEL Google Brain

Attention Is All You Need (Transformer)

A neural network architecture that processes all words in a sentence in parallel — instead of one at a time — using an "attention" mechanism that lets each word directly consider every other word. This both sped up training (massively) and improved quality.

Try demo
Context
512
Up to 512 tokens

Why it matters

If you are investing in, competing with, regulating, or being professionally affected by any generative AI company in 2026, this 2017 paper is the technical root of what you are dealing with. Every debate about "AI safety," "AI value," or "AI moats" implicitly assumes a transformer.

Core Capabilities

Long Documents
Handles entire codebases, books, and multi-doc RAG.
Research
Foundational paper or scientific contribution.

Context Window

512 tokens
short prompt
4k Chat 聊天
32k Long docs 长文档
128k Books 整本书
400k Multi-doc 多文档
1M Codebase 整个代码库
10M
512

Availability

API
Not available
Product / App
Not available
Open Source
Not released
Enterprise

Pricing Model

Research artifact
Not commercially released.
Research

Capability / Performance

Where this model sits relative to the middle 60% of models in the tree. All scores are 0–10 (higher is better).

Lower 20% Upper 80% This model
Context / memory
Context window size · log-scaled
6.0
9.0
0.0
Lower 20% 20th percentile — 20% of models score below this This model Where the current model lands Upper 80% 80th percentile — only 20% of models score above this Percentile boundaries are computed across every model in the tree that reports the underlying benchmark for each capability.

What it feels like

  • The architecture that enabled the modern AI era — every major LLM since 2018 is a Transformer descendant
  • Replaced RNN/LSTM sequential processing with parallel self-attention — cut training time from weeks to days
  • 200K+ citations on Google Scholar — among the most cited ML papers of all time
  • Originally framed for machine translation; the impact spread to virtually every sequence task
  • Encoder-decoder design later split into encoder-only (BERT) and decoder-only (GPT) lineages
  • Position encodings + multi-head attention + layer norm became the default kit for sequence modelling

Best use cases

  • Foundational paper to read before any other ML architecture work
  • Citation in any work involving sequence-to-sequence modelling, language, or attention
  • Teaching introductory NLP / deep learning courses
  • Understanding why every frontier model in this tree exists

Not ideal for

  • Use as a deployable model — this is an architecture paper, not a model checkpoint
  • Cost-sensitive long-context inference without the post-2023 efficiency improvements

Vaswani, A. · Shazeer, N. · Parmar, N. · Uszkoreit, J. · et al.