Transformers
The architecture that powers every modern foundation model, and why attention changed everything
The transformer architecture, introduced in "Attention Is All You Need" (2017), is the substrate of nearly every frontier model in use today. The core idea is deceptively simple: replace recurrence with attention, and let every token in a sequence look at every other token directly.
Self-Attention
Self-attention is the operation that defines the architecture. For each token, it computes:
- A query vector — what this token is looking for.
- A key vector — what this token offers.
- A value vector — what this token contributes if attended to.
Attention scores are dot products of queries and keys, softmaxed to get weights, then used to take a weighted sum of values. The result: every token's new representation is a learned mixture of every other token's value.
Multi-Head Attention
Instead of one attention operation, transformers run several in parallel ("heads"), each with its own learned projections. Different heads end up specializing — some attend to syntactic neighbors, some to long-range dependencies, some to specific tokens.
The Transformer Block
A standard block is:
- Multi-head self-attention
- Residual connection + LayerNorm
- Feed-forward network (typically a 4x expansion)
- Residual connection + LayerNorm
Stack dozens to hundreds of these and you have a modern LLM.
Positional Information
Self-attention is permutation-invariant on its own — it has no idea what order tokens come in. Position is injected explicitly, originally as sinusoidal embeddings, today most commonly as rotary position embeddings (RoPE), which encode position via rotations in query/key space and generalize better to longer contexts.
Why Transformers Won
- Parallelism — unlike RNNs, the entire sequence is processed at once during training.
- Long-range context — direct token-to-token connections, no information bottleneck.
- Scale — the architecture continues to improve with more parameters, data, and compute, without obvious saturation.
What to Watch For
Attention is O(n²) in sequence length. The frontier of efficient transformers — sliding window attention, grouped-query attention, state-space hybrids, sparse attention — is all about cutting that cost while preserving quality.