Transformers

The architecture that powers every modern foundation model, and why attention changed everything

The transformer architecture, introduced in "Attention Is All You Need" (2017), is the substrate of nearly every frontier model in use today. The core idea is deceptively simple: replace recurrence with attention, and let every token in a sequence look at every other token directly.

Self-Attention

Self-attention is the operation that defines the architecture. For each token, it computes:

A query vector — what this token is looking for.
A key vector — what this token offers.
A value vector — what this token contributes if attended to.

Attention scores are dot products of queries and keys, softmaxed to get weights, then used to take a weighted sum of values. The result: every token's new representation is a learned mixture of every other token's value.

Multi-Head Attention

Instead of one attention operation, transformers run several in parallel ("heads"), each with its own learned projections. Different heads end up specializing — some attend to syntactic neighbors, some to long-range dependencies, some to specific tokens.

The Transformer Block

A standard block is:

Multi-head self-attention
Residual connection + LayerNorm
Feed-forward network (typically a 4x expansion)
Residual connection + LayerNorm

Stack dozens to hundreds of these and you have a modern LLM.

Positional Information

Self-attention is permutation-invariant on its own — it has no idea what order tokens come in. Position is injected explicitly, originally as sinusoidal embeddings, today most commonly as rotary position embeddings (RoPE), which encode position via rotations in query/key space and generalize better to longer contexts.

Why Transformers Won

Parallelism — unlike RNNs, the entire sequence is processed at once during training.
Long-range context — direct token-to-token connections, no information bottleneck.
Scale — the architecture continues to improve with more parameters, data, and compute, without obvious saturation.

What to Watch For

Attention is O(n²) in sequence length. The frontier of efficient transformers — sliding window attention, grouped-query attention, state-space hybrids, sparse attention — is all about cutting that cost while preserving quality.