Understanding transformer variants
The original split
The 2017 transformer has two halves: encoder and decoder. What’s fascinating is that the two most influential model families each threw away one half.
- BERT (2018): Encoder only. Bidirectional — each token attends to all other tokens. Great for understanding (classification, NER, Q&A). Bad for generation.
- GPT (2018-present): Decoder only. Autoregressive — each token attends only to previous tokens. Great for generation. Understanding emerges as a byproduct of next-token prediction.
- T5 (2019): Kept both. Frames every task as text-to-text (input goes through encoder, output is generated by decoder).
BERT
Bidirectional Encoder Representations from Transformers. The training objective is Masked Language Modeling (MLM): randomly mask 15% of tokens, predict them from context. Plus a secondary objective: Next Sentence Prediction (NSP), which turned out to be mostly useless and was dropped in later variants (RoBERTa).
Key insight: by masking tokens and predicting from both left and right context, BERT learns representations that encode meaning in a way that left-to-right models can’t. The word “bank” gets different representations depending on whether the context is “river bank” or “bank account.”
BERT-base: 110M parameters, 12 layers, 768 hidden dimensions, 12 attention heads.
GPT lineage
GPT-1 (2018): 117M parameters. Showed that unsupervised pre-training on text followed by supervised fine-tuning works surprisingly well.
GPT-2 (2019): 1.5B parameters. The “too dangerous to release” model (they released it eventually). Demonstrated that scaling up produces qualitative improvements in generation quality.
GPT-3 (2020): 175B parameters. In-context learning — the model can perform tasks from a few examples in the prompt, without gradient updates. This was the “wait, what?” moment.
GPT-4 (2023): Architecture details not published. Rumored to be a mixture-of-experts model. Multimodal (text + images).
The trend: same architecture, more parameters, more data, emergent capabilities. Whether this continues to scale or hits diminishing returns is the central debate in the field.
Architectural improvements since 2017
- Rotary Position Embeddings (RoPE): Replace absolute sinusoidal encodings with rotation-based relative encodings. Better length generalization. Used in LLaMA, Mistral.
- Grouped Query Attention (GQA): Share key-value heads across multiple query heads. Reduces memory and compute with minimal quality loss. Used in LLaMA 2.
- Flash Attention: Not an architectural change but an implementation one. Reorders the attention computation to be memory-efficient (avoids materializing the full attention matrix). 2-4x speedup.
- Mixture of Experts (MoE): Replace the dense feedforward layers with a set of “expert” sub-networks, only activating a subset per token. More total parameters but same compute per token. Used in Mixtral, possibly GPT-4.
Open questions
- Is the transformer the final architecture? State-space models (Mamba) show that you can match transformer quality on some tasks with linear (not quadratic) scaling in sequence length. But transformers have massive infrastructure momentum.
- What’s the right training objective? MLM, next-token prediction, denoising, contrastive learning — each produces different representations. No consensus on which is “best.”
- How far does scaling go? The scaling laws (Kaplan et al., 2020) suggest predictable improvement with more compute. But are we learning or memorizing? Does generalization scale the same way?