Understanding transformer variants

The original split

The 2017 transformer has two halves: encoder and decoder. What’s fascinating is that the two most influential model families each threw away one half.

  • BERT (2018): Encoder only. Bidirectional — each token attends to all other tokens. Great for understanding (classification, NER, Q&A). Bad for generation.
  • GPT (2018-present): Decoder only. Autoregressive — each token attends only to previous tokens. Great for generation. Understanding emerges as a byproduct of next-token prediction.
  • T5 (2019): Kept both. Frames every task as text-to-text (input goes through encoder, output is generated by decoder).

BERT

Bidirectional Encoder Representations from Transformers. The training objective is Masked Language Modeling (MLM): randomly mask 15% of tokens, predict them from context. Plus a secondary objective: Next Sentence Prediction (NSP), which turned out to be mostly useless and was dropped in later variants (RoBERTa).

Key insight: by masking tokens and predicting from both left and right context, BERT learns representations that encode meaning in a way that left-to-right models can’t. The word “bank” gets different representations depending on whether the context is “river bank” or “bank account.”

BERT-base: 110M parameters, 12 layers, 768 hidden dimensions, 12 attention heads.

GPT lineage

GPT-1 (2018): 117M parameters. Showed that unsupervised pre-training on text followed by supervised fine-tuning works surprisingly well.

GPT-2 (2019): 1.5B parameters. The “too dangerous to release” model (they released it eventually). Demonstrated that scaling up produces qualitative improvements in generation quality.

GPT-3 (2020): 175B parameters. In-context learning — the model can perform tasks from a few examples in the prompt, without gradient updates. This was the “wait, what?” moment.

GPT-4 (2023): Architecture details not published. Rumored to be a mixture-of-experts model. Multimodal (text + images).

The trend: same architecture, more parameters, more data, emergent capabilities. Whether this continues to scale or hits diminishing returns is the central debate in the field.

Architectural improvements since 2017

  • Rotary Position Embeddings (RoPE): Replace absolute sinusoidal encodings with rotation-based relative encodings. Better length generalization. Used in LLaMA, Mistral.
  • Grouped Query Attention (GQA): Share key-value heads across multiple query heads. Reduces memory and compute with minimal quality loss. Used in LLaMA 2.
  • Flash Attention: Not an architectural change but an implementation one. Reorders the attention computation to be memory-efficient (avoids materializing the full attention matrix). 2-4x speedup.
  • Mixture of Experts (MoE): Replace the dense feedforward layers with a set of “expert” sub-networks, only activating a subset per token. More total parameters but same compute per token. Used in Mixtral, possibly GPT-4.

Open questions

  1. Is the transformer the final architecture? State-space models (Mamba) show that you can match transformer quality on some tasks with linear (not quadratic) scaling in sequence length. But transformers have massive infrastructure momentum.
  2. What’s the right training objective? MLM, next-token prediction, denoising, contrastive learning — each produces different representations. No consensus on which is “best.”
  3. How far does scaling go? The scaling laws (Kaplan et al., 2020) suggest predictable improvement with more compute. But are we learning or memorizing? Does generalization scale the same way?