RLHF — How Modern LLMs Are Aligned with Human Preferences

TLDR:

Reinforcement Learning from Human Feedback (RLHF) is the technique used to align LLMs with human preferences after pre-training. RLHF transformed GPT-3 into ChatGPT and remains a foundational alignment technique, though newer approaches (DPO, RLAIF, Constitutional AI) increasingly replace classical RLHF.

The RLHF Pipeline

Classical RLHF has three stages. First, a pre-trained LLM is fine-tuned on demonstrations of desired behavior (supervised fine-tuning). Second, human annotators rank multiple model outputs for the same prompt by quality, and these rankings train a separate “reward model” that predicts human preferences. Third, the LLM is further trained using reinforcement learning—typically PPO (Proximal Policy Optimization)—with the reward model providing the reward signal, producing outputs humans rate as better.

Why RLHF Matters

Pre-trained LLMs trained on raw web text are not aligned with what users want—they may produce verbose, off-topic, harmful, or unhelpful responses. RLHF teaches models to follow instructions, be helpful and harmless, refuse harmful requests, and produce outputs in preferred styles. Without alignment training, the dramatic capabilities of modern foundation models would not translate into useful products.

Limitations and Alternatives

RLHF has well-known limitations: it requires extensive human labeling, can incentivize sycophancy or surface-level pleasing behavior, and may not extend to novel scenarios. Newer methods include: Direct Preference Optimization (DPO, simpler and more stable than PPO), RLAIF (Reinforcement Learning from AI Feedback, using AI critics to scale labeling), and Constitutional AI (using explicit principles rather than learned preferences). Most modern frontier models combine multiple alignment techniques.