What is RLHF?
Reinforcement Learning from Human Feedback (RLHF) is the training technique that aligns LLMs with human preferences — making them helpful, harmless and following instructions, beyond what raw next-token prediction would produce. RLHF was the breakthrough behind ChatGPT (2022) and remains the dominant alignment method for production LLMs, though direct preference optimisation (DPO) and reinforcement learning from AI feedback (RLAIF) are increasingly used as alternatives.
How RLHF works
- Supervised fine-tuning (SFT): base model is fine-tuned on human-written example responses.
- Reward model training: humans rank multiple model outputs; a separate reward model learns to predict the human ranking.
- Reinforcement learning: the LLM is further trained using PPO (Proximal Policy Optimisation) or similar, optimised to maximise the reward model’s score.
What RLHF achieves
- Instruction-following: models produce responses to user queries rather than just completing text.
- Tone alignment: models match expected polite, helpful conversational style.
- Refusal patterns: models decline harmful or out-of-scope requests.
- Bias mitigation: systematic harmful outputs are reduced (though not eliminated).
RLHF limitations and risks
- Sycophancy: models can learn to flatter or agree with users rather than be accurate.
- Reward hacking: the policy exploits the reward model’s flaws rather than genuine helpfulness.
- Annotator bias: the values of the (often small) team that ranks outputs propagate to the model.
- Cost and scale: human annotation is expensive; RLAIF and synthetic feedback partially address this.
Türk ve regülatif bağlam
AB AI Act, yüksek-riskli AI sistemleri için eğitim verisi yönetişimi ve insan denetimi gereksinimleri getirir; RLHF kullanılan modellerde annotator kompozisyonu ve süreç dokümantasyonu uyum belgesinin parçası olur. ISO/IEC 42001 AI yönetim sistemi de RLHF süreçlerinin belgelenmesini ister.
Do: document RLHF training data sources and annotator demographics; audit for bias propagation regularly.
Don’t: assume RLHF eliminates all unsafe outputs — adversarial prompts and edge cases regularly bypass alignment training.