Reinforcement Learning from Human Feedback (RLHF)
Học tăng cường từ phản hồi con người (RLHF)
A training technique where human raters rank model outputs and those preferences train a reward model that guides further LLM fine-tuning.
Reinforcement Learning from Human Feedback (RLHF) is the technique that transformed raw LLMs into the helpful, harmless, and honest assistants we use today. Without RLHF, a language model would predict the next token based on training data — including harmful, biased, or off-topic content. RLHF steers the model toward outputs humans prefer.
**The three-phase process**
1. *Supervised fine-tuning (SFT)*: Human annotators write ideal responses to sample prompts. The base LLM is fine-tuned on these examples to establish a baseline helpful model.
2. *Reward model training*: The SFT model generates multiple responses to prompts. Human raters rank these responses (A is better than B). A separate model (the reward model) is trained to predict these human preference scores.
3. *RL optimization*: The SFT model is further trained using Proximal Policy Optimization (PPO) — a reinforcement learning algorithm — maximizing the reward model's score while not drifting too far from the SFT baseline (via KL divergence penalty).
**What RLHF achieves**
The resulting model is much better at following instructions, being helpful without being harmful, and declining inappropriate requests — even on prompts the human annotators never saw.
**Variations**
*RLAIF (RL from AI Feedback)*: Replace human raters with another AI model (usually stronger) doing the ranking. Cheaper and scalable.
*DPO (Direct Preference Optimization)*: A simpler alternative to PPO-based RLHF that achieves similar results with less compute.
**Pitfalls**
RLHF encodes the preferences of annotators — if annotators have biases, the model amplifies them. Reward models can be "gamed" by the trained model in unexpected ways (reward hacking). And RLHF makes models more conservative — sometimes refusing reasonable requests.