Reinforcement Learning from Human Feedback (RLHF) — What It Is and Why It Matters

Reinforcement Learning from Human Feedback (RLHF) is the technique that transformed raw LLMs into the helpful, harmless, and honest assistants we use today. Without RLHF, a language model would predict the next token based on training data — including harmful, biased, or off-topic content. RLHF steers the model toward outputs humans prefer.

**The three-phase process**

1. *Supervised fine-tuning (SFT)*: Human annotators write ideal responses to sample prompts. The base LLM is fine-tuned on these examples to establish a baseline helpful model.

2. *Reward model training*: The SFT model generates multiple responses to prompts. Human raters rank these responses (A is better than B). A separate model (the reward model) is trained to predict these human preference scores.

3. *RL optimization*: The SFT model is further trained using Proximal Policy Optimization (PPO) — a reinforcement learning algorithm — maximizing the reward model's score while not drifting too far from the SFT baseline (via KL divergence penalty).

**What RLHF achieves**

The resulting model is much better at following instructions, being helpful without being harmful, and declining inappropriate requests — even on prompts the human annotators never saw.

**Variations**

*RLAIF (RL from AI Feedback)*: Replace human raters with another AI model (usually stronger) doing the ranking. Cheaper and scalable.

*DPO (Direct Preference Optimization)*: A simpler alternative to PPO-based RLHF that achieves similar results with less compute.

**Pitfalls**

RLHF encodes the preferences of annotators — if annotators have biases, the model amplifies them. Reward models can be "gamed" by the trained model in unexpected ways (reward hacking). And RLHF makes models more conservative — sometimes refusing reasonable requests.