Evaluating RLHF models

Reinforcement Learning from Human Feedback (RLHF)

Mina Parham

AI Engineer

Automation metrics

 

  • Classification task: Accuracy, F1 score
classification_results.head(3)
| ID | Feedback_Text                         | True_Category | Predicted_Category |
|----|---------------------------------------|---------------|--------------------|
| 1  | "Arrived on time and works great."    | Positive      | Positive           |
| 2  | "I had issues with customer service." | Negative      | Neutral            |
| 3  | "The website is easy to navigate."    | Positive      | Positive           |
Reinforcement Learning from Human Feedback (RLHF)

Automation metrics

 

  • Text generation, summarization: ROUGE, BLEU
text_generation.head(3)
| ID | Prompt               | True_Completion  | Pred_Completion   |
|----|----------------------|------------------|-------------------|
| 1  | "Customer service"   | "can help you."  | "will assist."    |
| 2  | "To get a refund,"   | "contact us."    | "reach out."      |
| 3  | "Support team is"    | "here 24/7."     | "available 24/7." |
Reinforcement Learning from Human Feedback (RLHF)

Automation metrics

 

 

Reference statement:

  • RLHF improves model alignment with human values.

 

 

ROUGE score: 0.83

 

 

Statement to compare:

  • RLHF aligns models with human values.
Reinforcement Learning from Human Feedback (RLHF)

Artifact curves

config = PPOConfig(
    model_name="lvwerra/gpt2-imdb",learning_rate=1.41e-5, log_with="wandb")
import wandb
wandb.init()

A screenshot of the terminal output in Weights and Biases.

Reinforcement Learning from Human Feedback (RLHF)

Artifact curves

  • Reward increases as the model learns.

A curve showing an upward trend in the reward, meaning the model is improving.

  • The KL curve should increase gradually.

A curve showing a gradual upward trend in the KL loss.

Reinforcement Learning from Human Feedback (RLHF)

Human centered evaluation

  • Human evaluation: subjective judgments or a deep understanding of context

A human evaluator at her laptop.

  • Models evaluation: scalability and consistency

A robot with speech bubbles representing a model evaluator.

Reinforcement Learning from Human Feedback (RLHF)

Let's practice!

Reinforcement Learning from Human Feedback (RLHF)

Preparing Video For Download...