Incorporating diverse feedback sources

Reinforcement Learning from Human Feedback (RLHF)

Mina Parham

AI Engineer

Improved model generalization

 

  • Represent different viewpoints and contexts
  • Generalizes preferences and values

An image of hands with different callout bubbles representing diverse opinions.

Reinforcement Learning from Human Feedback (RLHF)

Reduced bias

  • Mitigates individual biases
  • Creates a more balanced and fair model output

A bar chart showing the difference between a dataset biased towards a male group, and a balanced dataset with equal distribution for men and women

Reinforcement Learning from Human Feedback (RLHF)

Better alignment with human values

  • Complex human preferences
  • Cultures and backgrounds represented

Icons representing a diverse group of people

Reinforcement Learning from Human Feedback (RLHF)

Enhanced adaptability

  • Model responds to wider range of user needs and preferences
  • Represent different viewpoints

An icon showing a person with emojis indicating different viewpoints.

Reinforcement Learning from Human Feedback (RLHF)

Increased robustness

  • Resilient to different types of inputs
  • Improving its performance

A diagram showing improved quality thanks to different inputs and contexts.

Reinforcement Learning from Human Feedback (RLHF)

Integrating preference data from multiple sources

Preference data preference_df with sources 'Journalist', 'Social Media Influencer', and 'Marketing Professional':

A table showing structured data from three different sources

Reinforcement Learning from Human Feedback (RLHF)

Majority voting

This sample data could easily be integrated by grouping by 'id':

df_majority = preference_df.groupby(['id']).apply(majority_vote)

Then, using majority voting:

from collections import Counter

def majority_vote(df):
    votes = Counter(zip(df['chosen'], df['rejected'])) 
    return max(votes, key=votes.get)
Reinforcement Learning from Human Feedback (RLHF)

Unreliable preference data sources

Preference data preference_df2 with the same three experts:

A table showing structured data from three different sources

Reinforcement Learning from Human Feedback (RLHF)

Unreliable preference data sources

  • Iterating over the rows of preference_df2 to identify unreliable sources:
df_majority = preference_df2.groupby('id').apply(majority_vote)

disagreements = {source: 0 for source in preference_df2['source'].unique()}
for _, row in preference_df2.iterrows(): if (row['chosen'], row['rejected']) != df_majority[row['id']]: disagreements[row['source']] += 1
detect_unreliable_source = max(disagreements, key=disagreements.get)
Reinforcement Learning from Human Feedback (RLHF)

Let's practice!

Reinforcement Learning from Human Feedback (RLHF)

Preparing Video For Download...