Pelatihan dengan PPO

Reinforcement Learning from Human Feedback (RLHF)

Mina Parham

AI Engineer

Fine-Tuning dengan reinforcement learning

LLM awal dan model reward dalam proses RLHF.

Reinforcement Learning from Human Feedback (RLHF)

Fine-Tuning dengan reinforcement learning

Proses RLHF lengkap.

Reinforcement Learning from Human Feedback (RLHF)

Fine-Tuning Model Bahasa dengan PPO

 

Diagram query ke LLM untuk menghasilkan kelanjutan.

Reinforcement Learning from Human Feedback (RLHF)

Fine-Tuning Model Bahasa dengan PPO

 

Diagram query ke LLM, yang melengkapi query: 'we're half way there, oh livin' on a prayer'.

Reinforcement Learning from Human Feedback (RLHF)

Fine-Tuning Model Bahasa dengan PPO

 

Diagram query ke LLM yang melengkapi: 'we're half way there, oh livin' on a prayer', dengan LLM lain yang mengevaluasi kelanjutannya.

Reinforcement Learning from Human Feedback (RLHF)

Fine-Tuning Model Bahasa dengan PPO

  • PPO: penyesuaian bertahap pada model
  • Menghindari overfitting pada umpan balik

Robot dan siput melambangkan perbaikan algoritma yang lambat.

Reinforcement Learning from Human Feedback (RLHF)

Menerapkan PPOTrainer dengan TRL

from trl import PPOConfig
config = PPOConfig(model_name="gpt2",learning_rate=1.4e-5)
from trl import AutoModelForCausalLMWithValueHead
model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
tokenizer = AutoTokenizer.from_pretrained(config.model_name)
from trl import PPOTrainer
ppo_trainer = PPOTrainer(model=model,config=config,dataset=dataset,
                         tokenizer=tokenizer)
Reinforcement Learning from Human Feedback (RLHF)

Memulai loop pelatihan

for epoch in tqdm(range(10), "epoch: "):


for batch in tqdm(ppo_trainer.dataloader):
# Get responses response_tensors = ppo_trainer.generate(batch["input_ids"])
batch["response"] = [tokenizer.decode(r.squeeze()) for r in response_tensors]
# Compute reward score texts = [q + r for q, r in zip(batch["query"], batch["response"])]
rewards = reward_model(texts)
stats = ppo_trainer.step(query_tensors, response_tensors, rewards) ppo_trainer.log_stats(stats, batch, rewards)
Reinforcement Learning from Human Feedback (RLHF)

Ayo berlatih!

Reinforcement Learning from Human Feedback (RLHF)

Preparing Video For Download...