Introduction to LLMs in Python
Iván Palomares Carrascosa, PhD
Senior Data Science & AI Manager
"What makes an LLM good?", "What is the LLM user looking for?"
- Objective, subjective and context-dependent criteria
- Truthfulness, originality, fine-grained detail vs. concise responses, etc.
Reinforcement Learning (RL): an agent learns to make decisions upon feedback -rewards-, adapting its behavior to maximize cumulative reward over time
Reinforcement Learning (RL): an agent learns to make decisions upon feedback -rewards-, adapting its behavior to maximize cumulative reward over time
Reinforcement Learning (RL): an agent learns to make decisions upon feedback -rewards-, adapting its behavior to maximize cumulative reward over time
TRL: a library to train transformer-based LLMs using a variety of RL approaches
Proximal Policy Optimization (PPO): optimize LLM upon <prompt, response, reward> triplets
AutoModelForCausalLMWithValueHead
: it incorporates a value head for RL scenariosmodel_ref
: reference model, e.g. the loaded pre-trained model before optimizingrespond_to_batch
: similar purpose as model.generate()
, adapted to RLPPOTrainer
instancePPO set-up example:
from trl import PPOTrainer, PPOConfig, create_reference_model, AutoModelForCausalLMWithValueHead from trl.core import respond_to_batch
model = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2') model_ref = create_reference_model(model) tokenizer = AutoTokenizer.from_pretrained('gpt2') if tokenizer.pad_token is None: tokenizer.add_special_tokens({'pad_token': '[PAD]'})
prompt = "My plan today is to " input = tokenizer.encode(query_txt, return_tensors="pt") response = respond_to_batch(model, input)
ppo_config = PPOConfig(batch_size=1, mini_batch_size=1) ppo_trainer = PPOTrainer(ppo_config, model, model_ref, tokenizer)
reward = [torch.tensor(1.0)] train_stats = ppo_trainer.step([input[0]], [response[0]], reward)
Introduction to LLMs in Python