Entropy bonus and PPO

Deep Reinforcement Learning in Python

Timothée Carayol

Principal Machine Learning Engineer, Komment

Entropy bonus

 

 

  • Policy gradient algorithms may collapse into deterministic policies
  • Solution: add entropy bonus
  • Entropy measures uncertainty of a distribution!

A Mars rover unable to progress because of a huge rock lying immediately in front of it.

Deep Reinforcement Learning in Python

Entropy of a probability distribution

 

The entropy of a discrete random variable X, measured in bits, is defined as H(X) = - sum over values x of p(x) log_2 p(x)

  • If $\ln$ instead of $\log_2$: result measured in $nats$.
  • $1\ nat = \frac{1}{\ln 2}\ bit \approx 1.44 \ bit$

First of three bar charts showing different stochastic policies over 4 actions. The probability of each action is 0.25, corresponding to an entropy of 2 bits.

Second of three bar charts showing different stochastic policies over 4 actions. The probability is concentrated equally on two actions, and 0 elsewhere; an entropy of 1 bit.

Third of three bar charts showing different stochastic policies over 4 actions. This policy has all its probability concentrated on just one action with probability of 1; entropy 0 bit.

Deep Reinforcement Learning in Python

Implementing the entropy bonus

def select_action(policy_network, state):
  action_probs = policy_network(state)
  action_dist = Categorical(action_probs)
  action = action_dist.sample()
  log_prob = action_dist.log_prob(action)

# Obtain the entropy of the policy entropy = action_dist.entropy()
return (action.item(), log_prob.reshape(1), entropy)
  • Actor loss: actor_loss -= c_entropy * entropy
  • Note: Categorical.entropy() is in nats; divide by math.log(2) for bits
Deep Reinforcement Learning in Python

PPO training loop

for episode in range(10):
  state, info = env.reset()
  done = False
  while not done:
    action, action_log_prob, entropy = select_action(actor, state)
    next_state, reward, terminated, truncated, _ = env.step(action)
    done = terminated or truncated
    actor_loss, critic_loss = calculate_losses(critic, action_log_prob, action_log_prob,
                                               reward, state, next_state, done)
    actor_loss -= c_entropy * entropy
    actor_optimizer.zero_grad(); actor_loss.backward(); actor_optimizer.step()
    critic_optimizer.zero_grad(); critic_loss.backward(); critic_optimizer.step()
    state = next_state
Deep Reinforcement Learning in Python

Towards PPO with batch updates

 

  • Updating at each step: not taking full advantage of PPO objective function
  • At each step, $\theta$ actually coincides with $\theta_{old}$.
  • Full PPO implementations decouple:
    • Parameter updates (minibatches)
    • Policy updates (rollouts)
Deep Reinforcement Learning in Python

Let's practice!

Deep Reinforcement Learning in Python

Preparing Video For Download...