Batch updates in policy gradient

Deep Reinforcement Learning in Python

Timothée Carayol

Principal Machine Learning Engineer, Komment

Stepwise vs batch gradient updates

A big box representing an episode.

Stepwise vs batch gradient updates

In the large box, a smaller box appears, representing step 1. Within it, another box with the text 'select action.'

Stepwise vs batch gradient updates

In the step 1 box, another small box appears with the text 'iterate environment'

Stepwise vs batch gradient updates

Underneath the step 1 box, another box with the labels 'calculate loss' and 'gradient descent'

Stepwise vs batch gradient updates

An identical pair of boxes appear for the second step, with the same content

Stepwise vs batch gradient updates

Step 3 and step 4 appear as well.

Batching the A2C / PPO updates

A large episode box; taking up half its area, another box labelled 'rollout 1'; within it,two empty boxes labelled 'step 1' and 'step 2'

Batching the A2C / PPO updates

In the step 1 box, the labels 'select action' and 'iterate environment' appear.

Batching the A2C / PPO updates

Same for step 2.

Batching the A2C / PPO updates

Underneath the step 1 and step 2 boxes, appears a single 'calculate loss' label, and a single 'gradient descent' label.

Batching the A2C / PPO updates

The remaining half of the episode area is now taken up by another identical rollout box with two steps, labelled 'rollout 2'.

The A2C training loop with batch updates

# Set rollout length
rollout_length = 10

# Initiate loss batches

actor_losses = torch.tensor([])
critic_losses = torch.tensor([])

Initiate loss batches
Iterate through episodes and steps as usual

for episode in range(10):
  state, info = env.reset()
  done = False
  while not done:
    action, action_log_prob = select_action(actor, 
                                            state)                
    next_state, reward, terminated, truncated, _ = (
                                   env.step(action))
    done = terminated or truncated    
    actor_loss, critic_loss = calculate_losses(
        critic, action_log_prob, 
        reward, state, next_state, done)
    ...

The A2C training loop with batch updates

  ...
  actor_losses = torch.cat((actor_losses, actor_loss))
  critic_losses = torch.cat((critic_losses, critic_loss))


  # If rollout is full, update the networks
  if len(actor_losses) >= rollout_length:

    actor_loss_batch = actor_losses.mean()
    critic_loss_batch = critic_losses.mean()

    actor_optimizer.zero_grad()
    actor_loss_batch.backward()
    actor_optimizer.step()
    critic_optimizer.zero_grad()
    critic_loss_batch.backward()
    critic_optimizer.step()

    actor_losses = torch.tensor([])
    critic_losses = torch.tensor([])


  state = next_state

Append step loss to the loss batches
When rollout is full:
- Take the batch average loss with .mean()
- Perform gradient descent
- Reinitialize the batch losses

Batch updates in policy gradient

Stepwise vs batch gradient updates

Stepwise vs batch gradient updates

Stepwise vs batch gradient updates

Stepwise vs batch gradient updates

Stepwise vs batch gradient updates

Stepwise vs batch gradient updates

Batching the A2C / PPO updates

Batching the A2C / PPO updates

Batching the A2C / PPO updates

Batching the A2C / PPO updates

Batching the A2C / PPO updates

The A2C training loop with batch updates

The A2C training loop with batch updates

A2C / PPO with multiple agents

Rollouts and minibatches

PPO with multiple epochs

Let's practice!