Introduction to policy gradient

Deep Reinforcement Learning in Python

Timothée Carayol

Principal Machine Learning Engineer, Komment

Introduction to Policy methods in DRL

 

Q-learning:

  • Learn the action value function Q

A Q-network, with the state as input and the action values as output

  • Policy: select action with highest value

 

Policy learning:

  • Learn the policy directly

A policy network, with the state as input and the action probabilities as output

Deep Reinforcement Learning in Python

Policy learning

 

  • Can be stochastic
  • Handle continuous spaces
  • Directly optimize for the objective
  • High variance
  • Less sample efficient

 

  • In Deep-Q learning: policies are deterministic

 

$\pi_\theta(a_t | s_t)$:

  • Probability distribution for $a_t$ in state $s_t$, with:
    • $a_t$, $s_t$: action and state at step $t$
    • $\theta$: policy parameters (network weights)
Deep Reinforcement Learning in Python

The policy network (discrete actions)

class PolicyNetwork(nn.Module):
  def __init__(self, state_size, action_size):
    super(PolicyNetwork, self).__init__()
    self.fc1 = nn.Linear(state_size, 64)
    self.fc2 = nn.Linear(64, 64)
    self.fc3 = nn.Linear(64, action_size)

  def forward(self, state):
    x = torch.relu(self.fc1(torch.tensor(state)))
    x = torch.relu(self.fc2(x))
    action_probs = torch.softmax(self.fc3(x), dim=-1)
    return action_probs

action_probs = policy_network(state) print('Action probabilities:', action_probs)
Action probabilities: tensor([0.21, 0.02, 0.74, 0.03])

A table mapping each of four possible actions to its index and probability. Action 'up' has index 0 and probability 0.21; action 'right' has index 1 and probability 0.02; action 'down' has index 2 and probability 0.74; action 'left' has index 3 and probability 0.03.

action_dist = (
    torch.distributions.Categorical(action_probs))

action = action_dist.sample()
Deep Reinforcement Learning in Python

The objective function

 

  • Policy must maximize expected returns

    • Assuming the agent follows $\pi_\theta$
    • By optimizing policy parameter $\theta$
  • Objective function:

An equation: J(pi theta) = Expected value over trajectories tau following pi theta of R_tau, where R_tau is the episode return

 

  • To maximize $J$: need gradient with respect to $\theta$:

Gradient of J(pi_theta) with respect to theta

Deep Reinforcement Learning in Python

The objective function

 

  • Policy must maximize expected returns

    • Assuming the agent follows $\pi_\theta$
    • By optimizing policy parameter $\theta$
  • Objective function:

The definition of J(pi theta), unchanged  from the previous slide

 

  • To maximize $J$: need gradient with respect to $\theta$:

Gradient of J(pi_theta) with respect to theta is called the policy gradient

Deep Reinforcement Learning in Python

The policy gradient theorem

 

  • Gives a tractable expression for $\nabla_\theta J(\pi_\theta)$
  • Expectation over trajectories following $\pi_\theta$
    • Collect trajectories and observe the returns

 

The policy gradient theorem: Gradient of J(pi_theta) with respect to theta equals the expectation over trajectories tau following pi_theta of...

Deep Reinforcement Learning in Python

The policy gradient theorem

 

  • Gives a tractable expression for $\nabla_\theta J(\pi_\theta)$
  • Expectation over trajectories following $\pi_\theta$
    • Collect trajectories and observe the returns
  • For each trajectory: consider return $R_\tau$

 

The policy gradient theorem: Gradient of J(pi_theta) with respect to theta equals the expectation over trajectories tau following pi_theta of the episode return multiplied by...

Deep Reinforcement Learning in Python

The policy gradient theorem

 

  • Gives a tractable expression for $\nabla_\theta J(\pi_\theta)$
  • Expectation over trajectories following $\pi_\theta$
    • Collect trajectories and observe the returns
  • For each trajectory: consider return $R_\tau$
  • Multiply by the sum of gradients of log probabilities of selected actions
  • Intuition: nudge theta in ways that increase probability of all actions taken in a 'good' episode

 

The policy gradient theorem: Gradient of J(pi_theta) with respect to theta equals the expectation over trajectories tau following pi_theta of the episode return multiplied by the sum of the gradients of log action probabilities, summed over all actions in the trajectory.

Deep Reinforcement Learning in Python

 

A gif representing a game of Pong

Deep Reinforcement Learning in Python

Let's practice!

Deep Reinforcement Learning in Python

Preparing Video For Download...