Introduction to deep Q learning

Deep Reinforcement Learning in Python

Timothée Carayol

Principal Machine Learning Engineer, Komment

What is Deep Q Learning?

An image representing Q(state, action), with the state represented as the Earth and the action represented as a joystick

Q-Learning refresher

Action value function Q_pi(s,a): sum of future rewards if action a is taken in state s, assuming that the policy pi is followed afterwards. Q_pi(s,a) = expected value over future trajectories given policy pi is followed of R_tau given s_t=s and a_t=a

Knowledge of $Q$ would enable optimal policy: $$ \pi(s_t) = {\arg\max}_a Q(s_t, a) $$
Goal of Q-learning: learn $Q$ over time

Q-Learning refresher

Bellman equation (in Q-learning) in a deterministic environment: Q_pi(s_t, a_t) = reward r_t+1 + discount rate gamma * max over a_t+1 of Q_pi(s_t+1, a_t+1))

Temporal difference target, a.k.a. TD-target, Q-target or target Q-value: refers to the right side of the Bellman equation, used as target value for the Q-learning update rule. r_t+1 + gamma * max over a_t+1 of Q_pi(s_t+1, a_t+1))

Bellman equation: recursive formula for $Q$
Right side of Bellman Equation: "TD-target"
Use TD-target from Bellman Equation to update $\hat{Q}$ after each step

Q-learning update rule: Q_new = (1-alpha) Q_old + alpha * TD-target

The Q-Network

A Q-table with 4 states and 4 actions, so 16 cells to fill

The Q-Network

A Q-table with 9 states and 4 actions, so 36 cells to fill

The Q-Network

A Q-table with dozens of states and 4 action, probably ~100 cells to fill

The Q-Network

At the heart of Deep Q Learning: a neural network

Illustration of a fully connected neural network with two hidden layers

The Q-Network

At the heart of Deep Q Learning: a neural network

Illustration of a fully connected neural network with two hidden layers, with the Earth image from the previous slide feeding into the input layer

The Q-Network

At the heart of Deep Q Learning: a neural network mapping state to Q-values

The illustration from the previous slide, with each node in the output layer associated with an action represented as a direction on the joystick. Up means action 0, 1 means right, down means 2, left means 3.

A network approximating the action-value function is called 'Q-network'
Q-networks are commonly used in Deep Q Learning algorithms, such as DQN.

Implementing the Q-network

class QNetwork(nn.Module):


  def __init__(self, state_size, action_size):
    super(QNetwork, self).__init__()

    self.fc1 = nn.Linear(state_size, 64)
    self.fc2 = nn.Linear(64, 64)
    self.fc3 = nn.Linear(64, action_size)


  def forward(self, state):
    x = torch.relu(self.fc1(torch.tensor(state)))
    x = torch.relu(self.fc2(x))
    return self.fc3(x)


q_network = QNetwork(8, 4)

optimizer = optim.Adam(q_network.parameters(), 
                       lr=0.0001)

Input dimension determined by state
Output dimension determined by number of possible actions
In this example:
- 2 hidden layers with 64 nodes each
- ReLu activation function

Let's practice!

Deep Reinforcement Learning in Python