Introduction to deep reinforcement learning

Deep Reinforcement Learning in Python

Timothée Carayol

Principal Machine Learning Engineer, Komment

Why Deep Reinforcement Learning

Traditional RL is suitable for low-dimensional tasks

Many applications have high-dimensional state and/or action space

An agent exploring the Frozen Lake environment

An agent playing the classic Space Invaders video game

The ingredients of DRL

Reinforcement Learning concepts
Deep Learning and PyTorch

DRL uses these concepts with deep neural networks

Pixel art illustration of a cook mixing ingredients

The RL framework

Step t:

A large box with two smaller boxes inside it, labelled 'Agent' and 'Environment'

The RL framework

Step t:
- Agent observes state $s_t$

A red arrow with the label State s_t goes from environment to agent.

The RL framework

Step t:
- Agent observes state $s_t$
- Agent takes action $a_t$

A red arrow with label action a_t goes from agent to environment. The state arrow is now black.

The RL framework

Step t:
- Agent observes state $s_t$
- Agent takes action $a_t$
Step t+1:
- Environment gives reward $r_t$
- State evolves to $s_{t+1}$

The state s_t arrow updates its label to state s_t+1 and is red again. A new red arrow labelled reward r_t+1 goes from environment to agent. The action arrow is now black.

The RL framework

Step t:
- Agent observes state $s_t$
- Agent takes action $a_t$
Step t+1:
- Environment gives reward $r_t$
- State evolves to $s_{t+1}$
Repeat until episode is complete

The same image as the previous slide, but all arrows are black.

Policy $\pi(s_t)$

Mapping from state to action, describing how the agent behaves in a given state $s_t$

Deterministic:
- Returns the selected action
Stochastic:
- Returns a distribution over actions
- Policy is a probability distribution over possible actions

Trajectory and episode return

Trajectory tau: Sequence of all states and actions in an episode; tau = ((s0, a0), (s1, a1), ... (sT, aT))

Episode return Rtau: total (discounted) rewards accumulated along trajectory tau. Rtau = sum over t of gamma to the power of t times r_t

Setting up the environment

env = gym.make("ALE/SpaceInvaders-v5")

# Define neural network architecture
class Network(nn.Module):
    def __init__(self, dim_inputs, dim_outputs):
        super(Network, self).__init__()
        self.linear = nn.Linear(dim_inputs, dim_outputs)
    def forward(self, x):
        return self.linear(x)

# Instantiate network
network = Network(dim_inputs, dim_outputs)

# Instantiate optimizer
optimizer = optim.Adam(network.parameters(), lr=0.0001)

The basic loop

for episode in range(1000):
  state, info = env.reset()
  done = False

  while not done:

    action = select_action(network, state)

    next_state, reward, terminated, truncated, _ = (
                                     env.step(action))
    done = terminated or truncated


    loss = calculate_loss(network, state, action, 
          next_state, reward, done)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()


    state = next_state

Outer loop: iterate through episodes
Inner loop: iterate through steps
- Select an action
- Observe new state and reward
- Calculate the loss and update the network
- Update the state
(Loss?)

Coming next

DRL is powerful!
Value-based and policy-based approaches
DQN and refinements
Policy gradient methods

A Datacamp learner diving deep into the sea to discover the secrets of Deep Reinforcement Learning

Let's practice!

Deep Reinforcement Learning in Python