Deep Reinforcement Learning in Python
Timothée Carayol
Principal Machine Learning Engineer, Komment
for episode in range(num_episodes):
# 1. Initialize episode
while not done:
# 2. Select action
# 3. Play action and obtain next state and reward
# 4. Add (discounted) reward to return
# 5. Update state
# 6. Calculate loss
# 7. Update policy network by gradient descent
from torch.distributions import Categorical def select_action(policy_network, state): action_probs = policy_network(state)
action_dist = Categorical(action_probs)
action = action_dist.sample()
log_prob = action_dist.log_prob(action)
return action.item(), log_prob.reshape(1)
action, log_prob = select_action( policy_network, state)
Sampled action index: 1
Log probability of sampled action: -1.38
Recall the policy gradient theorem:
In Python:
episode_return
episode_log_probs
loss = -episode_return * episode_log_probs.sum()
for episode in range(50): state, info = env.reset(); done = False; step = 0; episode_log_probs = torch.tensor([])
R = 0
while not done: step += 1 action, log_prob = select_action(policy_network, state)
next_state, reward, terminated, truncated, _ = env.step(action) done = terminated or truncated
R += (gamma ** step) * reward
episode_log_probs = torch.cat((episode_log_probs, log_prob))
state = next_state
loss = - R * episode_log_probs.sum()
optimizer.zero_grad(); loss.backward(); optimizer.step()
Deep Reinforcement Learning in Python