Temporal difference learning

Reinforcement Learning with Gymnasium in Python

Fouad Trad

Machine Learning Engineer

TD learning vs. Monte Carlo

 

TD learning
  • Model-free
  • Estimate Q-table based on interaction
  • Update Q-table each step within episode
  • Suitable for tasks with long/indefinite episodes

 

Monte Carlo
  • Model-free
  • Estimate Q-table based on interaction
  • Update Q-table when at least one episode done
  • Suitable for short episodic tasks
Reinforcement Learning with Gymnasium in Python

TD learning as weather forecasting

Image showing different weather conditions at different times in the same place.

Reinforcement Learning with Gymnasium in Python

SARSA

  • TD algorithm
  • On-policy method: adjusts strategy based on taken actions

Image showing that SARSA stands for the current state, the action taken, the reward received, the observed next state, and the next action.

Reinforcement Learning with Gymnasium in Python

SARSA update rule

Image showing the mathematical formula of the SARSA update rule.

  • $\alpha$: learning rate
  • $\gamma$: discount factor
  • Both between 0 and 1
Reinforcement Learning with Gymnasium in Python

Frozen Lake

Image showing the Frozen lake environment

Reinforcement Learning with Gymnasium in Python

Initialization

env = gym.make("FrozenLake", is_slippery=False)

num_states = env.observation_space.n num_actions = env.action_space.n
Q = np.zeros((num_states, num_actions))
alpha = 0.1 gamma = 1 num_episodes = 1000
Reinforcement Learning with Gymnasium in Python

SARSA loop

for episode in range(num_episodes):

state, info = env.reset() action = env.action_space.sample()
terminated = False while not terminated: next_state, reward, terminated, truncated, info = env.step(action)
next_action = env.action_space.sample()
update_q_table(state, action, reward, next_state, next_action)
state, action = next_state, next_action
Reinforcement Learning with Gymnasium in Python

SARSA updates

def update_q_table(state, action, reward, next_state, next_action):

old_value = Q[state, action]
next_value = Q[next_state, next_action]
Q[state, action] = (1 - alpha) * old_value + alpha * (reward + gamma * next_value)

  Image showing the mathematical formula of the SARSA update rule.

Reinforcement Learning with Gymnasium in Python

Deriving the optimal policy

policy = get_policy()
print(policy)
{ 0: 1,  1: 2,  2: 1,  3: 0, 
  4: 1,  5: 0,  6: 1,  7: 0, 
  8: 2,  9: 1, 10: 1, 11: 0, 
 12: 0, 13: 2, 14: 2, 15: 0}

Image showing the optimal policy in the frozen lake environment with actions represented with arrows, and we can see how the agent avoids falling into holes.

Reinforcement Learning with Gymnasium in Python

Let's practice!

Reinforcement Learning with Gymnasium in Python

Preparing Video For Download...