Expected SARSA

Reinforcement Learning with Gymnasium in Python

Fouad Trad

Machine Learning Engineer

Expected SARSA

TD method
Model-free technique
Updates Q-table differently than SARSA and Q-learning

Diagram showing the steps involved in expected SARSA including initializing a Q-table, choosing an action to perform, receiving a reward from the environment, and updating the table. The agent continues this loop until convergence is achieved after a certain number of episodes.

Expected SARSA update

SARSA

Image showing the mathematical formula of the SARSA update rule.

Q-learning

Image showing the mathematical formula of the Q-learning update rule.

Expected SARSA

Image showing the mathematical formula of the expected SARSA update rule.

Expected value of next sate

Image showing the mathematical formula of the expected SARSA update rule.

Takes into account all actions

Image showing the mathematical formula for the expected Q-value for the next state.

Random actions → equal probabilities

Image showing the mathematical formula for the expected Q-value for the next state when actions are chosen randomly with equal probabilities.

Implementation with Frozen Lake

env = gym.make('FrozenLake-v1', 
               is_slippery=False)

num_states = env.observation_space.n
num_actions = env.action_space.n
Q = np.zeros((num_states, num_actions))


gamma = 0.99
alpha = 0.1
num_episodes = 1000

Image showing the Frozen Lake environment

Expected SARSA update rule

def update_q_table(state, action, next_state, reward):

    expected_q = np.mean(Q[next_state])

    Q[state, action] = (1-alpha) * Q[state, action] 
                       + alpha * (reward + gamma * expected_q)

Image showing the mathematical formula of the expected SARSA update rule.

Training

for i in range(num_episodes):
    state, info = env.reset()    
    terminated = False  

    while not terminated: 
        action = env.action_space.sample()

        next_state, reward, terminated, truncated, info = env.step(action)

        update_q_table(state, action, next_state, reward)
        state = next_state

Agent's policy

policy = {state: np.argmax(Q[state]) 
          for state in range(num_states)}
print(policy)

{ 0: 1,  1: 2,  2: 1,  3: 0, 
  4: 1,  5: 0,  6: 1,  7: 0, 
  8: 2,  9: 2, 10: 1, 11: 0, 
 12: 0, 13: 2, 14: 2, 15: 0}

Image showing the policy learned by the agent, showing which action to perform in every state.

Let's practice!

Reinforcement Learning with Gymnasium in Python