Action-value functions

Reinforcement Learning with Gymnasium in Python

Fouad Trad

Machine Learning Engineer

Action-value functions (Q-values)

  • Expected return of:
    • Starting at a state $s$
    • Taking action $a$
    • Following the policy
  • Estimates desirability of actions within states

Image showing the formula to compute the q_value for a state and action, Q(s,a) as the sum of the immediate reward received after performing an action and the discounted value of the new state computed for a specific policy.

Reinforcement Learning with Gymnasium in Python

Grid world

Image showing the custom environment with 2 mountains and a diamond.

  • State-values

Image showing the state-values we computed earlier for all states.

Reinforcement Learning with Gymnasium in Python

Q-values - state 4

Image showing the agent in state 4.

  • Agent born in state 4

Image showing the state-values we computed earlier for all states.

Reinforcement Learning with Gymnasium in Python

Q-values - state 4

Image showing the 4 actions an agent can do in state 4 along with their rewards.

  • Agent can move up, down, left, right

Image showing the state-values we computed earlier for all states.

Reinforcement Learning with Gymnasium in Python

State 4 - action down

Image showing the agent moving down from state 4 to be in state 7.

  • Reward: -2, state-value: 5

Image showing the state-values we computed earlier for all states.

Reinforcement Learning with Gymnasium in Python

State 4 - action down

Image showing the agent moving down from state 4 to be in state 7, and the corresponding q-value of 3.

  • $Q(4, \text{down}) = -2 + 1 \times 5 = 3$

Image showing the state-values we computed earlier for all states.

Reinforcement Learning with Gymnasium in Python

State 4 - action left

Image showing the agent moving left from state 4 to be in state 3, and the corresponding q-value of 1.

  • $Q(4, \text{left}) = -1 + 1 \times 2 = 1$

Image showing the state-values we computed earlier for all states.

Reinforcement Learning with Gymnasium in Python

State 4 - action up

Image showing the agent moving up from state 4 to be in state 1, and the corresponding q-value of 7.

  • $Q(4, \text{up}) = -1 + 1 \times 8 = 7$

Image showing the state-values we computed earlier for all states.

Reinforcement Learning with Gymnasium in Python

State 4 - action right

Image showing the agent moving right from state 4 to be in state 5, and the corresponding q-value of 9.

  • $Q(4, \text{right}) = -1 + 1 \times 10 = 9$

Image showing the state-values we computed earlier for all states.

Reinforcement Learning with Gymnasium in Python

All Q-values

Image showing the q-values we computed for all state-action pairs.

Reinforcement Learning with Gymnasium in Python

Computing Q-values

def compute_q_value(state, action):

if state == terminal_state: return None
_, next_state, reward, _ = env.unwrapped.P[state][action][0] return reward + gamma * compute_state_value(next_state)

Image showing the formula to compute the q_value for a state and action, Q(s,a) as the sum of the immediate reward received after performing an action and the discounted value of the new state computed for a specific policy.

Reinforcement Learning with Gymnasium in Python

Computing Q-values

Q = {(state, action): compute_q_value(state, action)
     for state in range(num_states) 
     for action in range(num_actions)}

print(Q)
Reinforcement Learning with Gymnasium in Python

Computing Q-values

{(0, 0): 0, (0, 1): 1, (0, 2): 7, (0, 3): 0, 
 (1, 0): 0, (1, 1): 5, (1, 2): 8, (1, 3): 7, 
 (2, 0): 7, (2, 1): 9, (2, 2): 8, (2, 3): 8, 
 (3, 0): 1, (3, 1): 2, (3, 2): 5, (3, 3): 0, 
 (4, 0): 1, (4, 1): 3, (4, 2): 9, (4, 3): 7, 
 (5, 0): 5, (5, 1): 10, (5, 2): 9, (5, 3): 8, 
 (6, 0): 2, (6, 1): 2, (6, 2): 3, (6, 3): 1, 
 (7, 0): 2, (7, 1): 3, (7, 2): 10, (7, 3): 5, 
 (8, 0): None, (8, 1): None, (8, 2): None, (8, 3): None}

Image showing the q-values we computed for all state-action pairs.

Reinforcement Learning with Gymnasium in Python

Improving the policy

Image showing the q-values we computed for all state-action pairs.

Reinforcement Learning with Gymnasium in Python

Improving the policy

  • Selecting for each state the action with highest Q-value

Image showing the q-values we computed for all state-action pairs, with the maximum q-value for each state circled.

Reinforcement Learning with Gymnasium in Python

Improving the policy

Image showing the old policy and its corresponding Q-values Old policy

Image showing the new policy when taking the action leading to the maximum q-value.

Reinforcement Learning with Gymnasium in Python

Improving the policy

improved_policy = {}


for state in range(num_states-1): max_action = max(range(num_actions), key=lambda action: Q[(state, action)])
improved_policy[state] = max_action
print(improved_policy)
{0: 2, 1: 2, 2: 1, 
 3: 2, 4: 2, 5: 1, 
 6: 2, 7: 2}
Reinforcement Learning with Gymnasium in Python

Let's practice!

Reinforcement Learning with Gymnasium in Python

Preparing Video For Download...