Policies and state-value functions

Reinforcement Learning with Gymnasium in Python

Fouad Trad

Machine Learning Engineer

Policies

  • RL objective → formulate effective policies
  • Specify which action to take in each state to maximize return

Image showing a large roadmap.

Reinforcement Learning with Gymnasium in Python

Grid world example

  • Agent aims to reach diamond while avoiding mountains
  • Nine states
  • Deterministic movements

action_numbers_green.png

Image showing the custom environment with 9 grid cells, 2 of them are mountains, and one for the diamond.

Reinforcement Learning with Gymnasium in Python

Grid world example - rewards

  • Given based on states:
    • Diamond: +10

Image showing that movements from neighboring cells to the diamond yield a reward of +10.

Reinforcement Learning with Gymnasium in Python

Grid world example - rewards

  • Given based on states:
    • Diamond: +10
    • Mountain: -2

Image showing that action leading to mountains result in a reward of -2.

Reinforcement Learning with Gymnasium in Python

Grid world example - rewards

  • Given based on states:
    • Diamond: +10
    • Mountain: -2
    • Other states: -1

Image showing that any other movement results in a reward of -1.

Reinforcement Learning with Gymnasium in Python

Grid world example: policy

# 0: left, 1: down, 2: right, 3: up
policy = {
    0:1, 1:2, 2:1, 
    3:1, 4:3, 5:1,
    6:2, 7:3
}

state, info = env.reset() terminated = False while not terminated: action = policy[state] state, reward, terminated, _, _ = env.step(action)

Image showing the policy with arrows to represent movements between the states.

Reinforcement Learning with Gymnasium in Python

State-value functions

  • Estimate the state's worth
  • Expected return starting from state, following policy

Image showing the formula of the state value function as the discounted return of starting at state s and following the policy.

Reinforcement Learning with Gymnasium in Python

Grid world example: State-values

Image showing the policy with arrows to represent movements between the states.

  • Nine states → nine state-values
  • Discount factor: $\gamma = 1$
Reinforcement Learning with Gymnasium in Python

Value of goal state

Image showing the agent in the goal state.

  • Starting in goal state, agent doesn't move
  • $V(goal \, state) = 0$

Image showing that the value of the goal state is 0

Reinforcement Learning with Gymnasium in Python

Value of state 5

Image showing the agent in state 5.

  • Starting in 5, agent moves to goal
  • $V(5) = 10$

Image showing that the value of state 5 is 10

Reinforcement Learning with Gymnasium in Python

Value of state 2

Image showing the agent in state 2.

  • Starting in 2, rewards: $-1, 10$
  • $ V(2) = (1 \times -1) + (1 \times 10) = 9$

Image showing that the value of state 2 is 9.

Reinforcement Learning with Gymnasium in Python

All state values

Image showing all state values for the 9 states of the environment.

Reinforcement Learning with Gymnasium in Python

Bellman equation

  • Recursive formula
  • Computes state-values

Image showing the Bellman equation as the sum of the immediate reward of the current state with the discounted value of the next state.

Reinforcement Learning with Gymnasium in Python

Computing state-values

def compute_state_value(state):

if state == terminal_state: return 0
action = policy[state]
_, next_state, reward, _ = env.unwrapped.P[state][action][0]
return reward + gamma * compute_state_value(next_state)

Image showing the Bellman equation as the sum of the immediate reward of the current state with the discounted value of the next state.

Reinforcement Learning with Gymnasium in Python

Computing state-values

terminal_state = 8
gamma = 1

V = {state: compute_state_value(state) for state in range(num_states)}
print(V)
{0: 1, 1: 8, 2: 9, 
 3: 2, 4: 7, 5: 10, 
 6: 3, 7: 5, 8: 0}

Image showing all state values for the 9 states of the environment.

Reinforcement Learning with Gymnasium in Python

Changing policies

# 0: left, 1: down, 2: right, 3: up
policy_two = {
    0:2, 1:2, 2:1,
    3:2, 4:2, 5:1,
    6:2, 7:2
}

V_2 = {state: compute_state_value(state) for state in range(num_states)} print(V_2)

Image showing the policy with arrows to represent movements between the states.

Reinforcement Learning with Gymnasium in Python

Comparing policies

State-values for policy 1

{0: 1, 1: 8, 2: 9, 
 3: 2, 4: 7, 5: 10, 
 6: 3, 7: 5, 8: 0}

Image showing state values of policy 1.

State-values for policy 2

{0: 7, 1: 8, 2: 9, 
 3: 7, 4: 9, 5: 10, 
 6: 8, 7: 10, 8: 0}

Image showing state values of policy 2.

Reinforcement Learning with Gymnasium in Python

Let's practice!

Reinforcement Learning with Gymnasium in Python

Preparing Video For Download...