Policies and state-value functions

Reinforcement Learning with Gymnasium in Python

Fouad Trad

Machine Learning Engineer

Policies

RL objective → formulate effective policies
Specify which action to take in each state to maximize return

Image showing a large roadmap.

Grid world example

Agent aims to reach diamond while avoiding mountains
Nine states
Deterministic movements

Image showing the custom environment with 9 grid cells, 2 of them are mountains, and one for the diamond.

Grid world example - rewards

Given based on states:
- Diamond: +10

Image showing that movements from neighboring cells to the diamond yield a reward of +10.

Grid world example - rewards

Given based on states:
- Diamond: +10
- Mountain: -2

Image showing that action leading to mountains result in a reward of -2.

Grid world example - rewards

Given based on states:
- Diamond: +10
- Mountain: -2
- Other states: -1

Image showing that any other movement results in a reward of -1.

Grid world example: policy

# 0: left, 1: down, 2: right, 3: up
policy = {
    0:1, 1:2, 2:1, 
    3:1, 4:3, 5:1,
    6:2, 7:3
}


state, info = env.reset()
terminated = False
while not terminated:
    action = policy[state]
    state, reward, terminated, _, _ = env.step(action)

Image showing the policy with arrows to represent movements between the states.

State-value functions

Estimate the state's worth
Expected return starting from state, following policy

Image showing the formula of the state value function as the discounted return of starting at state s and following the policy.

Grid world example: State-values

Image showing the policy with arrows to represent movements between the states.

Nine states → nine state-values
Discount factor: $\gamma = 1$

Value of goal state

Image showing the agent in the goal state.

Starting in goal state, agent doesn't move
$V(goal \, state) = 0$

Image showing that the value of the goal state is 0

Value of state 5

Image showing the agent in state 5.

Starting in 5, agent moves to goal
$V(5) = 10$

Image showing that the value of state 5 is 10

Value of state 2

Image showing the agent in state 2.

Starting in 2, rewards: $-1, 10$
$ V(2) = (1 \times -1) + (1 \times 10) = 9$

Image showing that the value of state 2 is 9.

All state values

Image showing all state values for the 9 states of the environment.

Bellman equation

Recursive formula
Computes state-values

Image showing the Bellman equation as the sum of the immediate reward of the current state with the discounted value of the next state.

Computing state-values

def compute_state_value(state):

    if state == terminal_state:
        return 0


    action = policy[state]

    _, next_state, reward, _ = env.unwrapped.P[state][action][0]

    return reward + gamma * compute_state_value(next_state)

Image showing the Bellman equation as the sum of the immediate reward of the current state with the discounted value of the next state.

Computing state-values

terminal_state = 8
gamma = 1


V = {state: compute_state_value(state) 
     for state in range(num_states)}


print(V)

{0: 1, 1: 8, 2: 9, 
 3: 2, 4: 7, 5: 10, 
 6: 3, 7: 5, 8: 0}

Image showing all state values for the 9 states of the environment.

Changing policies

# 0: left, 1: down, 2: right, 3: up
policy_two = {
    0:2, 1:2, 2:1,
    3:2, 4:2, 5:1,
    6:2, 7:2
}

V_2 = {state: compute_state_value(state) 
     for state in range(num_states)}
print(V_2)

Image showing the policy with arrows to represent movements between the states.

Comparing policies

State-values for policy 1

{0: 1, 1: 8, 2: 9, 
 3: 2, 4: 7, 5: 10, 
 6: 3, 7: 5, 8: 0}

Image showing state values of policy 1.

State-values for policy 2

{0: 7, 1: 8, 2: 9, 
 3: 7, 4: 9, 5: 10, 
 6: 8, 7: 10, 8: 0}

Image showing state values of policy 2.

Let's practice!

Reinforcement Learning with Gymnasium in Python