Policy iteration and value iteration

Reinforcement Learning with Gymnasium in Python

Fouad Trad

Machine Learning Engineer

Policy iteration

Iterative process to find optimal policy

Image showing the first step, initializing the policy.

Policy iteration

Iterative process to find optimal policy

Image showing two steps: initializing and evaluating a policy.

Policy iteration

Iterative process to find optimal policy

Image showing the three steps: initializing, evaluating, and improving the policy.

Policy iteration

Iterative process to find optimal policy

Image showing that the process of evaluating and improving the policy is iterative and repeats until the policy stops changing.

Policy iteration

Iterative process to find optimal policy

Image showing the flow of policy iteration, starting with initializing a policy, then alternating between evaluating and improving the policy, to reach in the end an optimal policy.

Grid world

policy = {
    0:1, 1:2, 2:1, 
    3:1, 4:3, 5:1,
    6:2, 7:3
}

Image showing the policy with arrows to represent the move in every state.

Policy evaluation

def policy_evaluation(policy):

    V = {state: compute_state_value(state, policy) for state in range(num_states)}

    return V

Policy improvement

def policy_improvement(policy):

    improved_policy = {s: 0 for s in range(num_states-1)}

    Q = {(state, action): compute_q_value(state, action, policy)
      for state in range(num_states) for action in range(num_actions)}


    for state in range(num_states-1):
        max_action = max(range(num_actions), key=lambda action: Q[(state, action)])
        improved_policy[state] = max_action


    return improved_policy

Policy iteration

def policy_iteration():

    policy = {0:1, 1:2, 2:1, 3:1, 4:3, 5:1, 6:2, 7:3}

    while True:
        V = policy_evaluation(policy)
        improved_policy = policy_improvement(policy)


        if improved_policy == policy:
            break
        policy = improved_policy


    return policy, V

Optimal policy

policy, V = policy_iteration()
print(policy, V)

{0: 2, 1: 2, 2: 1, 
 3: 1, 4: 2, 5: 1, 
 6: 2, 7: 2} 

{0: 7, 1: 8, 2: 9, 
 3: 7, 4: 9, 5: 10, 
 6: 8, 7: 10, 8: 0}

Value iteration

Combines policy evaluation and improvement in one step
- Computes optimal state-value function
- Derives policy from it

Image showing the first step, initializing the state-values V with zeros.

Value iteration

Combines policy evaluation and improvement in one step.
- Computes optimal state-value function
- Derives policy from it

Image showing an additional step of computing Q-values using the V table.

Value iteration

Combines policy evaluation and improvement in one step.
- Computes optimal state-value function
- Derives policy from it

Image showing an additional step of updating V by selecting the best action in every state.

Value iteration

Combines policy evaluation and improvement in one step.
- Computes optimal state-value function
- Derives policy from it

Image showing the the process of computing Q-values using V and updating V is repeated until V stops changing.

Value iteration

Combines policy evaluation and improvement in one step.
- Computes optimal state-value function
- Derives policy from it

Image showing that once the iterative process is done, we get the optimal policy and V.

Implementing value-iteration

V = {state: 0 for state in range(num_states)}
policy = {state:0 for state in range(num_states-1)}
threshold = 0.001


while True:
    new_V = {state: 0 for state in range(num_states)}

    for state in range(num_states-1): 
        max_action, max_q_value = get_max_action_and_value(state, V)

        new_V[state] = max_q_value
        policy[state] = max_action


    if all(abs(new_V[state] - V[state]) < thresh for state in V):
        break
    V = new_V

Getting optimal actions and values

def get_max_action_and_value(state, V):
    Q_values = [compute_q_value(state, action, V) for action in range(num_actions)]

    max_action = max(range(num_actions), key=lambda a: Q_values[a])

    max_q_value = Q_values[max_action]

    return max_action, max_q_value

Computing Q-values

def compute_q_value(state, action, V):
    if state == terminal_state:
        return None
    _, next_state, reward, _ = env.P[state][action][0]
    return reward + gamma * V[next_state]

Optimal policy

print(policy, V)

{0: 2, 1: 2, 2: 1, 
 3: 1, 4: 2, 5: 1, 
 6: 2, 7: 2} 

{0: 7, 1: 8, 2: 9, 
 3: 7, 4: 9, 5: 10, 
 6: 8, 7: 10, 8: 0}

Image showing the state-values of the optimal policy.

Let's practice!

Reinforcement Learning with Gymnasium in Python