Policy iteration and value iteration

Reinforcement Learning with Gymnasium in Python

Fouad Trad

Machine Learning Engineer

Policy iteration

  • Iterative process to find optimal policy

Image showing the first step, initializing the policy.

Reinforcement Learning with Gymnasium in Python

Policy iteration

  • Iterative process to find optimal policy

Image showing two steps: initializing and evaluating a policy.

Reinforcement Learning with Gymnasium in Python

Policy iteration

  • Iterative process to find optimal policy

Image showing the three steps: initializing, evaluating, and improving the policy.

Reinforcement Learning with Gymnasium in Python

Policy iteration

  • Iterative process to find optimal policy

Image showing that the process of evaluating and improving the policy is iterative and repeats until the policy stops changing.

Reinforcement Learning with Gymnasium in Python

Policy iteration

  • Iterative process to find optimal policy

Image showing the flow of policy iteration, starting with initializing a policy, then alternating between evaluating and improving the policy, to reach in the end an optimal policy.

Reinforcement Learning with Gymnasium in Python

Grid world

policy = {
    0:1, 1:2, 2:1, 
    3:1, 4:3, 5:1,
    6:2, 7:3
}

Image showing the policy with arrows to represent the move in every state.

Reinforcement Learning with Gymnasium in Python

Policy evaluation

def policy_evaluation(policy):

V = {state: compute_state_value(state, policy) for state in range(num_states)}
return V
Reinforcement Learning with Gymnasium in Python

Policy improvement

def policy_improvement(policy):

improved_policy = {s: 0 for s in range(num_states-1)}
Q = {(state, action): compute_q_value(state, action, policy) for state in range(num_states) for action in range(num_actions)}
for state in range(num_states-1): max_action = max(range(num_actions), key=lambda action: Q[(state, action)]) improved_policy[state] = max_action
return improved_policy
Reinforcement Learning with Gymnasium in Python

Policy iteration

def policy_iteration():

policy = {0:1, 1:2, 2:1, 3:1, 4:3, 5:1, 6:2, 7:3}
while True: V = policy_evaluation(policy) improved_policy = policy_improvement(policy)
if improved_policy == policy: break policy = improved_policy
return policy, V
Reinforcement Learning with Gymnasium in Python

Optimal policy

policy, V = policy_iteration()
print(policy, V)
{0: 2, 1: 2, 2: 1, 
 3: 1, 4: 2, 5: 1, 
 6: 2, 7: 2} 

{0: 7, 1: 8, 2: 9, 
 3: 7, 4: 9, 5: 10, 
 6: 8, 7: 10, 8: 0}

optimal.png

Reinforcement Learning with Gymnasium in Python

Value iteration

  • Combines policy evaluation and improvement in one step
    • Computes optimal state-value function
    • Derives policy from it

Image showing the first step, initializing the state-values V with zeros.

Reinforcement Learning with Gymnasium in Python

Value iteration

  • Combines policy evaluation and improvement in one step.
    • Computes optimal state-value function
    • Derives policy from it

Image showing an additional step of computing Q-values using the V table.

Reinforcement Learning with Gymnasium in Python

Value iteration

  • Combines policy evaluation and improvement in one step.
    • Computes optimal state-value function
    • Derives policy from it

Image showing an additional step of updating V by selecting the best action in every state.

Reinforcement Learning with Gymnasium in Python

Value iteration

  • Combines policy evaluation and improvement in one step.
    • Computes optimal state-value function
    • Derives policy from it

Image showing the the process of computing Q-values using V and updating V is repeated until V stops changing.

Reinforcement Learning with Gymnasium in Python

Value iteration

  • Combines policy evaluation and improvement in one step.
    • Computes optimal state-value function
    • Derives policy from it

Image showing that once the iterative process is done, we get the optimal policy and V.

Reinforcement Learning with Gymnasium in Python

Implementing value-iteration

V = {state: 0 for state in range(num_states)}
policy = {state:0 for state in range(num_states-1)}
threshold = 0.001

while True: new_V = {state: 0 for state in range(num_states)}
for state in range(num_states-1): max_action, max_q_value = get_max_action_and_value(state, V)
new_V[state] = max_q_value policy[state] = max_action
if all(abs(new_V[state] - V[state]) < thresh for state in V): break V = new_V
Reinforcement Learning with Gymnasium in Python

Getting optimal actions and values

def get_max_action_and_value(state, V):
    Q_values = [compute_q_value(state, action, V) for action in range(num_actions)]

max_action = max(range(num_actions), key=lambda a: Q_values[a])
max_q_value = Q_values[max_action]
return max_action, max_q_value
Reinforcement Learning with Gymnasium in Python

Computing Q-values

def compute_q_value(state, action, V):
    if state == terminal_state:
        return None
    _, next_state, reward, _ = env.P[state][action][0]
    return reward + gamma * V[next_state]
Reinforcement Learning with Gymnasium in Python

Optimal policy

print(policy, V)
{0: 2, 1: 2, 2: 1, 
 3: 1, 4: 2, 5: 1, 
 6: 2, 7: 2} 

{0: 7, 1: 8, 2: 9, 
 3: 7, 4: 9, 5: 10, 
 6: 8, 7: 10, 8: 0}

Image showing the state-values of the optimal policy.

Reinforcement Learning with Gymnasium in Python

Let's practice!

Reinforcement Learning with Gymnasium in Python

Preparing Video For Download...