Policy-iteratie en value-iteratie

Reinforcement Learning met Gymnasium in Python

Fouad Trad

Machine Learning Engineer

Policy-iteratie

Iteratief proces om een optimale policy te vinden

Afbeelding met de eerste stap: de policy initialiseren.

Policy-iteratie

Iteratief proces om een optimale policy te vinden

Afbeelding met twee stappen: een policy initialiseren en evalueren.

Policy-iteratie

Iteratief proces om een optimale policy te vinden

Afbeelding met drie stappen: de policy initialiseren, evalueren en verbeteren.

Policy-iteratie

Iteratief proces om een optimale policy te vinden

Afbeelding: evalueren en verbeteren herhalen we iteratief tot de policy niet meer verandert.

Policy-iteratie

Iteratief proces om een optimale policy te vinden

Afbeelding van de flow van policy-iteratie: start met initialiseren, wissel af tussen evalueren en verbeteren, eindig met een optimale policy.

Gridworld

policy = {
    0:1, 1:2, 2:1, 
    3:1, 4:3, 5:1,
    6:2, 7:3
}

Afbeelding met de policy, pijlen tonen de zet per staat.

Policy-evaluatie

def policy_evaluation(policy):

    V = {state: compute_state_value(state, policy) for state in range(num_states)}

    return V

Policy-verbetering

def policy_improvement(policy):

    improved_policy = {s: 0 for s in range(num_states-1)}

    Q = {(state, action): compute_q_value(state, action, policy)
      for state in range(num_states) for action in range(num_actions)}


    for state in range(num_states-1):
        max_action = max(range(num_actions), key=lambda action: Q[(state, action)])
        improved_policy[state] = max_action


    return improved_policy

Policy-iteratie

def policy_iteration():

    policy = {0:1, 1:2, 2:1, 3:1, 4:3, 5:1, 6:2, 7:3}

    while True:
        V = policy_evaluation(policy)
        improved_policy = policy_improvement(policy)


        if improved_policy == policy:
            break
        policy = improved_policy


    return policy, V

Optimale policy

policy, V = policy_iteration()
print(policy, V)

{0: 2, 1: 2, 2: 1, 
 3: 1, 4: 2, 5: 1, 
 6: 2, 7: 2} 

{0: 7, 1: 8, 2: 9, 
 3: 7, 4: 9, 5: 10, 
 6: 8, 7: 10, 8: 0}

Value-iteratie

Combineert policy-evaluatie en -verbetering in één stap
- Berekent de optimale toestand-waardefunctie
- Leidt daaruit de policy af

Afbeelding met de eerste stap: V initialiseren met nullen.

Value-iteratie

Combineert policy-evaluatie en -verbetering in één stap.
- Berekent de optimale toestand-waardefunctie
- Leidt daaruit de policy af

Afbeelding met een extra stap: Q-waarden berekenen met de V-tabel.

Value-iteratie

Combineert policy-evaluatie en -verbetering in één stap.
- Berekent de optimale toestand-waardefunctie
- Leidt daaruit de policy af

Afbeelding met een extra stap: V updaten door de beste actie per staat te kiezen.

Value-iteratie

Combineert policy-evaluatie en -verbetering in één stap.
- Berekent de optimale toestand-waardefunctie
- Leidt daaruit de policy af

Afbeelding: Q-waarden berekenen met V en V updaten wordt herhaald tot V niet meer verandert.

Value-iteratie

Combineert policy-evaluatie en -verbetering in één stap.
- Berekent de optimale toestand-waardefunctie
- Leidt daaruit de policy af

Afbeelding: na de iteraties krijgen we de optimale policy en V.

Value-iteratie implementeren

V = {state: 0 for state in range(num_states)}
policy = {state:0 for state in range(num_states-1)}
threshold = 0.001


while True:
    new_V = {state: 0 for state in range(num_states)}

    for state in range(num_states-1): 
        max_action, max_q_value = get_max_action_and_value(state, V)

        new_V[state] = max_q_value
        policy[state] = max_action


    if all(abs(new_V[state] - V[state]) < thresh for state in V):
        break
    V = new_V

Optimale acties en waarden ophalen

def get_max_action_and_value(state, V):
    Q_values = [compute_q_value(state, action, V) for action in range(num_actions)]

    max_action = max(range(num_actions), key=lambda a: Q_values[a])

    max_q_value = Q_values[max_action]

    return max_action, max_q_value

Q-waarden berekenen

def compute_q_value(state, action, V):
    if state == terminal_state:
        return None
    _, next_state, reward, _ = env.P[state][action][0]
    return reward + gamma * V[next_state]

Optimale policy

print(policy, V)

{0: 2, 1: 2, 2: 1, 
 3: 1, 4: 2, 5: 1, 
 6: 2, 7: 2} 

{0: 7, 1: 8, 2: 9, 
 3: 7, 4: 9, 5: 10, 
 6: 8, 7: 10, 8: 0}

Afbeelding met de toestandwaarden van de optimale policy.

Laten we oefenen!

Reinforcement Learning met Gymnasium in Python