Monte Carlo methods

Reinforcement Learning with Gymnasium in Python

Fouad Trad

Machine Learning Engineer

Recap: model-based learning

Rely on knowledge of environment dynamics
No interaction with environment

Image showing the diagrams of policy iteration and value iteration algorithms, seen in the previous video.

Model-free learning

Doesn't rely on knowledge of environment dynamics
Agent interacts with environment
Learns policy through trial and error
More suitable for real-world applications

Image of a robot interacting with a chess environment.

Monte Carlo methods

Model-free techniques
Estimate Q-values based on episodes

Image showing the skeleton of a collected episode, comprising states, actions, rewards, and returns.

Monte Carlo methods

Model-free techniques
Estimate Q-values based on episodes

Image showing the second step of estimating the Q-values and showing how a Q-table looks like, having a number of rows equal to the number of states and a number of columns equal to the number of actions.

Monte Carlo methods

Model-free techniques
Estimate Q-values based on episodes

Image showing the final step of deriving the optimal policy, which essentially maps each state to the optimal action.

Two methods: first-visit, every-visit

Custom grid world

Image showing the custom grid world including 6 states, with 2 rows and 3 columns, numbered from the upper left (0) to the buttom right (5). The agent is in state 3, a mountain is in state 4, and the goal is in state 5.png

Collecting two episodes

Image showing the first episode collected in terms of states, actions, rewards, and returns.

Image showing the second episode collected in terms of states, actions, rewards, and returns.

Estimating Q-values

Image showing the states, actions, rewards, and returns collected for the two episodes.

Q-table: table for Q-values

Image showing an empty Q-table that we have to fill.

Q(4, left), Q(4, up), and Q(1, down)

Image showing the states, actions, rewards, and returns collected for the two episodes with (4, left), (4, up), and (1, down) highlighted.

(s,a) appears once -> fill with return

Q-table with the values of (4, left), (4, up), and (1, down) filled.

Q(4, right)

Image showing the states, actions, rewards, and returns collected for the two episodes with (4, right) highlighted in both episodes

(s,a) occurs once per episode -> average

Q-table with the value of (4, right) filled by the average of returns from both episodes.

Q(3, right) - first-visit Monte Carlo

Image showing the states, actions, rewards, and returns collected for the two episodes with (3, right) highlighted only for the first occurrence in both episodes

Average first visit to (s,a) within episodes

Q-table with the value of (3, right) filled by the average of returns from the highlighted rows (the first occurrences of (3, right)).

Q(3, right) - every-visit Monte Carlo

Image showing the states, actions, rewards, and returns collected for the two episodes with (3, right) highlighted for every occurrence in both episodes

Average every visit to (s,a) within episodes

Q-table with the value of (3, right) filled by the average of returns from the highlighted rows (every occurrence of (3, right)).

Generating an episode

def generate_episode():
    episode = []
    state, info = env.reset()

    terminated = False
    while not terminated:
        action = env.action_space.sample()

        next_state, reward, terminated, truncated, info = env.step(action)

        episode.append((state, action, reward))
        state = next_state

    return episode

First-visit Monte Carlo

def first_visit_mc(num_episodes):
    Q = np.zeros((num_states, num_actions))
    returns_sum = np.zeros((num_states, num_actions))
    returns_count = np.zeros((num_states, num_actions))


    for i in range(num_episodes):
        episode = generate_episode()
        visited_states_actions = set()

        for j, (state, action, reward) in enumerate(episode):

            if (state, action) not in visited_states:

                returns_sum[state, action] += sum([x[2] for x in episode[j:]])

                returns_count[state, action] += 1
                visited_states_actions.add((state, action))


    nonzero_counts = returns_count != 0

    Q[nonzero_counts] = returns_sum[nonzero_counts] / returns_count[nonzero_counts]
    return Q

Every-visit Monte Carlo

def every_visit_mc(num_episodes):
    Q = np.zeros((num_states, num_actions))
    returns_sum = np.zeros((num_states, num_actions))
    returns_count = np.zeros((num_states, num_actions))    

    for i in range(num_episodes):
        episode = generate_episode()  

        for j, (state, action, reward) in enumerate(episode):

            returns_sum[state, action] += sum([x[2] for x in episode[j:]])
            returns_count[state, action] += 1


    nonzero_counts = returns_count != 0
    Q[nonzero_counts] = returns_sum[nonzero_counts] / returns_count[nonzero_counts]
    return Q

Getting the optimal policy

def get_policy():
    policy = {state: np.argmax(Q[state]) for state in range(num_states)}    
    return policy

Putting things together

Q = first_visit_mc(1000)

policy_first_visit = get_policy()

print("First-visit policy: \n", policy_first_visit)

Q = every_visit_mc(1000)

policy_every_visit = get_policy()

print("Every-visit policy: \n", policy_every_visit)

First-visit policy:
{0: 2, 1: 2, 2: 1, 
 3: 2, 4: 2, 5: 0}

Every-visit policy:
{0: 2, 1: 2, 2: 1, 
 3: 2, 4: 2, 5: 0}

Image showing the optimal policy with the optimal action to take in every state in the form of arrows.

Let's practice!

Reinforcement Learning with Gymnasium in Python