Monte Carlo methods

Reinforcement Learning with Gymnasium in Python

Fouad Trad

Machine Learning Engineer

Recap: model-based learning

 

  • Rely on knowledge of environment dynamics
  • No interaction with environment

Image showing the diagrams of policy iteration and value iteration algorithms, seen in the previous video.

Reinforcement Learning with Gymnasium in Python

Model-free learning

 

  • Doesn't rely on knowledge of environment dynamics
  • Agent interacts with environment
  • Learns policy through trial and error
  • More suitable for real-world applications

Image of a robot interacting with a chess environment.

Reinforcement Learning with Gymnasium in Python

Monte Carlo methods

  • Model-free techniques
  • Estimate Q-values based on episodes

Image showing the skeleton of a collected episode, comprising states, actions, rewards, and returns.

Reinforcement Learning with Gymnasium in Python

Monte Carlo methods

  • Model-free techniques
  • Estimate Q-values based on episodes

Image showing the second step of estimating the Q-values and showing how a Q-table looks like, having a number of rows equal to the number of states and a number of columns equal to the number of actions.

Reinforcement Learning with Gymnasium in Python

Monte Carlo methods

  • Model-free techniques
  • Estimate Q-values based on episodes

Image showing the final step of deriving the optimal policy, which essentially maps each state to the optimal action.

  • Two methods: first-visit, every-visit
Reinforcement Learning with Gymnasium in Python

Custom grid world

Image showing the custom grid world including 6 states, with 2 rows and 3 columns, numbered from the upper left (0) to the buttom right (5). The agent is in state 3, a mountain is in state 4, and the goal is in state 5.png

Reinforcement Learning with Gymnasium in Python

Collecting two episodes

Image showing the first episode collected in terms of states, actions, rewards, and returns.

Image showing the second episode collected in terms of states, actions, rewards, and returns.

Reinforcement Learning with Gymnasium in Python

Estimating Q-values

Image showing the states, actions, rewards, and returns collected for the two episodes.

  • Q-table: table for Q-values

Image showing an empty Q-table that we have to fill.

Reinforcement Learning with Gymnasium in Python

Q(4, left), Q(4, up), and Q(1, down)

Image showing the states, actions, rewards, and returns collected for the two episodes with (4, left), (4, up), and (1, down) highlighted.

  • (s,a) appears once -> fill with return

Q-table with the values of (4, left), (4, up), and (1, down) filled.

Reinforcement Learning with Gymnasium in Python

Q(4, right)

Image showing the states, actions, rewards, and returns collected for the two episodes with (4, right) highlighted in both episodes

  • (s,a) occurs once per episode -> average

Q-table with the value of (4, right) filled by the average of returns from both episodes.

Reinforcement Learning with Gymnasium in Python

Q(3, right) - first-visit Monte Carlo

Image showing the states, actions, rewards, and returns collected for the two episodes with (3, right) highlighted only for the first occurrence in both episodes

  • Average first visit to (s,a) within episodes

Q-table with the value of (3, right) filled by the average of returns from the highlighted rows (the first occurrences of (3, right)).

Reinforcement Learning with Gymnasium in Python

Q(3, right) - every-visit Monte Carlo

Image showing the states, actions, rewards, and returns collected for the two episodes with (3, right) highlighted for every occurrence in both episodes

  • Average every visit to (s,a) within episodes

Q-table with the value of (3, right) filled by the average of returns from the highlighted rows (every occurrence of (3, right)).

Reinforcement Learning with Gymnasium in Python

Generating an episode

def generate_episode():
    episode = []
    state, info = env.reset()

terminated = False while not terminated: action = env.action_space.sample()
next_state, reward, terminated, truncated, info = env.step(action)
episode.append((state, action, reward)) state = next_state
return episode
Reinforcement Learning with Gymnasium in Python

First-visit Monte Carlo

def first_visit_mc(num_episodes):
    Q = np.zeros((num_states, num_actions))
    returns_sum = np.zeros((num_states, num_actions))
    returns_count = np.zeros((num_states, num_actions))

for i in range(num_episodes): episode = generate_episode() visited_states_actions = set()
for j, (state, action, reward) in enumerate(episode):
if (state, action) not in visited_states:
returns_sum[state, action] += sum([x[2] for x in episode[j:]])
returns_count[state, action] += 1 visited_states_actions.add((state, action))
nonzero_counts = returns_count != 0
Q[nonzero_counts] = returns_sum[nonzero_counts] / returns_count[nonzero_counts] return Q
Reinforcement Learning with Gymnasium in Python

Every-visit Monte Carlo

def every_visit_mc(num_episodes):
    Q = np.zeros((num_states, num_actions))
    returns_sum = np.zeros((num_states, num_actions))
    returns_count = np.zeros((num_states, num_actions))    

    for i in range(num_episodes):
        episode = generate_episode()  

        for j, (state, action, reward) in enumerate(episode):

            returns_sum[state, action] += sum([x[2] for x in episode[j:]])
            returns_count[state, action] += 1


    nonzero_counts = returns_count != 0
    Q[nonzero_counts] = returns_sum[nonzero_counts] / returns_count[nonzero_counts]
    return Q
Reinforcement Learning with Gymnasium in Python

Getting the optimal policy

def get_policy():
    policy = {state: np.argmax(Q[state]) for state in range(num_states)}    
    return policy
Reinforcement Learning with Gymnasium in Python

Putting things together

Q = first_visit_mc(1000)

policy_first_visit = get_policy()
print("First-visit policy: \n", policy_first_visit)
Q = every_visit_mc(1000)
policy_every_visit = get_policy()
print("Every-visit policy: \n", policy_every_visit)
First-visit policy:
{0: 2, 1: 2, 2: 1, 
 3: 2, 4: 2, 5: 0}

Every-visit policy:
{0: 2, 1: 2, 2: 1, 
 3: 2, 4: 2, 5: 0}

Image showing the optimal policy with the optimal action to take in every state in the form of arrows.

Reinforcement Learning with Gymnasium in Python

Let's practice!

Reinforcement Learning with Gymnasium in Python

Preparing Video For Download...