Markov Decision Processes

Reinforcement Learning with Gymnasium in Python

Fouad Trad

Machine Learning Engineer

MDP

Models RL environments mathematically

Image showing a complex environment (smart city) and the components we should extract from it: states, actions, rewards, and transition probabilities.

MDP

Models RL environments mathematically

Diagram showing how from a complex environment we extract MDP components (states, actions, rewards, and transition probabilities) to solve the environment with model-based RL techniques.

Markov property

Future state depends only on current state and action

Image showing a chess board with some arrows for possible moves.

Frozen Lake as MDP

Agent must reach goal without falling into holes

Image showing the frozen lake environment.

Frozen Lake as MDP - states

Positions agent can occupy

Image showing three different positions of the agent within the Frozen Lake environment.

Frozen Lake as MDP - terminal states

Lead to episode termination

Image showing the terminal states in the frozen lake environment.

Frozen Lake as MDP - actions

Up, down, left, right

Image showing the actions to perform in Frozen Lake along with their associated labels: 0-left, 1-down, 2-right, and 3-up.

Frozen Lake as MDP - transitions

Actions don't necessarily lead to expected outcomes

Image showing agent at the top left corner of the frozen lake grid aiming to move right.

Frozen Lake as MDP - transitions

Actions don't necessarily lead to expected outcomes

Image showing that the agent can move to the right.

Frozen Lake as MDP - transitions

Actions don't necessarily lead to expected outcomes

Image showing that the agent can also move down.

Frozen Lake as MDP - transitions

Actions don't necessarily lead to expected outcomes

Image showing that the agent might also stay in the same place.

Frozen Lake as MDP - transitions

Actions don't necessarily lead to expected outcomes

Image showing that when the agent decides to move right, there are probabilities for the agent to go right, down, or stay in the same place.

Transition probabilities: likelihood of reaching a state given a state and action

Frozen Lake as MDP - rewards

Reward only given in goal state

Image showing the agent in the goal state.

Gymnasium states and actions

import gymnasium as gym


env = gym.make('FrozenLake', is_slippery=True)

print(env.action_space)

print(env.observation_space)

print("Number of actions: ", env.action_space.n)

print("Number of states: ", env.observation_space.n)

Discrete(4)

Discrete(16)

Number of actions: 4

Number of states: 16

Gymnasium rewards and transitions

env.unwrapped.P: dictionary where keys are state-action pairs

print(env.unwrapped.P[state][action])

[
  (probability_1, next_state_1, reward_1, is_terminal_1), 
  (probability_2, next_state_2, reward_2, is_terminal_2), 
  etc.
]

Gymnasium rewards and transitions - example

state = 6
action = 0

print(env.unwrapped.P[state][action])

[(0.3333333333333333, 2, 0.0, False), 
(0.3333333333333333, 5, 0.0, True), 
(0.3333333333333333, 10, 0.0, False)]

Image showing action numbers: 0-left, 1-down, 2-right, 3-up.

Image showing the agent in state number 6 with states being numbered from the top left to the lower right, line by line.

Let's practice!

Reinforcement Learning with Gymnasium in Python