Navigating the RL framework

Reinforcement Learning with Gymnasium in Python

Fouad Trad

Machine Learning Engineer

RL framework

Image showing an agent component.

RL framework

Image showing agent and environment components.

RL framework

Agent: learner, decision-maker
Environment: challenges to be solved

Image showing all RL components including: agent, environment, states, actions, and rewards.

RL framework

Agent: learner, decision-maker
Environment: challenges to be solved
State: environment snapshot at given time

Image showing that the environment gives states for the agent.

RL framework

Agent: learner, decision-maker
Environment: challenges to be solved
State: environment snapshot at given time
Action: agent's choice in response to state

Image showing that the agent responds to the environment's state by executing an action.

RL framework

Agent: learner, decision-maker
Environment: challenges to be solved
State: environment snapshot at given time
Action: agent's choice in response to state
Reward: feedback for agent action

Image showing that the agent responds to the environment's state by executing an action, and receives a reward from the environment based on the executed action.

RL interaction loop

env = create_environment()
state = env.get_initial_state()


for i in range(n_iterations):
    action = choose_action(state)

    state, reward = env.execute(action)

    update_knowledge(state, action, reward)

Image showing that the agent responds to the environment's state by executing an action, and receives a reward from the environment based on the executed action.

Episodic vs. continuous tasks

Episodic tasks

Tasks segmented in episodes
Episode has beginning and end
Example: agent playing chess

Image showing a cat playing chess.

Continuous tasks

Continuous interaction
No distinct episodes
Example: Adjusting traffic lights

Image showing a frog riding a bike and waiting for traffic lights to turn green.

Return

Actions have long term consequences
Agent aims to maximize total reward over time
Return: sum of all expected rewards

Image showing that the return is the sum of individual rewards r_1 through r_n.

Discounted return

Immediate rewards are more valuable than future ones
Discounted return: gives more weight to nearer rewards
Discount factor ($\gamma$): discounts future rewards

Image showing the formula of the discounted return as the sum of rewards, each multiplied by the discount factor, raised to the power of its respective time step.

Discount factor

Between zero and one
Balances immediate vs. long-term rewards
- Lower value → immediate gains
- Higher value → long-term benefits

Image showing the influence of discount factor's extreme values where a value of zero favors only immediate gains, and a value of one favors future gains without discount.

Numerical example

import numpy as np
expected_rewards = np.array([1, 6, 3])

discount_factor = 0.9

discounts = np.array([discount_factor ** i for i in range(len(expected_rewards))])

print(f"Discounts: {discounts}")

Discounts: [1.   0.9  0.81]

discounted_return = np.sum(expected_rewards * discounts)
print(f"The discounted return is {discounted_return}")

The discounted return is 8.83

Let's practice!

Reinforcement Learning with Gymnasium in Python