Balancing exploration and exploitation

Reinforcement Learning with Gymnasium in Python

Fouad Trad

Machine Learning Engineer

Training with random actions

Agent explores environment
No strategy optimization based on learned knowledge
Agent uses knowledge when training done

Image showing an agent within an environment

Exploration-exploitation trade-off

Balances exploration and exploitation
Continuous exploration prevents strategy refinement
Exclusive exploitation misses undiscovered opportunities

Image showing an agent trying to explore new actions in order to discover more rewards, and trying to exploit its knowledge while possibly missing on some rewards.

Dining choices

Image showing a restaurant table.

Epsilon-greedy strategy

Explore with probability epsilon

Diagram showing that with a probability epsilon, the agent explores by choosing a random action.

Epsilon-greedy strategy

Explore with probability epsilon
Exploit with probability 1-epsilon
Ensures continuous exploration while using knowledge

Diagram showing that with a probability epsilon, the agent explores by choosing a random action, and with a probability of 1 - epsilon, it exploits by selecting the best known action.

Decayed epsilon-greedy strategy

Reduces epsilon over time
More exploration initially
More exploitation later on
Agent increasingly relies on its accumulated knowledge

Image showing how epsilon decreases over time.

Implementation with Frozen Lake

env = gym.make('FrozenLake', is_slippery=True)

action_size = env.action_space.n
state_size = env.observation_space.n
Q = np.zeros((state_size, action_size))


alpha = 0.1
gamma = 0.99
total_episodes = 10000

Image showing a snapshot of the Frozen Lake environment.

Implementing epsilon_greedy()

def epsilon_greedy(state):

    if np.random.rand() < epsilon:
        action = env.action_space.sample()  # Explore

    else:
        action = np.argmax(Q[state, :])  # Exploit
    return action

Training epsilon-greedy

epsilon = 0.9   # Exploration rate

rewards_eps_greedy = []

for episode in range(total_episodes):
    state, info = env.reset()
    terminated = False
    episode_reward = 0
    while not terminated:
        action = epsilon_greedy(state)
        new_state, reward, terminated, truncated, info = env.step(action)       
        Q[state, action] = update_q_table(state, action, new_state) 
        state = new_state

        episode_reward += reward
    rewards_eps_greedy.append(episode_reward)

Training decayed epsilon-greedy

epsilon = 1.0   # Exploration rate
epsilon_decay = 0.999
min_epsilon = 0.01

rewards_decay_eps_greedy = []
for episode in range(total_episodes):
    state, info = env.reset()
    terminated = False
    episode_reward = 0
    while not terminated:
        action = epsilon_greedy(state)
        new_state, reward, terminated, truncated, info = env.step(action)
        episode_reward += reward        
        Q[state, action] = update_q_table(state, action, new_state) 
        state = new_state
    rewards_decay_eps_greedy.append(episode_reward)

    epsilon = max(min_epsilon, epsilon * epsilon_decay)

Comparing strategies

avg_eps_greedy= np.mean(rewards_eps_greedy)
avg_decay = np.mean(rewards_decay_eps_greedy)
plt.bar(['Epsilon Greedy', 'Decayed Epsilon Greedy'],
        [avg_eps_greedy, avg_decay], 
        color=['blue', 'green'])
plt.title('Average Reward per Episode')
plt.ylabel('Average Reward')
plt.show()

Image of a bar plot showing that the average reward achieved with epsilon-greedy is around 0.02 while the one achieved with decayed epsilon-greedy is around 0.55.

Let's practice!

Reinforcement Learning with Gymnasium in Python