Multi-armed bandits

Reinforcement Learning with Gymnasium in Python

Fouad Trad

Machine Learning Engineer

Multi-armed bandits

 

  • Gambler facing slot machines
  • Challenge → maximize winning
  • Solution → exploration-exploitation

Image showing a man facing a row of slot machines

Reinforcement Learning with Gymnasium in Python

Slot machines

Image showing 4 slot machines having different probabilities of winning: 45%, 35%, 85%, and 62%, and these are unknown to the user.

  • Reward from an arm is 0 or 1
  • Agent's goal → Accumulate maximum reward
Reinforcement Learning with Gymnasium in Python

Solving the problem

 

  • Decayed epsilon-greedy
  • Epsilon → select random machine

Diagram showing that with a probability of epsilon, the agent explores by select a machine at random.

Reinforcement Learning with Gymnasium in Python

Solving the problem

 

  • Decayed epsilon-greedy
  • Epsilon → select random machine
  • 1 - epsilon → select best machine so far
  • Epsilon decreases over time

Diagram showing that with a probability of epsilon, the agent explores by select a machine at random and with a probability of 1 - epsilon it exploits by selecting the best known machine.

Reinforcement Learning with Gymnasium in Python

Initialization

n_bandits = 4  
true_bandit_probs = np.random.rand(n_bandits)

n_iterations = 100000 epsilon = 1.0 min_epsilon = 0.01 epsilon_decay = 0.999
counts = np.zeros(n_bandits) # How many times each bandit was played
values = np.zeros(n_bandits) # Estimated winning probability of each bandit
rewards = np.zeros(n_iterations) # Reward history
selected_arms = np.zeros(n_iterations, dtype=int) # Arm selection history
Reinforcement Learning with Gymnasium in Python

Interaction loop

for i in range(n_iterations):
    arm = epsilon_greedy()

reward = np.random.rand() < true_bandit_probs[arm]
rewards[i] = reward selected_arms[i] = arm counts[arm] += 1
values[arm] += (reward - values[arm]) / counts[arm]
epsilon = max(min_epsilon, epsilon * epsilon_decay)
Reinforcement Learning with Gymnasium in Python

Analyzing selections

selections_percentage = np.zeros((n_iterations, n_bandits))


Diagram showing the first step of the process: a sample array of size (iterations, n_bandits) filled with zeros.

Reinforcement Learning with Gymnasium in Python

Analyzing selections

selections_percentage = np.zeros((n_iterations, n_bandits))

for i in range(n_iterations): selections_percentage[i, selected_arms[i]] = 1

Diagram showing the second step of the process, which marks the selected arm in each iteration with a value of 1 inside the array.

Reinforcement Learning with Gymnasium in Python

Analyzing selections

selections_percentage = np.zeros((n_iterations, n_bandits))

for i in range(n_iterations): selections_percentage[i, selected_arms[i]] = 1
selections_percentage = np.cumsum(selections_percentage, axis=0) / np.arange(1, n_iterations + 1).reshape(-1, 1)

Diagram showing the last steps of the process, where a cumulative sum of the chosen bandits is performed, and then we divide by the iteration number to obtain the percentage of selection of each arm in each iteration.

Reinforcement Learning with Gymnasium in Python

Analyzing selections

  Image showing the selection_percentage curve for each bandit, showing that as the iterations progress, the agent tends to select the bandit #2 more than the others.

for arm in range(n_bandits):
    plt.plot(selections_percentage[:, arm], label=f'Bandit #{arm+1}')
plt.xscale('log')
plt.title('Bandit Action Choices Over Time')
plt.xlabel('Episode Number')
plt.ylabel('Percentage of Bandit Selections (%)')
plt.legend()
plt.show()

for i, prob in enumerate(true_bandit_probs, 1): print(f"Bandit #{i} -> {prob:.2f}")
Bandit #1 -> 0.37
Bandit #2 -> 0.95
Bandit #3 -> 0.73
Bandit #4 -> 0.60
  • Agent learns to select the bandit with highest probability
Reinforcement Learning with Gymnasium in Python

Let's practice!

Reinforcement Learning with Gymnasium in Python

Preparing Video For Download...