Multi-armed bandits

Reinforcement Learning with Gymnasium in Python

Fouad Trad

Machine Learning Engineer

Multi-armed bandits

Gambler facing slot machines
Challenge → maximize winning
Solution → exploration-exploitation

Image showing a man facing a row of slot machines

Slot machines

Image showing 4 slot machines having different probabilities of winning: 45%, 35%, 85%, and 62%, and these are unknown to the user.

Reward from an arm is 0 or 1
Agent's goal → Accumulate maximum reward

Solving the problem

Decayed epsilon-greedy
Epsilon → select random machine

Diagram showing that with a probability of epsilon, the agent explores by select a machine at random.

Solving the problem

Decayed epsilon-greedy
Epsilon → select random machine
1 - epsilon → select best machine so far
Epsilon decreases over time

Diagram showing that with a probability of epsilon, the agent explores by select a machine at random and with a probability of 1 - epsilon it exploits by selecting the best known machine.

Initialization

n_bandits = 4  
true_bandit_probs = np.random.rand(n_bandits)


n_iterations = 100000
epsilon = 1.0
min_epsilon = 0.01
epsilon_decay = 0.999


counts = np.zeros(n_bandits)  # How many times each bandit was played

values = np.zeros(n_bandits)  # Estimated winning probability of each bandit

rewards = np.zeros(n_iterations)  # Reward history

selected_arms = np.zeros(n_iterations, dtype=int)  # Arm selection history

Interaction loop

for i in range(n_iterations):
    arm = epsilon_greedy()

    reward = np.random.rand() < true_bandit_probs[arm]

    rewards[i] = reward
    selected_arms[i] = arm
    counts[arm] += 1

    values[arm] += (reward - values[arm]) / counts[arm]

    epsilon = max(min_epsilon, epsilon * epsilon_decay)

Analyzing selections

selections_percentage = np.zeros((n_iterations, n_bandits))

Diagram showing the first step of the process: a sample array of size (iterations, n_bandits) filled with zeros.

Analyzing selections

selections_percentage = np.zeros((n_iterations, n_bandits))

for i in range(n_iterations):
    selections_percentage[i, selected_arms[i]] = 1

Diagram showing the second step of the process, which marks the selected arm in each iteration with a value of 1 inside the array.

Analyzing selections

selections_percentage = np.zeros((n_iterations, n_bandits))

for i in range(n_iterations):
    selections_percentage[i, selected_arms[i]] = 1

selections_percentage = np.cumsum(selections_percentage, axis=0) / 
                        np.arange(1, n_iterations + 1).reshape(-1, 1)

Diagram showing the last steps of the process, where a cumulative sum of the chosen bandits is performed, and then we divide by the iteration number to obtain the percentage of selection of each arm in each iteration.

Analyzing selections

Image showing the selection_percentage curve for each bandit, showing that as the iterations progress, the agent tends to select the bandit #2 more than the others.

for arm in range(n_bandits):
    plt.plot(selections_percentage[:, arm], label=f'Bandit #{arm+1}')
plt.xscale('log')
plt.title('Bandit Action Choices Over Time')
plt.xlabel('Episode Number')
plt.ylabel('Percentage of Bandit Selections (%)')
plt.legend()
plt.show()

for i, prob in enumerate(true_bandit_probs, 1):
    print(f"Bandit #{i} -> {prob:.2f}")

Bandit #1 -> 0.37
Bandit #2 -> 0.95
Bandit #3 -> 0.73
Bandit #4 -> 0.60

Agent learns to select the bandit with highest probability

Let's practice!

Reinforcement Learning with Gymnasium in Python