Q-learning

Reinforcement Learning dengan Gymnasium di Python

Fouad Trad

Machine Learning Engineer

Pengantar Q-learning

  • Singkatan dari quality learning
  • Teknik tanpa model
  • Mempelajari Q-table optimal melalui interaksi

Diagram menunjukkan langkah-langkah Q-learning termasuk inisialisasi Q-table, memilih aksi untuk dijalankan, menerima reward dari lingkungan, dan memperbarui tabel. Agen mengulang hingga konvergen setelah sejumlah episode.

Reinforcement Learning dengan Gymnasium di Python

Q-learning vs. SARSA

SARSA

Gambar menampilkan rumus pembaruan SARSA.

  • Memperbarui berdasarkan aksi yang diambil
  • Pembelajar on-policy
Q-learning

Gambar menampilkan rumus pembaruan Q-learning.

  • Memperbarui independen dari aksi yang diambil
  • Pembelajar off-policy
Reinforcement Learning dengan Gymnasium di Python

Implementasi Q-learning

env = gym.make("FrozenLake", is_slippery=True)

num_episodes = 1000 alpha = 0.1 gamma = 1
num_states, num_actions = env.observation_space.n, env.action_space.n Q = np.zeros((num_states, num_actions))
reward_per_random_episode = []
Reinforcement Learning dengan Gymnasium di Python

Implementasi Q-learning

for episode in range(num_episodes):
    state, info = env.reset()
    terminated = False
    episode_reward = 0

while not terminated:
# Pemilihan aksi acak action = env.action_space.sample()
# Ambil aksi dan amati state dan reward baru new_state, reward, terminated, truncated, info = env.step(action)
# Perbarui Q-table update_q_table(state, action, new_state)
episode_reward += reward state = new_state
reward_per_random_episode.append(episode_reward)
Reinforcement Learning dengan Gymnasium di Python

Pembaruan Q-learning

Gambar menampilkan rumus pembaruan Q-learning.

def update_q_table(state, action, reward, new_state):

old_value = Q[state, action]
next_max = max(Q[new_state])
Q[state, action] = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)
Reinforcement Learning dengan Gymnasium di Python

Menggunakan kebijakan

reward_per_learned_episode = []
policy = get_policy()

for episode in range(num_episodes): state, info = env.reset() terminated = False episode_reward = 0 while not terminated: # Pilih aksi terbaik berdasarkan Q-table yang telah dipelajari action = policy[state] # Ambil aksi dan amati state baru new_state, reward, terminated, truncated, info = env.step(action) state = new_state
episode_reward += reward
reward_per_learned_episode.append(episode_reward)
Reinforcement Learning dengan Gymnasium di Python

Evaluasi Q-learning

import numpy as np
import matplotlib.pyplot as plt

avg_random_reward = np.mean(reward_per_random_episode) avg_learned_reward = np.mean(reward_per_learned_episode)
plt.bar(['Random Policy', 'Learned Policy'], [avg_random_reward, avg_learned_reward], color=['blue', 'green']) plt.title('Rata-rata Reward per Episode') plt.ylabel('Rata-rata Reward') plt.show()

Gambar diagram batang menunjukkan kebijakan yang dipelajari menghasilkan return jauh lebih tinggi (sekitar 0,26) dibanding kebijakan acak (sekitar 0,01).

Reinforcement Learning dengan Gymnasium di Python

Ayo berlatih!

Reinforcement Learning dengan Gymnasium di Python

Preparing Video For Download...