[Tex]Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha (r_{t+1} + \gamma \max_{a’}Q(s_{t+1}a’) – Q(s_t, a_t))[/Tex]
Feature
| Q-learning
| SARSA
|
---|
Policy Type
| Off-policy
| On-policy
|
---|
Update Rule
| [Tex] Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a’} Q(s’, a’) – Q(s, a) \right]
[/Tex]
| [Tex] Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma Q(s’, a’) – Q(s, a) \right]
[/Tex]
|
---|
Learning Approach
| Learns the value of the optimal policy
| Learns the value of the current policy
|
---|
Stability
| Potentially less stable due to off-policy updates
| More stable due to on-policy updates
|
---|
Convergence Speed
| Typically faster convergence to the optimal policy
| Typically slower convergence to the optimal policy
|
---|
Exploration Impact
| Exploration policy can differ from learning policy
| Exploration directly influences learning updates
|
---|
Action Selection for Update
| Updates based on the maximum future reward
| Updates based on the action actually taken
|
---|
Use Case Suitability
| Suitable for environments where efficiency is critical
| Suitable for environments where stability is critical
|
---|
Example Scenarios
| Gaming, robotics, financial trading
| Healthcare, adaptive traffic management, personalized learning
|
---|
Handling of Exploratory Actions
| More efficient but can be less aligned with actual experiences
| More cautious and aligned with actual experiences
|
---|
Algorithm Focus
| Focuses on finding the best possible actions
| Focuses on the actions currently taken by the agent
|
---|
Risk Tolerance
| Higher tolerance for risk and instability
| Lower tolerance for risk, prioritizing safety
|
---|
Demonstrating the Difference between Q-learning and SARSA
Let’s break down the code step by step to understand how these Q-tables were generated:
Step 1: Import Libraries
import gym
import numpy as np
Step 2: Initialize Environment
Initializes the CliffWalking-v0
environment.
env = gym.make('CliffWalking-v0')
Step 3: Define Hyperparameters
Sets the learning rate, discount factor, epsilon-greedy parameter, and the number of episodes for training.
alpha = 0.1 # Learning rate
gamma = 0.99 # Discount factor
epsilon = 0.1 # Epsilon-greedy parameter
episodes = 500 # Number of episodes
Step 4: Initialize Q-tables
Creates two Q-tables, one for Q-learning and one for SARSA, initialized to zeros.
q_table_q_learning = np.zeros((env.observation_space.n, env.action_space.n))
q_table_sarsa = np.zeros((env.observation_space.n, env.action_space.n))
Step 5: Define Epsilon-greedy Policy
Defines a policy that chooses an action based on the epsilon-greedy approach.
def epsilon_greedy_policy(state, q_table):
if np.random.rand() < epsilon:
return env.action_space.sample()
else:
return np.argmax(q_table[state])
Step 6: Q-learning Algorithm
Trains the agent using the Q-learning algorithm by updating the Q-table based on the maximum expected future rewards.
def q_learning():
for episode in range(episodes):
state = env.reset()
done = False
while not done:
action = epsilon_greedy_policy(state, q_table_q_learning)
next_state, reward, done, _ = env.step(action)
best_next_action = np.argmax(q_table_q_learning[next_state])
td_target = reward + gamma * q_table_q_learning[next_state][best_next_action]
td_error = td_target - q_table_q_learning[state][action]
q_table_q_learning[state][action] += alpha * td_error
state = next_state
Step 7: SARSA Algorithm
Trains the agent using the SARSA algorithm by updating the Q-table based on the actual actions taken following the current policy.
def sarsa():
for episode in range(episodes):
state = env.reset()
action = epsilon_greedy_policy(state, q_table_sarsa)
done = False
while not done:
next_state, reward, done, _ = env.step(action)
next_action = epsilon_greedy_policy(next_state, q_table_sarsa)
td_target = reward + gamma * q_table_sarsa[next_state][next_action]
td_error = td_target - q_table_sarsa[state][action]
q_table_sarsa[state][action] += alpha * td_error
state = next_state
action = next_action
Step 8: Train the Agents
q_learning()
sarsa()
Step 9: Print Q-tables
Prints the Q-tables learned by Q-learning and SARSA.
print("Q-table from Q-learning:")
print(q_table_q_learning)
print("\nQ-table from SARSA:")
print(q_table_sarsa)
Step 10: Test the Policies
Tests the learned policies by running a single episode using the learned Q-tables and prints the total rewards received.
def test_policy(q_table):
state = env.reset()
done = False
total_reward = 0
while not done:
action = np.argmax(q_table[state])
state, reward, done, _ = env.step(action)
total_reward += reward
return total_reward
print("\nTesting Q-learning policy:")
q_learning_reward = test_policy(q_table_q_learning)
print(f"Total reward: {q_learning_reward}")
print("\nTesting SARSA policy:")
sarsa_reward = test_policy(q_table_sarsa)
print(f"Total reward: {sarsa_reward}")
Complete Code
Python
import gym
import numpy as np
# Initialize the CliffWalking environment
env = gym.make('CliffWalking-v0')
# Hyperparameters
alpha = 0.1 # Learning rate
gamma = 0.99 # Discount factor
epsilon = 0.1 # Epsilon-greedy parameter
episodes = 500 # Number of episodes
# Initialize Q-tables for Q-learning and SARSA
q_table_q_learning = np.zeros((env.observation_space.n, env.action_space.n))
q_table_sarsa = np.zeros((env.observation_space.n, env.action_space.n))
def epsilon_greedy_policy(state, q_table):
if np.random.rand() < epsilon:
return env.action_space.sample()
else:
return np.argmax(q_table[state])
# Q-learning algorithm
def q_learning():
for episode in range(episodes):
state = env.reset()
done = False
while not done:
action = epsilon_greedy_policy(state, q_table_q_learning)
next_state, reward, done, _ = env.step(action)
best_next_action = np.argmax(q_table_q_learning[next_state])
td_target = reward + gamma * q_table_q_learning[next_state][best_next_action]
td_error = td_target - q_table_q_learning[state][action]
q_table_q_learning[state][action] += alpha * td_error
state = next_state
# SARSA algorithm
def sarsa():
for episode in range(episodes):
state = env.reset()
action = epsilon_greedy_policy(state, q_table_sarsa)
done = False
while not done:
next_state, reward, done, _ = env.step(action)
next_action = epsilon_greedy_policy(next_state, q_table_sarsa)
td_target = reward + gamma * q_table_sarsa[next_state][next_action]
td_error = td_target - q_table_sarsa[state][action]
q_table_sarsa[state][action] += alpha * td_error
state = next_state
action = next_action
# Train the agents
q_learning()
sarsa()
# Compare the Q-tables
print("Q-table from Q-learning:")
print(q_table_q_learning)
print("\nQ-table from SARSA:")
print(q_table_sarsa)
# Testing the learned policies
def test_policy(q_table):
state = env.reset()
done = False
total_reward = 0
while not done:
action = np.argmax(q_table[state])
state, reward, done, _ = env.step(action)
total_reward += reward
return total_reward
print("\nTesting Q-learning policy:")
q_learning_reward = test_policy(q_table_q_learning)
print(f"Total reward: {q_learning_reward}")
print("\nTesting SARSA policy:")
sarsa_reward = test_policy(q_table_sarsa)
print(f"Total reward: {sarsa_reward}")
Output:
Q-table from Q-learning:
[[ -10.59211215 -10.56938101 -10.67238889 -10.63737823]
[ -10.12685364 -10.10867596 -10.12577074 -10.34405853]
.
.
.
[ -2.61016867 -1.76042566 -1. -2.48510098]
[ -12.2478977 -101.76477563 -12.81475214 -12.66146631]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]]
Q-table from SARSA:
[[-12.81783185 -12.71444982 -12.77600325 -12.82976718]
[-12.10365824 -11.98888096 -12.00448456 -12.02939867]
[-11.22475027 -11.19298546 -11.25557992 -11.25058041]
[-10.50817883 -10.37003716 -10.37359796 -10.52296819]
.
.
.
[ -2.98975185 -1.84812003 -1. -2.17498711]
[-14.84919232 -94.75377767 -15.46421167 -15.33976999]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]]
Testing Q-learning policy:
Total reward: -13
Testing SARSA policy:
Total reward: -15
Strengths and Weaknesses
Q-learning
Strengths:
- Often converges faster to the optimal policy.
- More aggressive in exploring the environment, which can be beneficial in complex scenarios.
Weaknesses:
- Can be less stable due to aggressive exploration.
- Might converge to suboptimal policies if the exploration is not properly managed.
SARSA
Strengths:
- More stable learning process due to the on-policy approach.
- Better at handling environments with high levels of uncertainty.
Weaknesses:
- Can be slower to converge as it balances exploration and exploitation.
- May not explore as thoroughly as Q-learning, potentially missing optimal policies.
Conclusion
In conclusion, Q-learning and SARSA are fundamental reinforcement learning algorithms with distinct approaches: Q-learning is off-policy since it seeks to find the optimal values for the Q table that would in the future pay off maximally this makes it suitable for dynamic environments such as gaming and robotics, SARSA on the other hand, is on policy since it learns from the current action of the agent, thus being safer and stable especially in health care, traffic control and management. It is beneficial that anyone selecting the right algorithm shall have knowledge of the differences in the learning strategies, update rules, and use of each algorithm and therefore the need for a balance when getting fast convergence and safe exploration in the environments.
What are the key differences between Q-learning and SARSA?-FAQ’s
What is the main difference in the update rule between Q-learning and SARSA?
Q-learning update Q-value note that is based on the maximum possible reward of next state instead of the one which had been selected (off-policy). This is important because SARSA updates the Q-value using the action that has actually been taken by the agent (on-policy).
The difference in terms of policy learning of Q-learning and SARSA.
Q-learning learns the value of the optimal policy, and does not depend on the actions of the agent. SARSA targets at learning the value of the policy being followed by the agent at a specific time instance.
Off-policy in reference to Q-learning refers to situations where an agent chooses an action which has not been observed by the algorithm.
Off-policy is when Q is updated to the value of the best action possible in the next state ignoring the actions that the agent is likely to choose.
Which algorithm converges to the optimal policy at a fast rate?
The Q-learning algorithm is generally more convergent to the policy because it tends to take the max operation on the expected future rewards, though the stability might be affected.