Horje
Differences between Q-learning and SARSA

Reinforcement Learning (RL) is a subfield of machine learning where an agent learns to make decisions by performing actions in an environment to maximize cumulative reward. Among the various algorithms in RL, Q-learning and SARSA (State-Action-Reward-State-Action) are two of the most popular ones. Though both algorithms aim to find the optimal policy for decision-making, they have distinct differences in their approaches.

This article delves into the key differences between Q-learning and SARSA, exploring their mechanisms, advantages, and use cases.

Overview of Q-learning

Q-learning is an off-policy RL algorithm that learns the value of the optimal action independently of the policy being followed. It aims to learn the optimal action-value function, [Tex]Q*(s,a)[/Tex] which gives the maximum expected future reward for an action a taken in state s. The update rule for Q-learning is:

[Tex]Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha (r_{t+1} + \gamma \max_{a’}Q(s_{t+1}a’) – Q(s_t, a_t))[/Tex]

Q-learning vs SARSA

SARSA, on the other hand, is an on-policy RL algorithm. It also learns an action-value function, but it updates its estimates based on the action actually taken by the current policy. The update rule for SARSA is:

[Tex]Q(s,a)←Q(s,a)+α[r+γQ(s ′ ,a ′ )−Q(s,a)] [/Tex]

where:

  • a′ is the action taken in the next state s′ according to the current policy.

Key Differences between Q-learning and SARSA

1. Exploration vs. Exploitation

  • Q-learning: As an off-policy method, Q-learning updates its Q-values using the maximum possible future reward, regardless of the action taken. This can lead to more aggressive exploration of the environment.
  • SARSA: As an on-policy method, SARSA updates its Q-values based on the actions actually taken by the policy. This typically results in a more cautious approach, balancing exploration and exploitation more conservatively.

2. Update Rules

  • Q-learning: Uses the max operator to update Q-values, focusing on the best possible action.
  • SARSA: Uses the action taken by the current policy, making the learning process more dependent on the policy’s behavior.

3. On-policy vs. Off-policy Learning

  • Q-learning: Off-policy, meaning it learns the value of the optimal policy independently of the agent’s actions.
  • SARSA: On-policy, meaning it learns the value of the policy being followed by the agent.

Feature

Q-learning

SARSA

Policy Type

Off-policy

On-policy

Update Rule

[Tex] Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a’} Q(s’, a’) – Q(s, a) \right] [/Tex]

[Tex] Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma Q(s’, a’) – Q(s, a) \right] [/Tex]

Learning Approach

Learns the value of the optimal policy

Learns the value of the current policy

Stability

Potentially less stable due to off-policy updates

More stable due to on-policy updates

Convergence Speed

Typically faster convergence to the optimal policy

Typically slower convergence to the optimal policy

Exploration Impact

Exploration policy can differ from learning policy

Exploration directly influences learning updates

Action Selection for Update

Updates based on the maximum future reward

Updates based on the action actually taken

Use Case Suitability

Suitable for environments where efficiency is critical

Suitable for environments where stability is critical

Example Scenarios

Gaming, robotics, financial trading

Healthcare, adaptive traffic management, personalized learning

Handling of Exploratory Actions

More efficient but can be less aligned with actual experiences

More cautious and aligned with actual experiences

Algorithm Focus

Focuses on finding the best possible actions

Focuses on the actions currently taken by the agent

Risk Tolerance

Higher tolerance for risk and instability

Lower tolerance for risk, prioritizing safety

Demonstrating the Difference between Q-learning and SARSA

Let’s break down the code step by step to understand how these Q-tables were generated:

Step 1: Import Libraries

import gym
import numpy as np

Step 2: Initialize Environment

Initializes the CliffWalking-v0 environment.

env = gym.make('CliffWalking-v0')

Step 3: Define Hyperparameters

Sets the learning rate, discount factor, epsilon-greedy parameter, and the number of episodes for training.

alpha = 0.1 # Learning rate
gamma = 0.99 # Discount factor
epsilon = 0.1 # Epsilon-greedy parameter
episodes = 500 # Number of episodes

Step 4: Initialize Q-tables

Creates two Q-tables, one for Q-learning and one for SARSA, initialized to zeros.

q_table_q_learning = np.zeros((env.observation_space.n, env.action_space.n))
q_table_sarsa = np.zeros((env.observation_space.n, env.action_space.n))

Step 5: Define Epsilon-greedy Policy

Defines a policy that chooses an action based on the epsilon-greedy approach.

def epsilon_greedy_policy(state, q_table):
if np.random.rand() < epsilon:
return env.action_space.sample()
else:
return np.argmax(q_table[state])

Step 6: Q-learning Algorithm

Trains the agent using the Q-learning algorithm by updating the Q-table based on the maximum expected future rewards.

def q_learning():
for episode in range(episodes):
state = env.reset()
done = False

while not done:
action = epsilon_greedy_policy(state, q_table_q_learning)
next_state, reward, done, _ = env.step(action)

best_next_action = np.argmax(q_table_q_learning[next_state])
td_target = reward + gamma * q_table_q_learning[next_state][best_next_action]
td_error = td_target - q_table_q_learning[state][action]
q_table_q_learning[state][action] += alpha * td_error

state = next_state

Step 7: SARSA Algorithm

Trains the agent using the SARSA algorithm by updating the Q-table based on the actual actions taken following the current policy.

def sarsa():
for episode in range(episodes):
state = env.reset()
action = epsilon_greedy_policy(state, q_table_sarsa)
done = False

while not done:
next_state, reward, done, _ = env.step(action)
next_action = epsilon_greedy_policy(next_state, q_table_sarsa)

td_target = reward + gamma * q_table_sarsa[next_state][next_action]
td_error = td_target - q_table_sarsa[state][action]
q_table_sarsa[state][action] += alpha * td_error

state = next_state
action = next_action

Step 8: Train the Agents

q_learning()
sarsa()

Step 9: Print Q-tables

Prints the Q-tables learned by Q-learning and SARSA.

print("Q-table from Q-learning:")
print(q_table_q_learning)

print("\nQ-table from SARSA:")
print(q_table_sarsa)

Step 10: Test the Policies

Tests the learned policies by running a single episode using the learned Q-tables and prints the total rewards received.

def test_policy(q_table):
state = env.reset()
done = False
total_reward = 0

while not done:
action = np.argmax(q_table[state])
state, reward, done, _ = env.step(action)
total_reward += reward

return total_reward

print("\nTesting Q-learning policy:")
q_learning_reward = test_policy(q_table_q_learning)
print(f"Total reward: {q_learning_reward}")

print("\nTesting SARSA policy:")
sarsa_reward = test_policy(q_table_sarsa)
print(f"Total reward: {sarsa_reward}")

Complete Code

Python

import gym import numpy as np # Initialize the CliffWalking environment env = gym.make('CliffWalking-v0') # Hyperparameters alpha = 0.1 # Learning rate gamma = 0.99 # Discount factor epsilon = 0.1 # Epsilon-greedy parameter episodes = 500 # Number of episodes # Initialize Q-tables for Q-learning and SARSA q_table_q_learning = np.zeros((env.observation_space.n, env.action_space.n)) q_table_sarsa = np.zeros((env.observation_space.n, env.action_space.n)) def epsilon_greedy_policy(state, q_table): if np.random.rand() < epsilon: return env.action_space.sample() else: return np.argmax(q_table[state]) # Q-learning algorithm def q_learning(): for episode in range(episodes): state = env.reset() done = False while not done: action = epsilon_greedy_policy(state, q_table_q_learning) next_state, reward, done, _ = env.step(action) best_next_action = np.argmax(q_table_q_learning[next_state]) td_target = reward + gamma * q_table_q_learning[next_state][best_next_action] td_error = td_target - q_table_q_learning[state][action] q_table_q_learning[state][action] += alpha * td_error state = next_state # SARSA algorithm def sarsa(): for episode in range(episodes): state = env.reset() action = epsilon_greedy_policy(state, q_table_sarsa) done = False while not done: next_state, reward, done, _ = env.step(action) next_action = epsilon_greedy_policy(next_state, q_table_sarsa) td_target = reward + gamma * q_table_sarsa[next_state][next_action] td_error = td_target - q_table_sarsa[state][action] q_table_sarsa[state][action] += alpha * td_error state = next_state action = next_action # Train the agents q_learning() sarsa() # Compare the Q-tables print("Q-table from Q-learning:") print(q_table_q_learning) print("\nQ-table from SARSA:") print(q_table_sarsa) # Testing the learned policies def test_policy(q_table): state = env.reset() done = False total_reward = 0 while not done: action = np.argmax(q_table[state]) state, reward, done, _ = env.step(action) total_reward += reward return total_reward print("\nTesting Q-learning policy:") q_learning_reward = test_policy(q_table_q_learning) print(f"Total reward: {q_learning_reward}") print("\nTesting SARSA policy:") sarsa_reward = test_policy(q_table_sarsa) print(f"Total reward: {sarsa_reward}")

Output:

Q-table from Q-learning:
[[ -10.59211215 -10.56938101 -10.67238889 -10.63737823]
[ -10.12685364 -10.10867596 -10.12577074 -10.34405853]
.
.
.
[ -2.61016867 -1.76042566 -1. -2.48510098]
[ -12.2478977 -101.76477563 -12.81475214 -12.66146631]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]]

Q-table from SARSA:
[[-12.81783185 -12.71444982 -12.77600325 -12.82976718]
[-12.10365824 -11.98888096 -12.00448456 -12.02939867]
[-11.22475027 -11.19298546 -11.25557992 -11.25058041]
[-10.50817883 -10.37003716 -10.37359796 -10.52296819]

.
.
.
[ -2.98975185 -1.84812003 -1. -2.17498711]
[-14.84919232 -94.75377767 -15.46421167 -15.33976999]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]]

Testing Q-learning policy:
Total reward: -13

Testing SARSA policy:
Total reward: -15

Strengths and Weaknesses

Q-learning

Strengths:

  • Often converges faster to the optimal policy.
  • More aggressive in exploring the environment, which can be beneficial in complex scenarios.

Weaknesses:

  • Can be less stable due to aggressive exploration.
  • Might converge to suboptimal policies if the exploration is not properly managed.

SARSA

Strengths:

  • More stable learning process due to the on-policy approach.
  • Better at handling environments with high levels of uncertainty.

Weaknesses:

  • Can be slower to converge as it balances exploration and exploitation.
  • May not explore as thoroughly as Q-learning, potentially missing optimal policies.

Conclusion

In conclusion, Q-learning and SARSA are fundamental reinforcement learning algorithms with distinct approaches: Q-learning is off-policy since it seeks to find the optimal values for the Q table that would in the future pay off maximally this makes it suitable for dynamic environments such as gaming and robotics, SARSA on the other hand, is on policy since it learns from the current action of the agent, thus being safer and stable especially in health care, traffic control and management. It is beneficial that anyone selecting the right algorithm shall have knowledge of the differences in the learning strategies, update rules, and use of each algorithm and therefore the need for a balance when getting fast convergence and safe exploration in the environments.

What are the key differences between Q-learning and SARSA?-FAQ’s

What is the main difference in the update rule between Q-learning and SARSA?

Q-learning update Q-value note that is based on the maximum possible reward of next state instead of the one which had been selected (off-policy). This is important because SARSA updates the Q-value using the action that has actually been taken by the agent (on-policy).

The difference in terms of policy learning of Q-learning and SARSA.

Q-learning learns the value of the optimal policy, and does not depend on the actions of the agent. SARSA targets at learning the value of the policy being followed by the agent at a specific time instance.

Off-policy in reference to Q-learning refers to situations where an agent chooses an action which has not been observed by the algorithm.

Off-policy is when Q is updated to the value of the best action possible in the next state ignoring the actions that the agent is likely to choose.

Which algorithm converges to the optimal policy at a fast rate?

The Q-learning algorithm is generally more convergent to the policy because it tends to take the max operation on the expected future rewards, though the stability might be affected.




Reffered: https://www.geeksforgeeks.org


AI ML DS

Related
Forward Chaining and Backward Chaining inference in Rule-Based Systems Forward Chaining and Backward Chaining inference in Rule-Based Systems
Constraint Propagation in AI Constraint Propagation in AI
Mastering Heatmap Customization: Enhancing Heatmap Readability with Seaborn Mastering Heatmap Customization: Enhancing Heatmap Readability with Seaborn
Artificial Intelligence (AI) in Supply Chain and Logistics Artificial Intelligence (AI) in Supply Chain and Logistics
10 Best Data Engineering Tools for Big Data Processing 10 Best Data Engineering Tools for Big Data Processing

Type:
Geek
Category:
Coding
Sub Category:
Tutorial
Uploaded by:
Admin
Views:
15