Differences between Q-learning and SARSA - Coding

Reinforcement Learning (RL) is a subfield of machine learning where an agent learns to make decisions by performing actions in an environment to maximize cumulative reward. Among the various algorithms in RL, Q-learning and SARSA (State-Action-Reward-State-Action) are two of the most popular ones. Though both algorithms aim to find the optimal policy for decision-making, they have distinct differences in their approaches.

This article delves into the key differences between Q-learning and SARSA, exploring their mechanisms, advantages, and use cases.

Table of Content

Overview of Q-learning
Q-learning vs SARSA
Key Differences between Q-learning and SARSA

Exploration vs. Exploitation
Update Rules
On-policy vs. Off-policy Learning

Demonstrating the Difference between Q-learning and SARSA
Strengths and Weaknesses
Conclusion
What are the key differences between Q-learning and SARSA?-FAQ’s

Overview of Q-learning

Q-learning is an off-policy RL algorithm that learns the value of the optimal action independently of the policy being followed. It aims to learn the optimal action-value function, [Tex]Q*(s,a)[/Tex] which gives the maximum expected future reward for an action a taken in state s. The update rule for Q-learning is:

[Tex]Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha (r_{t+1} + \gamma \max_{a’}Q(s_{t+1}a’) – Q(s_t, a_t))[/Tex]

Q-learning vs SARSA

SARSA, on the other hand, is an on-policy RL algorithm. It also learns an action-value function, but it updates its estimates based on the action actually taken by the current policy. The update rule for SARSA is:

[Tex]Q(s,a)←Q(s,a)+α[r+γQ(s ′ ,a ′ )−Q(s,a)] [/Tex]

where:

a′ is the action taken in the next state s′ according to the current policy.

Key Differences between Q-learning and SARSA

1. Exploration vs. Exploitation

Q-learning: As an off-policy method, Q-learning updates its Q-values using the maximum possible future reward, regardless of the action taken. This can lead to more aggressive exploration of the environment.
SARSA: As an on-policy method, SARSA updates its Q-values based on the actions actually taken by the policy. This typically results in a more cautious approach, balancing exploration and exploitation more conservatively.

2. Update Rules

Q-learning: Uses the max operator to update Q-values, focusing on the best possible action.
SARSA: Uses the action taken by the current policy, making the learning process more dependent on the policy’s behavior.

3. On-policy vs. Off-policy Learning

Q-learning: Off-policy, meaning it learns the value of the optimal policy independently of the agent’s actions.
SARSA: On-policy, meaning it learns the value of the policy being followed by the agent.

Feature	Q-learning	SARSA
Policy Type	Off-policy	On-policy
Update Rule	[Tex] Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a’} Q(s’, a’) – Q(s, a) \right] [/Tex]	[Tex] Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma Q(s’, a’) – Q(s, a) \right] [/Tex]
Learning Approach	Learns the value of the optimal policy	Learns the value of the current policy
Stability	Potentially less stable due to off-policy updates	More stable due to on-policy updates
Convergence Speed	Typically faster convergence to the optimal policy	Typically slower convergence to the optimal policy
Exploration Impact	Exploration policy can differ from learning policy	Exploration directly influences learning updates
Action Selection for Update	Updates based on the maximum future reward	Updates based on the action actually taken
Use Case Suitability	Suitable for environments where efficiency is critical	Suitable for environments where stability is critical
Example Scenarios	Gaming, robotics, financial trading	Healthcare, adaptive traffic management, personalized learning
Handling of Exploratory Actions	More efficient but can be less aligned with actual experiences	More cautious and aligned with actual experiences
Algorithm Focus	Focuses on finding the best possible actions	Focuses on the actions currently taken by the agent
Risk Tolerance	Higher tolerance for risk and instability	Lower tolerance for risk, prioritizing safety

Demonstrating the Difference between Q-learning and SARSA

Let’s break down the code step by step to understand how these Q-tables were generated:

Step 1: Import Libraries

import gym
import numpy as np

Step 2: Initialize Environment

Initializes the CliffWalking-v0 environment.

env = gym.make('CliffWalking-v0')

Step 3: Define Hyperparameters

Sets the learning rate, discount factor, epsilon-greedy parameter, and the number of episodes for training.

alpha = 0.1 # Learning rate
gamma = 0.99 # Discount factor
epsilon = 0.1 # Epsilon-greedy parameter
episodes = 500 # Number of episodes

Step 4: Initialize Q-tables

Creates two Q-tables, one for Q-learning and one for SARSA, initialized to zeros.

q_table_q_learning = np.zeros((env.observation_space.n, env.action_space.n))
q_table_sarsa = np.zeros((env.observation_space.n, env.action_space.n))

Step 5: Define Epsilon-greedy Policy

Defines a policy that chooses an action based on the epsilon-greedy approach.

def epsilon_greedy_policy(state, q_table):
if np.random.rand() < epsilon:
return env.action_space.sample()
else:
return np.argmax(q_table[state])

Step 6: Q-learning Algorithm

Trains the agent using the Q-learning algorithm by updating the Q-table based on the maximum expected future rewards.

def q_learning():
for episode in range(episodes):
state = env.reset()
done = False

while not done:
action = epsilon_greedy_policy(state, q_table_q_learning)
next_state, reward, done, _ = env.step(action)

best_next_action = np.argmax(q_table_q_learning[next_state])
td_target = reward + gamma * q_table_q_learning[next_state][best_next_action]
td_error = td_target - q_table_q_learning[state][action]
q_table_q_learning[state][action] += alpha * td_error

state = next_state

Step 7: SARSA Algorithm

Trains the agent using the SARSA algorithm by updating the Q-table based on the actual actions taken following the current policy.

def sarsa():
for episode in range(episodes):
state = env.reset()
action = epsilon_greedy_policy(state, q_table_sarsa)
done = False

while not done:
next_state, reward, done, _ = env.step(action)
next_action = epsilon_greedy_policy(next_state, q_table_sarsa)

td_target = reward + gamma * q_table_sarsa[next_state][next_action]
td_error = td_target - q_table_sarsa[state][action]
q_table_sarsa[state][action] += alpha * td_error

state = next_state
action = next_action

Step 8: Train the Agents

q_learning()
sarsa()

Step 9: Print Q-tables

Prints the Q-tables learned by Q-learning and SARSA.

print("Q-table from Q-learning:")
print(q_table_q_learning)

print("\nQ-table from SARSA:")
print(q_table_sarsa)

Step 10: Test the Policies

Tests the learned policies by running a single episode using the learned Q-tables and prints the total rewards received.

def test_policy(q_table):
state = env.reset()
done = False
total_reward = 0

while not done:
action = np.argmax(q_table[state])
state, reward, done, _ = env.step(action)
total_reward += reward

return total_reward

print("\nTesting Q-learning policy:")
q_learning_reward = test_policy(q_table_q_learning)
print(f"Total reward: {q_learning_reward}")

print("\nTesting SARSA policy:")
sarsa_reward = test_policy(q_table_sarsa)
print(f"Total reward: {sarsa_reward}")

Complete Code

Python

import gym
import numpy as np

# Initialize the CliffWalking environment
env = gym.make('CliffWalking-v0')

# Hyperparameters
alpha = 0.1  # Learning rate
gamma = 0.99  # Discount factor
epsilon = 0.1  # Epsilon-greedy parameter
episodes = 500  # Number of episodes

# Initialize Q-tables for Q-learning and SARSA
q_table_q_learning = np.zeros((env.observation_space.n, env.action_space.n))
q_table_sarsa = np.zeros((env.observation_space.n, env.action_space.n))

def epsilon_greedy_policy(state, q_table):
    if np.random.rand() < epsilon:
        return env.action_space.sample()
    else:
        return np.argmax(q_table[state])

# Q-learning algorithm
def q_learning():
    for episode in range(episodes):
        state = env.reset()
        done = False
        
        while not done:
            action = epsilon_greedy_policy(state, q_table_q_learning)
            next_state, reward, done, _ = env.step(action)
            
            best_next_action = np.argmax(q_table_q_learning[next_state])
            td_target = reward + gamma * q_table_q_learning[next_state][best_next_action]
            td_error = td_target - q_table_q_learning[state][action]
            q_table_q_learning[state][action] += alpha * td_error
            
            state = next_state

# SARSA algorithm
def sarsa():
    for episode in range(episodes):
        state = env.reset()
        action = epsilon_greedy_policy(state, q_table_sarsa)
        done = False
        
        while not done:
            next_state, reward, done, _ = env.step(action)
            next_action = epsilon_greedy_policy(next_state, q_table_sarsa)
            
            td_target = reward + gamma * q_table_sarsa[next_state][next_action]
            td_error = td_target - q_table_sarsa[state][action]
            q_table_sarsa[state][action] += alpha * td_error
            
            state = next_state
            action = next_action

# Train the agents
q_learning()
sarsa()

# Compare the Q-tables
print("Q-table from Q-learning:")
print(q_table_q_learning)

print("\nQ-table from SARSA:")
print(q_table_sarsa)

# Testing the learned policies
def test_policy(q_table):
    state = env.reset()
    done = False
    total_reward = 0
    
    while not done:
        action = np.argmax(q_table[state])
        state, reward, done, _ = env.step(action)
        total_reward += reward
    
    return total_reward

print("\nTesting Q-learning policy:")
q_learning_reward = test_policy(q_table_q_learning)
print(f"Total reward: {q_learning_reward}")

print("\nTesting SARSA policy:")
sarsa_reward = test_policy(q_table_sarsa)
print(f"Total reward: {sarsa_reward}")

Output:

Q-table from Q-learning:
[[ -10.59211215 -10.56938101 -10.67238889 -10.63737823]
[ -10.12685364 -10.10867596 -10.12577074 -10.34405853]
.
.
.
[ -2.61016867 -1.76042566 -1. -2.48510098]
[ -12.2478977 -101.76477563 -12.81475214 -12.66146631]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]]

Q-table from SARSA:
[[-12.81783185 -12.71444982 -12.77600325 -12.82976718]
[-12.10365824 -11.98888096 -12.00448456 -12.02939867]
[-11.22475027 -11.19298546 -11.25557992 -11.25058041]
[-10.50817883 -10.37003716 -10.37359796 -10.52296819]

.
.
.
[ -2.98975185 -1.84812003 -1. -2.17498711]
[-14.84919232 -94.75377767 -15.46421167 -15.33976999]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]]

Testing Q-learning policy:
Total reward: -13

Testing SARSA policy:
Total reward: -15

Strengths and Weaknesses

Q-learning

Strengths:

Often converges faster to the optimal policy.
More aggressive in exploring the environment, which can be beneficial in complex scenarios.

Weaknesses:

Can be less stable due to aggressive exploration.
Might converge to suboptimal policies if the exploration is not properly managed.

SARSA

Strengths:

More stable learning process due to the on-policy approach.
Better at handling environments with high levels of uncertainty.

Weaknesses:

Can be slower to converge as it balances exploration and exploitation.
May not explore as thoroughly as Q-learning, potentially missing optimal policies.

Conclusion

In conclusion, Q-learning and SARSA are fundamental reinforcement learning algorithms with distinct approaches: Q-learning is off-policy since it seeks to find the optimal values for the Q table that would in the future pay off maximally this makes it suitable for dynamic environments such as gaming and robotics, SARSA on the other hand, is on policy since it learns from the current action of the agent, thus being safer and stable especially in health care, traffic control and management. It is beneficial that anyone selecting the right algorithm shall have knowledge of the differences in the learning strategies, update rules, and use of each algorithm and therefore the need for a balance when getting fast convergence and safe exploration in the environments.

What are the key differences between Q-learning and SARSA?-FAQ’s

What is the main difference in the update rule between Q-learning and SARSA?

Q-learning update Q-value note that is based on the maximum possible reward of next state instead of the one which had been selected (off-policy). This is important because SARSA updates the Q-value using the action that has actually been taken by the agent (on-policy).

The difference in terms of policy learning of Q-learning and SARSA.

Q-learning learns the value of the optimal policy, and does not depend on the actions of the agent. SARSA targets at learning the value of the policy being followed by the agent at a specific time instance.

Off-policy in reference to Q-learning refers to situations where an agent chooses an action which has not been observed by the algorithm.

Off-policy is when Q is updated to the value of the best action possible in the next state ignoring the actions that the agent is likely to choose.

Which algorithm converges to the optimal policy at a fast rate?

The Q-learning algorithm is generally more convergent to the policy because it tends to take the max operation on the expected future rewards, though the stability might be affected.

Reffered: https://www.geeksforgeeks.org

AI ML DS

Related
Forward Chaining and Backward Chaining inference in Rule-Based Systems
Constraint Propagation in AI
Mastering Heatmap Customization: Enhancing Heatmap Readability with Seaborn
Artificial Intelligence (AI) in Supply Chain and Logistics
10 Best Data Engineering Tools for Big Data Processing

Type:	Geek
Category:	Coding
Sub Category:	Tutorial
Uploaded by:	Admin
Views:	15