## Are Q-learning and SARSA the same when action selection is greedy?

8

1

I'm currently studying reinforcement learning and I'm having difficulties with question 6.12 in Sutton and Barto's book.

Suppose action selection is greedy. Is Q-learning then exactly the same algorithm as SARSA? Will they make exactly the same action selections and weight updates?

I think it's true, because the main difference between the two is when the agent explores, and following the greedy policy it never explores, but I am not sure.

5

If we write the pseudo-code for the SARSA algorithm we first initialise our hyper-parameters etc. and then initialise $$S_t$$, which we use to choose $$A_t$$ from our policy $$\pi(a|s)$$. Then for each $$t$$ in the episode we do the following:

1. Take action $$A_t$$ and observe $$R_{t+1}$$, $$S_{t+1}$$
2. Choose $$A_{t+1}$$ using $$S_{t+1}$$ in our policy
3. $$Q(S_t, A_t) = Q(S_t, A_t) + \alpha [R_{t+1} + \gamma Q(S_{t+1},A_{t+1}) - Q(S_t, A_t)]$$

Now, in Q-learning we replace $$Q(S_{t+1},A_{t+1})$$ in line 3 with $$\max_aQ(S_{t+1},a)$$. Recall that in SARSA we chose our $$A_{t+1}$$ using our policy $$\pi$$ - if our policy is greedy with respect to the action value function then this simply means the policy is $$\pi(a|s) = \max_aQ(s,a)$$ which is exactly how we choose our weight update in Q-learning.

To answer the question - yes, they always make the same weight update. If the algorithms both follow a greedy policy then they also become the same algorithm.

Edit 1: I forgot to consider an edge case, so they are not always the same algorithm.

Consider where we transition from $$s$$ to $$s'$$ where $$s'=s$$. I will outline the updates for SARSA and Q-learning indexing the $$Q$$ functions with $$t$$ to demonstrate the difference.

For each case I will assume we are at the start of the episode as this is the easiest way to illustrate the difference.

SARSA

1. We initialise $$S_0 = s$$ and choose $$A_0 = \max_a Q_0(s,a)$$
2. Take action $$A_0$$ and observe $$R_{1}$$ and $$S_{1} = s' = s$$.
3. Choose action $$A_{1} = \max_aQ_{0}(s,a)$$

Q-Learning

1. Initialise $$S_0 = s$$
2. Choose action $$A_0 = \max_aQ_0(s,a)$$, observe $$R_{1}, S_{1} = s' = s$$
3. $$Q_{1}(S_0,A_0) = Q_0(S_0,A_0) + \alpha [R_{1} + \gamma \max_aQ_0(s,a) - Q_0(S_0,A_0)]$$
4. Choose action $$A_1 = \max_aQ_1(s,a)$$

The key for understanding this edge case is that when we transition into the same state, the Q-Learning update will update the Q-Function before choosing $$A_1$$. I have indexed actions and Q-functions by the episode step - hopefully it makes sense why I have done this for the Q-Functions as usually this would not make sense but because we have two successive states that are the same it is okay.

1

Thank you for your answer. I have been linked to an unofficial solution manual (https://github.com/LyWangPX/Reinforcement-Learning-2nd-Edition-by-Sutton-Exercise-Solutions/blob/master/Chapter%206/Solutions_to_Reinforcement_Learning_by_Sutton_Chapter_6_rx.pdf) where someone states that even when using greedy selection algorithms can behave differently. The conclusion is that they are more or less the same but there are some limit cases that might behave differently.

– hyuj – 2020-05-11T14:31:49.963

Thanks - I hadn't thought of this edge case; I will edit my answer to account for this. – David Ireland – 2020-05-11T14:49:07.973

1Great feedback from both answers. Thanks. – ddaedalus – 2020-06-03T14:58:22.960