If we write the pseudo-code for the SARSA algorithm we first initialise our hyper-parameters etc. and then initialise $S_t$, which we use to choose $A_t$ from our policy $\pi(a|s)$. Then for each $t$ in the episode we do the following:

- Take action $A_t$ and observe $R_{t+1}$, $S_{t+1}$
- Choose $A_{t+1}$ using $S_{t+1}$ in our policy
- $Q(S_t, A_t) = Q(S_t, A_t) + \alpha [R_{t+1} + \gamma Q(S_{t+1},A_{t+1}) - Q(S_t, A_t)]$

Now, in Q-learning we replace $Q(S_{t+1},A_{t+1})$ in line 3 with $\max_aQ(S_{t+1},a)$. Recall that in SARSA we chose our $A_{t+1}$ using our policy $\pi$ - if our policy is greedy with respect to the action value function then this simply means the policy is $\pi(a|s) = \max_aQ(s,a)$ which is exactly how we choose our weight update in Q-learning.

To answer the question - yes, they always make the same weight update. If the algorithms both follow a greedy policy then they also become the same algorithm.

Edit 1: I forgot to consider an edge case, so they are not *always* the same algorithm.

Consider where we transition from $s$ to $s'$ where $s'=s$. I will outline the updates for SARSA and Q-learning indexing the $Q$ functions with $t$ to demonstrate the difference.

For each case I will assume we are at the start of the episode as this is the easiest way to illustrate the difference.

**SARSA**

- We initialise $S_0 = s$ and choose $A_0 = \max_a Q_0(s,a)$
- Take action $A_0$ and observe $R_{1}$ and $S_{1} = s' = s$.
- Choose action $A_{1} = \max_aQ_{0}(s,a)$

**Q-Learning**

- Initialise $S_0 = s$
- Choose action $A_0 = \max_aQ_0(s,a)$, observe $R_{1}, S_{1} = s' = s$
- $Q_{1}(S_0,A_0) = Q_0(S_0,A_0) + \alpha [R_{1} + \gamma \max_aQ_0(s,a) - Q_0(S_0,A_0)]$
- Choose action $A_1 = \max_aQ_1(s,a)$

The key for understanding this edge case is that when we transition into the same state, the Q-Learning update will update the Q-Function before choosing $A_1$. I have indexed actions and Q-functions by the episode step - hopefully it makes sense why I have done this for the Q-Functions as usually this would not make sense but because we have two successive states that are the same it is okay.

1

Thank you for your answer. I have been linked to an unofficial solution manual (https://github.com/LyWangPX/Reinforcement-Learning-2nd-Edition-by-Sutton-Exercise-Solutions/blob/master/Chapter%206/Solutions_to_Reinforcement_Learning_by_Sutton_Chapter_6_rx.pdf) where someone states that even when using greedy selection algorithms can behave differently. The conclusion is that they are more or less the same but there are some limit cases that might behave differently.

– hyuj – 2020-05-11T14:31:49.963Thanks - I hadn't thought of this edge case; I will edit my answer to account for this. – David Ireland – 2020-05-11T14:49:07.973

1Great feedback from both answers. Thanks. – ddaedalus – 2020-06-03T14:58:22.960