How do updates in SARSA and Q-learning differ in code?

2

2

The update rules for Q-learning and SARSA each are as follows:

Q Learning:

$$Q(s_t,a_t)←Q(s_t,a_t)+α[r_{t+1}+γ\max_{a'}Q(s_{t+1},a')−Q(s_t,a_t)]$$

SARSA:

$$Q(s_t,a_t)←Q(s_t,a_t)+α[r_{t+1}+γQ(s_{t+1},a_{t+1})−Q(s_t,a_t)]$$

I understand the theory that SARSA performs 'on-policy' updates, and Q-learning performs 'off-policy' updates.

At the moment I perform Q-learning by calculating the target thusly:

target = reward + self.y * np.max(self.action_model.predict(state_prime))

Here you can see I pick the maximum for the Q-function for state prime (i.e. greedy selection as defined by maxQ in the update rule). If I were to do a SARSA update and use the same on-policy as used when selecting an action, e.g. ϵ-greedy, would I basically change to this:

if np.random.random() < self.eps:
    target = reward + self.y * self.action_model.predict(state_prime)[random.randint(0,9)]
else:
    target = reward + self.y * np.max(self.action_model.predict(state_prime))

So sometimes it will pick a random future reward based on my epsilon greedy policy?

BigBadMe

Posted 2019-02-09T15:33:02.100

Reputation: 395

Answers

4

Picking actions and making updates should be treated as separate things. For Q-learning you also need to explore by using some exploration strategy (e.g. $\epsilon$-greedy).

Steps for Q-learning:
1) initialize state $S$
For every step of the episode:
2) choose action $A$ by some exploratory policy (e.g. $\epsilon$-greedy) from state $S$
3) take action $A$ and observe $R$ and $S'$
4) do the update $Q(S, A) = Q(S, A) + \alpha(R + \gamma*\max_aQ(S', a) - Q(S, A))$
5) update the state $S = S'$ and keep looping from step 2 until the end of episode

Steps for Sarsa:
1) initialize state $S$
2) initialize first action $A$ from state $S$ by some exploratory policy (e.g. $\epsilon$-greedy)
For every step of the episode:
3) take action $A$ and observe $R$ and $S'$
4) choose action $A'$ from state $S'$ by some exploratory policy (e.g. $\epsilon$-greedy)
5) do the update $Q(S, A) = Q(S, A) + \alpha(R + \gamma * Q(S', A') - Q(S, A))$
6) update state and action $S = S'$, $A = A'$ and keep looping from step 3 until end of the episode

Brale

Posted 2019-02-09T15:33:02.100

Reputation: 1 664

1Thanks for that, much appreciated. So with Sarsa, you actually follow the action A' that was chosen by the ϵ-greedy policy while performing the update. What do you do if it's a board game, and S' isn't actually the same because the other player has changed the state by taking their turn after my agent has performed its action? S' isn't what you thought it would be, and the chosen A' might not even be a legal move anymore. Can you not use Sarsa in that situation? – BigBadMe – 2019-02-09T23:23:22.160

1With board games you would consider something called afterstates. When you make a move, you don't immediately transfer to new state $S'$, you only move to new state after your opponent makes a move aswell. So the new state happens only after entire turn of play. In that case you would pick $A'$ in completely new state after 1 turn of play – Brale – 2019-02-10T08:50:46.727

So to be clear, in that scenario and $S'$ is the state when other players have perform their action and it comes back round to my agent's turn, the state at that moment would be $S'$? At that point if it were to always pick the most-greedy action $maxaQ(S′,a)$ at that $S'$ then that would be Q-learning, if I were to choose an ϵ -greedy action then it would be Sarsa? – BigBadMe – 2019-02-10T09:49:56.623

1Yes, when the action returns to you, you are in $S'$. Again, don't confuse updates and action choices, for updates in Q-learning you would choose the Q value of maximum action of state $S'$ and for Sarsa you would pick an action with $\epsilon$-greedy and use its value for the update. For action choice, for Q-learning you would use $\epsilon$-greedy in state $S'$, in next loop iteration after you made the update, and for Sarsa you would use action $A'$ you already picked from last loop iteration. – Brale – 2019-02-10T10:33:59.950