## What are the conditions for the convergence of SARSA to the optimal value function?

1

2

Is it correct that for SARSA to converge to the optimal value function (and policy)

1. The learning rate parameter $$\alpha$$ must satisfy the conditions: $$\sum \alpha_{n^k(s,a)} =\infty \quad \text{and}\quad \sum \alpha_{n^k(s,a)}^{2} <\infty \quad \forall s \in \mathcal{S}$$ where $$n_k(s,a)$$ denotes the $$k^\text{th}$$ time $$(s,a)$$ is visited

2. $$\epsilon$$ (of the $$\epsilon$$-greedy policy) must be decayed so that the policy converges to a greedy policy.

3. Every state-action pair is visited infinitely many times.

Are any of these conditions redundant?

1There is probably some relation between the fact that the TD(0) function converge to the true Value function with probability 1 when the learning rate parameter is decayed like above. There also exists some interval of alphas $\alpha \in (0,p)$ such that the TD estimate converges in expected value to the true value function – KaneM – 2020-02-27T13:17:15.150

1

The paper Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms by Satinder Singh et al. proves that SARSA(0), in the case of a tabular representation of the value functions, converges to the optimal value function, provided certain assumptions are met

1. Infinite visits to every state-action pair
2. The learning policy becomes greedy in the limit

The properties are more formally stated in lemma 1 (page 7 of the pdf) and theorem 1 (page 8). The Robbins–Monro conditions should ensure that each state-action pair is visited infinitely often.

I was already writing this answer before the other answer was published, but these answers are equivalent. I am only citing the paper that originally proved this.

– nbro – 2020-02-27T13:40:42.483

Thanks for the answer. Sorry about my poor formatting of posts here. Just learning about markdown. – KaneM – 2020-02-27T13:45:30.737

1@KaneM No problem! I have a look at source code of my and your answer (now after my edits) to understand better how markdown works :) Anyway, I am not completely sure that the Robbins-Monro conditions ensure that each state is visited infinitely often. Maybe it's the other way around: if each state-action pair is visited infinitely often, then the Robbins-Monro conditions are satisfied. – nbro – 2020-02-27T13:48:40.740

I believe they are separate conditions. From my understanding, the RM conditions ensure that the Q function eventually converges to a value while the condition that every state-action pair is visited infinitely often ensure that it converges to the correct value. – KaneM – 2020-02-27T13:59:25.467

1@KaneM These conditions are very related. I'm linking to a paper that seems to relate the two in a more precise way, but I haven't yet fully read this paper. – nbro – 2020-02-27T14:03:57.477

1I wonder if the Robbins-Monro conditions are not present, does there exist some range of $\alpha$ such that the policy converges in some notion of expectation to the optimal policy. – KaneM – 2020-02-27T14:32:34.420

@KaneM This is an interesting question. Maybe ask on the site! I wouldn't currently be able to answer it, but maybe there's already some research work that goes in this direction. In all convergence results I've seen so far, the RM conditions are assumed (essentially, because these RL theorems and proofs follow from the original results by Robbins and Monro, AFAIK). – nbro – 2020-02-27T14:43:54.107

1

I have the conditions for convergence in these notes SARSA convergence by Nahum Shimkin.

1. The Robbins-Monro conditions above hold for $$α_t$$.

2. Every state-action pair is visited infinitely often

3. The policy is greedy with respect to the policy derived from $$Q$$ in the limit

4. The controlled Markov chain is communicating: every state can be reached from any other with positive probability (under some policy).

5. $$\operatorname{Var}{R(s, a)} < \infty$$, where $$R$$ is the reward function