1

The conditions of convergence of SARSA(0) to the optimal policy are :

The Robbins-Monro conditions above hold for $α_t$.

Every state-action pair is visited infinitely often

The policy is greedy with respect to the policy derived from $Q$ in the limit

The controlled Markov chain is communicating: every state can be reached from any other with positive probability (under some policy).

$\operatorname{Var}{R(s, a)} < \infty$, where $R$ is the reward function

The original proof of the convergence of TD(0) prediction (page 24 of the paper Learning to Predict by the Method of Temporal Differences) was for convergence in the mean of the estimation to the true value function. This did not require the learning rate parameter to satisfy Robbins-Monro conditions.

I was wondering if the Robbins-Monro conditions are removed from the SARSA(0) assumptions would the policy converge in some notion of expectation to the optimal policy?