## Convergence of a delayed policy update Q-learning

3

1

I thought about an algorithm that twists the standard Q-learning slightly, but I am not sure whether convergence to the optimal Q-value could be guaranteed.

The algorithm starts with an initial policy. Within each episode, the algorithm conducts policy evaluation and does NOT update the policy. Once the episode is done, the policy is updated using the greedy policy based on the current learnt Q-values. The process then repeats. I attached the algorithm as a picture.

Just to emphasize that the updating policy does not change within each episode. The policy at each state is updated AFTER one episode is done, using the Q-tables.

Has anyone seen this kind of Q-learning before? If so, could you please kindly guide me to some resources regarding the convergence? Thank you!

It's on-policy generalised policy iteration as written - kind of a hybrid of Monte Carlo control and SARSA - so I am surprised to see it called Q learning. Also, no exploration, which could be a weakness in some environments. Unfortunately I have not seen it before and would not know where to find convergence resources. Could you perhaps link to where you found it, because that may give some clues? – Neil Slater – 2020-05-22T23:20:37.867

@NeilSlater Thanks for the comment. This is what I had in mind and had it written down, so I do not have a link. As for no exploration, I think the action selection here is not a big deal. One can just replace the sampling with an epsilon greedy selection. Also, could you please explain why this thing looks like a SARSA? – Scott Guan – 2020-05-22T23:32:05.653

To make it off-policy you could change the Q value update step to take a max over possible actions in $s_{t+1}$ instead of making the on-policy update. Doesn't fix the lack of exploration, but does mean you would be estimating a different target polciy from current behaviour policy. – Neil Slater – 2020-05-22T23:35:35.633

Yes it would be simple enough to have the behaviour policy as epsilon greedy. I suggest you write it like that though, because it is an important detail if you want to talk about convergence. Without exploration, convergence guarantees will be weaker – Neil Slater – 2020-05-22T23:37:36.183

@NeilSlater I will update it. Thanks. – Scott Guan – 2020-05-23T00:47:16.923

Thanks for the update. The new algorithm is essentially Q learning with separate batched evaluation steps and policy update steps. Could you clarity what the $\epsilon$-greedy is evaluated over - is the greeedy action considered to be $\text{argmax}_a Q_t(s_t, a)$ or is it $\pi_i(s_t)$? This will make a difference to speed of convergence in some cases. – Neil Slater – 2020-05-23T08:34:48.207

@NeilSlater We can use either. Personally, I don't care about convergence rate or sample complexity for now, I just want to know regarding convergence. – Scott Guan – 2020-05-23T12:56:04.500

I suggest pick one in the algorithm, to make it concrete what the behaviour policy is based on. Although I might think it only affects convergence rate, it may actually affect convergence guarantees too. These kinds of details can be important. – Neil Slater – 2020-05-23T13:39:28.483