I thought about an algorithm that twists the standard Q-learning slightly, but I am not sure whether convergence to the optimal Q-value could be guaranteed.
The algorithm starts with an initial policy. Within each episode, the algorithm conducts policy evaluation and does NOT update the policy. Once the episode is done, the policy is updated using the greedy policy based on the current learnt Q-values. The process then repeats. I attached the algorithm as a picture.
Just to emphasize that the updating policy does not change within each episode. The policy at each state is updated AFTER one episode is done, using the Q-tables.
Has anyone seen this kind of Q-learning before? If so, could you please kindly guide me to some resources regarding the convergence? Thank you!