Imagine that the agent receives a positive reward upon reaching a state . Once the state has been reached the positive reward associated with it vanishes and appears somewhere else in the state space, say at state ′. The reward associated to ′ also vanishes when the agent visits that state once and re-appears at state . This goes periodically forever. Will discounted Q-learning converge to the optimal policy in this setup? Is yes, is there any proof out there, I couldn't find anything.