## Will Q-learning converge to the optimal state-action function when the reward periodically changes?

2

Imagine that the agent receives a positive reward upon reaching a state . Once the state has been reached the positive reward associated with it vanishes and appears somewhere else in the state space, say at state ′. The reward associated to ′ also vanishes when the agent visits that state once and re-appears at state . This goes periodically forever. Will discounted Q-learning converge to the optimal policy in this setup? Is yes, is there any proof out there, I couldn't find anything.

"Once the state has been reached the positive reward associated with it vanishes and appears somewhere else in the state space, say at state ′": why should the reward vanish? The reward is just a scalar value you receive from the environment after having performed an action. The reward can't really vanish. Maybe you mean the "return" or "value" of a state? – nbro – 2019-02-06T16:31:39.780

1No, assume that the reward is dynamic and the environment gives the agent a positive scalar value only for the first visit. – Perissiane – 2019-02-06T16:32:30.763

How "assume that the reward is dynamic and the environment gives the agent a positive scalar value only for the first visit." is related to your question? Actually, in your question, you're saying that "sometimes you receive a reward for entering state $s$ and sometimes for entering state $s'$". This can be just a regular scenario, if the environment is stochastic: in general, if the environment is stochastic, if you enter state s once and you receive a reward $r$, in future visits of the state $s$, the environment may give you different reward, say $r'$ (or $-r$) – nbro – 2019-02-06T16:35:56.987

1I should have been more clear: I meant that assume the agent visits state $s$ for the first time. The reward for this transition is positive. However, the agent will not receive any positive reward if it visits state $s$ for the second time and so on. The positive reward now is given to the agent if it visits state $s'$. Once the agent visits state $s'$ it gets a positive reward and the positive reward goes back to state $s$. So, somehow the agent has to learn a cycle behavior between states $s$ and $s'$. – Perissiane – 2019-02-06T17:10:40.093

3

No, it will not converge in the general case (maybe it might in extremely convenient special cases, not sure, didn't think hard enough about that...).

Practically everything in Reinforcement Learning theory (including convergence proofs) relies on the Markov property; the assumption that the current state $$s_t$$ includes all relevant information, that the history leading up to $$s_t$$ is no longer relevant. In your case, this property is violated; it is important to remember whether or not we visited $$s$$ more recently than $$s'$$.

I suppose if you "enhance" your states such that they include that piece of information, then it should converge again. This means that you'd essentially double your state-space. For every state that you have in your "normal" state space, you'd have to add a separate copy that would be used in cases where $$s$$ was visited more recently than $$s'$$.

1@Perissiane I can't really comment on that without specifics... maybe they "enhanced" the states as I described. Or maybe with "handle" they must mean "turns out to perform decently well empirically", which is of course very different from having a theoretical convergence proofs. Often when there are "mild" violations of things like the Markov property, most RL algorithms can still perform well empirically, it's just the theory that breaks down first. – Dennis Soemers – 2019-02-07T10:06:42.860