As far as I'm aware, it is still somewhat of an open problem to get a really clear, formal understanding of exactly why / when we get a lack of convergence -- or, worse, sometimes a danger of divergence. It is typically attributed to the **"deadly triad"** (see 11.3 of the second edition of Sutton and Barto's book), the combination of:

- Function approximation, AND
- Bootstrapping (using our own value estimates in the computation of our training targets, as done by $Q$-learning), AND
- Off-policy training ($Q$-learning is indeed off-policy).

That only gives us a (possibly non-exhaustive) description of cases in which we have a lack of convergence and/or a danger of divergence, but still doesn't tell us **why** it happens in those cases.

John's answer already provides the intuition that part of the problem is simply that the use of function approximation can easily lead to situations where your function approximator isn't powerful enough to represent the true $Q^*$ function, there may always be approximation errors that are impossible to get rid of without switching to a different function approximator.

Personally, I think this intuition does help to understand why the algorithm cannot guarantee convergence to the optimal solution, but I'd still intuitively expect it to maybe be capable of "converging" to some "stable" solution that is the best possible approximation given the restrictions inherent in the chosen function representation. Indeed, this is what we observe in practice when we switch to on-policy training (e.g. Sarsa), at least in the case with linear function approximators.

My own intuition with respect to this question has generally been that an important source of the problem is **generalisation**. In the tabular setting, we have completely isolated entries $Q(s, a)$ for all $(s, a)$ pairs. Whenever we update our estimate for one entry, it leaves all other entries unmodified (at least initially -- there may be some effects on other entries in future updates due to bootstrapping in the update rule). Update rules for algorithms like $Q$-learning and Sarsa may sometimes update towards the "wrong" direction if we get "unlucky", but **in expectation**, they generally update towards the correct "direction". Intuitively, this means that, in the tabular setting, **in expectation** we will slowly, gradually fix any mistakes in any entries in isolation, without possibly harming other entries.

With function approximation, when we update our $Q(s, a)$ estimate for one $(s, a)$ pair, it can potentially also affect **all** of our other estimates for **all** other state-action pairs. Intuitively, this means that we no longer have the nice isolation of entries as in the tabular setting, and "fixing" mistakes in one entry may have a risk of adding new mistakes to other entries. However, like John's answer, this whole intuition would really also apply to on-policy algorithms, so it still doesn't explain what's special about $Q$-learning (and other off-policy approaches).

A very interesting recent paper on this topic is Non-delusional Q-learning and Value Iteration. They point out a problem of "delusional bias" in algorithms that combine function approximation with update rules involving a $\max$ operator, such as Q-learning (it's probably not unique to the $\max$ operator, but probably applies to off-policy in general?).

The problem is as follows. Suppose we run this $Q$-learning update for a state-action pair $(s, a)$:

$$Q(s, a) \gets Q(s, a) + \alpha \left[ \max_{a'} Q(s', a') - Q(s, a) \right].$$

The value estimate $\max_{a'} Q(s', a')$ used here is based on the assumption that we execute a policy that is greedy with respect to older versions of our $Q$ estimates over a -- possibly very long -- trajectory. As already discussed in some of the previous answers, our function approximator has a limited representational capacity, and updates to one state-action pair may affect value estimates for other state-action pairs. This means that, after triggering our update to $Q(s, a)$, **our function approximator may no longer be able to simultaneously express the policy that leads to the high returns that our $\max_{a'} Q(s', a')$ estimate was based on**. The authors of this paper say that the algorithm is "delusional". It performs an update under the assumption that, down the line, it can still obtain large returns, but it may no longer actually be powerful enough to obtain those returns with the new version of the function approximator's parameters.

Finally, another (even more recent) paper that I suspect is relevant to this question is Diagnosing Bottlenecks in Deep Q-learning Algorithms, but unfortunately I have not yet had the time to read it in sufficient detail and adequately summarise it.

2Great question! – John Doucette – 2019-04-05T19:26:00.673

The book that you referenced talks about this problem in chapter 11 so you might read it. Also, I don't think there's a formal proof why this happens but there are few examples that show divergence even in simple environments (e.g. Tsitsiklis and van Roy) . – Brale – 2019-04-05T20:21:20.297