## Is the Bellman equation that uses sampling weighted by the Q values (instead of max) a contraction?

5

It is proved that the Bellman update is a contraction (1).

Here is the Bellman update that is used for Q-Learning:

$$Q_{t+1}(s, a) = Q_{t}(s, a) + \alpha*(r(s, a, s') + \gamma \max_{a^*} (Q_{t}(s', a^*)) - Q_t(s,a)) \tag{1} \label{1}$$

The proof of (\ref{1}) being contraction comes from one of the facts (the relevant one for the question) that max operation is non expansive; that is:

$$\lvert \max_a f(a)- \max_a g(a) \rvert \leq \max_a \lvert f(a) - g(a) \rvert \tag{2}\label{2}$$

This is also proved in a lot of places and it is pretty intuitive.

Consider the following Bellman update:

$$Q_{t+1}(s, a) = Q_{t}(s, a) + \alpha*(r(s, a, s') + \gamma SAMPLE_{a^*} (Q_{t}(s', a^*)) - Q_t(s,a)) \tag{3}\label{3}$$

where $$SAMPLE_a(Q(s, a))$$ samples an action with respect to the Q values (weighted by their Q values) of each action in that state.

Is this new Bellman operation still a contraction?

Is the SAMPLE operation non-expansive? It is, of course, possible to generate samples that will not satisfy equation (\ref{2}). I ask is it non-expansive in expectation?

My approach is:

$$\lvert\,\mathbb{E}_{a \sim Q}[f(a)] - \mathbb{E}_{a \sim Q}[g(a)]\, \rvert \leq \,\,\mathbb{E}_{a \sim Q}\lvert\,\,[f(a) - g(a)]\,\,\rvert \tag{4} \label{4}$$

Equivalently:

$$\lvert\,\mathbb{E}_{a \sim Q}[f(a) - g(a)] \, \rvert \leq \,\,\mathbb{E}_{a \sim Q}\lvert\,\,[f(a) - g(a)]\,\,\rvert$$

(\ref{4}) is true since:

$$\lvert\,\mathbb{E}[X] \, \rvert \leq \,\,\mathbb{E} \,\,\lvert\,\,[X]\,\,\rvert$$

But, I am not sure if proving (\ref{4}) proves the theorem. Do you think that this is a legit proof that (\ref{3}) is a contraction.

(If so; this would mean that stochastic policy q learning theoretically converges and we can have stochastic policies with regular q learning; and this is why I am interested.)

Both intuitive answers and mathematical proofs are welcome.

(1) is a bellman update; it is a copy paste error that rhs has t+1 (sorry about that) thanks for noticing; I fixed the error now. – sirfroggy – 2020-07-24T11:29:55.557

Your question in not very clear to me. Since $f(a)$ and $g(a)$ are not clear to me. The formulas are intuitive and individually correct, I am not sure whether arriving at those intuitive forms you have mentioned, so easy. Check this link for example: http://users.isr.ist.utl.pt/~mtjspaan/readingGroup/ProofQlearning.pdf As a side not I do not think proving convergence is so easy. There is a topic called Concentration Inequalities which have to be studied to prove convergence. I think you can use this to prove your theorems.

– DuttaA – 2020-07-25T08:22:00.887