2

1

In the textbook "Reinforcement Learning: An Introduction" by Richard Sutton and Andrew Barto, the concept of Maximization Bias is introduced in section 6.7, and how Q-learning "over-estimates" action-values is discussed using an example. However, a formal proof of the same is not presented in the textbook, and I couldn't get it anywhere on the internet as well.

After reading the paper on Double Q-learning by Hado van Hasselt (link), I could understand to some extent why Q-learning "over-estimates" action values. Here is my (vague, informal) construction of a mathematical proof:

We know that Temporal Methods (just like Monte Carlo methods), use sample returns instead of real expected returns as estimates, to find the optimal policy. These sample returns converge to the true expected returns over infinite trials, provided all the state-action pairs are visited. Thus the following notation is used,

$$\mathbb{E}[Q()] \rightarrow q_\pi()$$ where $Q()$ is calculated from the sample return $G_t$ observed at every time-step. Over infinite trials, this sample return when averaged converges to it's expected value which is the true $Q$-value under the policy $\pi$. Thus $Q()$ is really an estimate of the true $Q$-value $q_\pi$.

In section 3 on page 4 of the paper, Hasselt describes how the quantity $\max_a Q(s_{t+1}, a)$ approximates $\mathbb{E}[\max_a Q(s_{t+1}, a)]$ which in turn approximates the quantity $\max_a(\mathbb{E}[Q(s_{t+1},a)])$ in Q-learning. Now, we know that the $\max[]$ function is a convex function (proof). From Jensen's inequality, we have $$\phi(\mathbb{E}[X]) \leq \mathbb{E}[\phi(X)]$$ where $X$ is a random variable, and the function $\phi()$ is a convex function. Thus, $$\max_a(\mathbb{E}[Q(s_{t+1},a)]) \leq \mathbb{E}[\max_a(Q(s_{t+1}, a)]$$

$$\therefore \max_a Q(s_{t+1}, a) \approx \max_a(\mathbb{E}[Q(s_{t+1},a)]) \leq \mathbb{E}[\max_a(Q(s_{t+1}, a)]$$

The quantity on the LHS of the above equation appears (along with $R_{t+1}$) as an estimate of the next action-value in the Q-learning update equation: $$Q(S_t,A_t) \leftarrow (1-\alpha)Q(S_t, A_t) + \alpha[R_{t+1} + \gamma\max_aQ(S_{t+1}, a)] $$

Lastly, we note that the bias of an estimate $T$ is given by: $$b(T) = \mathbb{E}[T] - T$$ Thus the bias of the estimate $\max_a Q(s_{t+1},a)$ will always be positive: $$b(\max_a Q(s_{t+1},a)) = \mathbb{E}[\max_a Q(s_{t+1},a)] - \max_a Q(s_{t+1},a) \geq 0$$ In statistics literature, any estimate whose bias is positive is said to be an "over-estimate". Thus the action values are over-estimated by the Q-learning algorithm due to the $\max[]$ operator, thus resulting in a $maximization$-$bias$.

Are the arguments made above valid? I am a student, with no rigorous knowledge of random processes. Thus, please forgive me if any of the steps above are totally unrelated, and doesn't make sense in a more mathematically rigorous fashion. Please let me know, if there is a much better proof than this failed attempt.

Thank you so much for your precious time. Any help/suggestions/corrections are greatly appreciated!

2when you take $\mathbb{E}$ of the $Q$ function, what are you taking expectation with respect to? The $Q$ function already is an expectation. – David Ireland – 2020-06-15T19:10:08.340

@DavidIreland, In TD methods, the Q function will not have any expectation, as we are just using sample returns instead of true expected values i.e., to say, instead of using $\mathbb{E}

\pi[G_t|S_t = s, A_t = a]$, we just use $G_t$, and then average them, which is then assigned to Q(). Thus, here $\mathbb{E}[Q()] = q\pi()$. The approximation in Q-learning update equation occurs as we are using $\gamma\max_a Q()$ instead of $\gamma\max_a q_\pi()$ – Nishanth Rao – 2020-06-16T04:00:54.5771Right, then your notation doesn’t make sense. You should write $\mathbb{E}[Q(s_{t+1}, a)] \rightarrow q(s_{t+1}, a)$ – David Ireland – 2020-06-16T08:49:31.820

@DavidIreland Thank you for the suggestion. Like you mentioned, I think it will be better to use different symbols. I have updated my post accordingly. – Nishanth Rao – 2020-06-16T09:15:28.997

I think the rest of your proof looks good as far as I can tell, good job :) – David Ireland – 2020-06-16T09:16:08.687

@DavidIreland Following your suggestion, I have restructured my entire proof, to make it as less confusing as possible. I have also added another line from the paper that I have referred to, to convey what exactly $Q()$ is. I am truly thankful for your valuable time. Please let me know if you have any other suggestions / corrections that would help other readers. Thank you! – Nishanth Rao – 2020-06-16T09:50:10.300