**Note:** you mentioned in the comments that you are reading the old, pre-print version of the paper describing AlphaZero on arXiv. My answer will be for the "official", peer-reviewed, more recent publication in Science (which nbro linked to in his comment). I'm not only focusing on the official version of the paper just because it is official, but also because I found it to be much better / more complete, and I would recommend reading it instead of the preprint if you are able to (I understand it's behind a paywall so it may be difficult for some to get access). The answer to this specific question would probably be identical for the arXiv version anyway.

**No, AlphaZero does not use $Q$-learning**.

The neural network in AlphaZero is trained to minimise the following loss function:

$$(z - \nu)^2 - \pi^{\top} \log \mathbf{p} + c \| \theta \|^2,$$

where:

- $z \in \{-1, 0, +1\}$ is the real outcome observed in a game of self-play.
- $\nu$ is a predicted outcome / value.
- $\pi$ is a distribution over actions derived from the visit counts of MCTS.
- $\mathbf{p}$ is the distribution over actions output by the network / the policy.
- $c \| \theta \|^2$ is a regularisation term, not really interesting for this question/answer.

In this loss function, the term $(z - \nu)^2$ is exactly what we would have if we were performing plain Monte-Carlo updates, based on Monte-Carlo returns, in a traditional Reinforcemet Learning setting (with function approximation). Note that there is no "bootstrapping", no combination of a single-step observed reward plus predicted future rewards, as we would have in $Q$-learning. We play out all the way until the end of a game (an "episode"), and use the final outcome $z$ for our updates to our value function. That's what makes this Monte-Carlo learning, rather than $Q$-learning.

As for the $- \pi^{\top} \log \mathbf{p}$ term, this is definitely not $Q$-learning (we're directly learning a policy in this part, not a value function).

If it's a Supervised Learning, then why is it said that AZ uses Reinforcement Learning? Is "reinforcement" part primarily a result of using Monte-Carlo Tree Search?

The policy learning looks very similar to Supervised Learning; indeed it's the cross-entropy loss which is frequently used in Supervised Learning (classification) settings. **I'd still argue it is Reinforcement Learning, because the update targets are completely determined by self-play, by the agent's own experience. There are no training targets / labels that are provided externally** (i.e. no learning from a database of human expert games, as was done in 2016 in AlphaGo).

Dennis, can I ask about the convention of using || around the theta? – DukeZhou – 2019-11-08T22:09:15.047

@DukeZhou Sure. The two bars on each side of the vector basically denotes that we take the $\ell_2$-norm, or Euclidean norm, of the vector: $\sqrt{\theta_0^2 + \theta_1^2 + \dots + \theta_n^2}$. This could've been made a bit more clear with an additional subscript $2$, but I just followed the notation from the paper :) Afterwards the superscript $2$ just means that we square it again, which cancels out the square root in the Euclidean norm, meaning we just end up with the sum of the squared elements of the vector. – Dennis Soemers – 2019-11-09T09:29:02.960