Chess policy network


I am interested in making a simple chess engine using neural networks. I already have a fairly good value network but I can't figure out how to train a policy network. I know that Leela chess zero outputs the probability of any of the about 1800 possible moves. But how do you train such a network? How do you calculate the loss when you only have the 1 move that was played in the game to work with?

Bojidar Ivanov

Posted 2018-11-05T17:42:12.927

Reputation: 112



It may come as a surprise, after learning all about dynamic programming, temporal difference learning, SARSA/Q-learning, to then discover that there is yet another whole dimension to reinforcement learning (on top of choices for on-policy/off-policy, model-based/model-free, bootstrap/monte carlo etc). That is value-based vs policy-based methods. Policy-based methods are often taught after value-based methods, because they are more complex.

You can learn the parameters of a policy function by training it using a policy gradient method. The archetypical policy gradient method is REINFORCE, although that is not very efficient. You may have heard of policy gradient methods developed recently: A3C, A2C, DDPG, TRPO, PPO . . . there are a few.

How do you calculate the loss when you only have the 1 move that was played in the game to work with?

You can pre-train a policy network using supervised learning (perhaps using the moves of winning players in high quality games) - that would use multi-class cross-entropy loss that you may be familiar with from supervised classification problems.

Policy Gradient methods work with a reward summation function defined as expected reward given the distribution of states. If your network parameters are $\theta$, then that may look like this:

$$J(\theta) = \sum_{s \in \mathcal{S}} \rho_{\pi}(s) \sum_{a \in \mathcal{A}} \pi(a|s,\theta)q_{\pi}(s,a)$$

where $\rho_{\pi}(s)$ is the expected proportion of time steps spent in state $s$. There is a way to take a sample gradient of this that can be used for gradient ascent - the derivation is called the Policy Gradient Theorem. It's a bit long to include in this answer, but the upshot is that you can use your sampled single step to generate an approximate gradient towards improving the policy. There are a few variations, but for instance advantage actor critic uses this:

$$\nabla J(\theta) = \hat{A}(s,a)\nabla\text{log}(\pi(a|s,\theta))$$

where $\hat{A}(s,a)$ is your current estimate of the advantage (or $Q(s,a) - V(s)$) for taking a specific action in state s.

The related loss function is

$$\mathcal{L}(\theta) = -A(s,a)\text{log}(\pi(a|s,\theta))$$

The $\text{log}$ function looks like an odd addition, but is just a consequence of adjusting $\nabla J$ to take account of ratios in which actions are taken in the current policy. In fact $\nabla\text{log}(\pi(a|s,\theta)) = \frac{\nabla\pi(a|s,\theta)}{\pi(a|s,\theta)}$ and it may help your intuition to keep it in that form (the $\text{log}$ form is concise and used elsewhere in statistics as the "score function", but it is not necessary for anything specific in RL).

The variations in policy gradients may use other functions than the advantage function, and it is not clear that there is any "best" one. The policy gradient theory basically gives us ways of estimating relative benefits of actions, and allows for any offset to estimated return from $(s,a)$ that does not further depend on the choice of action $a$. So you can use any method for getting an estimated return, and offset it with anything that you think might normalise the updates - common choices for the latter include subtracting average reward, or subtracting the state value function.

Neil Slater

Posted 2018-11-05T17:42:12.927

Reputation: 14 632