It may come as a surprise, after learning all about dynamic programming, temporal difference learning, SARSA/Q-learning, to then discover that there is yet another whole dimension to reinforcement learning (on top of choices for on-policy/off-policy, model-based/model-free, bootstrap/monte carlo etc). That is value-based vs policy-based methods. Policy-based methods are often taught *after* value-based methods, because they are more complex.

You can learn the parameters of a policy function by training it using a policy gradient method. The archetypical policy gradient method is REINFORCE, although that is not very efficient. You may have heard of policy gradient methods developed recently: A3C, A2C, DDPG, TRPO, PPO . . . there are a few.

How do you calculate the loss when you only have the 1 move that was played in the game to work with?

You can *pre-train* a policy network using supervised learning (perhaps using the moves of winning players in high quality games) - that would use multi-class cross-entropy loss that you may be familiar with from supervised classification problems.

Policy Gradient methods work with a reward summation function defined as expected reward given the distribution of states. If your network parameters are $\theta$, then that may look like this:

$$J(\theta) = \sum_{s \in \mathcal{S}} \rho_{\pi}(s) \sum_{a \in \mathcal{A}} \pi(a|s,\theta)q_{\pi}(s,a)$$

where $\rho_{\pi}(s)$ is the expected proportion of time steps spent in state $s$. There is a way to take a sample gradient of this that can be used for gradient *ascent* - the derivation is called the Policy Gradient Theorem. It's a bit long to include in this answer, but the upshot is that you *can* use your sampled single step to generate an approximate gradient towards improving the policy. There are a few variations, but for instance advantage actor critic uses this:

$$\nabla J(\theta) = \hat{A}(s,a)\nabla\text{log}(\pi(a|s,\theta))$$

where $\hat{A}(s,a)$ is your current estimate of the *advantage* (or $Q(s,a) - V(s)$) for taking a specific action in state s.

The related loss function is

$$\mathcal{L}(\theta) = -A(s,a)\text{log}(\pi(a|s,\theta))$$

The $\text{log}$ function looks like an odd addition, but is just a consequence of adjusting $\nabla J$ to take account of ratios in which actions are taken in the current policy. In fact $\nabla\text{log}(\pi(a|s,\theta)) = \frac{\nabla\pi(a|s,\theta)}{\pi(a|s,\theta)}$ and it may help your intuition to keep it in that form (the $\text{log}$ form is concise and used elsewhere in statistics as the "score function", but it is not necessary for anything specific in RL).

The variations in policy gradients may use other functions than the advantage function, and it is not clear that there is any "best" one. The policy gradient theory basically gives us ways of estimating *relative* benefits of actions, and allows for any offset to estimated return from $(s,a)$ that does not further depend on the choice of action $a$. So you can use any method for getting an estimated return, and offset it with anything that you think might normalise the updates - common choices for the latter include subtracting average reward, or subtracting the state value function.