## Building 'evaluation' neural networks for go, reversi, checkers etc, how to train?

2

I'm trying to build neural networks for games like Go, Reversi, Othello, Checkers, or even tic-tac-toe, not by calculating a move, but by making them evaluate a position.

The input is any board situation. The output is a score or estimate for the probability to win, or how favorable a given position is. Where 1 = guaranteed to win and 0 = guaranteed to lose.

In any given turn I can then loop over all possible moves for the current player, evaluate the resulting game situations, and pick the one with the highest score.

Hopefully, by letting this neural network play a trillion games vs itself, it can develop a sensible scoring function resulting in strong play.

Question: How do I train such a network ?

In every game, I can keep evaluating and making moves back and forth until one of the AI players wins. In that case the last game situation (right before the winning move) for the winning player should have a target value of 1, and the opposite situation (for the losing player) has a target value of 0.

Note that I don't intend to make the evaluation network two-sided. I encode the game situation as always "my own" pieces vs "the opponent", and then evaluate a score from my own (i.e. the current player's) side or perspective. Then after I pick a move, I flip sides so to speak, so the opponent pieces now become my own and vice versa, and then evaluate scores again (now from the other side's perspective) for the next counter move.

So the input to such a network does explicitly encode black and white pieces, or naughts and crosses (in case of tic-tac-toe) but just my pieces vs their pieces. And then evaluates how favorable the given game situation is for me, always assuming it's my turn.

I can obviously assign a desired score or truth value for the last move in the game (1 for win, 0 for loss) but how do I backpropagate that towards earlier situations in a played game?

Should I somehow distribute the 1 or 0 result back a few steps, with a decaying adjustment factor or learning rate? In a game with 40 turns, it might make sense to consider the last few situations as good or bad (being close to winning or losing) but I guess that shouldn't reflect all the way back to the first few moves in the game.

Or am I completely mistaken with this approach and is this not how it's supposed to be done?

This idea is on the right track! you have pretty much defined alphaZero, where they train a NN to evaluate each move, and then use MC-search for the actual play throughs! – mshlis – 2019-12-02T14:57:44.653

Ok thanks, and do you know if I can find some explanation or further documentation on the training process? – RocketNuts – 2019-12-02T20:45:24.647

1

The evaluation of the last steps in the game can be made with the 1 and 0 as you said. For all the other steps, the evaluation should be the evaluation of the best next step with a small decay.

Thanks, how do you mean the evaluation of the best next step? For example, if a simulated game of the neural network vs itself proceeds like this. The last move of A (in step 7) should obviously be scored 1 because it wins the game, and the preceding move of B (in step 6) should score 0 because it allowed the opponent to win. But then how further? The preceding step directly before that, move 5 by player A, was actually a very bad move, as player B could have won after that. And in fact player A started the game very badly. So how do I further assign scores?

– RocketNuts – 2019-12-05T22:14:33.797

1You should also score it 1(maybe with decay, so 0,99), because it lead to a win. This will make it take this wrong decission more in the future though. Eventually B will learn how to defeat this move and at that point, you wil start scoring step 5 with 0 because it lost that game. – Lustwelpintje – 2019-12-07T09:25:36.390

1This means that the early steps won't be properly trained at the start of the training sessions, but given enough training games, eventually he will learn how to correctly do the early steps – Lustwelpintje – 2019-12-07T09:29:01.813

OK I see, so this means I should rate A's moves 7, 5, 3, 1 with a score of 1, 0.99, 0.98, 0.97 respectively (or 0.99², 0.99³, etc) and B's moves 6, 4, 2 with a score of 0, 0.01, 0.02 respectively (or 1-0.99ⁿ). But one more thing: in this case A and B are actually the same neural network. So do I take the average of all the weight adjustments, i.e. learning rate η times 0.99ⁿ for the moves on the winning side (with n = number of remaining moves until the end of the game for that side) and η(1-0.99ⁿ) for the moves on the losing side. And then backpropagate the resulting average score adjustment? – RocketNuts – 2019-12-08T19:07:04.780

you just train every move once right? I don't really know why you want to average the weight adjustments? – Lustwelpintje – 2019-12-09T10:29:39.023

Ah, OK no I thought you meant some kind of (mini-)batch training and then apply weight adjustments once per game. Although now that I think about it, I guess it actually boils down to the same. So yes I can just adjust the weights for every individual move. Thanks again, much appreciated. – RocketNuts – 2019-12-09T10:37:56.257