2

I'm trying to build neural networks for games like Go, Reversi, Othello, Checkers, or even tic-tac-toe, not by calculating a move, but by making them **evaluate a position**.

The input is any board situation. The output is a score or estimate for the probability to win, or how favorable a given position is. Where 1 = guaranteed to win and 0 = guaranteed to lose.

In any given turn I can then loop over all possible moves for the current player, evaluate the resulting game situations, and pick the one with the highest score.

Hopefully, by letting this neural network play a trillion games vs itself, it can develop a sensible scoring function resulting in strong play.

Question: **How do I train such a network** ?

In every game, I can keep evaluating and making moves back and forth until one of the AI players wins. In that case the last game situation (right before the winning move) for the winning player should have a target value of 1, and the opposite situation (for the losing player) has a target value of 0.

Note that I don't intend to make the evaluation network two-sided. I encode the game situation as always "my own" pieces vs "the opponent", and then evaluate a score from my own (i.e. the current player's) side or perspective. Then after I pick a move, I flip sides so to speak, so the opponent pieces now become my own and vice versa, and then evaluate scores again (now from the other side's perspective) for the next counter move.

So the input to such a network does explicitly encode black and white pieces, or naughts and crosses (in case of tic-tac-toe) but just my pieces vs their pieces. And then evaluates how favorable the given game situation is for me, always assuming it's my turn.

I can obviously assign a desired score or truth value for the last move in the game (1 for win, 0 for loss) but how do I backpropagate that towards earlier situations in a played game?

Should I somehow distribute the 1 or 0 result back a few steps, with a decaying adjustment factor or learning rate? In a game with 40 turns, it might make sense to consider the last few situations as good or bad (being close to winning or losing) but I guess that shouldn't reflect all the way back to the first few moves in the game.

Or am I completely mistaken with this approach and is this not how it's supposed to be done?

This idea is on the right track! you have pretty much defined alphaZero, where they train a NN to evaluate each move, and then use MC-search for the actual play throughs! – mshlis – 2019-12-02T14:57:44.653

Ok thanks, and do you know if I can find some explanation or further documentation on the training process? – RocketNuts – 2019-12-02T20:45:24.647