What I notice is that the network's fitness keeps climbing up and falling down again. It seems that my current approach only evolves certain patterns on placing signs on the board and once random mutation interrupts current pattern new one emerges. My network goes in circles without ever evolving actual strategy. I suspect solution for this would be to pit network against tic-tac-toe AI, but is there any way to evolve actual strategy just by making it to playing against itself?

The likely cause of this phenomenon is that your fitness function involves evaluating the fitness of an agent by letting it play lots of other agents (the entire population), many of which are likely very poor agents.

Because Tic-Tac-Toe is such a simple game, we know that optimal play from both sides leads to a draw. Suppose we have a population of the following three strategies:

- $\pi_1$: an optimal player
- $\pi_2$: a sub-optimal player
- $\pi_3$: a different sub-optimal player

There can easily be situations where the optimal player $\pi_1$ gets a draw against both of the sub-optimal players (if they're not **extremely** bad), because right from the get-go an optimal player will play "safe" enough such that it can still guarantee a draw against an optimal opponent, which may not be the fastest way to win against a sub-optimal player.

In the same situation, the sub-optimal player $\pi_2$ for example may be able to consistently win against a slightly worse sub-optimal player $\pi_3$. In this example situation, your fitness function ranks the agents $f(\pi_2) > f(\pi_1) > f(\pi_3)$, which is wrong.

As you already suggested yourself, the most straightforward way to address this problem would be to simply evaluate the fitness of strategies by having them play against an optimal minimax-agent, rather than a population of many strategies (including poor ones).

If you really want to use only evolving, no tree search, you'll have to find a way to **fix the fitness function such that the problem described above can no longer occur**. One way you could try to do that (*not 100% sure it would work, but imagine that it might*) would be to set up some larger tournament bracket where agents progress through the brackets if they're able to beat others which they were paired up with. Getting further in the tournament would then increase the fitness of an agent. Very bad sub-optimal players should not be able to progress far in the tournament if they get beaten by other (sub-)optimal agents, but sub-optimal agents which manage to advance still shouldn't be able to get more than a draw against an optimal agent. Some things to keep in mind with this idea:

- You'll likely want to
**repeat the entire tournament many times with different, randomized initial pairings in the bracket**, and compute fitness based on average (or maybe median or max) ranking across all repetitions of the tournament. This would be to filter out members of the population getting particularly lucky or unlucky with the opponents they encounter in the brackets.
- You'll have to think about which actions the agents select in such a tournament. Do they deterministically play the action given by the $\arg\max$ of the softmax outputs (in which case you could also just use the linear outputs rather than softmax outputs, since the softmax function does not change ranking of outputs). Or do they nondeterministically sample actions according to the softmax distribution? Sampling actions according to the softmax distribution seems attractive because it leads to more variance in game state situations encountered, and it is important to be robust and be capable of handling many different game states. On the other hand, it does introduce noise and may make an optimal agent accidentally lose.
**I suppose I would lead towards deterministic play with the $\arg\max$ over the softmax outputs**. In a large tournament, there will still be sufficient variety due to encountering many different opponents.
- You'll have to put thought into handling the situation where two agents may infinitely keep drawing against each other. Which agent advances? If this happens, I think I would gradually shift from deterministcally playing $\arg\max$ actions to playing according to the softmax distribution. This guarantees that someone eventually loses. Averaging results over multiple different games like this, plus averaging over multiple different repetitions of the complete tournament, should yield accurate results.
- You'll want to think about how to handle
**illegal actions**. You describe including punishments for illegal actions in the fitness function. This introduces extra noise / variance in the fitness function which is already difficult enough to get right to begin with, so I wouldn't do this. I'd recommend having a **manual post-processing step in which you manually set probabilities of illegal actions being played to $0$**, and normalizing to make sure the probabilities over the remaining actions add up to $1$ again. This is also what's done (on a much larger scale) by DeepMind in their state-of-the-art Go/Chess/Shogi agents for example. If you really don't want to do this, if you really want evolved strategies to automatically learn not to ever have high outputs for illegal actions, I would recommend immediately making them lose the game if they do suggest an illegal action. That way you still have a "clean" fitness function based only on wins/draws/losses.

1Are you using any form of cross-over between two parents in reproduction? – Neil Slater – 2018-08-24T09:43:23.763

1I do, I tried to use mutation only and crossover + mutation, yet results are essentially same. I doubt that this is the problem. – Perpetuum – 2018-08-24T10:41:35.443

3Yes, I was asking because cross-over in neural networks may work better using a system like NEAT which can track compatibility between architectures as they grow. It is hard to tell where your problem might be, but combining GAs and NNs is quite tough, the search space is complex due to co-dependencies between weights (making cross-over fail unless care is taken). And the space also becomes very large quickly (making random mutations inefficient for searching). I guess your architecture has 180 parameters, weights + biases, to find optimal values for? – Neil Slater – 2018-08-24T11:41:41.900

1Yes. I understand why this approach doesn't work, but it's hard for me to it put in technical terms. Basically, there is no emerging correct strategy since network only plays against different variations of itself, playing against AI that knows optimal strategy would teach network to make correct moves, but I want to find a solution without this approach, so I could use network in situations where there is no known optimal strategy. – Perpetuum – 2018-08-24T12:28:39.727

3The design should work in principle for self-play. It is a matter of details, and what you can change to enable it in a reasonable time. I don't know the answer here. But one thing you could try is see whether the network can learn your desired function. Create a table of all the correct moves, and use normal supervised learning with gradient descent. Can the network get a high accuracy when told the answer? If it can, then you have ruled out the network architecture as a problem. If it cannot, then I would probably try a deeper network - keep adding layers until it works. – Neil Slater – 2018-08-24T12:38:10.383

1Of course in a non-trivial game, you won't be able to create this supervised learning data, but you may start to get a feel for correct NN architecture. – Neil Slater – 2018-08-24T12:39:03.557

1One thing: How are you matching up genomes to play each other and get a fitness rating in each generation? Is it all-play-all (so an evaluation round is 4950 games), a single random pairing (evaluation round is 50 games), or something in-between or different? Please [edit] that into the question. – Neil Slater – 2018-08-24T13:02:16.873

From where I stand you it seems that you miss some new blood in your strategy. Have look here : https://www.tutorialspoint.com/genetic_algorithms/genetic_algorithms_parent_selection.htm if you always select top elements to reproduce,you will converge too quickly which seems to be the case. Hope this helps

– MaxouMask – 2018-08-24T14:32:32.2831http://eng.uber.com/wp-content/uploads/2017/12/improving-es-arxiv.pdf To explore more when stuck in local optimum – caissalover – 2018-10-25T18:27:32.363