Q learning tic tac toe


I have a tic-tac-toe with a Q-learning algorithm, and the AI plays against the same algorithm (but they don't share the same Q matrix). But after 200,000 games, I still beat the AI very easily and it's rather dumb. My selection is made by epsilon greedy policy.

What could cause the AI not to learn?

Here is how I do it (pseudo code):

for(int i = 0; i < 200000; ++i){
    //Game is restarted here

And in my ticTacToe I have a simple loop :

    swapPlaying(); //Change the players' turn
    Position toPlay = playing.whereToMove();


//Here I just update my players whether they won, draw or lost.

In my players, I select the move with epsilon-greedy implemented sa below :

Moves moves = getMoves(); // Return every move available
Qvalues qValues = getQValues(moves); // return only qvalues of interest
//also create the state and add it to the Q-matrix if not already in.

if(!optimal) {
     updateEpsilon(); //I update epsilon with simple linear function epsilon = 1/k, with k being the number of games played.
     double r = (double) rand() / RAND_MAX; // Random between 0 and 1
     if(r < epsilon) { //Exploration
         return randomMove(moves); // Selection of a random move among every move available.
     else {
         return moveWithMaxQValue(qValues);
} else { // If I'm not in the training part anymore
     return moveWithMaxQValue(qValues);

And I update with the following :

double reward = getReward() // Return 1 if game won, -1 if game lost, 0 otherwise
double thisQ, maxQ, newQ;
Grid prevGrid = Grid(*grid); //I have a shared_ptr on the grid for simplicity
prevGrid.removeAt(position) // We remove the action executed before

string state = stateToString(prevGrid);
thisQ = qTable[state][action];
mawQ = maxQValues();

newQ = thisQ + alpha * (reward + gamma*maxQ - thisQ);
qTable[state][action] = newQ;

As mentioned above, both AI have the same algorithm, but they are two distinct instances so they don't have the same Q-matrix. I read somewhere on Stack Overflow that I should take in account the movement of the opposite player, but I update a state after player move and opponent move so I don't think it's necessary.


Posted 2017-04-12T13:32:49.070

Reputation: 39

1are you sure of the implementation? tic tac toe has a very limited set of alternative scenarios and Q-learning can even "memorize" the best game policy after a few number of iterations – Alireza – 2017-04-12T18:37:16.247

Well I'm not very sure, but when it plays against me, it learns how to avoid loosing against my strategy. But it still need a couple of training set against me to properly learn. I thought I could have a self-trained program with both AI having a Q-learning algorithm. I figured something was wrong in the training part rather than the implementation. Maybe I can train it against an algorithm with minimax ? – Irindul – 2017-04-16T15:07:47.873

I guess minimax would not be a good opponent for your algorithm; while minimax always plays the best move, so your QL algorithm always lose. and you know, QL learns from losing and winning, not only losing. self trained QLs should work. Check your explore-exploit ratio. It should start from around 0.95 and decrease over time. – Alireza – 2017-04-16T20:07:46.840

I actually have a fixed ratio, I will try to make it decrease over time. – Irindul – 2017-04-18T09:25:03.543

I've checked, and after 200,000 games, the AI has visited only 3000 states approximately, this mean there is a problem in my implementation doesn't it ? – Irindul – 2017-04-18T14:11:44.243

maybe it's useful to add your pseudo-code in your question. I guess you can use sigmoid function instead of linear function for explr/explt ratio, to remain more on the exploring behavior and experience more states and then rapidly pass the transient phase and switch to exploit mode, if this is the problem – Alireza – 2017-04-19T04:43:15.603

I edited my post as asked, I didn't try with sigmoid, I'll see if it does change something, but I'm starting to believe my implementation is somewhat wrong. – Irindul – 2017-04-19T10:45:55.867

@Alireza minmax should still work in this case. Due to Tic Tac Toe being very limited a random player should still draw 2%-3% of games against a optimal min-max player when going second. Considerably more going first. – user12889 – 2017-10-30T03:34:48.020

@user12889 as there are three states (win, lose, tie) your algorithm should experience all three. losing and drawing will not help. in most cases letting RL algorithms play against themselves is a good strategy since in this case the power is equal between sides – Alireza – 2017-10-30T11:37:55.397

@Alireza Good point. Learning by playing against itself however has a bit of a reputation for being quite slow learning and often not working well at all (although in some cases, e.g. Google Alpha, it seems to work exceedingly well). Maybe select randomly between a range of possible opponents... – user12889 – 2017-10-31T22:10:09.997



I couldn't add a comment because of my low reputation but you can check this. It is about the state space. https://math.stackexchange.com/questions/485752/tictactoe-state-space-choose-calculation

Molnár István

Posted 2017-04-12T13:32:49.070

Reputation: 534

So there are actually 5000 states, but after a quick correction on my program, I only got 300 states visited.. – Irindul – 2017-04-19T14:01:01.873

I didn't go through your code, but maybe you should try anniling the epsilon less in every iteration – Molnár István – 2017-04-19T16:00:49.887

I'd recommend checking out this article (http://brianshourd.com/posts/2012-11-06-tilt-number-of-tic-tac-toe-boards.html) on the number of states in tic-tac-toe. From my understanding of it, a reinforcement learning (Q-learning) AI program would need to know that there are 593 board states.

– Daniel G – 2017-12-28T00:43:10.820

@DanielG From what I know about Q learning, you can start with an empty Q matrix and just grow it as the bot learns. – Alexus – 2018-03-12T23:23:39.453