Does Q Learning learn from an opponent playing random moves?


I've created a Q Learning algorithm to play Connect Four against an opponent who just chooses a random free column. My Q Agent is currently only winning about 0.49 games on average (30,000 episodes). Will my Q Agent actually learn from these episodes, seeing as its opponent isn't 'trying' to beat it, as there's no strategy behind its random choices? Or should this not matter – if the Q Agent is playing enough games, it doesn't matter how good/bad its opponent is?


Posted 2020-05-03T22:05:05.463

Reputation: 489


Is it by any chance for this competition: ? If not, then you may find quite a lot of advice on the forums and notebooks there.

– Neil Slater – 2020-05-04T07:19:18.880

It's not for that but I'll take a look there. Thanks. – mason7663 – 2020-05-04T07:21:09.903



It should be possible to train an agent using some variant of DQN to beat a random agent around 100% of the time within a few thousand games.

It may require one or two more advanced techniques to get the learning time down to a low number of thousands. However, if your agent is winning ~50% of games against a random agent, something has gone wrong, since that is the performance you would expect of another random agent. Even simple policies, such as always play in same column, will beat a random agent a significant fraction of the time.

First thing to consider is that there are too many states in Connect 4 to use tabular Q learning. You have to use some variant of DQN. As a grid-based board game where winning patterns can repeat, some form of convolutional neural network (CNN) for the Q function is probably a good start.

I think for a first step, you should double-check that you have implemented DQN correctly. Check the TD target formula is correct, and that you have implemented experience replay. Ideally you will also have a delayed-update target network for calculating the TD targets.

As a second step, try some variations of hyper-parameters. The learning rate, exploration rate, size of replay table, number of games to play before starting learning etc. A discount factor $\gamma$ slightly below 1 can help (despite this being an episodic problem) - it makes the agent forget more of the initial bias for early time steps.

Or should this not matter – if the Q Agent is playing enough games, it doesn't matter how good/bad its opponent is?

Up to a point this is true. It is hard to learn against a perfect agent in Connect 4, because it always wins as player one, which means all policies are equally good and there is nothing to learn. Other than that, if there is a way to win, eventually a Q learning agent with exploration should find it.

Against a random agent, you should be seeing some improvement if your agent is correctly set up for the problem, after a few thousand games. As it happens I am currently training Connect 4 agents using variants of DQN for a Kaggle competition, and they consistently beat random agents with 100% measured success rate after 10,000 training games. I have added a few extras to my agents in order to achieve this - there are some discussions of approaches in the forums at

Neil Slater

Posted 2020-05-03T22:05:05.463

Reputation: 14 632

Thanks, Neil. I'm using 'vanilla' Q Learning and no NNs. After 100,000 episodes, my agent is winning 0.86 games on average – though my Q Table file is now over 150mb! – mason7663 – 2020-05-04T09:51:06.353

@mason7663: That's not too bad a result for a tabular approach. Yes the problem will be that it is impossible to store all possible states, or even visit them in a single lifetime using a simple approach. I don't think you will hit 100% success rate against a random agent if you continue - although I guess you may beat 90% or even 95% before you start to run out of RAM for your table. – Neil Slater – 2020-05-04T11:20:09.420