Deep Q-Learning poor convergence on Stochastic Environment


I'm trying to implement a Deep Q-network in Keras/TF that learns to play Minesweeper (our stochastic environment). I have noticed that the agent learns to play the game pretty well with both small and large board sizes. However, it only converges/learns when the layout of the mines is the same for each game. That is, if I randomize the mine distribution from game to game, the agent learns nothing - or near to it. I tried using various network architectures and hyperparameters but to no avail.

I tried a lot of network architectures including:

  1. The input to the network is the entire board matrix, with the individual cells having values of -1 if unrevealed, or 0 to 8 if revealed.
  2. The output of the network is also the entire board representing the desirability of clicking each cell.
  3. Tried Fully connected hidden layers (both wide and deep)
  4. Tried Convolutional hidden layers (tried stacked them, using different kernel sizes, padding..)
  5. Tried adding Dropout after hidden layers too

Is DQN applicable for environments that change every episode or have I approached this from the wrong side?

It seems no matter the network architecture, the agent won't learn. Any input is greatly appreciated. Please let me know if you require any code or further explanations. Thank you.


Posted 2018-11-17T11:39:38.267

Reputation: 113



The inputs that you describe seem like they should be sufficient for a DQN-based agent to learn a good strategy for playing Minesweeper, regardless of whether or not the starting layout changes. The inputs contain all information that is necessary.

However, the problem certainly becomes much easier (probably too easy) if the initial problem is always the same. The DQN algorithm will be very likely to "pick up" on this trend and "exploit" it. The inputs may be sufficient to learn a more general Minesweeper strategy, but if it is consistently faced with the same "level" every single time, it will be much easier for the DQN algorithm to just memorize exactly where the mines are and play perfectly based on that memory, rather than any actual strategy. Due to the way learning is implemented (based on gradient descent), the algorithm will generally tend to converge to such an easily-reachable "memorization" strategy rather than something that is actually "smart".

For that reason, I do think training will be much slower in the case where the layout is randomized. I'm not just thinking along the lines of e.g. twice as slow here, but would expect multiple orders of magnitude more experience required for successful learning. That's just my educated guess though, never tried training DQN for minesweeper specifically so can't tell for sure. It might also require more elaborate hyperparameter tuning, and maybe require a different network architecture (probably require a larger network).

Dennis Soemers

Posted 2018-11-17T11:39:38.267

Reputation: 7 644

1Thanks for your input! I certainly agree that the agent is memorizing the game rather than actually learning strategies. I'll try running it with a relatively large network for a day and see how it does.

On a side-note: Do you recommend CNN to exploit spatial structure in the game or just flatten out the board and go with a fully connected network? – Sanavesa – 2018-11-17T19:01:01.817

1@Sanavesa Hmm my initial guess would be that... CNNs seem like they should be applicable here? I'm really not 100% confident though, might be missing something. – Dennis Soemers – 2018-11-17T19:32:20.693

1After 24 hours of training (~500k games), it is fluctuating unstably around 0-5% win rate. I don't think it is slow, I think the learning is non-existent lol. – Sanavesa – 2018-11-19T00:42:11.957


@Sanavesa Hmmm, there do seem to be some others who have used DQN for Minesweeper. For example: . Maybe there are important differences in terms of network architecture or reward structure or anything like that, between your implementation and theirs? In theory I really can't think of any reason why it shouldn't be able to ever work, but it's not going to be an easy problem. Deep RL isn't exactly known for its stability/reliability either, if things like hyperparams aren't just perfect it can break down quite easily.

– Dennis Soemers – 2018-11-19T08:37:38.453

1Thank you very much. I'll take a look. Off the bat, the biggest difference is the network structure, in which the author uses a multiple channel one-hot encoding of the cells instead of a single channel. That could be the reason. – Sanavesa – 2018-11-19T15:02:52.860

1@Sanavesa Aaah yes I think that may be important, I should've thought of that. Actually I did, but it was when I wasn't behind my computer, and forgot about it again later :D If you just use numerical entries, neural networks will "think" that there is some level of... "continuity" in the inputs, and try to generalize across cells with "close" input numbers. I don't think that behaviour is desirable in the case of Minesweeper, where the best strategy is going to involve more "binary" choices – Dennis Soemers – 2018-11-19T16:30:06.037