I'm trying to implement a Deep Q-network in Keras/TF that learns to play Minesweeper (our stochastic environment). I have noticed that the agent learns to play the game pretty well with both small and large board sizes. However, it only converges/learns when the layout of the mines is the same for each game. That is, if I randomize the mine distribution from game to game, the agent learns nothing - or near to it. I tried using various network architectures and hyperparameters but to no avail.
I tried a lot of network architectures including:
- The input to the network is the entire board matrix, with the individual cells having values of -1 if unrevealed, or 0 to 8 if revealed.
- The output of the network is also the entire board representing the desirability of clicking each cell.
- Tried Fully connected hidden layers (both wide and deep)
- Tried Convolutional hidden layers (tried stacked them, using different kernel sizes, padding..)
- Tried adding Dropout after hidden layers too
Is DQN applicable for environments that change every episode or have I approached this from the wrong side?
It seems no matter the network architecture, the agent won't learn. Any input is greatly appreciated. Please let me know if you require any code or further explanations. Thank you.