## How to handle invalid moves in reinforcement learning?

26

11

I want to create an AI which can play five-in-a-row/gomoku. As I mentioned in the title, I want to use reinforcement learning for this.

I use policy gradient method, namely REINFORCE, with baseline. For the value and policy function approximation, I use a neural network. It has convolutional and fully connected layers. All of the layers, except for the output, are shared. The policy's output layer has $$8 \times 8=64$$ (the size of the board) output unit and softmax on them. So it is stochastic. But what if the network produces a very high probability for an invalid move? An invalid move is when the agent wants to check a square which has one "X" or "O" in it. I think it can stuck in that game state.

Could you recommend any solution for this problem?

My guess is to use the actor-critic method. For an invalid move, we should give a negative reward and pass the turn to the opponent.

12

Just ignore the invalid moves.

For exploration it is likely that you won't just execute the move with the highest probability, but instead choose moves randomly based on the outputted probability. If you only punish illegal moves they will still retain some probability (however small) and therefore will be executed from time to time (however seldom). So you will always retain an agent which occasionally makes illegal moves.

To me it makes more sense to just set the probabilities of all illegal moves to zero and renormalise the output vector before you choose your move.

Thank you. probably I wasn't clear but I chose the move randomly by the outputted probabilites. I will try your advice to set the probability of the illegal moves to zero and see what will hapen. Have a nice day. – Molnár István – 2017-03-14T17:03:11.347

Do you think we need to take the invalid move with negative rewards for sample to train? – shtse8 – 2020-08-16T19:04:51.383

Maybe if there is consistently a lot of probability mass on illegal moves. Otherwise probably not. – BlindKungFuMaster – 2020-08-18T08:12:53.837

11

Usually softmax methods in policy gradient methods using linear function approximation use the following formula to calculate the probability of choosing action $$a$$. Here, weights are $$\theta$$, and the features $$\phi$$ is a function of the current state $$s$$ and an action from the set of actions $$A$$.

$$\pi(\theta, a) = \frac{e^{\theta \phi(s, a)}}{\sum_{b \in A} e^{\theta \phi(s, b)}}$$

To eliminate illegal moves, one would limit the set of actions to only those that were legal, hence $$Legal(A)$$.

$$\pi(\theta, a) = \frac{e^{\theta \phi(s, a)}}{\sum_{b \in Legal(A)} e^{\theta \phi(s, b)}}, \, a \in Legal(A)$$

In pseudocode the formula may look like this:

action_probs = Agent.getActionProbs(state)
legal_actions = filterLegalActions(state, action_probs)
best_legal_action = softmax(legal_actions)


Whether using linear or non-linear function approximation (your neural network), the idea is to only use the legal moves when computing your softmax. This method means that only valid moves will be given by the agent, which is good if you wanted to change your game later on, and that the difference in value between the limited choice in actions will be easier to discriminate by the agent. It will also be faster as the number of possible actions decreases.

@brianberns Did you manage to find an answer? It seems like that would be the case to me but somehow in my toy example I'm only getting the right answer when using the log probabilities of the unfilitered softmax... – tryingtolearn – 2019-03-11T13:04:39.267

Very useful. Thanks for posting both the equations and pseudocode! – DukeZhou – 2017-12-22T17:43:05.947

1The maths and pseudocode do not match here. Softmax over the legal move probabilities will adjust the relative probabilities. E.g. (0.3, 0.4, 0.2, 0.1) filtered with first and third item removed would be (0.0, 0.8, 0.0, 0.2) with your formula, but would be (0.0, 0.57, 0.0, 0.42) using the pseudocode. The pseudocode needs to take the logits, prior to action probability calculations. – Neil Slater – 2018-03-14T16:09:24.477

4How does one compute the gradient of the filtered version of Softmax? Seems like this would be necessary for backpropagation to work successfuly, yes? – brianberns – 2018-03-22T14:26:08.337

7

I faced a similar issue recently with Minesweeper.

The way I solved it was by ignoring the illegal/invalid moves entirely.

1. Use the Q-network to predict the Q-values for all of your actions (valid and invalid)
2. Pre-process the Q-values by setting all of the invalid moves to a Q-value of zero/negative number (depends on your scenario)
3. Use a policy of your choice to select an action from the refined Q-values (i.e. greedy or Boltzmann)
4. Execute the selected action and resume your DQN logic

Hope this helps.

Comments are not for extended discussion; this conversation has been moved to chat.

– nbro – 2020-03-07T07:21:06.197

6

IMHO the idea of invalid moves is itself invalid. Imagine placing an "X" at coordinates (9, 9). You could consider it to be an invalid move and give it a negative reward. Absurd? Sure!

But in fact your invalid moves are just a relic of the representation (which itself is straightforward and fine). The best treatment of them is to exclude them completely from any computation.

This gets more apparent in chess:

• In a positional representation, you might consider the move a1-a8, which only belongs in the game if there's a Rook or a Queen at a1 (and some other conditions hold).

• In a different representation, you might consider the move Qb2. Again, this may or may not belong to the game. When the current player has no Queen, then it surely does not.

As the invalid moves are related to the representation rather than to the game, they should not be considered at all.

1Great point. In [M] games, which are played on Sudoku, the constraints make many positions (coordinates+value) illegal after the first placement. There is no value in considering these illegal positions from the standpoint of placement, but, an important strategic layer is recognizing which placements minimize value of remaining, unplayed positions. (i.e. if I place an 8 here, it blocks my opponent from placing an 8 in that row, column or region. Essentially, "how many strategic positions does this placement remove from the gameboard?") – DukeZhou – 2018-01-12T18:08:33.343