RL: What should be the output of the NN for an agent trying to learn how to play a game?


Say the game is tic tac toe. I found two possible output layers:

  1. Vector of length 9: each float of the vector represents 1 action (one of the 9 boxes in Tic Tac Toe). The agent will play the corresponding action with the highest value. The agent learns the rules through trial and error. When the agent tries to make an illegal move (i.e. placing a piece on a box where there is already one), the reward will be harshly negative (-1000 or so).
  2. A single float: the float represents who is winning (positive = "the agent is winning", negative = "the other player is winning"). The agent does not know the rules of the game. Each turn the agent is presented with all the possible next states (resulting from playing each legal action) and it chooses the state with the highest output value.

What other options are there?

I like the first option because it's cleaner, but it's not feasible with games that have thousands or millions of actions. Also, I am worried that the game might not really learn the rules. E.g. Say that in state S the action A is illegal. Say that state R is extremely similar to state S but action A is legal in state R (and maybe in state R action A is actually the best move!). Isn't there the risk that by learning not to play action A in state S it will also learn not to play action A in state R? Probably not an issue in Tic Tac Toe, but likely one in any game with more complex rules. What are the disadvantages of option 2?

Does the choice depend on the game? What's your rule of thumb when choosing the output layer?


Posted 2020-04-22T17:04:23.080

Reputation: 243



It depends on whether the action is part of the input or output of a neural network estimating the Q-value(state, action).alternative architectures for Q-function neural network

The network on the left has the state as input and outputs one scalar value for each of the categorical actions. It has the advantage of being easy to setup and only needs one network evaluation to predict the Q-value for all actions. If the action space is categorical and single-dimensional I would use it.

The network on the right has both the state and a representation of the action as input and outputs one single scalar value. This architecture also allows to compute the Q-value for multi-dimensional and continuous action spaces.

The action space of tic-tac-toe can be easily represented by a vector of length 9, thus I would recommend the left NN-architecture. However, if your game has continuous-valued variables in the action space (e.g. the position of your mouse pointer), you should use the NN-architecture on the right.

The approach to prevent illegal moves is only partially dependent on the choice of the Q-function architecture and covered by another question: How to forbid actions


Posted 2020-04-22T17:04:23.080

Reputation: 11

Can you elaborate your answer with the tic tac toe example ? – D_Raja – 2020-04-30T04:20:06.040


in reinforcement learning, neural networks are used to estimate the value function (board state worth), not to choose the action directly. In most games, the actions available are state-dependent anyway, so you cannot easily formulate them as ANN outputs.

So the idea is that at each state, you consider the alternative actions, and the one that leads to the most valuable state is the action of choice (without using lookahead). Your ANN will thus be approximating the board state values

Strictly speaking for tic-tac-toe you don't need a neural network, the tabular Q-learning method would suffice. Have you read Sutton and Barto book on RL?


Posted 2020-04-22T17:04:23.080

Reputation: 146