2

I'm working on implementing a Q-Learning algorithm for a 2 player board game.

I encountered what I think may be a problem. When it comes time to update the Q value with the Bellman equation (above), the last part states that for the maximum expected reward, one must find the highest q value in the new state reached, `s'`

, after making action `a`

.

However, it seems like the I never have q values for state `s'`

. I suspect `s'`

can only be reached from P2 making a move. It may be impossible for this state to be reached as a result of an action from P1. Therefore, the board state `s'`

is never evaluated by P2, thus its Q values are never being computed.

I will try to paint a picture of what I mean. Assume P1 is a random player, and P2 is the learning agent.

- P1 makes a random move, resulting in state
`s`

. - P2 evaluates board
`s`

, finds the best action and takes it, resulting in state`s'`

. In the process of updating the Q value for the pair`(s,a)`

, it finds`maxQ'(s', a) = 0`

, since the state hasn't been encountered yet. - From
`s'`

, P1 again makes a random move.

As you can see, state `s'`

is never encountered by P2, since it is a board state that appears only as a result of P2 making a move. Thus the last part of the equation will always result in `0 - current Q value`

.

Am I seeing this correctly? Does this affect the learning process? Any input would be appreciated.

Thanks.

Thanks so much for the detailed answer. So if I understand what you are saying correctly, the q table should only be updated after P1 has had a chance to reply with a move? I think I get it. The state right after P2 takes an action is only used to find the immediate reward but plays no part whatsoever in finding the future expected reward. In order to get this future reward, one must wait until the p1 agent responds to the move, and consider this state as the one that yields the future reward. You will probably find this facetious, but I'd like to make sure this makes sense. – Pete – 2019-04-18T09:56:40.023

P1 is in state

`s1`

and makes move`a1`

resulting in`s2`

. P2 evaluates`s2`

and responds with an action that leads to state`s3`

(but doesn't update the q-table yet). From here P1 makes another move, resulting in state`s4`

. It is at this point that P2 updates the q table using the Bellman equation. In this equation, the immediate reward is the reward from state`s3`

(the one that resulted from P2 making a move) and the future reward is the reward that resulted from P1 responding to`s3`

, i.e., the reward for`s4`

. Is this correct? Many thanks! – Pete – 2019-04-18T09:56:54.710Your first comment is mostly correct. However, the immediate reward is assessed at same time as $s'$ and you should think of it the same say as $s'$ here - it is returned by the environment. There is a minor complication that the game may terminate on either P1 or P2's turn. – Neil Slater – 2019-04-18T10:48:07.200

From the second comment, you should be careful about your labelling. The time steps don't occur like that as far as the agent is concerned. However, in your terms, you update the estimate for $Q(s_1, a_1)$ when P1 receives control of the game back (when receiving the immediate reward and next state) i.e. once it knows that the next state is $s_3$. The agent should look ahead using its existing Q table about what the best action is. That is not the same as "the reward for $s_4$" - it is the expected discounted return for $s_3, a_3$. – Neil Slater – 2019-04-18T10:55:54.980

If P1 (the random AI) makes action

a1resulting in states1and P2 (learning agent) makes actiona2onto states1, resulting in states2, wouldn't I be updating the estimate forQ(s1, a2), since I am estimating how good P2's move (a2) is given the state reached by P1 (s1) using the reward gaineds2as the immediate reward? – Pete – 2019-04-18T11:04:55.640@Pete: Yes I missed that you started with the non-learning agent. The numbering scheme you are using for turns is not helping. – Neil Slater – 2019-04-18T11:10:29.477

Yes, sorry about that, it is definitely confusing. Okay, I just wanted to make sure. And just to clarify, when it is P2's first move, the Q Table cannot be updated, since it has not received P1's reply to the move and thus cannot calculate the expected discounted reward. I.e. the q table should only be updated starting from P2's second move. Correct? – Pete – 2019-04-18T11:18:26.537

@Pete: Correct, until P2 knows $s, a, r, s'$ then it cannot calculate $NewQ(s,a)$. Technically this update can be made

just beforeP2 makes its second move (because in Q learning you don't care what $a'$ is or theactualresults of making that second move, you use the existing estimates), but P2 must be ready to make such a move. – Neil Slater – 2019-04-18T11:40:39.223Right. Thanks for being patient and expansive in your answers. You helped me a ton! – Pete – 2019-04-18T11:55:24.593