Using the opponent's mixed strategy in estimating the state value in minimax Q learning

2

In the paper Markov games as a framework for multi-agent reinforcement learning (which introduces the minimax Q Learning algorithm), at the bottom left of page 3, my understanding is that the author suggests, for a simultaneous 1v1 zero-sum game, to do Bellman iterations with $$V(s)=\min_{o}\sum_{a}\pi_{a}Q(s,a,o)$$ with $\pi_{a}$ the probability of playing action $a$ for the maximizing player in his best mixed strategy to play in state $s$.

If my understanding is correct, why does the opponent in this equation play a pure strategy ($\min_{o}$) rather than his best mixed strategy in state $s$. This would instead give $$V(s)=\sum_{o}\sum_{a}\pi_{a}\pi_{o}Q(s,a,o)$$ with $\pi_{o}$ the opponent's best mixed strategy in state $s$. Which of these two formulations is correct and why? Are they somehow equivalent?

The context of this question is that I am trying to use minimax Q learning with a Neural Network outputting the matrix $Q(s,a,o)$ for a simultaneous zero-sum game. I have tried both methods and so far have seen seemingly equally bad results, quite possibly due to bugs or other errors in my method.

Agade

Posted 2019-01-10T10:01:16.663

Reputation: 121

Answers

0

My understanding is now that the author's formula is deliberate. It seeks to learn a worst-case maximizing policy. The formula I instead suggest would, I believe, instead be Nash Q learning where the agent seeks to learn to play a Nash equilibrium.

After debugging, I have gotten good results with the second formula but cannot speak for the original Minimax Q learning one.

Agade

Posted 2019-01-10T10:01:16.663

Reputation: 121