2

In the paper Markov games as a framework for multi-agent reinforcement learning (which introduces the minimax Q Learning algorithm), at the bottom left of page 3, my understanding is that the author suggests, for a simultaneous 1v1 zero-sum game, to do Bellman iterations with $$V(s)=\min_{o}\sum_{a}\pi_{a}Q(s,a,o)$$ with $\pi_{a}$ the probability of playing action $a$ for the maximizing player in his best mixed strategy to play in state $s$.

If my understanding is correct, why does the opponent in this equation play a pure strategy ($\min_{o}$) rather than his best mixed strategy in state $s$. This would instead give $$V(s)=\sum_{o}\sum_{a}\pi_{a}\pi_{o}Q(s,a,o)$$ with $\pi_{o}$ the opponent's best mixed strategy in state $s$. Which of these two formulations is correct and why? Are they somehow equivalent?

The context of this question is that I am trying to use minimax Q learning with a Neural Network outputting the matrix $Q(s,a,o)$ for a simultaneous zero-sum game. I have tried both methods and so far have seen seemingly equally bad results, quite possibly due to bugs or other errors in my method.