1

In Appendix B of MuZero, they say

In two-player zero-sum games the value functions are assumed to be bounded within the $[0, 1]$ interval.

I'm confused about the boundary: Shouldn't the value/utility function be in the range of [-1,1] for two-player zero-sum games?

1Hi, @nikos if we use [0,1] as the bound for the value functions, then how can the game be zero-sum? – Maybe – 2020-05-06T01:07:45.010

The utility function is valued from the perspective of ONE of the 2 players, so it definitely shouldn't sum to zero! If you train it properly, it should give a larger percentage (probability) to the AI player winning. You should regard "zero sum" in the sense one wins and the other loses – nikos – 2020-05-07T10:50:20.297

Hi @nikos. The game will return either $+1$ or $-1$ to the agent to indicate whether it wins or not. I get that the agent is unlikely to take actions that cause it to lose, but I cannot see any guarantee that restricts the value function to positive. For example, if the agent is in a state where there is no way to win if its opponent acts optimally. Shouldn't the value function of that state be negative in that case? – Maybe – 2020-05-08T11:48:05.380

yes, if a board state is close to the opponent winning, then its value will be negative. If using [0,1] valuation, the value will be under 0.5. So either way the message will be transmitted – nikos – 2020-05-08T18:06:00.960

If that's the case, shouldn't the game return 0 to indicate the agent for its loss? – Maybe – 2020-05-09T00:07:14.867

well at the final state, when the game is lost, the value function is 0. At previous states further back, the state value can be interpreted as the probability of winning from that point onwards – nikos – 2020-05-10T15:30:21.967

In section 3 MuZero Algorithm from the paper, they say "Final outcomes ${ lose, draw, win }$ in board games are treated as rewards $u_t\in { −1, 0, +1 }$ occuring at the ﬁnal step of the episode." If I take it correctly, value function should be $-1$ when the game is lost. Where did I misunderstand? – Maybe – 2020-05-12T08:05:52.787

1I haven't read the particular paper that you refer to, but what I've been trying to tell you all along is that the actual range of rewards doesn't matter. If you want [-1,1], that's fine, and [0,1] is also fine, as well as [-1000,1000]. You just have to be consistent – nikos – 2020-05-12T11:04:54.590

1I got that, but this range is not an arbitrary choice; it affects the value of pUCT, which is important to MCTS. – Maybe – 2020-05-13T01:00:54.740

@Maybe Yes, you're right that the range selected can be very important in terms of hyperparameters. But nikos is also correct that the selected range is a fairly arbitrary choice -- in a mathematical sense. Whether we train an agent to optimise values in $[0, 1]$ or in $[-1, 1]$ doesn't really matter

mathematically; exactly the same states will be winning or losing, exactly the same policy will be optimal, etc. It's an implementation detail, which mathematically does not really matter. Implementation details can be important for empirical performance though, and indeed for hyperparameters! – Dennis Soemers – 2020-06-03T19:32:45.203Hi @DennisSoemers. I know that these does not matter in a mathematical sense. But I'm afraid that I might understand the idea in a wrong way as I think value function should be in $[-1,1]$. Moreover, for Atari games, the Q value normalization also aims to restrict the Q value in this range($[0,1]$ from the paper). – Maybe – 2020-06-04T00:24:57.260

@Maybe Technically you're completely right that the true, formal definition of "zero sum" would imply that losses are $-1$, draws are $0$, and wins are $+1$. Or at least that it would have to be a range centered on $0$. But the authors may have found the $[0, 1]$ range more convenient to work with for other reasons. I know that range is quite common in MCTS / bandit algorithms literature, and historically all kinds of theoretical analyses and proofs are based on that range. So also many people implemented it that way. And if they did implement it like that, its important to report in the paper – Dennis Soemers – 2020-06-04T08:28:55.187

Hi, @DennisSoemers. Did you say that MuZero actually uses reward [0, 1] to indicate the win/loss, even for Go and chess? – Maybe – 2020-06-06T08:50:41.083

@Maybe That's what they write in the paper, right? So yes. At least for the search. Maybe for learning they use a different range, not sure, would have to check. They also actually explain why they chose that $[0, 1]$ range in the sentence after the one you quoted; it ensures that their $Q(s, a)$ values are in the same range as the $P(s, a)$ values, and both of those are combined in the pUCT rule. I suppose that keeping them both in the same range can make hyperparameters slightly more easy to tune (or interpret / understand). – Dennis Soemers – 2020-06-06T09:00:32.717

@DennisSoemers Nope, They said the rewards in board games are {-1, 0, +1} in the end of Section 3. But, at the Appendix, they also said the value functions are assumed to be in the range of [0, 1]. That's why I'm confused – Maybe – 2020-06-06T11:50:57.747

@Maybe Ah right, I see. Yeah, so in Section 3 they describe their Neural Network Reinforcement Learning approach. For the purposes of this training algorithm, they use rewards of $-1$, $0$, and $1$ for losses, draws, and wins, respectively. In Appendix B they describe the tree search approach. That's a different component. And for the purposes of that part, they instead use (different) values in $[0, 1]$. – Dennis Soemers – 2020-06-06T12:13:42.297

@DennisSoemers Shouldn't the value estimate be in the same range as the reward in board games? I failed to see the difference. – Maybe – 2020-06-06T12:31:20.573

@Maybe It's about the same board games in both cases. It's just, one thing is the learning algorithm (learning values and policies in any board game), and the other thing is the tree search (again in all the same board games). They're mostly separate, and you can use different ranges of values in each of them. I suppose in the parts where these two separate components start "communicating" to each other, they'll make the translation between the two ranges. i.e. if the search uses a value in $[-1, 1]$ from the NN, it can just map that to the $[0, 1]$ range before making use of it. – Dennis Soemers – 2020-06-06T12:41:09.597

Let us continue this discussion in chat.

– Maybe – 2020-06-06T13:39:42.733