What's the optimal policy in the rock-paper-scissors game?

2

A deterministic policy in the rock-paper-scissors game can be easily exploited by the opponent - by doing just the right sequence of moves to defeat the agent. More often than not, I've heard that a random policy is the optimal policy in this case - but the argument seems a little informal.

Could someone please expound on this, possibly adding more mathematical details and intuition? I guess the case I'm referring to is that of a game between two RL agents, but I'd be happy to learn about other cases too. Thanks!

EDIT: When would a random policy be optimal in this case?

cogito_ai

Posted 2020-08-27T10:22:00.807

Reputation: 602

A deterministic policy can be the optimal policy. It entirely depends whether your environment is actively learning to adapt or not. I believe The formal argument will require us to find Nash equilibrium for each player, but making such an equilibrium requires some prior assumption about information (if I recall correctly) each player's possess about others, so I'll not write an answer. But the general direction is this, you can find about it more. – DuttaA – 2020-08-27T11:12:31.103

Interesting - if it's two RL agents playing against each other, the deterministic policy is definitely not optimal right? RL agents can adapt actively! – cogito_ai – 2020-08-27T13:48:11.700

True.....but we don't know.... RL policy can be deterministic if you want..no steadfast rule. – DuttaA – 2020-08-27T13:56:29.673

Answers

4

For this, we will need game theory.

In game theory, an optimal strategy is one that cannot be exploited by the opponent even if they know your strategy.

Let's say you want a strategy where your move selection is not based on what happened before (so you are not trying to model your opponent, or trick them into believing you will always play scissors and then throw them off, anything like that). A strategy will look like $(P, S, R)$, where $P, S, R \in [0, 1], P+S+R = 1$. You select paper with probability $P$, scissors with probability $S$, rock with probability $R$. Now, if your probabilities are a bit uneven (for example $(0.5, 0.2, 0.3)$) an opponent can abuse that strategy. If your opponent plays with probabilities $(p, s, r)$, their expected reward (counting +1 for win, -1 for loss, 0 for draw) would be $0.5(s - r) + 0.2(r - p) + 0.3(p - s) = 0.1p + 0.2s - 0.3r$. If they wish to maximize their wins, they would play scissors all the time against you, and expect to have a distinct advantage over you.

In general, for a strategy $(P, S, R)$ for you and $(p, s, r)$ for your opponent, your opponent's winnings would be $P(s - r) + S(r - p) + R(p - s) = p(R-S) + s(P-R) + r(S - P)$. If all the partial derivatives of this, with respect to $p$, $s$ and $r$ are 0, the opponent has no way to maximize his winnings; they would have no incentive to play a particular move over any other move. This occurs when $P = S = R = \frac13$.

That's basically how to approach game theory: find a strategy so your opponent has no incentive to choose one action over another. The approach seems a bit counter-intuitive at first (you're trying to find the optimal strategy for your opponent instead of for yourself) but it works for many similar problems.

Robby Goetschalckx

Posted 2020-08-27T10:22:00.807

Reputation: 286

1Although I agree with this answer, The answer makes an assumption that we know the opponent's strategy. I think in the RL setting this will be false, and we will get certain upper/lower bound based on the number of samples taken. – DuttaA – 2020-08-27T17:01:01.370

@DuttaA This answer doesn't make that assumption anywhere. In fact the "choose equally distributed randomly" strategy will work the same against any kind of strategy. – Helena – 2020-08-27T20:40:49.447

@Helena it did when it calculated the expected reward. The expression is known only when probabilities are known – DuttaA – 2020-08-27T20:43:16.547

@DuttaA The expression uses unknown variables for the opponents probabilities (p,s,r). It only uses constant values for the own example strategy (0.5, 0.2, 0.3) . – Helena – 2020-08-27T20:47:17.727

@Helena the example is supposed to illustrate such a thing is possible when in reality if the probabilities are super close such a thing requires a huge number of samples, not to mention the difficulty if we have a non stationary strategy (in the RL setting) – DuttaA – 2020-08-27T20:49:06.437

Or maybe you are correct..maybe my idea doesn't stand in the way to prove such a thing....I am not aware of the details. – DuttaA – 2020-08-27T20:53:21.800

2@DuttaA You are right, in reality you don't know each others strategy. Because you cannot know what kind of opponent you will get, the best thing you can do is to assume your strongest possible foe. – Helena – 2020-08-27T21:00:22.433