## Is there any flaw to this solution to the One shot prisoner's dilemma

3

2

I wrote a solution to the one shot Prisoner's dilemma:

# Introduction

My solution applies to a prisoner dilemma involving two people (I have neither sufficient knowledge of the prisoner's dilemma itself, nor sufficient mathematical aptitude/competence to generalise my solution to prisoner's dilemma where number of agents (n) > 2).

Let the two agents involved be A and B, assume A and B are maximally selfish (they care solely about maximising their payoff). If A and B satisfy the following 3 requirements, then whenever A and B are in a prisoner's dilemma together, they would choose to cooperate. 1. A and B are perfectly rational. 2. A and B are sufficiently intelligent, such that they can both simulate each other (the simulations doesn't have to be perfect; it only needs to sufficiently resemble the real agent that it can be used to predict the choice of the real agent). 3. A and B are aware of the above 2 points.

Solution

A and B have the same preference. (A,B) = (D,C) > (C,C) > (D,D) > (C,D).
(B,A) = (D,C) > (C,C) > (D,D) > (C,D).

My solution relies on A and B predicting each other's behaviour. They use a simulation of the other which guarantees high fidelity predictions.

If A adopts A defect invariant strategy (always defect) i.e committing to defection, then B will simulate this, and being rational B will defect. Vice versa.

If A adopts a cooperate invariant strategy, then B will simulate this and being rational, B will defect.
Vice versa.

To commit to a choice means to decide to adopt that choice irrespective of all other information. Committing means not taking into account any other information in deciding your choice.

Assuming A commits to a choice. Let the choice A commits to be k.

Then B's strategy would be:

## Diagram to Illustrate

h(x) is a function that takes in B's prediction of A's choice as input and outputs the rational response to that choice. As B prefers (D,C) over (C,C) and (D,D) over (C,D), then B would choose to defect whatever choice it predicts for A. Thus, adopting an invariant strategy will cause B to adopt the defect invariant strategy (insofar as B accurately predicts A). Being rational, and preferring (D,C) and (C,C) over (D,D) A would not adopt an invariant strategy.
Vice Versa.

If both of them were to decide to simultaneously adopt invariant strategies, they would adopt the defect invariant strategy. However, as they both prefer (D,C) and (C,C) (in that order) over (D,D) they would both strive to get a better outcome than (D,D) unless no better outcome is possible.

This means that A and B's choices depends on what they predict the other would do.

If A predicts B will defect, A can cooperate or defect.
If A predicts B will cooperate, A can cooperate or defect.
Vice versa.

As A is not committing, A's strategy is either predict(B) (picks what A predicts B will pick) or !predict(B) (picks the opposite of what A predicts B will pick. I.e cooperate if A predicts B will defect, and defect if A predicts B will cooperate).

Vice versa.

To not commit is to decide to base your strategy on your prediction of the choice the opponent adopts. You can either choose the same choice as what you predicted, or choose the opposite of what you predicted (any other choice is adopting an invariant strategy).

If A adopts !predict(B). A gains an outcome ranked 1 or ranked 4 in its preferences.

If A adopts predict(B) A gains an outcome ranked 2 or 3 in its preferences. Vice versa.

We can have:

1. predict(B) and predict(A)
2. predict(B) and !predict(A)
3. !predict(B) and predict(A)
4. !predict(B) and !predict(A).

Now A's decision is dependent on predict(B).
But B's decision (and thus predict(B)) is dependent on predict(A) (and thus A's decision). A = f(predict(B) = g(B = f(predict(A) = g(A)) ) ).

f(x) is a function that deterministically returns either x or !x (according to the strategy adopted by the agent implementing it).

g(x) is a function the stochastically return x or !x it returns x with a probability of p. Given that the agents are simulating each other, we can safely assume that p is high (sufficiently close to 1).

Vice versa.

The above assignment is circular and self-referential. If A and/or B tried to simulate it, it leads to a non terminating recursion of the simulations.

A's strategy would be:

## Diagram to Illustrate

As such, A and B cannot both decide to base their decision on the decision of the other.

Yet, neither A nor B can decide to commit to an option.

What they can do, is to predispose themselves to an option. To predispose themselves is to decide on their choice independent of the other agent. Assuming I received no further information, what would I do? A predisposition is not a final choice, and should not be mistaking for committing to a choice of action. An agent who predisposes themselves can change their choice based on how they predict the other agent would react to that predisposition.

Assuming A predisposes themselves. Let the predisposition made be q. The assignment becomes:

A = f(predict(B) = g(B = f(predict(A) = g(q)) ) ).

## Diagram to Illustrate

Only one of them needs to predispose themselves.

Assuming A predisposes themselves.

If A predisposes themselves to defection, then the only two outcomes are ranked 1 and 3 in their preferences (for A) and 3 and 4 in their preferences (for B).

Upon simulating this, B being rational would choose to defect (resulting in outcome 3). (D,D) is a Nash equilibrium, and as such once they reach there, being rational neither A nor B would change their strategy. (Note that this outlaws the !predict(A) strategy.

If A predisposes themselves to cooperation, then the two possible outcomes are ranked 2 and 4 in their preferences (for A) and 1 and 2 in their preferences (for B).

Upon simulating this, if B chooses to defect, then B is adopting a defect invariant strategy (which has been outlawed), and A will update and choose to defect, resulting in outcome 3. As B is rational, B will choose the outcome that leads to outcome 2, and B will decide to cooperate.

If B chooses to defect, and A simulates B choosing to defect if A predisposes themselves to cooperation, then A will update and defect, resulting in (D,D). If B chooses to cooperate if A predisposes themselves to cooperation, if A updates and chooses to defect, then B would update and choose to defect resulting in D,D. Thus, once they reach (C,C) they are at a reflective equilibrium (in the sense that if one defects, then the other would also defect, thus no one of them can increase their payoff by changing strategy).

(Thus, B will adopt a predict(A) strategy).

Vice Versa.

Because A is rational, and predisposing to cooperation dominates predisposing to defection (the outcomes outlawed are assumed not to manifest), if A predisposes themself, then A will predispose themself to cooperation. Vice Versa.

Thus if one agent predisposes theirself, it will be to cooperation, and the resulting outcome would be (C, C) which is ranked second in their preferences.
What if A and B both predispose themselves? We can have:
1. C & C
2. C & D
3. D & C
4. D & D.

If C & C occurs, the duo will naturally cooperate resulting in (C, C). Remember that we showed above that the strategy adopted is predict(B) (defecting from (C, C) results in D, D).

If C & D occurs, then A being rational will update on B's predisposition to defection and choose defect resulting in (D, D).

If D & C occurs, then B being rational will update on A's predisposition to defection and choose defect resulting in (D, D).

If D & D occurs, the duo will naturally defect resulting in (D, D).

Thus seeing as only predisposition to cooperation yields the best result, at least one of the duo will predispose to cooperation (and the other will either predispose themselves to cooperation or not predispose at all) the resulting outcome is (C, C).

If the two agents can predict each other with sufficient fidelity (explicit simulation is not necessary, only high fidelity predictions are) and are rational, and know of those two facts, then when they engage in the prisoner's dilemma, the outcome is (C, C).

Therefore, a cooperate-cooperate equilibrum can be achieved in a single instance prisoner's dilemma involving two rational agents given that they can predict each other with sufficient fidelity, and know of their rationality and intelligence.

Q.E.D
Thus if two super intelligences faced off against each other in the prisoner's dilemma, they would reach a cooperate-cooperate equilibrium. This solution also applies to two rational robots who have mutual access to the other's source code.

# Prisoner's Dilemma With Human Players

In the above section I outlined a strategy to resolve the prisoner's dilemma for two superintelligent AIs or rational bots with mutual access to the other's source code. The strategy is also applicable to humans who know each other well enough to simulate how the other would act in a given scenario. In this section I try to devise a strategy applicable to human players.

Consider two perfectly rational human agents A and B. A and B are maximally selfish and care only about maximising their payoff.

Let:
(D,C) = W
(C,C) = X
(D,D) = Y
(C,D) = Z
The preference is W > X > Y > Z.

A and B have the same preference.

The 3 conditions necessary for the resolution of the prisoner's dilemma in the case of human players are:
1. A and B are perfectly rational.
2. They each know the other's preference.
3. They are aware of the above two facts.

The resolution of the problem in the case of superintelligent AIs relied on their ability to simulate (generate high fidelity predictions of) each other. If the above 3 conditions are met, then A and B can both predict the other with high fidelity.

Consider the problem from A's point of view. B is as rational as A, and A knows B's preference. Thus, to simulate B, A merely needs to simulate themselves with B's preferences. Since A and B are perfectly rational, whatever conclusion A with B's preferences (A*) reaches is the same conclusion B reaches. Thus A* is a high fidelity prediction of B.
Vice versa.

A engages in a prisoner's dilemma with A*. However, as A* as the same preferences as A, A is basically engaging in a prisoner's dilemma with A.
Vice Versa.

An invariant strategy is outlawed by the same logic as in the AI section.
A = f(predict(A*) = g(A* = f(predict(A) = g(A)))
Vice Versa.

The above assignment is self referential, and if it was run as a simulation, there would be an infinite recursion.

## Diagram to Illustrate

Thus, either A or A* needs to predispose themselves.

If the predisposition A makes is q, then the assignment becomes:

A = f(predict(A*) = g(A* = f(predict(A) = g(q)))

## Diagram to Illustrate

However, as A* is A, then whatever predisposition A makes is the same predisposition A* makes. Both A and A* would predispose themselves. It is necessary for at least one of them to predispose themselves, and the strategy that has the highest probability of ensuring that at least one of them predisposes themselves is each of them individually deciding to predispose themselves. Thus, we enter a situation in which both of them predispose themselves. As A* = A, A's predisposition would be the same as A*'s predisposition. Vice Versa.

We have either: 1. (C,C) 2. (D,D)

If A predisposes themselves to defection, then we have (D,D). (D,D) is a Nash equilibrium as A and/or A* can only perform worse by unilaterally changing strategy at (D,D). As A and A* are rational, they would predispose themselves to cooperation.

If A and A* predispose themselves to cooperation, If A and/or A* tried to maximise their payoff by defecting, then the other (as they predict each other) would also defect to maximise their payoff. Defecting at (C,C) leads to (D,D). Thus, neither A nor A* would decide to defect at (C,C). (C,C) forms a reflective equilibrium. Vice Versa.

As B's reasoning process closely reflects A*'s (both being perfect rationalists, and having the same preferences), the two agents would naturally converge at (C,C).

Q.E.D

I think I'll call this process of basing your decision in multi-agent decision problems (that involve at least two agents who are perfectly rational, know each other's preferences and are aware of those two facts from the perspective of one of those agents satisfying the above 3 criteria) by modelling the other agents (who satisfy those 3 criteria) as simulations of yourself with their preferences recursive decision theory (RDT). I think convergence on RDT is natural for any two sufficiently (they may not need to be perfect, if they are equally rational, and rational enough that they try to predict how the other agent(s) would act) rational agents who know each other's preferences and are aware of the above two facts.

If a single one of those criteria is missing, then RDT is not applicable.

If for example, the two agents are not equally rational, then the more rational agent would choose to defect as it strongly dominates cooperation.

Or they did not know the other's preferences, then they would be unable to predict the other's actions by inserting themselves in the place of the other.

Or if they were not aware of the two facts, then they'd both reach the choice to defect, and we would be once again stuck at a (D,D) equilibrium. I'll formalise RDT after learning more about decision theory and game theory (so probably sometime this year (p < 0.2), or next year (p: 0.2 <= p <= 0.8) if my priorities don't change). I'm curious what RDT means for social choice theory though.

Always nice to get a Dilemma question. Welcome to AI! – DukeZhou – 2017-08-30T15:06:04.200

1If they have access to my source and see a random number generator, how would they react? How would things change if the RNG was uniform, Gaussian, etc? Tell us more about maximizing. Are they maximizing expectation or maximizing magnitude (like a lottery play), maximizing wins? – Bob Salita – 2017-08-30T13:53:36.287

@BobSalita They are maximising their payoff. – Tobi Alafin – 2017-08-30T15:35:05.937

So are these maximax agents? (Didn't see any mentions of minimizing downside in a worst-case scenario. I tend to think of maximax as the "optimistic" strategy, which is exceptionally risky in one-shot, and not generally considered rational. Hofstadter used a new term, superrationality, to describe this strategy. It's easy to demonstrate how the strictly rational strategy becomes "irrational" in iterated Dilemma with a superrational participant, but it's problematic in one-shot...)

– DukeZhou – 2017-09-01T20:10:20.730

I think of it more as trying to maximise their actual payoff. They want to "win" so to speak. I use that mindset when I think of the rational decision.

– Tobi Alafin – 2017-09-02T13:57:05.217

3

This question is re-inventing the analysis for iterated prisoner's dilemma and the co-evolution that can lead to agents playing super-rationally in the one-shot version, which has been studied really extensively.

Dan Ashlock's research career looked at this in great detail from an evolutionary perspective, but it's also been widely studied in other areas of AI. The strategy you describe as superintelligent is called "Tit-for-Tat", is well known to emerge when people play several rounds of the game. It actually mimics how people play the game in experimental settings. It emerges evolutionarily in simulations because any two agents that both implement it will get more reward than selfish agents. Other, more complex strategies can also appear. For example, the fortress family of strategies consists of playing a series of question/response opening moves in the early rounds (unlocking the fortress), and then cooperating for the rest of the game if the other player knows the "password".

Hope this helps!

nice, succinct answer to an interesting but difficult to field question! Useful links. You may also be interested in: https://math.stackexchange.com/q/1629967/362640

– DukeZhou – 2018-08-08T20:43:02.067

1Good point! In fact, I've misidentified the strategy in the question, which is closer to Grim Trigger than Tit-for-Tat. IIRC Grim Trigger emerges evolutionarily before Tit-for-Tat, and is a sort of useful prototype for it. Ashlock's research has a lot more on these kinds of dynamics though. – John Doucette – 2018-08-08T21:06:02.197