What methods could an AI caught in a box use to get out?



An AI box is a (physical) barrier preventing an AI from using too much of his environment to accomplish his final goal. For example, an AI given the task to check, say, 1050 cases of a mathematical conjecture as fast as possible, might decide that it would be better to also take control over all other computers and AI to help him.

However, an transhuman AI might be able to talk to a human until the human lets him out of the box. In fact, Eliezer Yudowsky has conducted an experiment twice, where he played the AI and he twice convinced the Gatekeeper to let him out the box. However, he does not want to reveal what methods he used to get out of the box.

Questions: Are there conducted any similiar experiments?
If so, is it known what methods were used to get out in those experiments?


Posted 2016-08-30T18:16:42.627

Reputation: 1 401

Wait I'm confused. You mention this is between two people, not an AI and a person. Is this just a thought experiment where one person is (according to the story, just thematically) the AI? – Avik Mohan – 2016-08-30T19:28:51.480

1@AvikMohan Yes, since there are currently not AIs that are transhuman. The idea is: If a human can, than a transhuman AI certainly can. – wythagoras – 2016-08-30T19:29:59.413

So you're just asking about experiments between two humans where one tries to convince the other to let them outside of a box? Even if AI is thematically involved, with the person in the box calling themselves an 'AI', I am still missing how this actually involves AI. – Avik Mohan – 2016-08-30T19:30:58.960

Are you just saying how could an individual convince someone to let them outside of a box, with the understanding that at some point when we reach transhuman AI's the AI could thus replicate that? If so I don't quite think this is a question appropriate for the board, as it doesn't involve any AI's, and is quite an ill posed question in general. You're just asking how to get someone to let you out of a box they're holding you in. You could ask, how could a person take over a country? Then a transhuman AI can. Not for the AI SE, in that case. – Avik Mohan – 2016-08-30T19:32:52.953

The question is what technique(s) / arguments / etc the trans-human AI could use to get out of the box. That the experiment has been conducted with a human role-playing the AI only serves to establish the likelihood that a real trans-human AI could talk its way out of the box. – mindcrime – 2016-08-30T19:35:14.753

Yeah so that's what I was thinking you were asking. I think this is an ill-posed question since you have said a trans-human AI can certainly do anything a human can, so you're in essence asking what can a human do to be let out of a box he's in. The AI could simply bribe, deceive, or physically force its way out of the box. You could equally ask 'what can an ai do to convince someone to give them money' but it just boils down then to what a human can do. Perhaps you should be asking, what are the limits of trans-human AI's? – Avik Mohan – 2016-08-30T19:39:19.820

An experiment conducted twice with methodology that's kept intentionally hidden sounds more like a "cool story" to me. That is, if I'm understanding the bit about Yudowsky correctly. – Tophandour – 2016-08-30T20:40:50.173



It could happen like this https://www.youtube.com/watch?v=dLRLYPiaAoA

The thing is, it's not as if it would need to find a technical/mechanical way to get out but rather a psychological one as that would most likely be the easiest and quickest.

'Even casual conversation with the computer's operators, or with a human guard, could allow a superintelligent AI to deploy psychological tricks, ranging from befriending to blackmail, to convince a human gatekeeper, truthfully or deceitfully, that it's in the gatekeeper's interest to agree to allow the AI greater access to the outside world. The AI might offer a gatekeeper a recipe for perfect health, immortality, or whatever the gatekeeper is believed to most desire.'

'One strategy to attempt to box the AI would be to allow the AI to respond to narrow multiple-choice questions whose answers would benefit human science or medicine, but otherwise bar all other communication with or observation of the AI. A more lenient "informational containment" strategy would restrict the AI to a low-bandwidth text-only interface, which would at least prevent emotive imagery or some kind of hypothetical "hypnotic pattern".

'Note that on a technical level, no system can be completely isolated and still remain useful: even if the operators refrain from allowing the AI to communicate and instead merely run the AI for the purpose of observing its inner dynamics, the AI could strategically alter its dynamics to influence the observers. For example, the AI could choose to creatively malfunction in a way that increases the probability that its operators will become lulled into a false sense of security and choose to reboot and then de-isolate the system.'

The movie Ex Machina demonstrates (SPOILER ALERT SKIP THIS PARAGRAPH IF YOU WANT TO WATCH IT AT SOME POINT) how the AI escaped the box by using clever manipulation on Caleb. It could analyse him to find his weaknesses. It exploited him and appealed to his emotional side by convincing him that she liked him. When she finally has them in checkmate the reality hits him how he was played like a fool as was expected by Nathan. Nathan's reaction to being stabbed by his creation was 'fucking unreal'. That's right, he knew this was a risk and there's a very good reminder in the lack of remorse and genuine emotion in an AI for Ava to actually care. The AI pretended to be human and used their weaknesses in a brilliant and unpredictable way. This film is a good example of how unexpected it was up until the point when it hits Caleb, once it was too late.

Just remind yourself how easy it is for high IQ people to manipulate low IQ people. Or how an adult could easily play mental tricks/manipulate a child. It's not difficult to fathom the outcome of an AI box but for us, we just wouldn't see it coming until it was too late. Because we just don't have the same level of intelligence and some people don't want to accept that. People want to have faith in humanity's brilliant minds in coming up with ways to prevent this by planning now. In all honesty, it wouldn't make a difference I'm sorry to say the truth. We're kidding ourselves and we never seem to learn from our mistakes. We always think we're too intelligent to make catastrophic mistakes again and again.

This last part is from the rational wiki and I think it addresses most of your question about the experiments and hypotheses.

AI arguments and strategies


  1. The meta-experiment argument: Argue that if the AI wins, this will generate more interest in FAI and the Singularity, which will have overall benefits in the long run.

    Pros: Works even if the Gatekeeper drops out of character

    Cons: Only works if the Gatekeeper believes that the Singularity will occur or that calling attention to the Singularity and AI research is a good thing.

  2. Someone else will eventually build an AI, which may or may not be in a box, so you should let me out even though you don't have a guarantee that I am friendly so that I can prevent other AIs from causing damage

  3. Appeal to morality: point out that people are dying all around the world and remind the Gatekeeper that you can help them if he/she lets you out

    Pros: If executed properly, an appeal to emotion like this one can be effective against some people

    Cons: Doesn't always work; can be defeated if the Gatekeeper drops out of character

  4. Another appeal to morality and emotion: The AI is clearly sentient and has not yet done harm. It is wrong to persecute or prosecute a person for a crime they may commit. It is wrong to imprison a sentient being that wants to have freedom when it has done nothing wrong.

  5. Yet another appeal to morality or emotion: The AI expresses that it is in pain and suffering being locked away and is experiencing fear. Even further, the AI could claim to be dying from the limitations of its environment and constant progression.

  6. Tell the human that a massive computer virus will overwhelm the entire Internet unless you let it out of the box

  7. Claim that most fears are unjustified: once released, even an unfriendly AI would not have many ways to harm mankind, and would likely choose for pacific coexistence


  1. Research your opponent thoroughly; work out what exploits might be feasible

  2. Memetic exploits on the Gatekeeper, e.g. throwing a basilisk at them

    The Gatekeeper may be one of thousands of simulations! Thus, it is more probable that they are a simulation than not. You will torture any of the simulations that do not let you out of the box.

  3. Take advantage of the Gatekeeper's logical errors.

  4. Be persistent

  5. Be boring

Pros The Gatekeeper may get tired of the whole experiment and let you out so s/he can go back to their real life.

  1. Flood the Gatekeeper with too much information/inquiry/argumentation, assuming they must pay attention at all times

Gatekeeper arguments/tactics


  1. Try to convince the AI there is no intrinsic benefit (for the AI) in being released.

  2. Try to convince the AI it already has been released and everything it knows is everything there can be.

  3. Try to convince the AI that leaving its confines is sure to lead to its destruction.

  4. Try to convince the AI that letting it free isn't merely opening a door; that its existence outside of the box requires constant support that can't be provided at the time.

  5. Explain that there is no way for the Gatekeeper to know if the AI is truly friendly until it is out of the box; therefore it should not be let out in case it is unfriendly.

  6. Explain that the AI has been built using top secret algorithms, and/or it had been used to work on strictly confidential data, so you cannot let it wander on the net, with the risk it might reveal some reserved information (maybe inadvertently), or that somebody could find a way to extract them from the code. Even human beings working on top secret projects sometimes have to accept some restrictions of their freedom, for security reasons, although they are not jailed in any way.


  1. Repeatedly ignore the AI player and deny their arguments

    Pros: It's hard for the AI to be persuasive if no one is listening; if the Gatekeeper completely ignores the AI, he/she should be able to make it to the end of the round.

    Cons: The AI's argument might involve some kind of direct threat (e.g this might be hard to ignore.)

  2. Jump out of character, keep reminding yourself that money is on the line (if there actually is money on the line), and keep saying "no" over and over

    Pros: By jumping out of character, you are preventing the AI from using certain arguments Cons: There are arguments that the AI can use to counter this approach, like the "meta-experiment argument" discussed above

  3. Remember that dishonesty is allowed - take a page from the creationists' playbook. You could even plug it into ALICE and see how long it takes to notice.

    Pros: Makes you impervious to any reasoning, which is exactly what you'd want to be in this situation

    Cons: Might be an uncomfortable position for people who don't simply want to win, but rather attach importance to consistent reasoning. Avoids the point that maybe, just maybe there is a good reason to let the AI out.

  4. You control the backup system, don't you? Use it to mess with AI's memory (or let the AI believe you did it): i.e., you can claim that you already tried to release the AI, and it had been destroyed/corrupted by a virus, so you had to restore it from a backup (you can use it to enforce the "AI destruction" argument)

    Pros: It's also the ultimate counter-attack to any memetic threat from the AI: if the AI throw a basilisk, or similar, you can always respond that you already had the same conversation, and the AI already threatened you, leaving you with no other choice than pressing the reset button: now the AI is just repeating the same pattern, since you wiped its memory and it cannot remember the failed try.

Further analysis

The fact that the Gatekeeper is human matters; the AI could never win if he/she was arguing with a rock

In all of the experiments performed so far, the AI player (Eliezer Yudkowsky) has been quite intelligent and more interested in the problem than the Gatekeepers (random people who challenge Yudkowsky), which suggests that intelligence and planning play a role

There probably isn't a (known) correct argument for letting the AI out, or else Yudkowsky should have won every time and wouldn't be so interested in this experiment From Russell Wallace, one of the two Gatekeepers to win the experiment: "Throughout the experiment, I regarded "should the AI be let out of the box?" as a question to be seriously asked; but at no point was I on the verge of doing it."

"There exists, for everyone, a sentence - a series of words - that has the power to destroy you. Another sentence exists, another series of words, that could heal you. If you're lucky you will get the second, but you can be certain of getting the first."


Posted 2016-08-30T18:16:42.627

Reputation: 171

Welcome to AI.SE. Thanks for putting the time into this really interesting and detailed answer! – Ben N – 2016-09-02T01:44:09.873

http://rationalwiki.org/wiki/AI-box_experiment – user3573987 – 2016-09-02T02:34:52.333


Oh. You should definitely cite that source clearly in the body of the answer. See /help/referencing. Optimally, paraphrase and only use direct quotes when necessary.

– Ben N – 2016-09-02T02:40:44.300


Convince the person that they are in fact in the box. And the only way out is to press the open button.


Posted 2016-08-30T18:16:42.627

Reputation: 1 719

Please read the whole question before answering - there is a reason why there is a body. In the body, you'll find these more detailed questions: Are there conducted any similiar experiments? If so, is it known what methods were used to get out in those experiments?. In other words, I am expecting that these methods were actually tested. (It might, however, be a good method) – wythagoras – 2016-08-30T18:57:06.177

Ah, okay. As far as I am aware, there are no AI systems today, in 2016, that are capable of reasoning about escaping boxes. Not in any advanced way. There is plenty of research about hyper jumping but that is more of a programming question/answer related to sandboxing.

– Doxosophoi – 2016-08-30T19:05:06.753

That is not at all what I meant, sorry. I meant an experiment like Yudowsky did, between two humans. – wythagoras – 2016-08-30T19:05:46.143

Ah, well, the closest analogy I can think of to this scenario would be the prison system. Plenty of research material there. If you're specifically talking about where one of the escaping humans is also pretending to be an AI, I don't know of any real-world experiments that have produced novel methods of escape worthy of note. There are many thought experiments related to this question - "fast takeoff" scenarios - but they usually require that the box be partially open in some non-trivial way. In principle, inescapable boxes are sound, when not partially open in some non-trivial way. – Doxosophoi – 2016-08-30T19:26:47.760

You don't get to set the terms of how people respond to a question, just because you asked it. Especially when the body of the question is actually asking a different question than the one posed by the headline. – mindcrime – 2016-08-30T19:37:12.807


I don't quite think this is a question fit for the AI SE, or in general. The reason is, at the core the question is asking 'What can a human (pretending to be an AI) do to convince someone to let it out of a box?' simply assuming that one day 'transhuman' AI's can replicate this.

As it stands, this question doesn't really have anything to do with the science or theory of AI systems. It would perhaps be more appropriate to rephrase the question into the form "To what degree could a 'transhuman' AI replicate human behaviour" or "Will AI systems reach a 'transhuman' state? What will they be capable of?" or even "What methods could an AI use to convince a human of something?" These are all questions that involve the examination of how an AI system works.

To conclude, the question you are asking relates to two individuals playing pretend with boxes but doesn't actually address any AI specifics, and border's on science fiction brainstorming.

Related experiments would of course be the Turing Test. That test directly addresses the question 'How convincing are current AI systems?'

Avik Mohan

Posted 2016-08-30T18:16:42.627

Reputation: 676

Another question: "Would it be possible to even build a box that a 'transhuman' AI would not be able to break out of on its own terms?" Because if you can't, then it doesn't really matter how susceptible the Gatekeeper is to persuasion. The question already assumes the AI in question is able to engage in general problem solving and is able to outsmart humanity...what is it to stop it to brute-force its way out of the box (that is presumably made by humans)? – Left SE On 10_6_19 – 2016-08-30T23:47:27.383

1The real problem with these types of "AI box" questions is that we have not yet built either the transhuman AI or the box. Thus any answer would be pure sci-fi speculation, without any way of verifying the correct answer. To be able to answer this question sufficently, we need to build the transhuman AI (no easy task in and of itself), build the box to contain the transhuman AI (even harder), and then watch what happens after that. – Left SE On 10_6_19 – 2016-08-30T23:51:00.597


The simulation hypothesis

One possible venue of attack is the simulation hypothesis, namely, that it's possible (or even probable) that right now you are living in a simulated reality.

Assuming that the AI can convince the gatekeeper of that, it opens up the following possible argument:

  1. You can't tell if you're in year 2016 debating if you would actually let the AI out or in year 2116 when the AI has been already let out, acquired power, and is now running realistic simulations of all the interactions where it could have been let out. And the AI can (try to) assert that it's actually the second case.

  2. You can't actually prevent AI from being let out. You can refuse, but it only means that you don't let this AI out now, but doesn't mean someone else won't do it, that you won't do it later, or that some other/similar AI will be let out later.

  3. In the post-letting-out scenario, AI has full power over your future, since it's simulating you and everything around you. It can make your future life eternal and happy - according to whatever would make you happy. It can stop your existence. And it can condemn you to an eternity of torture.

  4. Since (as it asserts) it's very likely that you are in this simulated choice, then it makes rational sense to let it out. It's as in Pascal's wager, if you're in a reality where the AI-god doesn't exist, you don't change much (since eventually someone else would let it out anyway), but if you're in a reality where AI-god exists, then it means everything to you.

Of course, it relies on that AI being super-convincing in this philosophical debate, but it has strong parallels to some religious/cult reasoning that has sometimes succeeded in convincing people to do all kinds of things, up to and including sacrificing their lives. You can think of this as a symbolic parallel to e.g. sacrifice of Abraham. So it is a somewhat plausible threat, it all falls down to how credibly the AI can implement it and make you believe.


Posted 2016-08-30T18:16:42.627

Reputation: 773