## Computationally expensive AI techniques (that are promising)

29

18

To produce tangible results in the field of ML/AI, one must take theoretical results under the lens of computational complexity.

Indeed, minimax effectively solves any two-person "board game" with win/loss conditions, but the algorithm quickly becomes untenable for games of large enough size, so it's practically useless asides from toy problems.

In fact, this issue seems to cut at the heart of intelligence itself: the Frame Problem highlights this by observing that any "intelligent" agent that operates under logical axioms must somehow deal with the explosive growth of computational complexity.

So we need to deal with computational complexity: but that doesn't mean researchers must limit themselves with practical concerns. In the past, multilayered perceptrons were thought to be intractable (I think), and thus we couldn't evaluate their utility until recently. I've heard that Bayesian techniques are conceptually elegant, but they become computationally intractable once your dataset becomes large, and thus we usually use variational methods to compute the posterior, instead of naively using the exact solution.

I'm looking for more examples like this: What are examples of "neat" ideas in ML/AI that are impracticable due to computational intractability?

1Are you also looking for uncomputable algorithms (and not just intractable), but that nonetheless may have approximations? – nbro – 2019-10-20T23:40:57.943

good question! i didn't consider that. my intent with the question was to see if there are theories/approaches that are hypothesized to produce results, but due to practical concerns (that we have limited compute, that we don't have super-quantum computers, that logic works a certain way, or anything beyond the scope of my imagination) we do not implement / cannot evaluate. i think that has the potential to be overly broad, but for now im just "fishing" for what creative ideas people might say. – k.c. sayz 'k.c sayz' – 2019-10-21T00:03:55.293

but i want something that is at least partially concrete: so something like (im 100% making this up, i dont know what im saying) a function which by definition outputs 1 if insert smart behavior, and then throw said function onto a "thing" that can do unbounded-computation. well... if there's a well-formulated way of defining that i would be curious, but i don't think such a well-formed definition exists..... – k.c. sayz 'k.c sayz' – 2019-10-21T00:07:37.967

i'm also not interested in the suggestion of "insert universe simulator that knows of all physics" as a function because of above reason. (but if you dear reader can well-formulate this construction, be sure to let me know) – k.c. sayz 'k.c sayz' – 2019-10-21T00:10:22.177

but i guess it might be the case that there are very few answers to this question, because computational tractability is a requisite for anyone who wants results....... or more poignantly, the entire study of ML is an attempt to trace the untraceable computations (im just making this claim up) – k.c. sayz 'k.c sayz' – 2019-10-21T00:12:06.370

I have to challenge the premise that minimax "quickly becomes untenable" for AI, when chess programs are now better than humans. Aren't you redefining AI down (up) thereby? And/or mis-equating "solving" with "intelligence"? And as to "any agent", what about humans, how do we manage it? (We are using humans as definitional of "intelligence", yes?) – Jeff Y – 2019-10-21T15:23:21.820

not sure what you mean by " redefining AI down (up)". regarding minimax: chess AI (traditional methods: i am not speaking of AlphaZero) is "strong" (better than humans) because of evaluation functions, deep(er) search, and endgame books: not pure mini max. it's well-known that minimax is untenable: i don't know where you are coming from. just consider for yourself that Go has (roughly) 361 positions per move. take 300, and consider 40 moves, which is 300^40 = more than the atoms in the universe. – k.c. sayz 'k.c sayz' – 2019-10-21T16:28:44.890

regarding "mis-equating "solving" with "intelligence"": im not sure how to interpret this: do you mean that the notion of intelligence has nothing to do with solving problems? or do you question the notion of emulating intelligence with solving an equation? i think you mean the latter so i'll just answer that: i'm not sure if i think that intelligence will ever be "solved" given some "godly-equation", but if we have a "super-equation" that does a lot of things (maybe it can prove Goldbach, but it can't write a poem?), then it would still be worthy to study – k.c. sayz 'k.c sayz' – 2019-10-21T16:33:20.750

regarding "And as to "any agent", what about humans, how do we manage it?": i assume you mean my statement regarding the Frame Problem? i think you misunderstand the frame problem and its context: the Frame Problem is a counter-argument of the hypothesis that the mind operates on a GOFAI-type (strictly logical propositions over axiomatic deductions) system, which was a prevailing theory during the time. we don't know how the mind works. – k.c. sayz 'k.c sayz' – 2019-10-21T16:38:17.177

"We are using humans as definition of "intelligence", yes?": using humans as the basis of "intelligence" is a necessity: we don't know what is "smart" asides from ourselves, so we can only use our own subjective selves as the basis of this alleged "thing". Friston implicitly claims that intelligence is "reduction of thermodynamic free-energy"; tied to Baez's observation that "life is self-replicating information on how to self-replicate" in conjunction with that living things consume free-energy; tied to the thesis of Predictive Coding. none of this is rigour, so we are left to speculate. – k.c. sayz 'k.c sayz' – 2019-10-21T16:58:54.433

^oh, my above comment was to note that non "behavioural"/"human-intentional" definitions of intelligence exist, and there are people trying to make them rigour. – k.c. sayz 'k.c sayz' – 2019-10-21T17:00:28.313

I mean that minimax (with alpha-beta) is an essential component of some chess programs that beat all human players. So it's not "untenable" for AI purposes. Unless you are redefining that capability as "not AI", which your wording indicates you're not. Your argument seems to be that minimax cannot solve chess (because chess as a problem is too large). But solving a game is not necessary for AI, so I don't know where you're coming from. Frame problem: https://plato.stanford.edu/entries/frame-problem/. Yes, the human mind is not (solely) an inference engine.

– Jeff Y – 2019-10-21T17:44:33.833

i think i see your point, but let me voice my response first: by "minimax is untenable" i mean that "minimax solves certain games" is a fact, but is untenable due to computational reasons. i think you know what i mean by "solve", but just to be sure, by "solve" i mean that one player can force a win/tie by playing by a specific method, while the other player will always lose/tie if the other player plays by that strategy, no matter what they do. so yes, minimax solves all "board games" of a certain type (win/loss; information complete, i might be missing something) – k.c. sayz 'k.c sayz' – 2019-10-21T18:26:21.627

i think your concern is with my notion of "what does strong-AI imply": i assume by AI, you would consider passing the Turing Test, or some other behavioural "quasi-equivalence" test as sufficient, and thus "solving Chess" is not a requisite for AI? fair enough, i conceded that i wasn't clear enough, but im also not sure on how I would modify my question to address that. – k.c. sayz 'k.c sayz' – 2019-10-21T18:31:32.007

^in any case, there seems to be two approaches the interpretation of "AGI": one is that it acts like humans; another is that it's some of super-intelligence doing things we can never dream of. our pragmatic goal the former, but it seems that to reach that goal would also require us to reach the 2nd goal: ironically, the fact that human nature is so elusive to formal analysis requisites that we can only assume that an unboundedly powerful agent can simulate humanness. but honestly this latter claim is just my own analysis, and i dont know if anyone else says this. – k.c. sayz 'k.c sayz' – 2019-10-21T18:44:41.653

^^but in any case, once we have AGI to a degree of human-likeness, there is no reason for it to be bounded within human intelligence. if we know some algorithm that produces "human intelligence", then surely we can throw that same algorithm onto a more powerful computer to generate a more powerful AGI? so again, our analysis of AGI usually defaults towards the latter interpretation instead of the first. there is also the question of self-improvement, but i won't comment on that. – k.c. sayz 'k.c sayz' – 2019-10-21T18:48:25.410

Fully bidirectional wide+deep LSTMs are probably right on the edge of what's computationally possible right now, is that along the lines of what you're looking for? – Oso – 2019-10-21T18:48:41.910

@Andy honestly i don't want answers that more-or-less are "deeper, wider, more parameters", but that's a tricky distinction to make. my question itself fails this (MLP vs single layer). but your response could be valid: it can produce significant results (but: though we know that NN's are universal function approximators, it seems ludicrous to imagine that "going deeper" is the solution to ML lol (also it overfits)). i think you can submit an answer if you provide reason to believe that such an architecture would improve results beyond "because deeper, wider, more params". but you do you! – k.c. sayz 'k.c sayz' – 2019-10-21T19:09:59.590

I'm not pointing to wider/deeper in general, but rather fully bidirectional deep networks (I specified LSTM, but it's not the only option). – Oso – 2019-10-21T19:11:15.353

im not sure how being bidirectional changes things, due to what i assume is my own ignorance on the subject. if you think that this is a significant architectural change, and/or there is unique/significant reason to believe this approach makes things better then please submit an answer. – k.c. sayz 'k.c sayz' – 2019-10-21T19:28:18.040

15

AIXI is a Bayesian, non-Markov, reinforcement learning and artificial general intelligence agent that is incomputable, given the involved incomputable Kolmogorov complexity. However, there are approximations of AIXI, such as AIXItl, described in Universal Artificial Intelligence: Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability (2005), by Marcus Hutter (the original author of AIXI), and MC-AIXI-CTW (which stands for Monte Carlo AIXI Context-Tree Weighting). Here is a Python implementation of MC-AIXI-CTW: https://github.com/gkassel/pyaixi.

13

To be concrete, exact Bayesian inference is (often) intractable (that is, not polynomially computable) because it involves the computation of an integral over a range of real (or even floating-point) numbers, which is not a polynomial-time operation. More precisely, for example, if you want to find the parameters $$\mathbf{\theta} \in \Theta$$ of a model given some data $$D$$, then Bayesian inference is just the application of the Bayes' theorem

\begin{align} p(\mathbf{\theta} \mid D) &= \frac{p(D \mid \mathbf{\theta}) p(\mathbf{\theta})}{p(D)} \\ &= \frac{p(D \mid \mathbf{\theta}) p(\mathbf{\theta})}{\int_{\Theta} p(D \mid \mathbf{\theta}^\prime) p(\mathbf{\theta}^\prime) d \mathbf{\theta}^\prime} \\ &= \frac{p(D \mid \mathbf{\theta}) p(\mathbf{\theta})}{\int_{\Theta} p(D, \mathbf{\theta}^\prime) d \mathbf{\theta}^\prime } \tag{1}\label{1} \end{align}

where $$p(\mathbf{\theta} \mid D)$$ is the posterior (which is what you want to find or compute), $$p(D \mid \mathbf{\theta})$$ is the likelihood of your data given the (fixed) parameters $$\mathbf{\theta}$$, $$p(\mathbf{\theta})$$ is the prior and $$p(D) = \int_{\Theta} p(D \mid \mathbf{\theta}^\prime) p(\mathbf{\theta}^\prime) d \mathbf{\theta}^\prime$$ is the evidence of the data (which is an integral given that $$\mathbf{\theta}$$ is assumed to be a continuous random variable), which is intractable because the integral is over all possible values of $$\mathbf{\theta}$$, that is, $${\Theta}$$. If all terms in \ref{1} were tractable (polynomially computable), then, given more data $$D$$, you could iteratively keep on updating your posterior (which becomes your prior on the next iteration), and exact Bayesian inference would become tractable.

The variational Bayesian approach casts the problem of inferring $$p(\mathbf{\theta} \mid D)$$ (which requires the computation of the intractable evidence term) as an optimization problem, which approximately finds the posterior, more precisely, it approximates the intractable posterior, $$p(\mathbf{\theta} \mid D)$$, with a tractable one, $$q(\mathbf{\theta} \mid D)$$ (the variational distribution). For example, the important variational auto-encoder (VAEs) paper (which did not introduce the variational Bayesian approach) uses the variational Bayesian approach to approximate a posterior in the context of neural networks (that represent distributions), so that existing machine (or deep) learning techniques (that is, gradient descent with back-propagation) can be used to learn the parameters of a model.

The variational Bayesian approach (VBA) becomes always more appealing in machine learning. For example, Bayesian neural networks (which can partially solve some of the inherent problems of non-Bayesian neural networks) are usually inspired by the results reported in the VAE paper, which shows the feasibility of the VBA in the context of deep learning.

ah, thank you for the clarification. i was partially aware of this, but i wasn't sure how to phrase it due to my understanding being incomplete. i'll update the question to reflect this. – k.c. sayz 'k.c sayz' – 2019-10-20T23:11:14.660

@k.c.sayz'k.csayz' But then you partially make my answer useless. – nbro – 2019-10-20T23:22:36.130

oh no, what should i do? i think it would be better to make the question well-formed but..... – k.c. sayz 'k.c sayz' – 2019-10-20T23:23:50.353

1@k.c.sayz'k.csayz' Well, I go into the details of Bayesian inference and why they can be intractable (while you don't mention this), so your last edit to your question is fine. I think my answer still addresses some problems related to Bayesian inference that are related to your question. – nbro – 2019-10-20T23:36:29.137

This is a good topic, but note that from a practical perspective, inexact Bayesian methods are very fast, and usually very accurate, so these problems are already wisely solved. I think a better example is learning the structure of a Bayesian network. – John Doucette – 2019-10-21T16:15:10.157

@JohnDoucette In this answer, I am clarifying the motivation for variational Bayesian approaches, that is, the intractability of Bayesian inference. I am not saying that variational Bayesian approaches are intractable. – nbro – 2019-10-21T16:24:38.473

@nbro I think I understand that from your answer, but the Gibb's Sampling approaches (mentioned on the VB wikipedia page), are much more wide spread, and actually just solve the Bayesian Inference problem directly, not by substitution. They do this using a stochastic technique (MCMC), so they are not exact, but they are highly effective in practice. Because of this, I think this is not a good use case: Bayesian Inference is routinely done well, so it's not a big limiting case for current AI research. – John Doucette – 2019-10-21T17:07:03.247

@JohnDoucette Again, you wanted to say approximate Bayesian inference is tractable (or approximative BI is routinely done well), not Bayesian inference. Bayesian inference is a good example of an intractable problem that has some good approximative solutions – nbro – 2019-10-21T17:14:22.517

@nbro You're technically correct, but it's a misleading example. The problem is NP-Hard. There's no efficient algorithm for an exact solution, and probably never will be. But the same is true of almost every AI technique. Exact optimization for any ML problem. Search is intractable. Scheduling is intractable. Logical inference is intractable. Despite this, AI solves all these problems routinely, because we have good techniques that work in practice. Learning the structure of a Bayesian network is something we have no good technique for. Learning the parameters is something we can do. – John Doucette – 2019-10-21T17:41:24.713

@JohnDoucette You can provide your own answer. I think your point of view will also be useful ;) – nbro – 2019-10-21T18:14:49.370

1Hi @nbro, when you say "which is intractable because the integral is over all possible values of $\theta$", you mean it in a strict complexity theory meaning (i.e. superpolynomial in $\mathcal{D}$) or simply that it is an integral too hard to compute? – olinarr – 2019-10-22T06:41:09.550

1

@olinarr Have a look at this question Intractable posterior distributions, whose answers attempt to address your question.

– nbro – 2019-10-22T12:30:33.657

2@nbro thank you very much. I was just puzzled because, if the input variable is $\mathcal{D}$, then assuming we are doing numerical integration over a pre-determined range of $\theta$s, than shouldn't the computational complexity be constant? (because, independently of $\mathcal{D}$, we always do the same number of "integration steps" over $\theta$). – olinarr – 2019-10-22T12:39:22.883

1@olinarr I have not found a rigorous argument or proof that shows the time or space complexity of integrals in the case of Bayesian inference. I would also like to see it. Anyway, I think you're right: in terms of complexity theory, the input is $D$ and the integration is over another variable. However, note that the posterior depends on $D$. – nbro – 2019-10-22T12:44:06.380

9

This question gets at a really interesting fact about AI research in general: AI is hard.

In fact, almost every AI problem is computationally hard (typically NP-Hard, or #P-Hard). This means that most new areas of AI research starts out by characterizing some problem that is intractable, and proposing an algorithm that technically works, but is too slow to be useful. However, that's not the whole story. Usually AI researchers then proceed to develop tractable techniques according to one of two schools:

• Algorithms that usually work in practice, and are always fast, but are not completely correct.
• Algorithms that are always correct, and are usually fast, but are sometimes very slow, or only work on specific kinds of sub-problem.

Take together, these let AI address most problems. For example:

• Search was developed as a general purpose AI technique for solving planning and logic problems. The first algorithm, called the general problem solver, always worked, but was extremely slow. Eventually, we developed heuristic guided search techniques like A*, domain specific tricks like GraphPlan, and stochastic search techniques like Monte-Carlo Tree Search.
• Bayesian Learning (or Bayesian Inference) has been known since the 1800's, but it is known to involve either the computation of intractable integrals, or the creation of exponentially sized discrete tables, making it NP-Hard. A very simple algorithm involves applying brute force and enumerating all of the options, but this is too slow. Eventually, we developed techniques like Gibbs Sampling (that is always fast, and usually right), or Variable Elimination (that is always right, and usually fast). Today we can solve most problems of this kind very well.
• Reasoning about language was thought to be very hard (see the Frame Problem), because there are an infinite number of possible sentences, and an infinite number of possible contexts they could be used in. Exact approaches based on rules did not work. Eventually we developed probabilistic approaches like Hidden Markov Models and Deep Neural Networks, that aren't certain to work, but work so well in practice that language problems are, if not completely solve, getting very close.
• Games of chance, like Poker, were thought to be impossible, because they are #P-Hard to complete exactly (this is harder than NP-Hard). There will probably never be an exact algorithm for these. In spite of this, techniques like CFR+ can derive solutions that are so close to exactly perfect that you would need to play for decades against them to tell the difference.

So, what's still hard?

• Inferring the structure of a Bayesian network. This is closely related to the problem of causality. It's #P-Hard, but we don't currently have any good algorithms to even do this approximately very well. This is an active area of research.
• Picking a machine learning algorithm to use for an arbitrary problem. The No Free Lunch theorem tells us this is not possible in general, but it seems like we ought to be able to do it pretty well in practice.
• More to come...?

5

The logical induction algorithm can make predictions about whether mathematical statements are true or false, which are eventually consistent; e.g. if A is true, its probability will eventually reach 1; if B implies C then C's probability will eventually reach or exceed B's; the probability of D will eventually be the inverse of not(D); the probabilities of E and F will eventually reach or exceed that of E AND F; etc.

It can also give consistent predictions about itself, e.g. "the logical induction algorithm will predict the probability of X to be Y at timestep T", whilst avoiding paradoxes like the liar's paradox.

1ohh, this is very cool! MIRI stuff is very hard and i am forced by dumbness to ignore much of their work, but thank you for the source and a short description – k.c. sayz 'k.c sayz' – 2019-10-21T19:22:06.527

4

Hutter's "fastest and shortest algorithm for all well-defined problems" is the ultimate just-in-time compiler. It runs a given program and, in parallel, searches for proofs that some other program is equivalent but faster. The running program is restarted at exponentially-spaced intervals; if a faster program has been found, that is started instead. The running time of this algorithm is of the same order as the fastest provably-equivalent algorithm, plus a constant $$O(1)$$ term (the time taken to find the proof, which doesn't dependent on the input size). For example, it will run Bubble Sort in at most $$O(n~log (n))$$) time, by finding a proof that it's equivalent to such a fast algorithm (like Merge Sort) then switching to that algorithm.

Hutter's algorithm is similar to the best ahead-of-time compilers, known as super-optimisers. They search through all possible programs, starting with the smallest/fastest, until they find one equivalent to the given code. These are actually in use right now, but are only practical for programs that are a few (machine code) instructions long. The LLVM compiler contains some "peephole optimisations" (i.e. find/replace templates) that were found by a super-optimiser a few years ago. Note that super-optimisation should not be confused with super-compilation (a rather general optimisation, which is not optimal and involves no search).

3

wow! so many good answers from you. maybe this question was indeed too broad haha. thank you and i will take some time to digest these hefty papers – k.c. sayz 'k.c sayz' – 2019-10-21T00:39:44.947

1@k.c.sayz'k.csayz' Well, initially, only AIXI came to my mind, but you were asking for intractable problems, not uncomputable ones. I also thought about providing clarifications and details regarding the intractability of Bayesian inference. Meanwhile, other intractable problems came to my mind, so it is possible that there are other answers to your question :) You're asking for examples, so don't worry, I don't think your question is too broad. – nbro – 2019-10-21T00:42:10.627

3

Levin's search algorithm is a general method of function inversion. Many AI tasks are of this sort, e.g. given a cost or reward function (object -> cost or object -> reward), its inverse (cost -> object or reward -> object) would find an object with the given cost/reward; we could ask this inverse function for an object with low cost or high reward.

Levin's algorithm is optimal iff the given function is a "black box" with no known pattern in its output. For example, if a small change in the input produces a small change in the output, Levin search wouldn't be optimal; instead we could use hill climbing or some other gradient method.

Levin's algorithm looks for the function's inverse by running all possible programs in parallel, assigning exponentially more time to shorter programs. Whenever a program halts, we check whether its output is the desired inverse (i.e. whether givenProgram(outputOfHaltedProgram) = desiredOutput, e.g. whether cost(outputOfHaltedProgram) = low).

This way "simpler" guesses at the inverse are made first; where we define the simplicity (AKA "Levin complexity") of a value by looking through all programs $$p$$ which generate that value, and minimising the sum of: $$p$$'s length (in bits) plus the logarithm of $$p$$'s running time (in steps). If we ignored running time we would get Kolmogorov complexity, which is theoretically nicer but is incomputable (we don't know when to give up waiting for short non-halting programs, due to the Halting Problem). Levin complexity is computable, since we can give up waiting for those loops once they've taken exponentially-many steps as a longer solution (e.g. once we've spent $$T$$ steps waiting for a possible loop of length $$N$$, we can start trying programs that are $$N+1$$ bits long for $$T/2$$ steps).

The running time of Levin Search is of the same order as the simplest such inverse-value-generating program. However, this is misleading, since the fraction of steps allocated to running any particular program $$p$$ is $$1/2^{complexity(p)}$$, so this constant factor will be slowing down the computation of the inverse too. There is also overhead associated with context-switching between all of these programs.

The FAST algorithm does the same job as Levin Search, in the same time, but avoids the overhead of context-switching between an infinite number parallel programs. Instead it runs one program at a time, cuts it off if it hasn't halted within an appropriate number of steps, then retries for twice as many steps later on. The GUESS algorithm is also equivalent, but chooses programs at random; the expected runtime is the same, but there's no need to keep track of loop counters like in FAST, plus it can be run on parallel hardware without having to coordinate anything (whilst still avoiding the infinite parallelism of the original).

Levin search is currently impractical in its original setting of searching through general-purpose, Turing-complete programs. It can be useful in less general domains, e.g. searching through the space of hyper-parameters or other domain-specific, configuration-like "programs".

1Can you specify the time complexity of Levin search, etc.? – nbro – 2019-10-22T10:20:08.720

Added a bit about time, and "Levin complexity" (which time is parameterised by) – Warbo – 2019-10-22T14:42:35.730