204

234

Maybe the concept, why it's used, and an example.

204

234

Maybe the concept, why it's used, and an example.

187

First, we need to understand what is a Markov chain. Consider the following weather example from Wikipedia. Suppose that weather on any given day can be classified into two states only: sunny and rainy. Based on past experience, we know the following:

$P(\text{Next day is Sunny}\,\vert \,\text{Given today is Rainy)}=0.50$

Since, the next day's weather is either sunny or rainy it follows that:

$P(\text{Next day is Rainy}\,\vert \,\text{Given today is Rainy)}=0.50$

Similarly, let:

$P(\text{Next day is Rainy}\,\vert \,\text{Given today is Sunny)}=0.10$

Therefore, it follows that:

$P(\text{Next day is Sunny}\,\vert \,\text{Given today is Sunny)}=0.90$

The above four numbers can be compactly represented as a transition matrix which represents the probabilities of the weather moving from one state to another state as follows:

$P = \begin{bmatrix} & S & R \\ S& 0.9 & 0.1 \\ R& 0.5 & 0.5 \end{bmatrix}$

We might ask several questions whose answers follow:

**Q1:** If the weather is sunny today then what is the weather likely to be tomorrow?

**A1:** Since, we do not know what is going to happen for sure, the best we can say is that there is a $90\%$ chance that it is likely to be sunny and $10\%$ that it will be rainy.

**Q2:** What about two days from today?

**A2:** One day prediction: $90\%$ sunny, $10\%$ rainy. Therefore, two days from now:

First day it can be sunny and the next day also it can be sunny. Chances of this happening are: $0.9 \times 0.9$.

Or

First day it can be rainy and second day it can be sunny. Chances of this happening are: $0.1 \times 0.5$.

Therefore, the probability that the weather will be sunny in two days is:

$P(\text{Sunny 2 days from now} = 0.9 \times 0.9 + 0.1 \times 0.5 = 0.81 + 0.05 = 0.86$

Similarly, the probability that it will be rainy is:

$P(\text{Rainy 2 days from now} = 0.1 \times 0.5 + 0.9 \times 0.1 = 0.05 + 0.09 = 0.14$

In linear algebra (transition matrices) these calculations correspond to all the permutations in transitions from one step to the next (sunny-to-sunny ($S_2S$), sunny-to-rainy ($S_2R$), rainy-to-sunny ($R_2S$) or rainy-to-rainy ($R_2R$)) with their calculated probabilities:

On the lower part of the image we see how to calculate the probability of a future state ($t+1$ or $t+2$) given the probabilities (probability mass function, $PMF$) for every state (sunny or rainy) at time zero (now or $t_0$) as simple matrix multiplication.

If you keep forecasting weather like this you will notice that eventually the $n$-th day forecast, where $n$ is very large (say $30$), settles to the following 'equilibrium' probabilities:

$P(\text{Sunny}) = 0.833$

and

$P(\text{Rainy}) = 0.167$

In other words, your forecast for the $n$-th day and the $n+1$-th day remain the same. In addition, you can also check that the 'equilibrium' probabilities do not depend on the weather today. You would get the same forecast for the weather if you start of by assuming that the weather today is sunny or rainy.

The above example will only work if the state transition probabilities satisfy several conditions which I will not discuss here. But, notice the following features of this 'nice' Markov chain (nice = transition probabilities satisfy conditions):

Irrespective of the initial starting state we will eventually reach an equilibrium probability distribution of states.

Markov Chain Monte Carlo exploits the above feature as follows:

We want to generate random draws from a target distribution. We then identify a way to construct a 'nice' Markov chain such that its equilibrium probability distribution is our target distribution.

If we can construct such a chain then we arbitrarily start from some point and iterate the Markov chain many times (like how we forecast the weather $n$ times). Eventually, the draws we generate would appear as if they are coming from our target distribution.

We then approximate the quantities of interest (e.g. mean) by taking the sample average of the draws after discarding a few initial draws which is the Monte Carlo component.

There are several ways to construct 'nice' Markov chains (e.g., Gibbs sampler, Metropolis-Hastings algorithm).

79

I'd probably say something like this:

"Anytime we want to talk about probabilities, we're really integrating a density. In Bayesian analysis, a lot of the densities we come up with aren't analytically tractable: you can only integrate them -- if you can integrate them at all -- with a great deal of suffering. So what we do instead is simulate the random variable a lot, and then figure out probabilities from our simulated random numbers. If we want to know the probability that X is less than 10, we count the proportion of simulated random variable results less than 10 and use that as our estimate. That's the "Monte Carlo" part, it's an estimate of probability based off of random numbers. With enough simulated random numbers, the estimate is very good, but it's still inherently random.

"So why "Markov Chain"? Because under certain technical conditions, you can generate a memoryless process (aka a Markovian one) that has the same limiting distribution as the random variable that you're trying to simulate. You can iterate any of a number of different kinds of simulation processes that generate correlated random numbers (based only on the current value of those numbers), and you're guaranteed that once you pool enough of the results, you will end up with a pile of numbers that looks "as if" you had somehow managed to take independent samples from the complicated distribution you wanted to know about.

"So for example, if I want to estimate the probability that a standard normal random variable was less than 0.5, I could generate ten thousand independent realizations from a standard normal distribution and count up the number less than 0.5; say I got 6905 that were less than 0.5 out of 10000 total samples; my estimate for P(Z<0.5) would be 0.6905, which isn't that far off from the actual value. That'd be a Monte Carlo estimate.

"Now imagine I couldn't draw independent normal random variables, instead I'd start at 0, and then with every step add some uniform random number between -0.5 and 0.5 to my current value, and then decide based on a particular test whether I liked that new value or not; if I liked it, I'd use the new value as my current one, and if not, I'd reject it and stick with my old value. Because I only look at the new and current values, this is a Markov chain. If I set up the test to decide whether or not I keep the new value correctly (it'd be a random walk Metropolis-Hastings, and the details get a bit complex), then even though I never generate a single normal random variable, if I do this procedure for long enough, the list of numbers I get from the procedure will be distributed like a large number of draws from something that generates normal random variables. This would give me a Markov Chain Monte Carlo simulation for a standard normal random variable. If I used this to estimate probabilities, that would be a MCMC estimate."

7That's a good explanation, but not for a nontechnical layperson. I suspect the OP wanted to know how to explain it to, say, the MBA who hired you to do some statistical analysis! How would you describe MCMC to someone who, at best, sorta understands the concept of a standard deviation (variance, though, may be too abstract)? – Harlan – 2010-07-20T02:18:02.333

7@Harlan: It's a hard line to straddle; if someone doesn't at least know what a random variable is, why we might want to estimate probabilities, and have some hazy idea of a density function, then I don't think it *is* possible to meaningfully explain the how or why of MCMC to them, only the "what", which in this case would boil down to "it's a way of numerically solving an otherwise impossible problem by simulation, like flipping a coin a lot to estimate the probability that it lands on heads". – Rich – 2010-07-20T02:39:24.543

Cool explanation. I think this is great for a nontechnical person. – SmallChess – 2015-06-11T03:54:01.127

Doubt - In the last para, why would the list of numbers seem like coming from a normal distribution? Is it because of the Central Limit Theorem? Further, what if we wanted to simulate some other distribution? – Manoj Kumar – 2015-11-25T02:27:43.210

1+1 for the last paragraph. With a minimum of technicalities it conveys the idea well. – whuber – 2010-11-03T00:52:31.300

77

I think there's a nice and simple intuition to be gained from the (independence-chain) Metropolis-Hastings algorithm.

First, what's the goal? The goal of MCMC is to draw samples from some probability distribution without having to know its exact height at any point. The way MCMC achieves this is to "wander around" on that distribution in such a way that the amount of time spent in each location is proportional to the height of the distribution. If the "wandering around" process is set up correctly, you can make sure that this proportionality (between time spent and height of the distribution) is achieved.

Intuitively, what we want to do is to to walk around on some (lumpy) surface in such a way that the amount of time we spend (or # samples drawn) in each location is proportional to the height of the surface at that location. So, e.g., we'd like to spend twice as much time on a hilltop that's at an altitude of 100m as we do on a nearby hill that's at an altitude of 50m. The nice thing is that we can do this even if we don't know the absolute heights of points on the surface: all we have to know are the *relative* heights. e.g., if one hilltop A is twice as high as hilltop B, then we'd like to spend twice as much time at A as we spend at B.

The simplest variant of the Metropolis-Hastings algorithm (independence chain sampling) achieves this as follows: assume that in every (discrete) time-step, we pick a random new "proposed" location (selected uniformly across the entire surface). If the proposed location is higher than where we're standing now, move to it. If the proposed location is lower, then move to the new location with probability p, where p is the ratio of the height of that point to the height of the current location. (i.e., flip a coin with a probability p of getting heads; if it comes up heads, move to the new location; if it comes up tails, stay where we are). Keep a list of the locations you've been at on every time step, and that list will (asyptotically) have the right proportion of time spent in each part of the surface. (And for the A and B hills described above, you'll end up with twice the probability of moving from B to A as you have of moving from A to B).

There are more complicated schemes for proposing new locations and the rules for accepting them, but the basic idea is still: (1) pick a new "proposed" location; (2) figure out how much higher or lower that location is compared to your current location; (3) probabilistically stay put or move to that location in a way that respects the overall goal of spending time proportional to height of the location.

What is this useful for? Suppose we have a probabilistic model of the weather that allows us to evaluate A*P(weather), where A is an unknown constant. (This often happens--many models are convenient to formulate in a way such that you can't determine what A is). So we can't exactly evaluate P("rain tomorrow"). However, we can run the MCMC sampler for a while and then ask: what fraction of the samples (or "locations") ended up in the "rain tomorrow" state. That fraction will be the (model-based) probabilistic weather forecast.

4+1. In my opinion, the 'wandering about' is the most intuitive analogy among the ones listed on this page. – Berkmeister – 2013-10-02T13:04:12.910

+1. Nice layman's explanation of Metropolis-Hastings! – raegtin – 2011-07-16T05:04:30.000

33

Imagine you want to find a better strategy to beat your friends at the board game Monopoly. Simplify the stuff that matters in the game to the question: which properties do people land on most? The answer depends on the structure of the board, the rules of the game and the throws of two dice.

One way to answer the question is this. Just follow a single piece around the board as you throw the dice and follow the rules. Count how many times you land on each property (or program a computer to do the job for you). Eventually, if you have enough patience or you have programmed the rules well enough in you computer, you will build up a good picture of which properties get the most business. This should help you win more often.

What you have done is a Markov Chain Monte Carlo (MCMC) analysis. The board defines the rules. Where you land next only depends on where you are now, not where you have been before and the specific probabilities are determined by the distribution of throws of two dice. MCMC is the application of this idea to mathematical or physical systems like what tomorrow's weather will be or where a pollen grain being randomly buffeted by gas molecules will end up.

1The explanation sounds like simple Monte Carlo Simulation, but what about Markov Chain ? How Markov Chain is related to this Monte Carlo Simulation ? – Emran Hussain – 2015-11-15T04:50:29.317

@Emran Graham Cookson's answer seems to explain a connection between Monopoly and Markov chains already. – Glen_b – 2016-02-10T21:37:25.780

Monopoly can be modelled as a Markov chain where each property/space is a node/state. When you're on any particular space, you've got various probabilities of moving to the next 12 spaces (if using 2 dice) - these are the edges/connections in the Markov chain. It's easy to work out the probability of each edge/connection: http://gwydir.demon.co.uk/jo/probability/calcdice.htm#sum

– JoeRocc – 2016-11-01T06:21:03.26329

OK here's my best attempt at an informal and crude explanation.

A Markov Chain is a random process that has the property that the future depends only on the current state of the process and not the past i.e. it is memoryless. An example of a random process could be the stock exchange. An example of a Markov Chain would be a board game like Monopoly or Snakes and Ladders where your future position (after rolling the die) would depend only on where you started from before the roll, not any of your previous positions. A textbook example of a Markov Chain is the "drunkard's walk". Imagine somebody who is drunk and can move only left or right by one pace. The drunk moves left or right with equal probability. This is a Markov Chain where the drunk's future/next position depends only upon where he is at present.

Monte Carlo methods are computational algorithms (simply sets of instructions) which randomly sample from some process under study. They are a way of estimating something which is too difficult or time consuming to find deterministically. They're basically a form of computer simulation of some mathematical or physical process. The Monte Carlo moniker comes from the analogy between a casino and random number generation. Returning to our board game example earlier, perhaps we want to know if some properties on the Monopoly board are visited more often than others. A Monte Carlo experiment would involve rolling the dice repeatedly and counting the number of times you land on each property. It can also be used for calculating numerical integrals. (Very informally, we can think of an integral as the area under the graph of some function.) Monte Carlo integration works great on a high-dimensional functions by taking a random sample of points of the function and calculating some type of average at these various points. By increasing the sample size, the law of large numbers tells us we can increase the accuracy of our approximation by covering more and more of the function.

These two concepts can be put together to solve some difficult problems in areas such as Bayesian inference, computational biology, etc where multi-dimensional integrals need to be calculated to solve common problems. The idea is to construct a Markov Chain which converges to the desired probability distribution after a number of steps. The state of the chain after a large number of steps is then used as a sample from the desired distribution and the process is repeated. There many different MCMC algorithms which use different techniques for generating the Markov Chain. Common ones include the Metropolis-Hastings and the Gibbs Sampler.

1A good explanation indeed. Only one confusion is not cleared. As you said, "the idea is to construct a Markov Chain which converges to the desired Probability Distribution.". Sounds like we already know the Stead State Probability Distribution for the states, then why would we need to construct a markov chain. Markov chain's purpose is to Provide us with the steady state distribution, which we already have at the first place, is not it ? Unless you meant, getting a Markov Chain is still necessary to compute n + 1's state probability based on current state. – Emran Hussain – 2015-11-15T06:17:55.617

14

Excerpt from Bayesian Methods for Hackers

When we setup a Bayesian inference problem with $N$ unknowns, we are implicitly creating a $N$ dimensional space for the prior distributions to exist in. Associated with the space is an additional dimension, which we can describe as the *surface*, or *curve*, of the space, that reflects the *prior probability* of a particular point. The surface of the space is defined by our prior distributions. For example, if we have two unknowns $p_1$ and $p_2$, and both are uniform on [0,5], the space created is the square of length 5 and the surface is a flat plane that sits ontop of the square (representing that every point is equally likely).

Alternatively, if the two priors are $\text{Exp}(3)$ and $\text{Exp}(10)$, then the space is all postive numbers on the 2-D plane, and the surface induced by the priors looks like a water fall that starts at the point (0,0) and flows over the positive numbers.

The visualization below demonstrates this. The more dark red the color, the more prior probability that the unknowns are at that location. Conversely, areas with darker blue represent that our priors assign very low probability to the unknowns being there.

These are simple examples in 2D space, where our brains can understand surfaces well. In practice, spaces and surfaces generated by our priors can be much higher dimensional.

If these surfaces describe our *prior distributions* on the unknowns, what happens to our space after we have observed data $X$. The data $X$ does not change the space, but it changes the surface of the space by *pulling and stretching the fabric of the surface* to reflect where the true parameters likely live. More data means more pulling and stretching, and our original shape becomes mangled or insignificant compared to the newly formed shape. Less data, and our original shape is more present. Regardless, the resulting surface describes the *posterior distribution*. Again I must stress that it is, unfortunately, impossible to visualize this in larger dimensions. For two dimensions, the data essentially *pushes up* the original surface to make *tall mountains*. The amount of *pushing up* is resisted by the prior probability, so that less prior probability means more resistance. Thus in the double exponential-prior case above, a mountain (or multiple mountains) that might erupt near the (0,0) corner would be much higher than mountains that erupt closer to (5,5), since there is more resistance near (5,5). The mountain, or perhaps more generally, the mountain ranges, reflect the posterior probability of where the true parameters are likely to be found.

Suppose the priors mentioned above represent different parameters $\lambda$ of two Poisson distributions. We observe a few data points and visualize the new landscape.

The plot on the left is the deformed landscape with the $\text{Uniform}(0,5)$ priors, and the plot on the right is the deformed landscape with the exponential priors. The posterior landscapes look different from one another. The exponential-prior landscape puts very little posterior weight on values in the upper right corner: this is because *the prior does not put much weight there*, whereas the uniform-prior landscape is happy to put posterior weight there. Also, the highest-point, corresponding the the darkest red, is biased towards (0,0) in the exponential case, which is the result from the exponential prior putting more prior wieght in the (0,0) corner.

The black dot represents the true parameters. Even with 1 sample point, as what was simulated above, the mountains attempts to contain the true parameter. Of course, inference with a sample size of 1 is incredibly naive, and choosing such a small sample size was only illustrative.

We should explore the deformed posterior space generated by our prior surface and observed data to find the posterior mountain ranges. However, we cannot naively search the space: any computer scientist will tell you that traversing $N$-dimensional space is exponentially difficult in $N$: the size of the space quickly blows-up as we increase $N$ (see the curse of dimensionality ). What hope do we have to find these hidden mountains? The idea behind MCMC is to perform an intelligent search of the space. To say "search" implies we are looking for a particular object, which perhaps not an accurate description of what MCMC is doing. Recall: MCMC returns *samples* from the posterior distribution, not the distribution itself. Stretching our mountainous analogy to its limit, MCMC performs a task similar to repeatedly asking "How likely is this pebble I found to be from the mountain I am searching for?", and completes its task by returning thousands of accepted pebbles in hopes of reconstructing the original mountain. In MCMC and PyMC lingo, the returned sequence of "pebbles" are the samples, more often called the *traces*.

When I say MCMC intelligently searches, I mean MCMC will *hopefully* converge towards the areas of high posterior probability. MCMC does this by exploring nearby positions and moving into areas with higher probability. Again, perhaps "converge" is not an accurate term to describe MCMC's progression. Converging usually implies moving towards a point in space, but MCMC moves towards a *broader area* in the space and randomly walks in that area, picking up samples from that area.

At first, returning thousands of samples to the user might sound like being an inefficient way to describe the posterior distributions. I would argue that this is extremely efficient. Consider the alternative possibilities::

- Returning a mathematical formula for the "mountain ranges" would involve describing a N-dimensional surface with arbitrary peaks and valleys.
- Returning the "peak" of the landscape, while mathematically possible and a sensible thing to do as the highest point corresponds to most probable estimate of the unknowns, ignores the shape of the landscape, which we have previously argued is very important in determining posterior confidence in unknowns.

Besides computational reasons, likely the strongest reason for returning samples is that we can easily use *The Law of Large Numbers* to solve otherwise intractable problems. I postpone this discussion for the next chapter.

There is a large family of algorithms that perform MCMC. Simplestly, most algorithms can be expressed at a high level as follows:

```
1. Start at current position.
2. Propose moving to a new position (investigate a pebble near you ).
3. Accept the position based on the position's adherence to the data
and prior distributions (ask if the pebble likely came from the mountain).
4. If you accept: Move to the new position. Return to Step 1.
5. After a large number of iterations, return the positions.
```

This way we move in the general direction towards the regions where the posterior distributions exist, and collect samples sparingly on the journey. Once we reach the posterior distribution, we can easily collect samples as they likely all belong to the posterior distribution.

If the current position of the MCMC algorithm is in an area of extremely low probability, which is often the case when the algorithm begins (typically at a random location in the space), the algorithm will move in positions *that are likely not from the posterior* but better than everything else nearby. Thus the first moves of the algorithm are not reflective of the posterior.

2I understand that the problem was related to specifically MCMC, and not Bayesian inference, but in the context of Bayesian landscapes I find MCMC to be very understandable. – Cam.Davidson.Pilon – 2013-03-06T13:10:14.250

10

So there are plenty of answers here paraphrased from statistics/probability textbooks, Wikipedia, etc. I believe we have "laypersons" where I work; I think they are in the marketing department. If I ever have to explain anything technical to them, I apply the rule "show don't tell." With that rule in mind, I would probably show them something like this.

The idea here is to try to code an algorithm that I can teach to spell--not by learning all of the hundreds (thousands?) of rules like *When adding an ending to a word that ends with a silent e, drop the final e if the ending begins with a vowel*. One reason that won't work is I don't know those rules (i'm not even sure the one I just recited is correct). Instead I am going to teach it to spell by showing it a bunch of correctly spelled words and letting it extract the rules from those words, which is more or less the essence of Machine Learning, regardless of the algorithm--pattern extraction and pattern recognition.

The success criterion is correctly spelling a word the algorithm has never seen before (i realize that can happen by pure chance, but that won't occur to the marketing guys, so i'll ignore--plus I am going to have the algorithm attempt to spell not one word, but a lot, so it's not likely we'll be deceived by a few lucky guesses).

An hour or so ago, I downloaded (as a plain text file) from the excellent Project Gutenberg Site, the Herman Hesse novel *Siddhartha*. I'll use the words in this novel to teach the algorithm how to spell.

So I coded the algorithm below that scanned this novel, three letters at a time (each word has one additional character at the end, which is 'whitespace', or the end of the word). Three-letter sequences can tell you a lot--for instance, the letter 'q' is nearly always followed by 'u'; the sequence 'ty' usually occurs at the end of a word; z rarely does, and so forth. (Note: I could just as easily have fed it entire words in order to train it to speak in complete sentences--exactly the same idea, just a few tweaks to the code.)

None of this involves MCMC though, that happens after training, when we give the algorithm a few random letters (as a seed) and it begins forming 'words'. How does the algorithm build words? Imagine that it has the block 'qua'; what letter does it add next? During training, the algorithm constructed a massive l*etter-sequence frequency matrix* from all of the thousands of words in the novel. Somewhere in that matrix is the three-letter block 'qua' and the frequencies for the characters that could follow the sequence. The algorithm selects a letter based on those frequencies that could possibly follow it. So the letter that the algorithm selects next depends on--and solely on--the last three in its word-construction queue.

So that's a Markov Chain Monte Carlo algorithm.

I think perhaps the best way to illustrate how it works is to show the results based on different levels of training. Training level is varied by changing the number of passes the algorithm makes though the novel--the more passes thorugh the greater the fidelity of its letter-sequence frequency matrices. Below are the results--in the form of 100-character strings output by the algorithm--after training on the novel 'Siddharta'.

**A single pass through the novel, Siddhartha:**

then whoicks ger wiff all mothany stand ar you livid theartim mudded sullintionexpraid his sible his

(Straight away, it's learned to speak almost perfect Welsh; I hadn't expected that.)

**After two passes through the novel:**

the ack wor prenskinith show wass an twor seened th notheady theatin land rhatingle was the ov there

**After 10 passes:**

despite but the should pray with ack now have water her dog lever pain feet each not the weak memory

And here's the code (in Python, i'm nearly certain that this could be done in R using an MCMC package, of which there are several, in just 3-4 lines)

```
def create_words_string(raw_string) :
""" in case I wanted to use training data in sentence/paragraph form;
this function will parse a raw text string into a nice list of words;
filtering: keep only words having more than 3 letters and remove
punctuation, etc.
"""
pattern = r'\b[A-Za-z]{3,}\b'
pat_obj = re.compile(pattern)
words = [ word.lower() for word in pat_obj.findall(raw_string) ]
pattern = r'\b[vixlm]+\b'
pat_obj = re.compile(pattern)
return " ".join([ word for word in words if not pat_obj.search(word) ])
def create_markov_dict(words_string):
# initialize variables
wb1, wb2, wb3 = " ", " ", " "
l1, l2, l3 = wb1, wb2, wb3
dx = {}
for ch in words_string :
dx.setdefault( (l1, l2, l3), [] ).append(ch)
l1, l2, l3 = l2, l3, ch
return dx
def generate_newtext(markov_dict) :
simulated_text = ""
l1, l2, l3 = " ", " ", " "
for c in range(100) :
next_letter = sample( markov_dict[(l1, l2, l3)], 1)[0]
simulated_text += next_letter
l1, l2, l3 = l2, l3, next_letter
return simulated_text
if __name__=="__main__" :
# n = number of passes through the training text
n = 1
q1 = create_words_string(n * raw_str)
q2 = create_markov_dict(q1)
q3 = generate_newtext(q2)
print(q3)
```

12You've built a Markov model for spelling in English and fit it to data. But sampling from the fitted model is *not* MCMC. (What's the "desired distribution" it samples from? Clearly, not the distribution over "properly spelled words in English", since the model still makes mistakes after training). I don't mean to criticize the exercise; it's a nice demonstration of a Markov chain model for language. But the key idea of MCMC is to *design* a Markov Chain so that its equilibrium distribution corresponds to some distribution you have in mind, and it's not obvious that this achieves that. – jpillow – 2011-07-09T05:36:34.640

2

MCMC is typically used as an alternative to crude Monte Carlo simulation techniques. Both MCMC and other Monte Carlo techniques are used to evaluate difficult integrals but MCMC can be used more generally.

For example, a common problem in statistics is to calculate the mean outcome relating to some probabilistic/stochastic model. Both MCMC and Monte Carlo techniques would solve this problem by generating a sequence of simulated outcomes that we could use to estimate the true mean.

Both MCMC and crude Monte Carlo techniques work as the long-run proportion of simulations that are equal to a given outcome will be equal* to the modelled probability of that outcome. Therefore, by generating enough simulations, the results produced by both methods will be accurate.

***I say equal although in general I should talk about measurable sets. A layperson, however, probably wouldn't be interested in this***

However, while crude Monte Carlo involves producing many independent simulations, each of which is distributed according to the modelled distribution, MCMC involves generating a random walk that in the long-run "visits" each outcome with the desired frequency.

The trick to MCMC, therefore, is picking a random walk that will "visit" each outcome with the desired long-run frequencies.

A simple example might be to simulate from a model that says the probability of outcome "A" is 0.5 and of outcome "B" is 0.5. In this case, if I started the random walk at position "A" and prescribed that in each step it switched to the other position with probability 0.2 (or any other probability that is greater than 0), I could be sure that after a large number of steps the random walk would have visited each of "A" and "B" in roughly 50% of steps--consistent with the probabilities prescribed by our model.

This is obviously a very boring example. However, it turns out that MCMC is often applicable in situations in which it is difficult to apply standard Monte Carlo or other techniques.

You can find an article that covers the basics of what it is and why it works here:

We are trying to build a permanent repository of high-quality statistical information in the form of questions & answers. We try to avoid link-only answers which are subject to link-rot; as such this is more of a comment than an answer in its own right. If you're able, could you expand it, perhaps by giving a summary of the information at the link (or we could convert it into a comment for you).

– Glen_b – 2016-08-17T08:48:08.6931

I'm a DNA analyst that uses fully continuous probabilistic genotyping software to interpret DNA evidence and I have to explain how this works to a jury. Admittedly, we over simplify and I realize some of this over simplification sacrifices accuracy of specific details in the name of improving overall understanding. But, within the context of a jury understanding how this process is used in DNA interpretation without academic degrees and years of professional experience, they get the gist :)

Background: The software uses metropolis Hastings MCMC and a biological model that mimics the known behavior of DNA profiles (model is built based upon validation data generated by laboratory analyzing many DNA profiles from known conditions representing the range encountered in the unknown casework). There's 8 independent chains and we evaluate the convergence to determine whether to re-run increasing the burn in and post accepts (default burnin 100k accepts and post 400k accepts)

When asked by prosecution/defense about MCMC: we explain it stands for markov chain Monte Carlo and represents a special class/kind of algorithm used for complex problem-solving and that an algorithm is just a fancy word referring to a series of procedures or routine carried out by a computer... mcmc algorithms operate by proposing a solution, simulating that solution, then evaluating how well that simulation mirrors the actual evidence data being observed... a simulation that fits the evidence observation well has a higher probability than a simulation that does not fit the observation well... over many repeated samplings/guesses of proposed solutions, the Markov chains move away from the low probability solutions toward the high probability solutions that better fit/explain the observed evidence profile, until eventually equilibrium is achieved, meaning the algorithm has limited ability to sample new proposals yielding significantly increased probabilities

When asked about metropolis Hastings: we explain it's a refinement to MCMC algorithm describing its decision-making process accepting or rejecting a proposal... usually this is explained with an analogy of "hot/cold" children's game but I may have considered using "swipe right or left" when the jury is especially young!! :p But using our hot/cold analogy, we always accept a hot guess and will occasionally accept a cold guess a fraction of the time and explain the purpose of sometimes accepting the cold guess is to ensure the chains sample a wider range of possibilities as opposed to getting stuck around one particular proposal before actual equilibrium

Edited to add/clarify: with the hot/cold analogy we explain that in the children's game, the leader picks a target object/area within the room and the players take turns making guesses which direction to move relative to their current standing/position. The leader tells them to change their position/make the move if it's a hot guess and they lose their turn/stay in position if it's a cold guess. Similarly, within our software, the decision to move/accept depends only on the probability of the proposal compared to the probability of currently held position... HOWEVER, the target is pre-defined/known by the leader in the children's game whereas the target within our software isn't pre-defined--it's completely unknown (also why it's more important for our application to adequately sample the space and occasionally accept cold guesses)

Like I said, super super basic and absolutely lacking technical detail for sake of improving comprehension--we strive for explaining at about a middle-school level of education. Feel free to make suggestions. I'll incorporate them.

0

This question is broad yet the answers are often quite casual. Alternatively, you can see this paper which gives a concise mathematical description of a broad class of MCMC algorithms including Metropolis-Hastings algorithms, Gibbs sampling, Metropolis-within-Gibbs and auxiliary variables methods, slice sampling, recursive proposals, directional sampling, Langevin and Hamiltonian Monte Carlo, NUTS sampling, pseudo-marginal Metropolis-Hastings algorithms, and pseudo-marginal Hamiltonian Monte Carlo, as discussed by the authors.

A credible review is given here

I'll find more time to elaborate its content in the format of stackexchange.

-1

First, we should explain monte-Carlo sampling to the layperson. Imagine when you don't have the exact form of a function(for example f(x,y)=z=x^2+2*x*y) but there is a machine in Europe(and Los Alamos) that replicates this function(numerically); We can put as many (x,y) into it and it will give us the value z. This numerical repetition is sampling and this process is a monte-Carlo simulation of f(x,y). After 10000 efforts we almost know what the f(x,y) is.

Assuming the layperson knows the Monte-Carlo, in MCMC you don't want to waste your CPU efforts/time when you are sampling from a multi-dimensional space( f(x,y,z,t,s,...,zzz); normal Monte-Carlo does.The key different is that in MCMC you need to have a Markov-chain as a map to guide your efforts.

This video has a very good statement of intuition. Starts at 5:50 here: https://youtu.be/12eZWG0Z5gY?t=265

Imagine you want to sample points that are on the green (multi-dimensional) branches in this picture. If you throw points all over the black super-space and check their value, you are WASTING a lot sampling(searching) energy. So it would make more sense to control your sampling strategy(which can be automated) to pick points closer to the green branches(where it matters). Green branches can be found by being hit once accidentally(or controlled) and the rest of the sampling effort(red points) will be generated afterward. The reason the red gets attracted to green line is the Markov chain transition matrix that works as your sampling-engine.

So in layman terms, MCMC is an energy-saver(low cost) sampling method, especially when working in massive and dark(multi-dimensional) space.

1I think we have a different definition of "layman" – Neil McGuigan – 2017-04-30T00:13:55.597

hahaha. I can add the Monte-Carlo for a "layperson" too, but sampling/Monte-Carlo was not a question. – Amir – 2017-04-30T01:52:24.637

2

The link doesn't work: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.7133&rep=rep1&type=pdf

– Thylacoleo – 2010-08-07T09:47:53.430Did some Google-hunting.. I think @mbq was linking to this: http://www.cs.princeton.edu/courses/archive/spr06/cos598C/papers/AndrieuFreitasDoucetJordan2003.pdf

– Jon Gauthier – 2014-12-24T00:29:09.19311

Here is my favorite paper about the topic: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.7133&rep=rep1&type=pdf

– mbq – 2010-07-21T13:28:05.2803

To Understand MCMC algorithm you do have to understand what it is really used for and how it converges to a given distribution. I blogged about the whole intuition and applications for it. You can visit them here: http://mlwhiz.com/blog/2015/08/19/MCMC_Algorithms_Beta_Distribution/ http://mlwhiz.com/blog/2015/08/21/MCMC_Algorithms_Cryptography/

– Rahul Agarwal – 2015-08-22T02:39:49.877On this site you'll find a nice graphical representation of MCMC and some useful links. – Sergey – 2011-07-04T14:40:15.880

Hi @Neil. As you know, this thread has community wiki status, which I find inappropriate given that we have several other very successful threads about explaining smth to a layperson that are not CW (so stripping everybody who participated here from reputation points seems unjustified). I suggested to the mods to remove the CW status from this thread but they replied that it does not seem necessary since the OP (you) never objected to this thread being CW. So I thought I would ask your opinion; what do you think? Were you trying to collect a list of examples, or looking for one best answer? – amoeba – 2016-08-16T23:42:59.993

@amoeba I don't seem to have the ability to change it back from a cw – Neil McGuigan – 2016-08-17T00:17:05.223

@Neil Certainly not, only mods can do it. I was asking your opinion on what you would prefer. Because the lack of your objections to the CW status in 2010 is now used as an argument against removing the CW status. So if you do prefer to have CW removed, please say so. – amoeba – 2016-08-18T18:08:25.217

@amoeba seems like we have enough answers, so feel free to tell the mods I support turning off cw – Neil McGuigan – 2016-08-18T20:03:17.670

It's done. CW status removed from the question and from all the answers. You should get a lot of rep points :-) – amoeba – 2016-08-19T13:28:23.053

Should be community wiki? – Shane – 2010-07-26T20:22:54.280