## What does the argmax of the expectation of the log likelihood mean?

3

1

What does the following equation mean? What does each part of the formula represent or mean?

$$\theta^* = \underset {\theta}{\arg \max} \Bbb E_{x \sim p_{data}} \log {p_{model}(x|\theta) }$$

1Hi Arash. This is a legitimate question. However, next time, please, provide some context, that is, describe where you saw this formula. – nbro – 2019-06-14T22:48:20.953

2

This equation and more information of it can be found in Expectation Maximization Wikipedia site and the explanation there was as follows (formula there in two parts): Some more explanation from same page:

In statistics, an expectation–maximization (EM) algorithm is an iterative method to find maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.

Mathematically, E in your equation stands for Expectation Value, x|theta is conditional probability and x~data and model are sub-titles of source of probability in either case. The arg max theta is argument theta that maximizes the equation.

-1

I am correct in thinking that you would like an intuitive feel or explanation?

After thinking about this for a little bit, I thought I would approach this problem from understanding the K-means algorithm. As you may know the k-means algorithm uses some straight forward calculations. The computer calculates the distance from one point to a center. If you can measure the distance of a point (x1) to center 1 and from point (x1) to center 2 then you are golden. Because next all you do is choose the minimum argument or distance and that becomes the group that the point will belong to.

So, in our example of K-means, let us say a human picks the number of groupings, k=2. Meaning the human thinks there are only two types of 'things'. Therefore all the things, items or whatever that are closer to center 1 are called '1'. The items that are closer to center 2 are now called '2'.

In our K-means case it chooses the distance which is the shortest. It chooses the minimum value of distance and assigns the future points to '1' or '2' (for example).

Well, what if you don't want to use distance as your measure but instead you just learned about Gaussian curves and what to call your items as being from Gaussian curve '1' or '2'. We could do that, right? Well, if I used Gaussian curves as my measuring criteria, I could use probability instead of distance (as we did in K-means).

Now let's say we started to look at our points in terms of what is the probability of point 1 being related to center '1'. Let's say we knew the mean and standard deviation of our center '1'. Well we could then use probability to use as our new measuring stick now. So if point x1 is 1 standard deviation away from center '1' its p-value would be (let's say) (P of 1 std.dev.) = ~0.35.

Now what if we also knew the mean and variance of the second center, '2'. We could then calculate the Probability of our point x1 being part of center '2'. Let's say that point x1 is 4 standard deviations from center '2'. It's p-value = 0.0002.

Therefore I would choose the MAXIMUM p-values to use to assign my points to my centers, NOT the Minimum that I used in K-means.

Really K-means is similar to E.M. except it uses distance not p-values. Does that help? P.S. Forgive the grammar mistakes, etc I am doing this late for me, lol.