## Why do cost functions use the square error?

40

33

I'm just getting started with some machine learning, and until now I have been dealing with linear regression over one variable.

I have learnt that there is a hypothesis, which is:

$h_\theta(x)=\theta_0+\theta_1x$

To find out good values for the parameters $\theta_0$ and $\theta_1$ we want to minimize the difference between the calculated result and the actual result of our test data. So we subtract

$h_\theta(x^{(i)})-y^{(i)}$

for all $i$ from $1$ to $m$. Hence we calculate the sum over this difference and then calculate the average by multiplying the sum by $\frac{1}{m}$. So far, so good. This would result in:

$\frac{1}{m}\sum_{i=1}^mh_\theta(x^{(i)})-y^{(i)}$

But this is not what has been suggested. Instead the course suggests to take the square value of the difference, and to multiply by $\frac{1}{2m}$. So the formula is:

$\frac{1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2$

Why is that? Why do we use the square function here, and why do we multiply by $\frac{1}{2m}$ instead of $\frac{1}{m}$?

1

Related question at stats.stackexchange.com

user1205197 2017-01-01T04:17:31.710

Also take a look at Chris McCormick's explanation on https://goo.gl/VNiUR5

vimdude 2017-08-15T23:12:28.393

22

Your loss function here would not work because it incentivizes putting both $\theta_0$ and $\theta_1$ to $-\infty$.

Lets call $r(x,y)=h_\theta(x) -y$ the residual (as is often done). You're asking to make it as small as possible, i.e. $h_\theta$ as negative as possible since $y$, the ground truth, is just a constant. The squared error sidesteps this issue because it forces $h(x)$ and $y$ to match, i.e. $(u-v)^2$ is minimized when $u=v$, if possible, and is always $>0$ otherwise, because it's a square.

One justification for using the squared error is that it relates to Gaussian Noise. Suppose the residual, which measures the error is the sum of many small independent noise terms. By the Central Limit Theorem, it should be distributed Normally. So, we want to pick $\theta$ where this noise distribution - the things your model cannot explain - has the smallest variance. This corresponds to the least squares loss.

Let $R$ be a random variable that follows a Gaussian distribution $\mathcal{N}(\mu,\sigma)$. Then, the variance of $R$ is $E[R^2] = \sigma^2$. See here

Regarding your second question, the 1/2 does not matter and actually, the $m$ doesn't matter either :) . The optimal value of $\theta$ would remain the same in both cases, it is put in so that when you take the derivative, the expression is prettier, because the 2 cancels out the 2 from the square term. The $m$ is useful if you solve this problem with gradient descent. Then your gradient is the sum of $m$ terms divided by $m$, so it is like an average over your points. This lets you handle all sizes of datasets, so your gradients won't overflow the integers if you scale up the dataset.

It's also used to maintain consistency with future equations where you'll add regularization terms. The slides I linked here don't do it, but you can see that if you do, the regularization parameter $\lambda$ will not depend on the dataset size $m$ so it'll be more interpretable.

you said, "when you take the derivative, the expression is prettier, because the 2 cancels out the 2 from the square term". But why do we want to take its derivative ?DrGeneral 2017-05-25T11:20:53.403

We typically optimize the loss using gradient descent, which requires taking the Derivative. I didn't mention this because it should be clear from the context of this question.Harsh 2017-05-25T15:31:01.483

Harsh, Forgive my naivete, but why not use absolute value instead of square?Alexander Suraphel 2017-09-05T16:42:34.170

Absolute error can also work, but in that case you will regress to the expected median instead of the mean. Take a small list of numbers and see how the loss differs by shifting your estimate (for both squared and absolute error)Jan van der Vegt 2017-10-26T10:58:11.043

17

The 1/2 coefficient is merely for convenience; it makes the derivative, which is the function actually being optimized, look nicer. The 1/m is more fundamental; it suggests that we are interested in the mean squared error. This allows you to make fair comparisons when changing the sample size, and prevents overflow. So called "stochastic" optimizers use a subset of the data set (m' < m). When you introduce a regularizer (an additive term to the objective function), using the 1/m factor allows you to use the same coefficient for the regularizer regardless of the sample size.

As for the question of why the square and not simply the difference: don't you want underestimates to be penalized similarly to overestimates? Squaring eliminates the effect of the sign of the error. Taking the absolute value (L1 norm) does too, but its derivative is undefined at the origin, so it requires more sophistication to use. The L1 norm has its uses, so keep it in mind, and perhaps ask the teacher if (s)he's going to cover it.

3In addition to differentiability, the $L^2$ norm is unique in the $L^p$ norms in that it is a Hilbert space. The fact that the norm arises from an inner product makes a huge amount of machinery available for $L^2$ which is not available for other norms.Steven Gubkin 2016-02-11T05:59:41.023

4

The error measure in the loss function is a 'statistical distance'; in contrast to the popular and preliminary understanding of distance between two vectors in Euclidean space. With 'statistical distance' we are attempting to map the 'dis-similarity' between estimated model and optimal model to Euclidean space.

There is no constricting rule regarding the formulation of this 'statistical distance', but if the choice is appropriate then a progressive reduction in this 'distance' during optimization translates to a progressively improving model estimation. Consequently, the choice of 'statistical distance' or error measure is related to the underlying data distribution.

In fact, there are several well defined distance/error measures for different classes of statistical distributions. It is advisable to select the error measure based on the distribution of the data in hand. It just so happens that the Gaussian distribution is ubiquitous, and consequently its associated distance measure, the L2-norm is the most popular error measure. However, this is not a rule and there exist real world data for which an 'efficient'* optimization implementation would adopt a different error measure than the L2-norm.

Consider the set of Bregman divergences. The canonical representation of this divergence measure is the L2-norm (squared error). It also includes relative entropy (Kullback-Liebler divergence), generalized Euclidean distance (Mahalanobis metric), and Itakura-Saito function. You can read more about it in this paper on Functional Bregman Divergence and Bayesian Estimation of Distributions.

Take-away: The L2-norm has an interesting set of properties which makes it a popular choice for error measure (other answers here have mentioned some of these, sufficient to the scope of this question), and the squared error will be the appropriate choice most of the time. Nevertheless, when the data distribution requires it, there are alternate error measures to choose from, and the choice depends in large part on the formulation of the optimization routine.

*The 'appropriate' error measure would make the loss function convex for the optimization, which is very helpful, as opposed to some other error measure where the loss function is non-convex and thereby notoriously difficult.

2

In addition to the key points made by others, using squared error puts a greater emphasis on larger error (what happens to 1/2 when you square it vs 3/2?).

Having an algorithm that moves the fractional errors, that would likely result in correct classification or very small difference between estimate and ground truth, if left alone close to zero, while leaving the large errors as large errors or misclassifications, is not a desirable characteristic of an algorithm.

Using squared error uses the error as an implied importance weight for adjusting prediction.

1

In your formulation, you try to obtain the mean deviation of your approximation from the observed data. If the mean value of your approximation is close or equal to the mean value of the observed data (something which is desirable and often happens with many approximation schemes) then the result of your formulation would be zero or negligible, because positive errors compensate with negative errors. This might lead to the conclusion that your approximation is wonderful at each observed sample, while it might not be the case. That's why you use the square of the error at each sample and you add them up (your turn each error positive). Of course this is only a possible solution, as you could have used L1-norm (absolute value of the error at each sample) or many others, instead of L2-norm.