Your loss function would not work because it incentivizes setting $\theta_1$ to any finite value and $\theta_0$ to $-\infty$.

Let's call $r(x,y)=\frac{1}{m}\sum_{i=1}^m {h_\theta\left(x^{(i)}\right)} -y$ the *residual* for $h$.

Your goal is to make $r$ **as close to zero** as possible, **not just minimize it**. A high negative value is just as bad as a high positive value.

**EDIT:** You can counter this by artificially limiting the parameter space $\mathbf{\Theta} $(e.g. you want $|\theta_0| < 10$). In this case, the optimal parameters would lie on certain points on the boundary of the parameter space. See https://math.stackexchange.com/q/896388/12467. This is not what you want.

## Why do we use the square loss

The squared error forces $h(x)$ and $y$ to match. It's minimized at $u=v$, if possible, and is always $\ge 0$, because it's a square of the real number $u-v$.

$|u-v|$ would also work for the above purpose, as would $(u-v)^{2n}$, with $n$ some positive integer. The first of these is actually used (it's called the $\ell_1$ loss; you might also come across the $\ell_2$ loss, which is another name for squared error).

So, why is the squared loss better than these? This is a *deep* question related to the link between *Frequentist* and *Bayesian* inference. In short, the squared error relates to **Gaussian Noise**.

If your data does not fit all points exactly, i.e. $h(x)-y$ is not zero for some point no matter what $\theta$ you choose (as will always happen in practice), that might be because of **noise**. In any complex system there will be many small **independent** causes for the difference between your *model* $h$ and *reality* $y$: measurement error, environmental factors etc. By the **Central Limit Theorem**(CLT), the total noise would be distributed *Normally*, i.e. according to the **Gaussian distribution**. We want to pick the best fit $\theta$ taking this noise distribution into account. Assume $R = h(X)-Y$, the part of $\mathbf{y}$ that your model cannot explain, follows the Gaussian distribution $\mathcal{N}(\mu,\sigma)$. We're using capitals because we're talking about random variables now.

The Gaussian distribution has two parameters, mean $\mu = \mathbb{E}[R] = \frac{1}{m} \sum_i h_\theta(X^{(i)})-Y^{(i))}$ and variance $\sigma^2 = E[R^2] = \frac{1}{m} \sum_i \left(h_\theta(X^{(i)})-Y^{(i))}\right)^2$. See here to understand these terms better.

Consider $\mu$, it is the *systematic error* of our measurements. Use $h'(x) = h(x) - \mu$ to correct for systematic error, so that $\mu' = \mathbb{E}[R']=0$ (exercise for the reader). Nothing else to do here.

$\sigma$ represents the *random error*, also called *noise*. Once we've taken care of the systematic noise component as in the previous point, the best predictor is obtained when $\sigma^2 = \frac{1}{m} \sum_i \left(h_\theta(X^{(i)})-Y^{(i))}\right)^2$ is minimized. Put another way, the best predictor is the one with the tightest distribution (smallest variance) around the predicted value, i.e. smallest variance. **Minimizing the the least squared loss is the same thing as minimizing the variance!** That explains why the least squared loss works for a wide range of problems. The underlying noise is very often Gaussian, because of the CLT, and minimizing the squared error turns out to be the *right* thing to do!

To simultaneously take both the mean and variance into account, we include a *bias* term in our classifier (to handle systematic error $\mu$), then minimize the square loss.

Followup questions:

**Least squares loss = Gaussian error. Does every other loss function also correspond to some noise distribution?** Yes. For example, the $\ell_1$ loss (minimizing absolute value instead of squared error) corresponds to the Laplace distribution (Look at the formula for the PDF in the infobox -- it's just the Gaussian with $|x-\mu|$ instead of $(x-\mu)^2$). A popular loss for probability distributions is the KL-divergence.
-The Gaussian distribution is very well motivated because of the **Central Limit Theorem**, which we discussed earlier. When is the Laplace distribution the right noise model? There are some circumstances where it comes about naturally, but it's more commonly as a regularizer to enforce **sparsity**: the $\ell_1$ loss is the *least convex* among all convex losses.

- As Jan mentions in the comments, the minimizer of
*squared* deviations is the mean and the minimizer of the sum of **absolute** deviations is the **median**. Why would we want to find the median of the residuals instead of the mean? Unlike the mean, the median isn't thrown off by one very large outlier. So, the $\ell_1$ loss is used for increased robustness. Sometimes a combination of the two is used.

**Are there situations where we minimize both the Mean and Variance?** Yes. Look up Bias-Variance Trade-off. Here, we are looking at a set of classifiers $h_\theta \in H$ and asking which among them is best. If we ask which *set* of classifiers is the best for a problem, minimizing both the bias and variance becomes important. It turns out that there is always a trade-off between them and we use **regularization** to achieve a compromise.

## Regarding the $\frac{1}{2}$ term

The 1/2 does not matter and actually, neither does the $m$ - they're both constants. The optimal value of $\theta$ would remain the same in both cases.

The expression for the gradient becomes prettier with the $\frac{1}{2}$, because the 2 from the square term cancels out.

- When writing code or algorithms, we're usually concerned more with the gradient, so it helps to keep it concise. You can check progress just by checking the norm of the gradient. The loss function itself is sometimes omitted from code because it is used only for validation of the final answer.

The $m$ is useful if you solve this problem with gradient descent. Then your gradient becomes the average of $m$ terms instead of a sum, so its' scale does not change when you add more data points.

- I've run into this problem before: I test code with a small number of points and it works fine, but when you test it with the entire dataset there is loss of precision and sometimes over/under-flows, i.e. your gradient becomes
`nan`

or `inf`

. To avoid that, just normalize w.r.t. number of data points.

These aesthetic decisions are used here to maintain consistency with future equations where you'll add **regularization** terms. If you include the $m$, the regularization parameter $\lambda$ will not depend on the dataset size $m$ and it will be more interpretable across problems.

4

Related question at stats.stackexchange.com

– user1205901 - Reinstate Monica – 2017-01-01T04:17:31.710Also take a look at Chris McCormick's explanation on https://goo.gl/VNiUR5

– vimdude – 2017-08-15T23:12:28.393because it's a Bregman divergence – Andrew – 2018-03-25T20:19:00.033

@vimdude https://goo.gl/VNiUR5 gives a 404 now.

– Marcus – 2020-12-17T17:41:59.600