## Why square the difference instead of taking the absolute value in standard deviation?

333

216

In the definition of standard deviation, why do we have to square the difference from the mean to get the mean (E) and take the square root back at the end? Can't we just simply take the absolute value of the difference instead and get the expected value (mean) of those, and wouldn't that also show the variation of the data? The number is going to be different from square method (the absolute-value method will be smaller), but it should still show the spread of data. Anybody know why we take this square approach as a standard?

The definition of standard deviation:

$\sigma = \sqrt{E\left[\left(X - \mu\right)^2\right]}.$

Can't we just take the absolute value instead and still be a good measurement?

$\sigma = E\left[|X - \mu|\right]$

5In accepting an answer it seems important to me that we pay attention to whether the answer is circular. The normal distribution is based on these measurements of variance from squared error terms, but that isn't in and of itself a justification for using (X-M)^2 over |X-M|. – russellpierce – 2010-07-20T07:59:54.683

This following article has the pictorial & easy-to-understand explanation. http://www.mathsisfun.com/data/standard-deviation.html Thanks, Rajesh.

– Rajesh – 2013-06-13T13:56:59.363

See http://www.graphpad.com/curvefit/linear_regression.htm See Minimizing sum-of-squares section

– None – 2011-06-12T05:39:06.267

41

Every answer offered so far is circular. They focus on ease of mathematical calculations (which is nice but by no means fundamental) or on properties of the Gaussian (Normal) distribution and OLS. Around 1800 Gauss started with least squares and variance and from those derived the Normal distribution--there's the circularity. A truly fundamental reason that has not been invoked in any answer yet is the unique role played by the variance in the Central Limit Theorem. Another is the importance in decision theory of minimizing quadratic loss.

– whuber – 2013-09-13T15:28:32.647

1+1 @whuber: Thanks for pointing this out, which was bothering me as well. Now, though, have to go and read up on the Central Limit Theorem! Oh well. ;-) – Sabuncu – 2014-02-11T21:55:47.667

1

Taleb makes the case at Edge.org for retiring standard deviation and using mean absolute deviation.

– Alex Holcombe – 2015-06-02T10:38:32.660

@c4il will you please cite the source for the formula of S.D. qouted by you. I do think that it is incorrect. – subhash c. davar – 2015-11-20T17:11:47.500

@rpierce would you please check the correctness of formula of s.d . under definition while asking question. – subhash c. davar – 2015-11-21T01:28:32.163

@subhash c. davar, the notation isn't in a form I'm familiar with. However, OP defines E as the process of getting the mean, so IMO the equations check out. – russellpierce – 2015-11-21T16:53:27.540

1

@subhashc.davar: The missing definition is for the expectation of the random variable $X$, $\mu=\operatorname{E}[X]$. (It's so commonly used that it's no more than a venial sin to let us guess what it means from the context.) Wikipedia will serve as a reference for the definition of standard deviation: https://en.wikipedia.org/wiki/Standard_deviation#Definition_of_population_values. Note the distinction between the standard deviation of a distribution/population & an estimate of it that may be calculated from a sample.

– Scortchi – 2015-11-23T16:20:55.817

@whuber Could you clarify relation to CLT specifically? Is variance the only non-zero functional $f$ s.t. $f(\sqrt n (\bar X_n-EX))=f(X)$? – A.S. – 2016-01-27T19:58:47.097

@A.S. Sure--I have answered this question in some detail at http://stats.stackexchange.com/a/3904. Briefly, there are infinitely many such functionals--but they must all asymptotically converge to the variance.

– whuber – 2016-01-27T20:03:49.250

@whuber What do you mean by "asymptotically converge"? Are you considering convergence of separate $f_n$ (defined for each $n$) rather than a single $f$ that satisfies the above for all $n$? // I'll read the post. – A.S. – 2016-01-27T20:29:56.877

1Do you think the term standard means this is THE standard today ? Isn't it like asking why principal component are "principal" and not secondary ? – robin girard – 2010-07-23T21:44:37.093

My understanding of this question is that it could be shorter just be something like: what is the difference between the MAE and the RMSE ? otherwise it is difficult to deal with. – robin girard – 2010-07-24T06:08:14.627

"the absolute-value method will be smaller", actually, it'll be bigger for small variances - it'll always be closer to 1 though (unless it is 1 or 0) – naught101 – 2012-03-29T05:22:51.173

Finding out that the variance uses squared by definition satisfied me. The moments of distribution are measurements which are defined by the powers of the differences: mean (^1) , variance (^2), skewness (^3), and kurtosis (^4). The variance can be particularly useful (many of the reasons are mentioned in this post; numbers further away have more weight, etc). – Federico – 2017-02-01T02:03:26.073

21In a way, the measurement you proposed is widely used in case of error (model quality) analysis -- then it is called MAE, "mean absolute error". – mbq – 2010-07-19T21:30:23.000

1Despite the antiquity of this question, I've posted a new answer, which says something that I think is worth knowing about. – Michael Hardy – 2012-09-18T01:42:06.330

Related question: http://stats.stackexchange.com/q/354/919 ("Bias towards natural numbers in the case of least squares.")

– whuber – 2010-11-27T21:53:19.177

154

If the goal of the standard deviation is to summarise the spread of a symmetrical data set (i.e. in general how far each datum is from the mean), then we need a good method of defining how to measure that spread.

The benefits of squaring include:

• Squaring always gives a positive value, so the sum will not be zero.
• Squaring emphasizes larger differences—a feature that turns out to be both good and bad (think of the effect outliers have).

Squaring however does have a problem as a measure of spread and that is that the units are all squared, whereas we might prefer the spread to be in the same units as the original data (think of squared pounds, squared dollars, or squared apples). Hence the square root allows us to return to the original units.

I suppose you could say that absolute difference assigns equal weight to the spread of data whereas squaring emphasises the extremes. Technically though, as others have pointed out, squaring makes the algebra much easier to work with and offers properties that the absolute method does not (for example, the variance is equal to the expected value of the square of the distribution minus the square of the mean of the distribution)

It is important to note however that there's no reason you couldn't take the absolute difference if that is your preference on how you wish to view 'spread' (sort of how some people see 5% as some magical threshold for $p$-values, when in fact it is situation dependent). Indeed, there are in fact several competing methods for measuring spread.

My view is to use the squared values because I like to think of how it relates to the Pythagorean Theorem of Statistics: $c = \sqrt{a^2 + b^2}$ …this also helps me remember that when working with independent random variables, variances add, standard deviations don't. But that's just my personal subjective preference which I mostly only use as a memory aid, feel free to ignore this paragraph.

A much more in-depth analysis can be read here.

12

Much of the field of robust statistics is an attempt to deal with the excessive sensitivity to outliers that that is a consequence of choosing the variance as a measure of data spread (technically scale or dispersion). http://en.wikipedia.org/wiki/Robust_statistics

– Thylacoleo – 2010-08-13T05:15:29.947

3The article linked to in the answer is a god send. – traggatmot – 2015-03-19T07:27:00.683

57"Squaring always gives a positive value, so the sum will not be zero." and so does absolute values. – robin girard – 2010-07-22T09:54:23.443

24@robin girard: That is correct, hence why I preceded that point with "The benefits of squaring include". I wasn't implying that anything about absolute values in that statement. I take your point though, I'll consider removing/rephrasing it if others feel it is unclear. – Tony Breyal – 2010-07-22T13:19:20.007

I think the paragraph about Pythagoras is spot on. You can think of the error as a vector in $n$ dimensions, with $n$ being the number of samples. The size in each dimension is the difference from the mean for that sample. $[(x_1-\mu), (x_2-\mu), (x_3-\mu), ...]$ The length of that vector (Pythagoras) is the root of summed squares, i.e. the standard deviation. – Arne Brasseur – 2017-09-30T09:39:48.153

121

The squared difference has nicer mathematical properties; it's continuously differentiable (nice when you want to minimize it), it's a sufficient statistic for the Gaussian distribution, and it's (a version of) the L2 norm which comes in handy for proving convergence and so on.

The mean absolute deviation (the absolute value notation you suggest) is also used as a measure of dispersion, but it's not as "well-behaved" as the squared error.

3@Rich: Both the variance and the median can be found in linear time, and of course no faster. Median does not require sorting. – Neil G – 2014-05-19T20:02:11.773

@NeilG how do you propose to find the sample median in linear time? – Jerome Baum – 2014-12-27T02:38:47.563

2

@JeromeBaum: http://en.wikipedia.org/wiki/Median_of_medians

– Neil G – 2014-12-27T04:11:00.280

2said "it's continuously differentiable (nice when you want to minimize it)" do you mean that the absolute value is difficult to optimize ? – robin girard – 2010-07-23T21:40:12.833

27@robin: while the absolute value function is continuous everywhere, its first derivative is not (at x=0). This makes analytical optimization more difficult. – Vince – 2010-07-23T23:59:23.210

1Yeah, finding quantiles in general (which includes optimizing absolute values) tends to churn up linear programming type problems, which -- while they're certainly tractable numerically -- can get fiddly. They typically don't have an analytical closed-form solution, and are a bit slower and a bit more difficult to implement than least-square-type solutions. – Rich – 2010-07-24T02:55:02.763

2I do not agree with this. First, theoretically, the problem may be of different nature (because of the discontinuity) but not necessarily harder (for example the median is easely shown to be arginf_m E[|Y-m|]). Second, practically, using a L1 norm (absolute value) rather than a L2 norm makes it piecewise linear and hence at least not more difficult. Quantile regression and its multiple variante is an example of that. – robin girard – 2010-07-24T06:01:42.113

12Yes, but finding the actual number you want, rather than just a descriptor of it, is easier under squared error loss. Consider the 1 dimension case; you can express the minimizer of the squared error by the mean: O(n) operations and closed form.

You can express the value of the absolute error minimizer by the median, but there's not a closed-form solution that tells you what the median value is; it requires a sort to find, which is something like O(n log n).

Least squares solutions tend to be a simple plug-and-chug type operation, absolute value solutions usually require more work to find. – Rich – 2010-07-24T09:10:00.387

77

One way you can think of this is that standard deviation is similar to a "distance from the mean".

Compare this to distances in euclidean space - this gives you the true distance, where what you suggested (which, btw, is the absolute deviation) is more like a manhattan distance calculation.

1This should be modified as minimum distance from the mean. It's essentially a Pythagorean equation. – John – 2014-11-21T16:40:34.797

2Except that in one dimension the $l_1$ and $l_2$ norm are the same thing, aren't they? – naught101 – 2012-03-29T05:20:37.400

16Nice analogy of euclidean space! – c4il – 2010-07-19T21:38:48.373

4@naught101: It's not one dimension, but rather $n$ dimensions where $n$ is the number of samples. The standard deviation and the absolute deviation are (scaled) $l_2$ and $l_1$ distances respectively, between the two points $(x_1, x_2, \dots, x_n)$ and $(\mu, \mu, \dots, \mu)$ where $\mu$ is the mean. – ShreevatsaR – 2012-11-16T07:21:30.730

46

The reason that we calculate standard deviation instead of absolute error is that we are assuming error to be normally distributed. It's a part of the model.

Suppose you were measuring very small lengths with a ruler, then standard deviation is a bad metric for error because you know you will never accidentally measure a negative length. A better metric would be one to help fit a Gamma distribution to your measurements:

$\log(E(x)) - E(\log(x))$

Like the standard deviation, this is also non-negative and differentiable, but it is a better error statistic for this problem.

2Great counter-example as to when the standard deviation is not the best way to think of fluctuation sizes. – Hbar – 2014-05-13T02:49:09.420

Shouldn't you have an opposite sign on the quantity to yield a positive measure - using a convex $-log x$ instead of concave $\log x$? – A.S. – 2016-01-27T20:58:00.613

@A.S. No, it is already always positive. It is zero when all the samples $x$ are equal, and otherwise its magnitude measures variation. – Neil G – 2016-01-27T22:21:03.357

You are mistaken. $E(g(X))\le g(E(X))$ for concave $g$. – A.S. – 2016-01-27T22:25:46.880

@A.S.: Oh, I thought you wanted me to change the sign of one of the terms — not both. Okay, I'll flip it around then. Good catch. – Neil G – 2016-01-27T22:26:57.137

3I like your answer. The sd is not always the best statistic. – RockScience – 2010-11-25T03:03:17.077

20

The answer that best satisfied me is that it falls out naturally from the generalization of a sample to n-dimensional euclidean space. It's certainly debatable whether that's something that should be done, but in any case:

Assume your $n$ measurements $X_i$ are each an axis in $\mathbb R^n$. Then your data $x_i$ define a point $\bf x$ in that space. Now you might notice that the data are all very similar to each other, so you can represent them with a single location parameter $\mu$ that is constrained to lie on the line defined by $X_i=\mu$. Projecting your datapoint onto this line gets you $\hat\mu=\bar x$, and the distance from the projected point $\hat\mu\bf 1$ to the actual datapoint is $\sqrt{\frac{n-1} n}\hat\sigma=\|\bf x-\hat\mu\bf 1\|$.

This approach also gets you a geometric interpretation for correlation, $\hat\rho=\cos \angle(\vec{\bf\tilde x},\vec{\bf\tilde y})$.

2This answer was thought-provoking and I think my preferred way of viewing it. In 1-D it's hard to understand why squaring the difference is seen as better. But in multiple dimensions (or even just 2) one can easily see that Euclidean distance (squaring) is preferable to Manhattan distance (sum of absolute value of differences). – thecity2 – 2016-06-07T21:25:00.007

6This is correct and appealing. However, in the end it appears only to rephrase the question without actually answering it: namely, why should we use the Euclidean (L2) distance? – whuber – 2010-11-24T21:07:08.960

That is indeed an excellent question, left unanswered. I used to feel strongly that the use of L2 is unfounded. After having studied a little statistics, I saw the analytic niceties, and since then have revised my viewpoint into "if it really matters, you're probably in deep water already, and if not, easy is nice". I don't know measure theory yet, and worry that analysis rules there too - but I've noticed some new interest in combinatorics, so perhaps new niceties have been/will be found. – sesqu – 2010-11-24T21:39:54.330

18@sesqu Standard deviations did not become commonplace until Gauss in 1809 derived his eponymous deviation using squared error, rather than absolute error, as a starting point. However, what pushed them over the top (I believe) was Galton's regression theory (at which you hint) and the ability of ANOVA to decompose sums of squares--which amounts to a restatement of the Pythagorean Theorem, a relationship enjoyed only by the L2 norm. Thus the SD became a natural omnibus measure of spread advocated in Fisher's 1925 "Statistical Methods for Research Workers" and here we are, 85 years later. – whuber – 2010-11-24T21:56:31.867

12(+1) Continuing in @whuber's vein, I would bet that had Student published a paper in 1908 entitled, "Probable Error of the Mean - Hey, Guys, Check Out That MAE in the Denominator!" then statistics would have an entirely different face by now. Of course, he didn't publish a paper like that, and of course he couldn't have, because the MAE doesn't boast all the nice properties that S^2 has. One of them (related to Student) is its independence of the mean (in the normal case), which of course is a restatement of orthogonality, which gets us right back to L2 and the inner product. – None – 2010-11-25T03:38:57.890

17

Squaring the difference from the mean has a couple of reasons.

• Variance is defined as the 2nd moment of the deviation (the R.V here is $(x-\mu)$) and thus the square as moments are simply the expectations of higher powers of the random variable.

• Having a square as opposed to the absolute value function gives a nice continuous and differentiable function (absolute value is not differentiable at 0) - which makes it the natural choice, especially in the context of estimation and regression analysis.

• The squared formulation also naturally falls out of parameters of the Normal Distribution.

14

Yet another reason (in addition to the excellent ones above) comes from Fisher himself, who showed that the standard deviation is more "efficient" than the absolute deviation. Here, efficient has to do with how much a statistic will fluctuate in value on different samplings from a population. If your population is normally distributed, the standard deviation of various samples from that population will, on average, tend to give you values that are pretty similar to each other, whereas the absolute deviation will give you numbers that spread out a bit more. Now, obviously this is in ideal circumstances, but this reason convinced a lot of people (along with the math being cleaner), so most people worked with standard deviations.

4Your argument depends on the data being normally distributed. If we assume the population to have a "double exponential" distribution, then the absolute deviation is more efficient (in fact it is a sufficient statistic for the scale) – probabilityislogic – 2011-07-16T05:08:39.190

6Yes, as I stated, "if your population is normally distributed." – Eric Suh – 2011-09-08T19:49:55.497

Besides assuming normal distribution Fisher proof assumes error-free measurements. With small errors like 1% the situation inverts and the average absolute deviation is more efficient than the standard deviation – juanrga – 2017-08-06T10:51:49.837

12

Just so people know, there is a Math Overflow question on the same topic.

Why-is-it-so-cool-to-square-numbers-in-terms-of-finding-the-standard-deviation

The take away message is that using the square root of the variance leads to easier maths. A similar response is given by Rich and Reed above.

3'Easier math' isn't an essential requirement when we want our formulas and values to more truly reflect a given set of data. Computers do all the hard work anyway. – Dan W – 2015-07-31T05:26:58.147

Defining pi as 3.14 makes math easier, but that doesn't make it right. – James – 2015-11-28T03:29:23.887

10

There are many reasons; probably the main is that it works well as parameter of normal distribution.

4I agree. Standard deviation is the right way to measure dispersion if you assume normal distribution. And a lot of distributions and real data are an approximately normal. – Łukasz Lew – 2010-07-20T14:40:02.050

2

I don't think you should say "natural parameter": the natural parameters of the normal distribution are mean and mean times precision. (http://en.wikipedia.org/wiki/Natural_parameter)

– Neil G – 2012-03-12T07:40:43.097

1@NeilG Good point; I was thinking about "casual" meaning here. I'll think about some better word. – mbq – 2012-03-12T10:41:54.607

9

I think the contrast between using absolute deviations and squared deviations becomes clearer once you move beyond a single variable and think about linear regression. There's a nice discussion at http://en.wikipedia.org/wiki/Least_absolute_deviations, particularly the section "Contrasting Least Squares with Least Absolute Deviations" , which links to some student exercises with a neat set of applets at http://www.math.wpi.edu/Course_Materials/SAS/lablets/7.3/73_choices.html .

To summarise, least absolute deviations is more robust to outliers than ordinary least squares, but it can be unstable (small change in even a single datum can give big change in fitted line) and doesn't always have a unique solution - there can be a whole range of fitted lines. Also least absolute deviations requires iterative methods, while ordinary least squares has a simple closed-form solution, though that's not such a big deal now as it was in the days of Gauss and Legendre, of course.

the "unique solution" argument is quite weak, it really means there is more than one value well supported by the data. Additionally, penalisation of the coefficients, such as L2, will resolve the uniqueness problem, and the stability problem to a degree as well. – probabilityislogic – 2014-07-04T11:13:05.683

9

$\newcommand{\var}{\operatorname{var}}$ Variances are additive: for independent random variables $X_1,\ldots,X_n$, $$\var(X_1+\cdots+X_n)=\var(X_1)+\cdots+\var(X_n).$$

Notice what this makes possible: Say I toss a fair coin 900 times. What's the probability that the number of heads I get is between 440 and 455 inclusive? Just find the expected number of heads ($450$), and the variance of the number of heads ($225=15^2$), then find the probability with a normal (or Gaussian) distribution with expectation $450$ and standard deviation $15$ is between $439.5$ and $455.5$. Abraham de Moivre did this with coin tosses in the 18th century, thereby first showing that the bell-shaped curve is worth something.

Are mean absolute deviations not additive in the same way as variances? – russellpierce – 2013-02-09T23:30:10.380

4No, they're not. – Michael Hardy – 2013-02-10T18:14:55.087

8

In many ways, the use of standard deviation to summarize dispersion is jumping to a conclusion. You could say that SD implicitly assumes a symmetric distribution because of its equal treatment of distance below the mean as of distance above the mean. The SD is surprisingly difficult to interpret to non-statisticians. One could argue that Gini's mean difference has broader application and is significantly more interpretable. It does not require one to declare their choice of a measure of central tendency as the use of SD does for the mean. Gini's mean difference is the average absolute difference between any two different observations. Besides being robust and easy to interpret it happens to be 0.98 as efficient as SD if the distribution were actually Gaussian.

2

Just to add to @Frank's suggestion on Gini, there's a nice paper here: http://projecteuclid.org/download/pdf_1/euclid.ss/1028905831 It goes over various measures of dispersion and also give an informative historical perspective.

– Thomas Speidel – 2014-05-14T17:06:06.053

1I like these ideas too, but there's a less well known parallel definition of the variance (and thus the SD) that makes no reference to means as location parameters. The variance is half the mean square over all the pairwise differences between values, just as the Gini mean difference is based on the absolute values of all the pairwise difference. – Nick Cox – 2014-10-21T23:46:46.453

5

"Why square the difference" instead of "taking absolute value"? To answer very exactly, there is literature that gives the reasons it was adopted and the case for why most of those reasons do not hold. "Can't we simply take the absolute value...?". I am aware of literature in which the answer is yes it is being done and doing so is argued to be advantageous.

Author Gorard states, first, using squares was previously adopted for reasons of simplicity of calculation but that those original reasons no longer hold. Gorard states, second, that OLS was adopted because Fisher found that results in samples of analyses that used OLS had smaller deviations than those that used absolute differences (roughly stated). Thus, it would seem that OLS may have benefits in some ideal circumstances; however, Gorard proceeds to note that there is some consensus (and he claims Fisher agreed) that under real world conditions (imperfect measurement of observations, non-uniform distributions, studies of a population without inference from a sample), using squares is worse than absolute differences.

Gorard's response to your question "Can't we simply take the absolute value of the difference instead and get the expected value (mean) of those?" is yes. Another advantage is that using differences produces measures (measures of errors and variation) that are related to the ways we experience those ideas in life. Gorard says imagine people who split the restaurant bill evenly and some might intuitively notice that that method is unfair. Nobody there will square the errors; the differences are the point.

Finally, using absolute differences, he notes, treats each observation equally, whereas by contrast squaring the differences gives observations predicted poorly greater weight than observations predicted well, which is like allowing certain observations to be included in the study multiple times. In summary, his general thrust is that there are today not many winning reasons to use squares and that by contrast using absolute differences has advantages.

References:

Thanks @Jen, this reminds me of the QWERTY keyboard history. Hey, how come it takes so long to type QWERTY? – toto_tico – 2016-02-25T00:01:07.610

5

Estimating the standard deviation of a distribution requires to choose a distance.
Any of the following distance can be used:

$$d_n((X)_{i=1,\ldots,I},\mu)=\left(\sum | X-\mu|^n\right)^{1/n}$$

We usually use the natural euclidean distance ($n=2$), which is the one everybody uses in daily life. The distance that you propose is the one with $n=1$.
Both are good candidates but they are different.

One could decide to use $n=3$ as well.

I am not sure that you will like my answer, my point contrary to others is not to demonstrate that $n=2$ is better. I think that if you want to estimate the standard deviation of a distribution, you can absolutely use a different distance.

4

It depends on what you are talking about when you say "spread of the data". To me this could mean two things:

1. The width of a sampling distribution
2. The accuracy of a given estimate

For point 1) there is no particular reason to use the standard deviation as a measure of spread, except for when you have a normal sampling distribution. The measure $E(|X-\mu|)$ is a more appropriate measure in the case of a Laplace Sampling distribution. My guess is that the standard deviation gets used here because of intuition carried over from point 2). Probably also due to the success of least squares modelling in general, for which the standard deviation is the appropriate measure. Probably also because calculating $E(X^2)$ is generally easier than calculating $E(|X|)$ for most distributions.

Now, for point 2) there is a very good reason for using the variance/standard deviation as the measure of spread, in one particular, but very common case. You can see it in the Laplace approximation to a posterior. With Data $D$ and prior information $I$, write the posterior for a parameter $\theta$ as:

$$p(\theta\mid DI)=\frac{\exp\left(h(\theta)\right)}{\int \exp\left(h(t)\right)\,dt}\;\;\;\;\;\;h(\theta)\equiv\log[p(\theta\mid I)p(D\mid\theta I)]$$

I have used $t$ as a dummy variable to indicate that the denominator does not depend on $\theta$. If the posterior has a single well rounded maximum (i.e. not too close to a "boundary"), we can taylor expand the log probability about its maximum $\theta_\max$. If we take the first two terms of the taylor expansion we get (using prime for differentiation):

$$h(\theta)\approx h(\theta_\max)+(\theta_\max-\theta)h'(\theta_\max)+\frac{1}{2}(\theta_\max-\theta)^{2}h''(\theta_\max)$$

But we have here that because $\theta_\max$ is a "well rounded" maximum, $h'(\theta_\max)=0$, so we have:

$$h(\theta)\approx h(\theta_\max)+\frac{1}{2}(\theta_\max-\theta)^{2}h''(\theta_\max)$$

If we plug in this approximation we get:

$$p(\theta\mid DI)\approx\frac{\exp\left(h(\theta_\max)+\frac{1}{2}(\theta_\max-\theta)^{2}h''(\theta_\max)\right)}{\int \exp\left(h(\theta_\max)+\frac{1}{2}(\theta_\max-t)^{2}h''(\theta_\max)\right)\,dt}$$

$$=\frac{\exp\left(\frac{1}{2}(\theta_\max-\theta)^{2}h''(\theta_\max)\right)}{\int \exp\left(\frac{1}{2}(\theta_\max-t)^{2}h''(\theta_\max)\right)\,dt}$$

Which, but for notation is a normal distribution, with mean equal to $E(\theta\mid DI)\approx\theta_\max$, and variance equal to

$$V(\theta\mid DI)\approx \left[-h''(\theta_\max)\right]^{-1}$$

($-h''(\theta_\max)$ is always positive because we have a well rounded maximum). So this means that in "regular problems" (which is most of them), the variance is the fundamental quantity which determines the accuracy of estimates for $\theta$. So for estimates based on a large amount of data, the standard deviation makes a lot of sense theoretically - it tells you basically everything you need to know. Essentially the same argument applies (with same conditions required) in multi-dimensional case with $h''(\theta)_{jk}=\frac{\partial h(\theta)}{\partial \theta_j \, \partial \theta_k}$ being a Hessian matrix. The diagonal entries are also essentially variances here too.

The frequentist using the method of maximum likelihood will come to essentially the same conclusion because the MLE tends to be a weighted combination of the data, and for large samples the Central Limit Theorem applies and you basically get the same result if we take $p(\theta\mid I)=1$ but with $\theta$ and $\theta_\max$ interchanged: $$p(\theta_\max\mid\theta)\approx N\left(\theta,\left[-h''(\theta_\max)\right]^{-1}\right)$$ (see if you can guess which paradigm I prefer :P ). So either way, in parameter estimation the standard deviation is an important theoretical measure of spread.

4

Because squares can allow use of many other mathematical operations or functions more easily than absolute values.

Example: squares can be integrated, differentiated, can be used in trigonometric, logarithmic and other functions, with ease.

2I wonder if there is a self fulfilling profecy here. We get – probabilityislogic – 2012-03-13T12:04:02.063

3

When adding random variables, their variances add, for all distributions. Variance (and therefore standard deviation) is a useful measure for almost all distributions, and is in no way limited to gaussian (aka "normal") distributions. That favors using it as our error measure. Lack of uniqueness is a serious problem with absolute differences, as there are often an infinite number of equal-measure "fits", and yet clearly the "one in the middle" is most realistically favored. Also, even with today's computers, computational efficiency matters. I work with large data sets, and CPU time is important. However, there is no single absolute "best" measure of residuals, as pointed out by some previous answers. Different circumstances sometimes call for different measures.

1I remain unconvinced that variances are very useful for asymmetric distributions. – Frank Harrell – 2014-10-22T12:58:45.370

3

Naturally you can describe dispersion of a distribution in any way meaningful (absolute deviation, quantiles, etc.).

One nice fact is that the variance is the second central moment, and every distribution is uniquely described by its moments if they exist. Another nice fact is that the variance is much more tractable mathematically than any comparable metric. Another fact is that the variance is one of two parameters of the normal distribution for the usual parametrization, and the normal distribution only has 2 non-zero central moments which are those two very parameters. Even for non-normal distributions it can be helpful to think in a normal framework.

As I see it, the reason the standard deviation exists as such is that in applications the square-root of the variance regularly appears (such as to standardize a random varianble), which necessitated a name for it.

1If I recall correctly, isn't the log-normal distribution not uniquely defined by its moments. – probabilityislogic – 2014-04-10T13:38:31.843

@probabilityislogic, indeed, that is true, see https://en.wikipedia.org/wiki/Log-normal_distribution in the section "Characteristic function and moment generating function".

– kjetil b halvorsen – 2015-08-02T18:45:05.660

0

Squaring amplifies larger deviations.

If your sample has values that are all over the chart then to bring the 68.2% within the first standard deviation your standard deviation needs to be a little wider. If your data tended to all fall around the mean then σ can be tighter.

Some say that it is to simplify calculations. Using the positive square root of the square would have solved that so that argument doesn't float.

$|x| = \sqrt{x^{2}}$

So if algebraic simplicity was the goal then it would have looked like this:

$\sigma = \text{E}\left[\sqrt{(x-\mu)^{2}}\right]$ which yields the same results as $\text{E}\left[|x-\mu|\right]$.

Obviously squaring this also has the effect of amplifying outlying errors (doh!).

Based on a flag I just processed, I suspect the downvoter did not completely understand how this answer responds to the question. I believe I see the connection (but you might nevertheless consider making some edits to help other readers appreciate your points better). Your first paragraph, though, strikes me as being somewhat of a circular argument: the 68.2% value is derived from properties of the standard deviation, so how does invoking that number help justify using the SD instead of some other $L^p$ norm of deviations from the mean as a way to quantify the spread of a distribution? – whuber – 2014-07-28T21:20:49.153

The first paragraph was the reason for my downvote. – Alexis – 2014-07-28T22:45:40.260

2@Preston Thayne: Since the standard deviation is not the expected value of sqrt((x-mu)^2), your formula is misleading. In addition, just because squaring has the effect of amplifying larger deviations does not mean that this is the reason for preferring the variance over the MAD. If anything, that is a neutral property since oftentimes we want something more robust like the MAD. Lastly, the fact that the variance is more mathematically tractable than the MAD is a much deeper issue mathematically then you've conveyed in this post. – Steve S – 2014-07-29T02:18:51.680

0

A different and perhaps more intuitive approach is when you think about linear regression vs. median regression.

Suppose our model is that $\mathbb{E}(y|x) = x\beta$. Then we find b by minimisize the expected squared residual, $\beta = \arg \min_b \mathbb{E} (y - x b)^2$.

If instead our model is that Median$(y|x) = x\beta$, then we find our parameter estimates by minimizing the absolute residuals, $\beta = \arg \min_b \mathbb{E} |y - x b|$.

In other words, whether to use absolute or squared error depends on whether you want to model the expected value or the median value.

If the distribution, for example, displays skewed heteroscedasticity, then there is a big difference in how the slope of the expected value of $y$ changes over $x$ to how the slope is for the median value of $y$.

Koenker and Hallock have a nice piece on quantile regression, where median regression is a special case: http://master272.com/finance/QR/QRJEP.pdf.

0

My guess is this: Most populations (distributions) tend to congregate around the mean. The farther a value is from the mean, the rarer it is. In order to adequately express how "out of line" a value is, it is necessary to take into account both its distance from the mean and its (normally speaking) rareness of occurrence. Squaring the difference from the mean does this, as compared to values which have smaller deviations. Once all the variances are averaged, then it is OK to take the square root, which returns the units to their original dimensions.

2This doesn't explain why you couldn't just take the absolute value of the difference. That seems conceptually simpler to most stats 101 students, & it would "take into account both its distance from the mean and its (normally speaking) rareness of occurrence". – gung – 2013-09-13T02:35:02.840

I think the absolute value of the difference would only express the difference from the mean and would not take into account the fact that large differences are doubly disruptive to a normal distribution. – Samuel Berry – 2013-09-13T02:44:24.830

2Why is "doubly disruptive" important and not, say, "triply disruptive" or "quadruply disruptive"? It looks like this answer merely replaces the original question with an equivalent question. – whuber – 2013-09-13T15:19:21.483