It depends on what you are talking about when you say "spread of the data". To me this could mean two things:

- The width of a sampling distribution
- The accuracy of a given estimate

For point 1) there is no particular reason to use the standard deviation as a measure of spread, except for when you have a normal sampling distribution. The measure $E(|X-\mu|)$ is a more appropriate measure in the case of a Laplace Sampling distribution. My guess is that the standard deviation gets used here because of intuition carried over from point 2). Probably also due to the success of least squares modelling in general, for which the standard deviation is the appropriate measure. Probably also because calculating $E(X^2)$ is generally easier than calculating $E(|X|)$ for most distributions.

Now, for point 2) there is a very good reason for using the variance/standard deviation as the measure of spread, in one particular, but very common case. You can see it in the Laplace approximation to a posterior. With Data $D$ and prior information $I$, write the posterior for a parameter $\theta$ as:

$$p(\theta\mid DI)=\frac{\exp\left(h(\theta)\right)}{\int \exp\left(h(t)\right)\,dt}\;\;\;\;\;\;h(\theta)\equiv\log[p(\theta\mid I)p(D\mid\theta I)]$$

I have used $t$ as a dummy variable to indicate that the denominator does not depend on $\theta$. If the posterior has a single well rounded maximum (i.e. not too close to a "boundary"), we can taylor expand the log probability about its maximum $\theta_\max$. If we take the first two terms of the taylor expansion we get (using prime for differentiation):

$$h(\theta)\approx h(\theta_\max)+(\theta_\max-\theta)h'(\theta_\max)+\frac{1}{2}(\theta_\max-\theta)^{2}h''(\theta_\max)$$

But we have here that because $\theta_\max$ is a "well rounded" maximum, $h'(\theta_\max)=0$, so we have:

$$h(\theta)\approx h(\theta_\max)+\frac{1}{2}(\theta_\max-\theta)^{2}h''(\theta_\max)$$

If we plug in this approximation we get:

$$p(\theta\mid DI)\approx\frac{\exp\left(h(\theta_\max)+\frac{1}{2}(\theta_\max-\theta)^{2}h''(\theta_\max)\right)}{\int \exp\left(h(\theta_\max)+\frac{1}{2}(\theta_\max-t)^{2}h''(\theta_\max)\right)\,dt}$$

$$=\frac{\exp\left(\frac{1}{2}(\theta_\max-\theta)^{2}h''(\theta_\max)\right)}{\int \exp\left(\frac{1}{2}(\theta_\max-t)^{2}h''(\theta_\max)\right)\,dt}$$

Which, but for notation is a normal distribution, with mean equal to $E(\theta\mid DI)\approx\theta_\max$, and variance equal to

$$V(\theta\mid DI)\approx \left[-h''(\theta_\max)\right]^{-1}$$

($-h''(\theta_\max)$ is always positive because we have a well rounded maximum). So this means that in "regular problems" (which is most of them), the variance is the fundamental quantity which determines the accuracy of estimates for $\theta$. So for estimates based on a large amount of data, the standard deviation makes a lot of sense theoretically - it tells you basically everything you need to know. Essentially the same argument applies (with same conditions required) in multi-dimensional case with $h''(\theta)_{jk}=\frac{\partial h(\theta)}{\partial \theta_j \, \partial \theta_k}$ being a Hessian matrix. The diagonal entries are also essentially variances here too.

The frequentist using the method of maximum likelihood will come to essentially the same conclusion because the MLE tends to be a weighted combination of the data, and for large samples the Central Limit Theorem applies and you basically get the same result if we take $p(\theta\mid I)=1$ but with $\theta$ and $\theta_\max$ interchanged:
$$p(\theta_\max\mid\theta)\approx N\left(\theta,\left[-h''(\theta_\max)\right]^{-1}\right)$$ (see if you can guess which paradigm I prefer :P ). So either way, in parameter estimation the standard deviation is an important theoretical measure of spread.

5In accepting an answer it seems important to me that we pay attention to whether the answer is circular. The normal distribution is based on these measurements of variance from squared error terms, but that isn't in and of itself a justification for using (X-M)^2 over |X-M|. – russellpierce – 2010-07-20T07:59:54.683

This following article has the pictorial & easy-to-understand explanation. http://www.mathsisfun.com/data/standard-deviation.html Thanks, Rajesh.

– Rajesh – 2013-06-13T13:56:59.363See http://www.graphpad.com/curvefit/linear_regression.htm See Minimizing sum-of-squares section

– None – 2011-06-12T05:39:06.26741

Every answer offered so far is circular. They focus on ease of mathematical calculations (which is nice but by no means fundamental) or on properties of the Gaussian (Normal) distribution and OLS. Around 1800 Gauss

– whuber – 2013-09-13T15:28:32.647startedwith least squares and variance and from thosederivedthe Normal distribution--there's the circularity. A truly fundamental reason that has not been invoked in any answer yet is theuniquerole played by the variance in the Central Limit Theorem. Another is the importance in decision theory of minimizing quadratic loss.1+1 @whuber: Thanks for pointing this out, which was bothering me as well. Now, though, have to go and read up on the Central Limit Theorem! Oh well. ;-) – Sabuncu – 2014-02-11T21:55:47.667

1

Taleb makes the case at Edge.org for retiring standard deviation and using mean absolute deviation.

– Alex Holcombe – 2015-06-02T10:38:32.660@c4il will you please cite the source for the formula of S.D. qouted by you. I do think that it is incorrect. – subhash c. davar – 2015-11-20T17:11:47.500

@rpierce would you please check the correctness of formula of s.d . under definition while asking question. – subhash c. davar – 2015-11-21T01:28:32.163

@subhash c. davar, the notation isn't in a form I'm familiar with. However, OP defines E as the process of getting the mean, so IMO the equations check out. – russellpierce – 2015-11-21T16:53:27.540

1

@subhashc.davar: The missing definition is for the expectation of the random variable $X$, $\mu=\operatorname{E}[X]$. (It's so commonly used that it's no more than a venial sin to let us guess what it means from the context.) Wikipedia will serve as a reference for the definition of standard deviation: https://en.wikipedia.org/wiki/Standard_deviation#Definition_of_population_values. Note the distinction between the standard deviation of a distribution/population & an estimate of it that may be calculated from a sample.

– Scortchi – 2015-11-23T16:20:55.817@whuber Could you clarify relation to

CLTspecifically? Is variance the only non-zero functional $f$ s.t. $f(\sqrt n (\bar X_n-EX))=f(X)$? – A.S. – 2016-01-27T19:58:47.097@A.S. Sure--I have answered this question in some detail at http://stats.stackexchange.com/a/3904. Briefly, there are infinitely many such functionals--but they must all asymptotically converge to the variance.

– whuber – 2016-01-27T20:03:49.250@whuber What do you mean by "asymptotically converge"? Are you considering convergence of separate $f_n$ (defined for each $n$) rather than a single $f$ that satisfies the above for all $n$? // I'll read the post. – A.S. – 2016-01-27T20:29:56.877

1Do you think the term standard means this is THE standard today ? Isn't it like asking why principal component are "principal" and not secondary ? – robin girard – 2010-07-23T21:44:37.093

My understanding of this question is that it could be shorter just be something like: what is the difference between the MAE and the RMSE ? otherwise it is difficult to deal with. – robin girard – 2010-07-24T06:08:14.627

"the absolute-value method will be smaller", actually, it'll be bigger for small variances - it'll always be closer to 1 though (unless it is 1 or 0) – naught101 – 2012-03-29T05:22:51.173Finding out that the variance uses squared by definition satisfied me. The moments of distribution are measurements which are defined by the powers of the differences: mean (^1) , variance (^2), skewness (^3), and kurtosis (^4). The variance can be particularly useful (many of the reasons are mentioned in this post; numbers further away have more weight, etc). – Federico – 2017-02-01T02:03:26.073

21In a way, the measurement you proposed is widely used in case of error (model quality) analysis -- then it is called MAE, "mean absolute error". – mbq – 2010-07-19T21:30:23.000

1Despite the antiquity of this question, I've posted a new answer, which says something that I think is worth knowing about. – Michael Hardy – 2012-09-18T01:42:06.330

Related question: http://stats.stackexchange.com/q/354/919 ("Bias towards natural numbers in the case of least squares.")

– whuber – 2010-11-27T21:53:19.177