Today I taught an introductory class of statistics and a student came up to me with a question, which I rephrase here as: “Why is the standard deviation defined as sqrt of variance and not as the sqrt of sum of squares over N?”
We define population variance: σ2=1N∑(xi−μ)2
And standard deviation: σ=√σ2=1√N√∑(xi−μ)2.
The interpretation we may give to σ is that it gives the average deviation of units in the population from the population mean of X.
However, in the definition of the s.d. we divide the sqrt of the sum of squares through √N. The question the student raises is why we do not divide the sqrt of the sume of squares by N instead. Thus we come to competing formula: σnew=1N√∑(xi−μ)2. The student argued that this formula looks more like an “average” deviation from the mean than when dividing through √N as in σ.
I thought this question is not stupid. I would like to give an answer to the student that goes further than saying that the s.d. is defined as sqrt of the variance which is the average squared deviaton. Put differently, why should the student use the correct formula and not follow her idea?
This question relates to an older thread and answers provided here. Answers there go in three directions:
 σ is the rootmeansquared (RMS) deviation, not the “typical”
deviation from the mean (i.e., σnew). Thus, it is defined differently. It has nice mathematical properties.
 Furthermore, the sqrt would bring back “units” to their original scale. However, this would also be the case for σnew, which divides by N instead.
Both of points 1 and 2 are arguments in favour of the s.d. as RMS, but I do not see an argument against the use of σnew. What would be the good arguments to convince introductory level students of the use of the average RMS distance σ from the mean?
Answer
There are at least three basic problems which can readily be explained to beginners:

The “new” SD is not even defined for infinite populations. (One could declare it always to equal zero in such cases, but that would not make it any more useful.)

The new SD does not behave the way an average should do under random sampling.

Although the new SD can be used with all mathematical rigor to assess deviations from a mean (in samples and finite populations), its interpretation is unnecessarily complicated.
1. The applicability of the new SD is limited
Point (1) could be brought home, even to those not versed in integration, by pointing out that because the variance clearly is an arithmetic mean (of squared deviations), it has a useful extension to models of “infinite” populations for which the intuition of the existence of an arithmetic mean still holds. Therefore its square root–the usual SD–is perfectly well defined in such cases, too, and just as useful in its role as a (nonlinear reexpression of) a variance. However, the new SD divides that average by the arbitrarily large √N, rendering problematic its generalization beyond finite populations and finite samples: what should 1/√N be taken to equal in such cases?
2. The new SD is not an average
Any statistic worthy of the name “average” should have the property that it converges to the population value as the size of a random sample from the population increases. Any fixed multiple of the SD would have this property, because the multiplier would apply both to computing the sample SD and the population SD. (Although not directly contradicting the argument offered by Alecos Papadopoulos, this observation suggests that argument is only tangential to the real issues.) However, the “new” SD, being equal to 1/√N times the usual one, obviously converges to 0 in all circumstances as the sample size N grows large. Therefore, although for any fixed sample size N the new SD (suitably interpreted) is a perfectly adequate measure of variation around the mean, it cannot justifiably be considered a universal measure applicable, with the same interpretation, for all sample sizes, nor can it correctly be called an “average” in any useful sense.
3. The new SD is complicated to interpret and use
Consider taking samples of (say) size N=4. The new SD in these cases is 1/√N=1/2 times the usual SD. It therefore enjoys comparable interpretations, such as an analog of the 689599 rule (about 68% of the data should lie within two new SDs of the mean, 95% of them within four new SDs of the mean, etc.; and versions of classical inequalities such as Chebychev’s will hold (no more than 1/k2 of the data can lie more than 2k new SDs away from their mean); and the Central Limit Theorem can be analogously restated in terms of the new SD (one divides by √N times the new SD in order to standardize the variable). Thus, in this specific and clearly constrained sense, there is nothing wrong with the student’s proposal. The difficulty, though, is that these statements all contain–quite explicitly–factors of √N=2. Although there is no inherent mathematical problem with this, it certainly complicates the statements and interpretation of the most fundamental laws of statistics.
It is of note that Gauss and others originally parameterized the Gaussian distribution by √2σ, effectively using √2 times the SD to quantify the spread of a Normal random variable. This historical use demonstrates the propriety and effectiveness of using other fixed multiples of the SD in its stead.
Attribution
Source : Link , Question Author : tomka , Answer Author : Community