Measures of Dispersion
Chapter 4

Besides the mean that characterizes the location of a population, there is also a way to measure how the observations are distributed around the mean; that is do most observations lie close to the mean or are they distributed far from the mean. This characteristic is called the dispersion of the population and there are several ways in which it can be calculated. Of course, in most cases, we will only have a sample of a populations so we will calculate a sample statistic that estimates the population parameter that is itself the measure of dispersion.


Range (chapter 4.1)

The range is simply the lowest value subtracted from the highest value (equation 4.1). The range is greatly affected by the outliers and gives very little information about how the observations cluster around the mean. The sample range is a poor estimator of the population range and as such is rarely used. If it is reported, other measures of dispersion should also be reported.


Variance (chapter 4.4)

To calculate the variance, we must first calculate the squared deviations of each observation from the mean. We will again use our data of bacterial colonies from the 100-ml samples of water. Recall that we had already determined that the sample mean (Xbar) = 80.4
XiXi - Xbar(Xi - Xbar)^2
90 9.6 92.16
73 -7.4 54.76
68 -12.4 153.76
87 6.6 43.56
84 3.6 12.96
The sum of all of the Xi - Xbar is 0.0 (section 4.3), however the sum of all of the (Xi - Xbar)^2 is a positive number that is equal to 357.2 (equation 4.5) in this case, and as this value is based on a sample it is called the sample sum of squares. This concept of a sum of squares (abbreviated SS) is very important and we will use this term a few thousand times before this semester is over. Of course, if we had every observation that existed in a population such that we knew the population mean, then this sum of squares would be called a population sum of squares (equation 4.4).

To calculate the variance we are going to divide the population sum of squares by the population size (equation 4.7). This gives us a term called the mean squared deviations (sometimes referred to as the mean square) or the variance. The population variance is denoted by sigma2 as this is a population parameter.

It turns out that the best estimator of the population variance is to divide the sample sum of squares by the (sample size - 1) (equation 4.8). This yields the sample variance and is denoted as s2. Dividing by n - 1 yields an unbiased estimator of the population variance. The term (n - 1) is called the degrees of freedom. For our data the sample variance is 357.2/4 = 89.3

As the sum of squares can vary from zero (all observations are identical) to infinity, the variance itself can vary from zero to infinity. You can never have a variance with a negative value.


Standard Deviation (chapter 4.5)

The sum of squares has the units of the original observations squared. For example, in our case the observations and Xbar have the units counts, but the sum of squares and the resulting sample variance have the units counts2. If we take the positive square root of the variance, we will have a measure that has the same units as the original observations. This term is called the sample standard deviation (equation 4.13) and it is denoted by s. For our data, we take the square root of 89.3 = 9.45. Of course, the positive square root of the population variance yields the population standard deviation (equation 4.12)


Coefficient of Variation (chapter 4.6)

The coefficient of variation (CV) is the sample standard deviation divided by the sample mean and then times 100 to convert to a percentage (equation 4.16). This statistic is very useful when you wish to compare standard deviations between very different populations. For instance, is there more or less variability (dispersion around the mean) in leg length in deer mice as compared to horses. Horses have much longer legs and thus the standard deviation will be larger, thus horses by their very nature have a larger standard deviation in leg length than mice. However, the CV allows you to scale the standard deviation relative to the magnitude of the data. Thus it might be that deer mice have a higher CV (indicating more variability) in leg length than do horses. For our data on bacterial counts, the CV is equal to (9.45/80.4)*100 = 11.75% (equation 4.16).

Do problems 4.1 and 4.2 at the end of chapter 4 (page 47).

Answers:

Last updated on 28 August 2000.
Provide comments to Dwight Moore at mooredwi@emporia.edu.
Return to the RDA Home Page at Emporia State University.