Besides the mean that characterizes the location of a population, there is also a way to measure
how the observations are distributed around the mean; that is do most observations
lie close to the mean or are they distributed far from the mean. This characteristic is called the
dispersion of the population and there are several ways in which it can be calculated. Of course,
in most cases, we will only have a sample of a populations so we will calculate a sample statistic
that estimates the population parameter that is itself the measure of dispersion.
Range (chapter 4.1)
The range is simply the lowest value subtracted from the highest value (equation 4.1). The range
is greatly affected by the outliers and gives very little information about how the observations
cluster around the mean. The sample range is a poor estimator of the population range and as
such is rarely used. If it is reported, other measures of dispersion should also be reported.
Variance (chapter 4.4)
To calculate the variance, we must first calculate the squared deviations of each observation from
the mean. We will again use our data of bacterial colonies from the 100-ml samples of water.
Recall that we had already determined that the sample mean (Xbar) = 80.4
Xi
Xi - Xbar
(Xi -
Xbar)^2
90
9.6
92.16
73
-7.4
54.76
68
-12.4
153.76
87
6.6
43.56
84
3.6
12.96
The sum of all of the Xi - Xbar is 0.0 (section 4.3), however the sum of
all of the (Xi - Xbar)^2 is a positive number that is equal to 357.2 (equation 4.5) in
this case, and as this value is based on a sample it is called the sample sum of squares. This
concept of a sum of squares (abbreviated SS) is very important and we will use this term a few
thousand times before this semester is over. Of course, if we had every observation that existed
in a population such that we knew the population mean, then this sum of squares would be called
a population sum of squares (equation 4.4).
To calculate the variance we are going to divide the population sum of squares by the population
size (equation 4.7). This gives us a term called the mean squared deviations (sometimes referred
to as the mean square) or the variance. The population variance is denoted by sigma2 as
this is a population parameter.
It turns out that the best estimator of the population variance is to divide the sample sum of
squares by the (sample size - 1) (equation 4.8). This yields the sample variance and is denoted as
s2. Dividing by n - 1 yields an unbiased estimator of the population variance. The
term (n - 1) is called the degrees of freedom. For our data the sample variance is
357.2/4 = 89.3
As the sum of squares can vary from zero (all observations are identical) to infinity, the variance
itself can vary from zero to infinity. You can never have a variance with a negative value.
Standard Deviation (chapter 4.5)
The sum of squares has the units of the original observations squared. For example, in our case
the observations and Xbar have the units counts, but the sum of squares and the resulting sample
variance have the units counts2. If we take the positive square root of the variance,
we will have a measure that has the same units as the original observations. This term is called
the sample standard deviation (equation 4.13) and it is denoted by s. For our data, we take the
square root of 89.3 = 9.45. Of course, the positive square root of the population variance yields the population standard deviation (equation 4.12)
Coefficient of Variation (chapter 4.6)
The coefficient of variation (CV) is the sample standard deviation divided by the sample mean
and then times 100 to convert to a percentage (equation 4.16). This statistic is very useful when
you wish to compare standard deviations between very different populations. For instance, is
there more or less variability (dispersion around the mean) in leg length in deer mice as
compared to horses. Horses have much longer legs and thus the standard deviation will be
larger, thus horses by their very nature have a larger standard deviation in leg length than mice.
However, the CV allows you to scale the standard deviation relative to the magnitude of the data.
Thus it might be that deer mice have a higher CV (indicating more variability) in leg length than
do horses. For our data on bacterial counts, the CV is equal to (9.45/80.4)*100 = 11.75% (equation 4.16).
Do problems 4.1 and 4.2 at the end of chapter 4 (page 47).