Measures of Location
One of the most important lessons to be learned from this course is when certain statistics are
appropriate and when they are not. It is clearly the sign of someone who does not understand
statistics when they report all of the information that some computer program may spit out at
them. I will stress this why and when on tests.
Statistics of Location (also called measures of central tendency)
When a set of data has been collected, one of the first steps is to summarize that data. This can
be done with a frequency distributions as shown previously, however a numerical summary of the
data is often needed or desired. These summaries are referred to as descriptive statistics and they
are divided into two categories, statistics of location and statistics of dispersion. Statistics of
location summarize the "central" point of the data along a number line, while statistics of
dispersion summarize how the observations are distributed about that "central" point.
Arithmetic Mean (often simply called the Mean or Average) (chapter 3.1)
The calculation of the arithmetic mean is very simple. You simply sum all of the observations
and then divide by the number of observations. For example, let us suppose that I collected 100-ml samples of water from 5 different sources. I then grew bacterial colonies from these water
samples and counted the number of colonies. I obtained observations of 90, 78, 63, 84, and 87.
To calculate the arithmetic mean I would sum across all observations (402) and then divided by
the number of observations (5), to yield an arithmetic mean of 402/5 = 80.4. (equation 3.2).
If X was the letter designated to represent our variable, then X with a bar over it (X bar) would represent
the sample mean of our observations. For example if H represents heterozygosity at one locus,
the H bar would represent sample mean heterozygosity across all loci in our study. Some books,
journals, and other scientific publications prefer X bar (X with a bar over it) as the designation
for the mean of the observations.
The mean represents the center of the observations in a sample, a frequency distribution will
exactly balance at the mean. The mean is also very easy to calculate and has several other
properties that lend it to inferential statistics, which we will cover later. For these reasons, the
mean is the most commonly reported statistics of location. One problem with the mean though is
that extreme values will greatly influence its value, thus if there is one value that is much smaller
or larger that the other values in a sample, the mean may not be the most appropriate statistic.
Sometimes we wish to compute a mean value and what we have is a series of means, each mean
in the series is based upon a different number of observations (sample size). In this cases, we
calculate what is called a weighted mean, in that the individual means are weighted in the
calculation by their sample size. For example, suppose that we wanted to calculate the mean
body weight of raccoons living in the Great Plains and we want to do this from means that have
been reported in the literature.
23 raccoons from North Dakota had a mean weight of 11.2 kg
7 raccoons from Nebraska had a mean weight of 9.2 kg
19 raccoons from Kansas had a mean weight of 7.7 kg
14 raccoons from Oklahoma had a mean weight of 6.5 kg
Using a variation of equation 3.3, each mean is multiplied by its corresponding sample size and
then those products are summed, the result is then divided by the sum of the sample sizes.
(23*11.2) + (7*9.2) + (19*7.7) + (14*6.5) 559.3
---------------------------------------------------- = ------- = 8.88
(23 + 7 + 19 + 14) 63
Note that the sample from North Dakota has the largest effect on the weighted mean because it
has the largest sample size. If you had simply added the 4 means together and divided by 4, the
answer would have been 8.65, which would have been wrong.
Equation 3.3 is also used to calculate the mean when the data are presented in the form of a frequency distribution, such as we saw in the previous chapter.
Geometric Mean (chapter 3.5)
A geometric mean might be an appropriate measure of location if the data are arranged along a
logarithmic scale, for example pH or growth. Though you may rarely see reference to the
geometric mean, you should know that it exists and what it means.
To calculate the geometric mean, you multiply all of the observations together and then take the
nth root of the result (equation 3.12). As you may notice, this becomes computationally quite
difficult as you will soon be dealing with very large numbers and taking nth root of numbers can
also be quite troublesome. In practice, values are first converted to natural logarithms using base
e (in practice you can use logarithms to any base), added together, then the sum is divided by n,
and then the antilogarithm using base e is taken of that result (equation 3.13).
sum of ln(Xi) = 21.9
| 90|| 4.50|
| 78|| 4.36|
| 63|| 4.14 |
| 84|| 4.43|
| 87|| 4.47|
21.9/n = 21.9/5 = 4.38
geometric mean = antiln 4.38 = 79.8
Harmonic Mean (chapter 3.5)
A harmonic mean often shows up in the study of populations size in relation to genetic drift or
when doing dilutions.
To calculate the harmonic mean, convert all observations to reciprocals of themselves. Add the
reciprocals together and divide that sum by the sample size. Then take the reciprocal of that
result (use equation 3.14).
sum of 1/Xi = 0.0632
| 90|| 0.0111|
| 78|| 0.0128|
| 63|| 0.0159|
| 84|| 0.0119|
| 87|| 0.0115|
0.0632/n = 0.0632/5 = 0.0126
harmonic mean = 1/0.0126 = 79.1
Unless all of the observations in a data set have the same value, the harmonic mean is less than
the geometric mean which is less than the arithmetic mean. For the vast majority of cases that
you will run across in biology, you will use the arithmetic mean. The other two means are only
applied is very specific situations. Hence forth, unless specified differently, mean refers to the
Median (chapter 3.2)
It is defined as the value that has an equal number of observations on either side of it. It divides
a frequency distribution in half relative to the number of observations. For example, in our
sample of bacterial colonies (90, 87, 84, 78, 63) the median is 84. It is that observation that has
exactly the same number of observations above it as below. In this case, there are an odd number
of observations, so the median will always be one of the observations. The formula would be:
((n+1)/2)th observation (equation 3.4; species A in example 3.3).
If there were an even number of observations, there is no one observation that fits the criterion of
having an equal number of observations larger as there are smaller. In this case the value must
be calculated by averaging the middle two observations (species B in example 3.3).
The formula is:
((n/2)+1)th + (n/2)th
Thus, if we had the observations (90, 87, 84, 78,63, and 61)
4th + 3rd
84 + 78
--------- = 81
You might also note that the data must be put in order from highest to lowest, or vice versa, to
determine the median. This can be quite time consuming for a large data set and thus the median
is computationally more troublesome than the mean. You might also note that the median is
unaffected by a single (or even several) very large or small numbers. For example, in the above example with 5
observations, if I changed the 90 to 1090, the median stays the same. However, the mean, which
was 80.4 is now 280.4. Thus the median is often a more appropriate value when a few extreme
values are greatly influencing the mean. For example, salaries of large corporations are often
reported as medians because the salary of the CEO would often greatly inflate the mean and
give an unrealistic impression of the salaries of the employees in a corporation.
Quantiles (chapter 3.3)
The median is also referred to as a quantile. Common quantiles are the three quartiles. These
values divide a frequency distribution into fourths.
1st quartile is the value with 25% below and 75% above.
2nd quartile is the value with 50% below and 50% above (same as the median).
3rd quartile is the value with 75% below and 25% above.
These values are determined in a manner exactly similar to that of the median.
Other common quantiles are percentiles. SATs. ACTs, GREs often give percentile rankings of a
persons score. For example, 80th percentile means that 80% of the people taking
the test scored below that value and 20% of the people score above that value.
Mean or Median from frequency data
Often times you may have access to a frequency distribution that portrays the data and you would
like to calculate the mean or median from the data. This is a straight forward procedure. Use the
data below to calculate the mean and the median.
|class marks||frequency||cumulative frequency|
| 7 || 2 || 2 |
| 8 || 4 || 6 |
| 9 || 10 || 16 |
| 10 || 12 || 28 |
| 11 || 9 || 37 |
| 12 || 3 || 40 |
| 13 || 1 || 41 |
The mean is calculated by using equation 3.3 and is essentially the same as that which you used for
the weighted mean. The class marks represent the measurements and the height of the bars
represent the frequency. Applying equation 3.3 would yield
(7*2) + (8*4) + (9*10) + (10*12) + (11*9) + (12*3) + (13*1) 404
----------------------------------------------------------- = ----- = 9.85
(2 + 4 + 10 + 12 + 9 + 3 +1) 41
The median is calculated using equation 3.5. The idea is this. There are 41 observations and the
(n/2)th (20.5th) observation lies somewhere in the class with the class
mark of 10. Note that the cumulative frequency is 16 through the class with mark 9 and is 28
through the class with mark 10, thus the 20.5th must lie in the class with mark 10.
The lower implied limit of the class with mark 10 is 9.50. What we will do now is determine
how far through the class 10 interval we must go to get to the 20.5th value and we
will add this distance to the lower implied class limit for class 10. Again using equation 3.4, this
9.5 + ((20.5 - 16)/12)*1 = 9.5 + 0.375*1 = 9.875.
The median is 9.875.
Mode (chapter 3.4)
The mode is simply the most common observation in the data. If there are two most common
values then the distribution is said to be bimodal and it has two separate peaks. The mode is not
often used as it contains very little useable information and thus you rarely see it reported in the
Some additional thoughts
The mean is by far the most commonly reported statistic of location. It has a smaller standard
error, a distribution of sample means tends to be normally distributed, and it is easy to work with
mathematically. The disadvantages are that it is greatly affected by extreme values and if there
are any missing values you can not calculate a mean.
The median (as well as quantiles in general) are less commonly reported. Their advantage is that
they are not greatly affected by outliers nor are they as sensitive to the shape of the distribution.
The median can often be calculated when there are missing or incomplete values.
The mode is rarely used because of the paucity of information that it conveys.
If a distribution is symmetrical and unimodal, then the mean, median and mode will have the
same value. If the distribution is skewed to the right (positively skewed), that is the right side of
the distribution (the right tail) has more values than the left side, then the mode is less than the
median and the median is less than the mean (mode < median < mode). If the distribution is
skewed in the other direction (left or negatively skewed), then the mode > median > mean. To
help you remember this, you might note that they are in alphabetical order. Also, if you
remember that the mean is affected by extreme values, then you will realize that it always winds
up in the longer of the two tails of the distribution. We will discuss the concepts of symmetry
and skewness much more when we get to chapter 6.
Sample Statistics versus Population Parameters (chapters 2.4 and 3.1)
For a population there is one true (but usually unknown) mean. This value is calculated using
equation 3.1. Note that the only difference between this equation and the one that we used for
the arithmetic mean, is that N is in the denominator in equation 3.1. N is equal to the entire
number of observations in the population, where n is the size of a sample. In the vast majority of
cases, we can not measure every member of a population and thus we must take only a sample.
Equation 3.2 gives the best estimate of the population mean given our sample. Thus from the
sample, we calculate a sample mean, which is an unbiased, efficient, and consistent estimator of
the population mean. The population mean is a parameter of the population distribution and thus
is one of many population parameters that we might try to estimate from our sample. It is
important when we are presenting statistics that we keep our designation for the population mean
separate from our designation for the sample mean as these numbers are not likely to be the
same (though, we hope that they are close). The convention in statistics is to use Greek letters as
designators of population parameters and to use Arabic letters as designators of sample statistics.
Thus the population mean is designated with the lower case mu (µ) and the sample mean is
designated with an X with a bar over it (X bar).
|Do problems 3.1, 3.2, 3.3, and 3.4, at the end of chapter 3 (page 31).|
Last updated on 25 August 2000.
Provide comments to Dwight Moore at email@example.com.
Return to the RDA Home Page at Emporia State University.