Measures of Location
Chapter 3
One of the most important lessons to be learned from this course is when certain statistics are appropriate and when they are not. It is clearly the sign of someone who does not understand statistics when they report all of the information that some computer program may spit out at them. I will stress this why and when on tests.


Statistics of Location (also called measures of central tendency)

When a set of data has been collected, one of the first steps is to summarize that data. This can be done with a frequency distributions as shown previously, however a numerical summary of the data is often needed or desired. These summaries are referred to as descriptive statistics and they are divided into two categories, statistics of location and statistics of dispersion. Statistics of location summarize the "central" point of the data along a number line, while statistics of dispersion summarize how the observations are distributed about that "central" point.


Arithmetic Mean (often simply called the Mean or Average) (chapter 3.1)

The calculation of the arithmetic mean is very simple. You simply sum all of the observations and then divide by the number of observations. For example, let us suppose that I collected 100-ml samples of water from 5 different sources. I then grew bacterial colonies from these water samples and counted the number of colonies. I obtained observations of 90, 78, 63, 84, and 87. To calculate the arithmetic mean I would sum across all observations (402) and then divided by the number of observations (5), to yield an arithmetic mean of 402/5 = 80.4. (equation 3.2).

If X was the letter designated to represent our variable, then X with a bar over it (X bar) would represent the sample mean of our observations. For example if H represents heterozygosity at one locus, the H bar would represent sample mean heterozygosity across all loci in our study. Some books, journals, and other scientific publications prefer X bar (X with a bar over it) as the designation for the mean of the observations.

The mean represents the center of the observations in a sample, a frequency distribution will exactly balance at the mean. The mean is also very easy to calculate and has several other properties that lend it to inferential statistics, which we will cover later. For these reasons, the mean is the most commonly reported statistics of location. One problem with the mean though is that extreme values will greatly influence its value, thus if there is one value that is much smaller or larger that the other values in a sample, the mean may not be the most appropriate statistic.

Sometimes we wish to compute a mean value and what we have is a series of means, each mean in the series is based upon a different number of observations (sample size). In this cases, we calculate what is called a weighted mean, in that the individual means are weighted in the calculation by their sample size. For example, suppose that we wanted to calculate the mean body weight of raccoons living in the Great Plains and we want to do this from means that have been reported in the literature.

23 raccoons from North Dakota had a mean weight of 11.2 kg
7 raccoons from Nebraska had a mean weight of 9.2 kg
19 raccoons from Kansas had a mean weight of 7.7 kg
14 raccoons from Oklahoma had a mean weight of 6.5 kg

Using a variation of equation 3.3, each mean is multiplied by its corresponding sample size and then those products are summed, the result is then divided by the sum of the sample sizes.
(23*11.2) + (7*9.2) + (19*7.7) + (14*6.5)                 559.3
----------------------------------------------------  =  -------  =  8.88
                    (23 + 7 + 19 + 14)                      63
Note that the sample from North Dakota has the largest effect on the weighted mean because it has the largest sample size. If you had simply added the 4 means together and divided by 4, the answer would have been 8.65, which would have been wrong.

Equation 3.3 is also used to calculate the mean when the data are presented in the form of a frequency distribution, such as we saw in the previous chapter.


Geometric Mean (chapter 3.5)

A geometric mean might be an appropriate measure of location if the data are arranged along a logarithmic scale, for example pH or growth. Though you may rarely see reference to the geometric mean, you should know that it exists and what it means.

To calculate the geometric mean, you multiply all of the observations together and then take the nth root of the result (equation 3.12). As you may notice, this becomes computationally quite difficult as you will soon be dealing with very large numbers and taking nth root of numbers can also be quite troublesome. In practice, values are first converted to natural logarithms using base e (in practice you can use logarithms to any base), added together, then the sum is divided by n, and then the antilogarithm using base e is taken of that result (equation 3.13).
Xi ln(Xi)
90 4.50
78 4.36
63 4.14
84 4.43
87 4.47
sum of ln(Xi) = 21.9
21.9/n = 21.9/5 = 4.38
geometric mean = antiln 4.38 = 79.8


Harmonic Mean (chapter 3.5)

A harmonic mean often shows up in the study of populations size in relation to genetic drift or when doing dilutions.

To calculate the harmonic mean, convert all observations to reciprocals of themselves. Add the reciprocals together and divide that sum by the sample size. Then take the reciprocal of that result (use equation 3.14).
Xi 1/Xi
90 0.0111
78 0.0128
63 0.0159
84 0.0119
87 0.0115
sum of 1/Xi = 0.0632
0.0632/n = 0.0632/5 = 0.0126
harmonic mean = 1/0.0126 = 79.1

Unless all of the observations in a data set have the same value, the harmonic mean is less than the geometric mean which is less than the arithmetic mean. For the vast majority of cases that you will run across in biology, you will use the arithmetic mean. The other two means are only applied is very specific situations. Hence forth, unless specified differently, mean refers to the arithmetic mean.


Median (chapter 3.2)

It is defined as the value that has an equal number of observations on either side of it. It divides a frequency distribution in half relative to the number of observations. For example, in our sample of bacterial colonies (90, 87, 84, 78, 63) the median is 84. It is that observation that has exactly the same number of observations above it as below. In this case, there are an odd number of observations, so the median will always be one of the observations. The formula would be:
((n+1)/2)th observation (equation 3.4; species A in example 3.3).

If there were an even number of observations, there is no one observation that fits the criterion of having an equal number of observations larger as there are smaller. In this case the value must be calculated by averaging the middle two observations (species B in example 3.3). The formula is:
 ((n/2)+1)th + (n/2)th
 --------------------
          2

Thus, if we had the observations (90, 87, 84, 78,63, and 61)

   4th + 3rd
  -----------
       2    

    84 + 78
   ---------  =  81
       2
You might also note that the data must be put in order from highest to lowest, or vice versa, to determine the median. This can be quite time consuming for a large data set and thus the median is computationally more troublesome than the mean. You might also note that the median is unaffected by a single (or even several) very large or small numbers. For example, in the above example with 5 observations, if I changed the 90 to 1090, the median stays the same. However, the mean, which was 80.4 is now 280.4. Thus the median is often a more appropriate value when a few extreme values are greatly influencing the mean. For example, salaries of large corporations are often reported as medians because the salary of the CEO would often greatly inflate the mean and give an unrealistic impression of the salaries of the employees in a corporation.


Quantiles (chapter 3.3)

The median is also referred to as a quantile. Common quantiles are the three quartiles. These values divide a frequency distribution into fourths.
1st quartile is the value with 25% below and 75% above.
2nd quartile is the value with 50% below and 50% above (same as the median).
3rd quartile is the value with 75% below and 25% above.
These values are determined in a manner exactly similar to that of the median.

Other common quantiles are percentiles. SATs. ACTs, GREs often give percentile rankings of a persons score. For example, 80th percentile means that 80% of the people taking the test scored below that value and 20% of the people score above that value.


Mean or Median from frequency data

Often times you may have access to a frequency distribution that portrays the data and you would like to calculate the mean or median from the data. This is a straight forward procedure. Use the data below to calculate the mean and the median.
class marksfrequencycumulative frequency
7 2 2
8 4 6
9 10 16
10 12 28
11 9 37
12 3 40
13 1 41


The mean is calculated by using equation 3.3 and is essentially the same as that which you used for the weighted mean. The class marks represent the measurements and the height of the bars represent the frequency. Applying equation 3.3 would yield
(7*2) + (8*4) + (9*10) + (10*12) + (11*9) + (12*3) + (13*1)    404
----------------------------------------------------------- = ----- = 9.85
         (2 + 4 + 10 + 12 + 9 + 3 +1)                           41
The median is calculated using equation 3.5. The idea is this. There are 41 observations and the (n/2)th (20.5th) observation lies somewhere in the class with the class mark of 10. Note that the cumulative frequency is 16 through the class with mark 9 and is 28 through the class with mark 10, thus the 20.5th must lie in the class with mark 10. The lower implied limit of the class with mark 10 is 9.50. What we will do now is determine how far through the class 10 interval we must go to get to the 20.5th value and we will add this distance to the lower implied class limit for class 10. Again using equation 3.4, this yields

9.5 + ((20.5 - 16)/12)*1 = 9.5 + 0.375*1 = 9.875.

The median is 9.875.


Mode (chapter 3.4)

The mode is simply the most common observation in the data. If there are two most common values then the distribution is said to be bimodal and it has two separate peaks. The mode is not often used as it contains very little useable information and thus you rarely see it reported in the scientific literature.


Some additional thoughts

The mean is by far the most commonly reported statistic of location. It has a smaller standard error, a distribution of sample means tends to be normally distributed, and it is easy to work with mathematically. The disadvantages are that it is greatly affected by extreme values and if there are any missing values you can not calculate a mean.

The median (as well as quantiles in general) are less commonly reported. Their advantage is that they are not greatly affected by outliers nor are they as sensitive to the shape of the distribution. The median can often be calculated when there are missing or incomplete values.

The mode is rarely used because of the paucity of information that it conveys.

If a distribution is symmetrical and unimodal, then the mean, median and mode will have the same value. If the distribution is skewed to the right (positively skewed), that is the right side of the distribution (the right tail) has more values than the left side, then the mode is less than the median and the median is less than the mean (mode < median < mode). If the distribution is skewed in the other direction (left or negatively skewed), then the mode > median > mean. To help you remember this, you might note that they are in alphabetical order. Also, if you remember that the mean is affected by extreme values, then you will realize that it always winds up in the longer of the two tails of the distribution. We will discuss the concepts of symmetry and skewness much more when we get to chapter 6.


Sample Statistics versus Population Parameters (chapters 2.4 and 3.1)

For a population there is one true (but usually unknown) mean. This value is calculated using equation 3.1. Note that the only difference between this equation and the one that we used for the arithmetic mean, is that N is in the denominator in equation 3.1. N is equal to the entire number of observations in the population, where n is the size of a sample. In the vast majority of cases, we can not measure every member of a population and thus we must take only a sample. Equation 3.2 gives the best estimate of the population mean given our sample. Thus from the sample, we calculate a sample mean, which is an unbiased, efficient, and consistent estimator of the population mean. The population mean is a parameter of the population distribution and thus is one of many population parameters that we might try to estimate from our sample. It is important when we are presenting statistics that we keep our designation for the population mean separate from our designation for the sample mean as these numbers are not likely to be the same (though, we hope that they are close). The convention in statistics is to use Greek letters as designators of population parameters and to use Arabic letters as designators of sample statistics. Thus the population mean is designated with the lower case mu (µ) and the sample mean is designated with an X with a bar over it (X bar).

Do problems 3.1, 3.2, 3.3, and 3.4, at the end of chapter 3 (page 31).

Answers:

Last updated on 25 August 2000.
Provide comments to Dwight Moore at mooredwi@emporia.edu.
Return to the RDA Home Page at Emporia State University.