The normal distribution is a curve defined by equation 6.1, where Y is the height of the curve given a value of X. In
the equation pi is a constant (3.14159) and e is a constant (2.71828). There are also two variables
constants which define the curve. These are the standard deviation (sigma) and the
mean (mu). The area under this curve from -infinity to +infinity is defined as 1. Probabilities of
occurrence are defined relative to a set of class limits or intervals of X. The probability of X
being exactly some value is undefined. For example, the probability that a woman will give birth
to her child on her due date is about 0.05, that is 5% of mothers give birth on their due date. The
due date is actually an interval of time equal to 24 hours. But if you wanted to know the
probability of giving birth at exactly noon on a given date, this number becomes infinitely small
and is actually undefined.
Because the shape and position along the number line are a function of sigma and
mu, respectively, there are an infinite number of normal curves (figures 6.2a and 6.2b). The
range of possible values that X can have is actually from -infinity to +infinity, but values more
than 3 times the standard deviation larger or smaller than the mean are very rare. For example,
68.2% of all values of X fall in the range from the mean - 1 standard deviation to the mean + 1
standard deviation; 95.46% of all values of X fall in the range from the mean -2 standard
deviation to the mean + 2 standard deviation; and ; 99.72% of all values of X fall in the range
from the mean -3 standard deviation to the mean + 3 standard deviation The normal curve is also
symmetrical around the mean ( mean = median = mode).
Skewness and Kurtosis (chapter 6.1)
For a population to have a normal distribution, the distribution must be symmetrical. If the
distribution is not symmetrical, then the distribution is said to be skewed. In this case, it has
more observations in one tail than in the other. (figures 3.2c and 3.2d). We can measure
symmetry by calculating a statistic called gamma1.
Recall that we had introduced the quantity sum of (Xi - Xbar)/N across all
observations and the quantity sum of (Xi - Xbar)2/N across all observations. These
are called the first and second moments about the mean, respectively. The second moment is also known as the population
variance. The third moment is the quantity sum of (Xi - Xbar)3/N across all
observations, and this moment can be used to measure the symmetry of a distribution. The
sample statistic (k3) that estimates the third moment is given by equation 6.3. As
this value has units cubed, it is divided by s3 to yield a statistic called
g1, which is an estimator of gamma1. g1 is equal to zero
if the distribution is symmetrical. If g1 is less than zero, the distribution is skewed
to the left (negatively skewed) as in figure 3.2d (page 24). If g1 is greater than zero, then the
distribution is skewed to the right (positively skewed) as in figure 3.2c (page 24).
Kurtosis is a measure of how the observations in a distribution are distributed in the peak relative
to the tails. A normal curve is defined as mesokurtic (figure 6.3). If there are more values near
the peak, then the curve is said to be leptokurtic (figure 6.3c). If there are more values in the
tails, then the curve is said to be platykurtic (figure 6.2b). We can measure kurtosis by
calculating a statistic called gamma2, which uses the fourth moment about the
mean, sum of (Xi - Xbar)4/N. The sample statistic, k4, is given by equation 6.6. However k4 has units to the fourth power,
so k4 is divided by s4 to yield a unitless number. This number equals
3 if the distribution is mesokurtic. In practice, when calculating the sample statistic, 3 is
subtracted from the result to yield a sample statistic called g2 which is an
estimator of gamma2. If g2 is less than zero, then the distribution is
platykurtic, and if g2 is greater than zero then the distribution is leptokurtic. If
g2 is equal to zero, then the distribution is mesokurtic (normal).
After we introduce hypothesis testing we will discuss this and other methods for testing if
g1 and g2 are significantly different from zero.
Proportions under the curve (chapter 6.2)
Probabilities for the normal curve are defined relative to some range of values of X. These
portions are actually integrals (areas under the curve) over some range of X values. For example,
the IQ of humans is distributed as a normal curve with a mean of 100 and a standard deviation of
15. We might ask what portion of the population has an IQ that is greater than 120, or we might
ask what portion of the population has an IQ that is between 85 and 95. In North America, the
mean height of males is 69 inches with a standard deviation of 2 inches. What is the portion of
the population that has a height that is less than 66 inches. To determine the answers to these questions we would need to determine
probabilities which would be quite complicated, except that we can convert these questions to those
using a standard normal curve.
If we subtract the mean from each value of X, we will have a new distribution whose mean is
zero and if we divide each value of X by the standard deviation then we have a new distribution
whose standard deviation is 1 (chapters 3.6 and 4.8). A normal curve whose mean is zero and
whose standard deviation is 1 is called a standard normal curve. Given a value of X from which
we subtract the mean and divide by the standard deviation, we have a normal deviate (Z)
(equation 6.13). We can then use table B2 in the appendix to answer questions about portion (or probabilities of occurrence) of a
For example, what portion of the population has an IQ greater than 120 or put another way what is the probability of finding a person whose IQ is greater than
120 when sampling at random from the population.
P(Xi > 120)
covert 120 to a standard deviate
Remember that the mean (mu) is 100 and the standard deviation (sigma) is 15.
Zi = (Xi - 100)/15 = (120 - 100)/15 = 1.333
P(Xi > 120) = P (Zi > 1.333)
To use table B2, look down the left column and find the value 1.3, then go across that row to the
column headed by 3; the number where the row and column cross is 0.0918. Thus, P (Zi > 1.33) = 0.0918.
to carefully study the table heading so that you understand how the table works. Tables with
similar information in other books are often set-up differently, so if you use a different table it is
important to carefully study the table heading so that you understand which proportions under
the normal curve the table is giving you.
What is the probability of obtaining a person whose IQ is between 85 and 95?
P(85 < Xi < 95)
convert 85 and 95 to standard deviates
Zi = (Xi - 100)/15 = (85 - 100)/15 = -1.00
Zi = (Xi - 100)/15 = (95 - 100)/15 = -0.33
P (85 < Xi <95) = P (-1.00 < Zi < -0.33)
Again we must use table B2. You should notice that all Z values are positive, but as the normal
curve is symmetrical it is quite easy to find the probability values for negative Z scores.
P (-1.00 < Zi < -0.33) = P (0.33 < Zi < 1.00)
P (0.33 < Zi < 1.00) = P(Zi > 0.33) - P(Zi > 1.00)
Because we must subtract the smaller portion of the curve (P(Zi > 1.00)) rom the larger portion (P(Zi > 0.33))to get that portion that
remains between 0.33 and 1.00.
Thus, P (-1.00 < Zi < -0.33) = P (85 < Xi <95) = 0.2120.
It is often very important to draw a normal curve and sketch the areas of the curve that are under
consideration. This will greatly help you in determining what values you need to find and the
mathematical manipulations that you need to do. Be sure to carefully study example 6.3. This
concept of portions of the normal curve is very important and will be used to derive more
advanced concepts involving confidence limits and hypothesis testing.
Distribution of means (chapter 6.3)
Given a population whose mean is mu, I could draw a sample of size n from this population and
calculate a sample mean (Xbar). I could then repeat this process calculating a number of
different Xbars. These Xbars would likely be different from each other, however the Xbars
would themselves have a distribution. This distribution is called the distribution of means. This
distribution will approximate a normal distribution, even if the distribution of the original
population was not normal. This property of the distribution of means is know as the Central
Limit Theorem. The variance of the population of means is given by equation 6.14. You can see
from this equation that the variance of the means decreases as n (the sample size) for the
determination of Xbar increases. The term sigma2Xbar is called the
variance of the means. The positive square root of the variance of the means is called the standard
deviation of the means (sigmaXbar). This term is also called the standard error of
the mean (SEM) or more simply the standard error.
You can calculate the standard error of other
sample statistics, however the standard error of the mean is assumed if the value is simply called
the standard error.
You can also calculate standard deviates of Xbars. We used equation 6.13 to convert
Xi to Zi. Using the same logic, we can use equation 6.16 to convert an
Xbar to a standard deviate. We subtract the mean (mu) from Xbar and divide by the standard
error of the mean.
For example, what is the probability of drawing a sample of 20 people, whose mean IQ is greater
than 110. Remember that mu = 100 and sigma = 15. First, using equation 6.15, we determine that
the standard error of the mean is 15/square root of 20, which equals 3.354.
In general, we will not know what the true population standard deviation (sigma) and thus we
will have to estimate sigma from the data. The unbiased estimator is s, the sample standard
deviation. For this reason, we will usually calculate a sample standard error following equation
Confidence limits (chapters 7.3)
Xbar is said to be a point estimate of mu, and even though Xbar is the best estimator of mu, we
do not know what mu actually is nor do we know how close Xbar comes to mu. It might be very close or it
might be quite some distance from mu. We can increase our confidence in our estimation of mu,
if we calculate an interval within which mu will likely be found and we can set the width of the interval
so that mu will be found within this interval with a certain probability (usually 95%).
Confidence interval from a population whose variance is known
If we were sampling from a population whose standard deviation was known we could establish
a 95% confidence interval around a sample mean by converting our sample mean to a standard
deviate (subtracting mu and dividing by the standard error of the mean). We would like to
establish upper and lower limits for this standard deviate, such that our standard deviate will fall
within the central 0.95 portion of the standard normal curve. As the normal curve is
symmetrical, this would me that we have 0.025 portion of the curve in each one of the tails. Now
if we examine the body of the table B2, we will find a proportion value equal to 0.025. This
value corresponds to a Z score equal to 1.96, which means that our standard deviate must lie
between -1.96 and 1.96. We would write this as below.
P(-1.96 < (Xbar - mu)/SEM < 1.96) = 0.95
The next step is to rearrange this equation so that only mu is in the middle. This yields
The quantity Xbar - 1.96(SEM) is the lower confidence limit (L1) of our confidence interval,
while the quantity Xbar + 1.96(SEM) is the upper confidence limit (L2) of our confidence
For example, let us suppose that we took a sample of 10 measurements from a
population whose standard deviation was 5. The Xbar of this sample was 12 and we would like
to calculate the upper and lower limits of the 95% confidence interval.
The first step is to calculate the standard error of the mean, which would be the sigma divided by the square root of the sample size (equation 6.15).
SEM = 5/square root of 10 = 1.581139
L1 = Xbar - 1.96(SEM) = 12 - 1.96 * 1.581139 = 8.90
L2 = Xbar + 1.96(SEM) = 12 + 1.96 * 1.581139 = 15.10
Thus there is a 95% probability that the true mean lies between 8.90 and 15.10
Confidence interval from a population whose variance is unknown (chapter 7.3)
In the above example, we knew the variance of the population from which we were sampling. In
the vast majority of cases, we will not known the variance of the population, but will instead
have to estimate the variance from the sample. We will then calculate the standard deviate using
equation 7.1. The difference between this equation and that of 6.15 is that the population
standard deviation is replaced by the sample standard deviation when we calculate the standard
error of the mean (which is now a sample standard error, equation 6.18).
This quantity is no longer distributed as a normal distribution but as a t-distribution with n-1
degrees of freedom (figure 7.1). There are an infinite number of t-distributions, depending upon
the degrees of freedom, however a t-distribution with infinite degrees of freedom is the same as a
standard normal curve.
To determine a confidence interval, we need to determine upper and lower limits as we did above
for our t value (standard deviate when the population variance is unknown). Again we will
determine these values so that we will include the central 0.95 portion of the t-distribution. As the
t-distribution is symmetrical, this would mean that we have 0.025 portion of the curve in each
one of the tails. The upper and lower t-values will vary with the sample size (degrees of
freedom), so our probability statement will look like this.
The quantity Xbar - t0.05(2),v(SEM) is called the lower confidence limit
(L1), while the quantity Xbar + t0.05(2),v(SEM) is called the upper
confidence limit (L2).
Suppose we sampled from a population and found a sample mean of 4.004 and a sample standard
deviation of 0.366 from a sample of 25 individuals. To calculate the 95% confidence interval,
we will need to calculate L1 and L2.
We first need to determine the sample standard error of the mean, which is the sample standard
deviation divided by the square root of the sample size (s/square root of n)
s / square root of n 0.366 / square root 25 = 0.0732.
We next need to determine the correct value of t. We will use table B3 to do this. Note that
across the top of the table that there are two rows of alpha values. We are going to use the alpha(2) values as these correspond to two-tailed values. For example at alpha(2) = 0.05, then 0.05/2
is in each tail. Down the left side of the table are the values for the degrees of freedom. As our
sample size is 25 (n), our degrees of freedom are 24 (n - 1). Looking in the column headed alpha(2) = 0.05 and row for 24 degrees of freedom, our t-value is 2.064.
You should notice that the width of the confidence interval is a function of the sample size. As you increase the sample size (n), the standard error will get smaller and the t-value will also get smaller. Thus a 95% confidence interval based upon a sample of 10 observations will be larger than a confidence interval based upon 25 observations. Thus, an investigator has some control over the width of the C.I. by sampling effort.
You can also calculate a 99% C.I. (or any % that you want). Statisticians have agreed that 95% is the most commonly used. A 99% C.I. will use a different t-value alpha(2)=0.01 and will be larger than a 95%. 95% C.I. is a trade off between having a C.I. that is so wide as to be useless and having such a narrow C.I. that you have too little confidence that mu is within the interval.