The Normal Distribution
Chapter 6

The normal distribution is a curve defined by equation 6.1, where Y is the height of the curve given a value of X. In the equation pi is a constant (3.14159) and e is a constant (2.71828). There are also two variables constants which define the curve. These are the standard deviation (sigma) and the mean (mu). The area under this curve from -infinity to +infinity is defined as 1. Probabilities of occurrence are defined relative to a set of class limits or intervals of X. The probability of X being exactly some value is undefined. For example, the probability that a woman will give birth to her child on her due date is about 0.05, that is 5% of mothers give birth on their due date. The due date is actually an interval of time equal to 24 hours. But if you wanted to know the probability of giving birth at exactly noon on a given date, this number becomes infinitely small and is actually undefined.

Because the shape and position along the number line are a function of sigma and mu, respectively, there are an infinite number of normal curves (figures 6.2a and 6.2b). The range of possible values that X can have is actually from -infinity to +infinity, but values more than 3 times the standard deviation larger or smaller than the mean are very rare. For example, 68.2% of all values of X fall in the range from the mean - 1 standard deviation to the mean + 1 standard deviation; 95.46% of all values of X fall in the range from the mean -2 standard deviation to the mean + 2 standard deviation; and ; 99.72% of all values of X fall in the range from the mean -3 standard deviation to the mean + 3 standard deviation The normal curve is also symmetrical around the mean ( mean = median = mode).


Skewness and Kurtosis (chapter 6.1)

For a population to have a normal distribution, the distribution must be symmetrical. If the distribution is not symmetrical, then the distribution is said to be skewed. In this case, it has more observations in one tail than in the other. (figures 3.2c and 3.2d). We can measure symmetry by calculating a statistic called gamma1.

Recall that we had introduced the quantity sum of (Xi - Xbar)/N across all observations and the quantity sum of (Xi - Xbar)2/N across all observations. These are called the first and second moments about the mean, respectively. The second moment is also known as the population variance. The third moment is the quantity sum of (Xi - Xbar)3/N across all observations, and this moment can be used to measure the symmetry of a distribution. The sample statistic (k3) that estimates the third moment is given by equation 6.3. As this value has units cubed, it is divided by s3 to yield a statistic called g1, which is an estimator of gamma1. g1 is equal to zero if the distribution is symmetrical. If g1 is less than zero, the distribution is skewed to the left (negatively skewed) as in figure 3.2d (page 24). If g1 is greater than zero, then the distribution is skewed to the right (positively skewed) as in figure 3.2c (page 24).

Kurtosis is a measure of how the observations in a distribution are distributed in the peak relative to the tails. A normal curve is defined as mesokurtic (figure 6.3). If there are more values near the peak, then the curve is said to be leptokurtic (figure 6.3c). If there are more values in the tails, then the curve is said to be platykurtic (figure 6.2b). We can measure kurtosis by calculating a statistic called gamma2, which uses the fourth moment about the mean, sum of (Xi - Xbar)4/N. The sample statistic, k4, is given by equation 6.6. However k4 has units to the fourth power, so k4 is divided by s4 to yield a unitless number. This number equals 3 if the distribution is mesokurtic. In practice, when calculating the sample statistic, 3 is subtracted from the result to yield a sample statistic called g2 which is an estimator of gamma2. If g2 is less than zero, then the distribution is platykurtic, and if g2 is greater than zero then the distribution is leptokurtic. If g2 is equal to zero, then the distribution is mesokurtic (normal).

After we introduce hypothesis testing we will discuss this and other methods for testing if g1 and g2 are significantly different from zero.


Proportions under the curve (chapter 6.2)

Probabilities for the normal curve are defined relative to some range of values of X. These portions are actually integrals (areas under the curve) over some range of X values. For example, the IQ of humans is distributed as a normal curve with a mean of 100 and a standard deviation of 15. We might ask what portion of the population has an IQ that is greater than 120, or we might ask what portion of the population has an IQ that is between 85 and 95. In North America, the mean height of males is 69 inches with a standard deviation of 2 inches. What is the portion of the population that has a height that is less than 66 inches. To determine the answers to these questions we would need to determine probabilities which would be quite complicated, except that we can convert these questions to those using a standard normal curve.

If we subtract the mean from each value of X, we will have a new distribution whose mean is zero and if we divide each value of X by the standard deviation then we have a new distribution whose standard deviation is 1 (chapters 3.6 and 4.8). A normal curve whose mean is zero and whose standard deviation is 1 is called a standard normal curve. Given a value of X from which we subtract the mean and divide by the standard deviation, we have a normal deviate (Z) (equation 6.13). We can then use table B2 in the appendix to answer questions about portion (or probabilities of occurrence) of a population.

For example, what portion of the population has an IQ greater than 120 or put another way what is the probability of finding a person whose IQ is greater than 120 when sampling at random from the population.

P(Xi > 120)
covert 120 to a standard deviate
Remember that the mean (mu) is 100 and the standard deviation (sigma) is 15.
Zi = (Xi - 100)/15 = (120 - 100)/15 = 1.333
P(Xi > 120) = P (Zi > 1.333)

To use table B2, look down the left column and find the value 1.3, then go across that row to the column headed by 3; the number where the row and column cross is 0.0918. Thus, P (Zi > 1.33) = 0.0918.

Be sure to carefully study the table heading so that you understand how the table works. Tables with similar information in other books are often set-up differently, so if you use a different table it is important to carefully study the table heading so that you understand which proportions under the normal curve the table is giving you.

What is the probability of obtaining a person whose IQ is between 85 and 95?

P(85 < Xi < 95)
convert 85 and 95 to standard deviates
Zi = (Xi - 100)/15 = (85 - 100)/15 = -1.00
Zi = (Xi - 100)/15 = (95 - 100)/15 = -0.33
P (85 < Xi <95) = P (-1.00 < Zi < -0.33)

Again we must use table B2. You should notice that all Z values are positive, but as the normal curve is symmetrical it is quite easy to find the probability values for negative Z scores.

P (-1.00 < Zi < -0.33) = P (0.33 < Zi < 1.00)
P (0.33 < Zi < 1.00) = P(Zi > 0.33) - P(Zi > 1.00)
Because we must subtract the smaller portion of the curve (P(Zi > 1.00)) rom the larger portion (P(Zi > 0.33))to get that portion that remains between 0.33 and 1.00.

P(Zi > 0.33) - P(Zi > 1.00) = 0.3707 - 0.1587 = 0.2120

Thus, P (-1.00 < Zi < -0.33) = P (85 < Xi <95) = 0.2120.

It is often very important to draw a normal curve and sketch the areas of the curve that are under consideration. This will greatly help you in determining what values you need to find and the mathematical manipulations that you need to do. Be sure to carefully study example 6.3. This concept of portions of the normal curve is very important and will be used to derive more advanced concepts involving confidence limits and hypothesis testing.


Distribution of means (chapter 6.3)

Given a population whose mean is mu, I could draw a sample of size n from this population and calculate a sample mean (Xbar). I could then repeat this process calculating a number of different Xbars. These Xbars would likely be different from each other, however the Xbars would themselves have a distribution. This distribution is called the distribution of means. This distribution will approximate a normal distribution, even if the distribution of the original population was not normal. This property of the distribution of means is know as the Central Limit Theorem. The variance of the population of means is given by equation 6.14. You can see from this equation that the variance of the means decreases as n (the sample size) for the determination of Xbar increases. The term sigma2Xbar is called the variance of the means. The positive square root of the variance of the means is called the standard deviation of the means (sigmaXbar). This term is also called the standard error of the mean (SEM) or more simply the standard error.

You can calculate the standard error of other sample statistics, however the standard error of the mean is assumed if the value is simply called the standard error.

You can also calculate standard deviates of Xbars. We used equation 6.13 to convert Xi to Zi. Using the same logic, we can use equation 6.16 to convert an Xbar to a standard deviate. We subtract the mean (mu) from Xbar and divide by the standard error of the mean.

For example, what is the probability of drawing a sample of 20 people, whose mean IQ is greater than 110. Remember that mu = 100 and sigma = 15. First, using equation 6.15, we determine that the standard error of the mean is 15/square root of 20, which equals 3.354.

The standard deviate follows from equation 6.16.

Z = (Xbar - mu)/sigmaXbar = (110 - 100)/3.354 = 2.98

P(Xbar > 110) = P(Z > 2.98) = 0.0014

In general, we will not know what the true population standard deviation (sigma) and thus we will have to estimate sigma from the data. The unbiased estimator is s, the sample standard deviation. For this reason, we will usually calculate a sample standard error following equation 6.18.


Confidence limits (chapters 7.3)

Xbar is said to be a point estimate of mu, and even though Xbar is the best estimator of mu, we do not know what mu actually is nor do we know how close Xbar comes to mu. It might be very close or it might be quite some distance from mu. We can increase our confidence in our estimation of mu, if we calculate an interval within which mu will likely be found and we can set the width of the interval so that mu will be found within this interval with a certain probability (usually 95%).


Confidence interval from a population whose variance is known
If we were sampling from a population whose standard deviation was known we could establish a 95% confidence interval around a sample mean by converting our sample mean to a standard deviate (subtracting mu and dividing by the standard error of the mean). We would like to establish upper and lower limits for this standard deviate, such that our standard deviate will fall within the central 0.95 portion of the standard normal curve. As the normal curve is symmetrical, this would me that we have 0.025 portion of the curve in each one of the tails. Now if we examine the body of the table B2, we will find a proportion value equal to 0.025. This value corresponds to a Z score equal to 1.96, which means that our standard deviate must lie between -1.96 and 1.96. We would write this as below.

P(-1.96 < (Xbar - mu)/SEM < 1.96) = 0.95

The next step is to rearrange this equation so that only mu is in the middle. This yields

P( Xbar - 1.96(SEM) < mu < Xbar + 1.96(SEM) = 0.95

The quantity Xbar - 1.96(SEM) is the lower confidence limit (L1) of our confidence interval, while the quantity Xbar + 1.96(SEM) is the upper confidence limit (L2) of our confidence interval.

For example, let us suppose that we took a sample of 10 measurements from a population whose standard deviation was 5. The Xbar of this sample was 12 and we would like to calculate the upper and lower limits of the 95% confidence interval.

The first step is to calculate the standard error of the mean, which would be the sigma divided by the square root of the sample size (equation 6.15). SEM = 5/square root of 10 = 1.581139

L1 = Xbar - 1.96(SEM) = 12 - 1.96 * 1.581139 = 8.90
L2 = Xbar + 1.96(SEM) = 12 + 1.96 * 1.581139 = 15.10
Thus there is a 95% probability that the true mean lies between 8.90 and 15.10


Confidence interval from a population whose variance is unknown (chapter 7.3)

In the above example, we knew the variance of the population from which we were sampling. In the vast majority of cases, we will not known the variance of the population, but will instead have to estimate the variance from the sample. We will then calculate the standard deviate using equation 7.1. The difference between this equation and that of 6.15 is that the population standard deviation is replaced by the sample standard deviation when we calculate the standard error of the mean (which is now a sample standard error, equation 6.18).

This quantity is no longer distributed as a normal distribution but as a t-distribution with n-1 degrees of freedom (figure 7.1). There are an infinite number of t-distributions, depending upon the degrees of freedom, however a t-distribution with infinite degrees of freedom is the same as a standard normal curve. To determine a confidence interval, we need to determine upper and lower limits as we did above for our t value (standard deviate when the population variance is unknown). Again we will determine these values so that we will include the central 0.95 portion of the t-distribution. As the t-distribution is symmetrical, this would mean that we have 0.025 portion of the curve in each one of the tails. The upper and lower t-values will vary with the sample size (degrees of freedom), so our probability statement will look like this.

P(-t0.05(2),v < (Xbar - mu)/SEM < +t0.05(2),v) = 0.95 (equation 7.3)

Rearranging this equation so that only mu is in the middle, yields

P(Xbar - t0.05(2),v(SEM) < mu < Xbar + t0.05(2),v) = 0.95 (equation 7.4)

The quantity Xbar - t0.05(2),v(SEM) is called the lower confidence limit (L1), while the quantity Xbar + t0.05(2),v(SEM) is called the upper confidence limit (L2).

Suppose we sampled from a population and found a sample mean of 4.004 and a sample standard deviation of 0.366 from a sample of 25 individuals. To calculate the 95% confidence interval, we will need to calculate L1 and L2. We first need to determine the sample standard error of the mean, which is the sample standard deviation divided by the square root of the sample size (s/square root of n)

s / square root of n
0.366 / square root 25 = 0.0732.

We next need to determine the correct value of t. We will use table B3 to do this. Note that across the top of the table that there are two rows of alpha values. We are going to use the alpha(2) values as these correspond to two-tailed values. For example at alpha(2) = 0.05, then 0.05/2 is in each tail. Down the left side of the table are the values for the degrees of freedom. As our sample size is 25 (n), our degrees of freedom are 24 (n - 1). Looking in the column headed alpha(2) = 0.05 and row for 24 degrees of freedom, our t-value is 2.064.

Next use the formulas below for L1 and L2

L1 = Xbar - t0.05(2),v(SEM) = 4.004 - 2.064 * 0.0732 = 3.853
L2 = Xbar + t0.05(2),v(SEM) = 4.004 + 2.064 * 0.0732 = 4.155

You should notice that the width of the confidence interval is a function of the sample size. As you increase the sample size (n), the standard error will get smaller and the t-value will also get smaller. Thus a 95% confidence interval based upon a sample of 10 observations will be larger than a confidence interval based upon 25 observations. Thus, an investigator has some control over the width of the C.I. by sampling effort.

You can also calculate a 99% C.I. (or any % that you want). Statisticians have agreed that 95% is the most commonly used. A 99% C.I. will use a different t-value alpha(2)=0.01 and will be larger than a 95%. 95% C.I. is a trade off between having a C.I. that is so wide as to be useless and having such a narrow C.I. that you have too little confidence that mu is within the interval.

Use SigmaStat to run basic statistics for problem 6.1 (page 89).
Answers:

Do problems 6.2, 6.3, and 6.4 at the end of chapter 6 (page 90).
Answers:

Last updated on 28 August 2000.
Provide comments to Dwight Moore at mooredwi@emporia.edu.
Return to the RDA Home Page at Emporia State University.