Hypothesis Testing
Chapter 6

Besides descriptive statistics there are inferential statistics. These are statistics that allow us to make statements about one or more population based upon one or more samples that we have taken from the population. There are many such scenarios and we will only be able to cover a few of them during the rest of the course. Hypothesis testing, errors associated with testing, 1-tailed and 2-tailed tests, and power are concepts that will be covered in this section. You MUST understand these concepts if the rest of the semester is to make any sense at all.

Introduction (chapter 6.4)

Inferential statistics are based upon the idea of a null hypothesis and an alternate hypothesis. The null hypothesis (which is written Ho: is a statement that is written in such a way that there is no difference between the two items being tested. When we test the null hypothesis we will determine a P value, which provides a numerical value for the likelihood that the null hypothesis is true. If it is unlikely that the null hypothesis is true, then we will reject our null hypothesis and accept an alternate hypothesis (which is written HA:), which states that the two items are not equal. We will look at variation on this theme later.

For example, suppose you were arrested on some charge (say murder). In the US criminal justice system, the null hypothesis is that you are innocent of the crime. The state has the burden to show that this null hypothesis is not likely to be true (guilty beyond a reasonable doubt). If the state does that then the jury rejects the null hypothesis and accepts the alternate hypothesis, that is that you are guilty. Thus the state has to show that you are not innocent, in order to reject the null hypothesis. It is similar in statistics. Your statistical test must show that the two items are different.

In the trial, if the null hypothesis is not rejected, your innocence has not been proven. It is just that the state failed to support your guilt. You are never proven innocent in terms of the trial; the state simply failed to show your guilt. The same is true in statistics, if you fail to show that the two items are different, then you fail to reject the null hypothesis but you have not proven that the null hypothesis is true. It is correct to say that you rejected your null hypothesis, but it is incorrect to say that you accepted your null hypothesis. The last statement implies that you have shown the null hypothesis to be correct and that is not the case. If you reject your null hypothesis, then you must accept your null hypothesis that the two items are different. In the trial if the jury rejected your hypothesis of innocence, then they must accept the alternate hypothesis that you are guilty and you must be sent to jail.

Philosophically, it is important to understand these concepts associated with the null and alternate hypothesis. After looking at a statistical test, we will revisit these concepts and examine the types of errors that one can make with statistics.

One sample hypothesis (chapter 6.4)

This is the case when you have a sample and you wish to determine if this sample could have come from a population with a known mean.

Suppose that it is known that the mean life span of horses is 22 years and the population standard deviation (sigma) is 3.8 years. Your family has for years been developing a new breed of horse, and based upon a sample of 25 horses, the mean life span (sample mean) is 24.23 years. Does this new breed of horse have a life span that is different from than that of horses in general? The null and alternate hypotheses are these.

Ho: mu of the new breed = 22 years
HA: mu of the new breed =/ 22 years

Some symbols are impossible to put into HTML and thus I will use the following for these:
=/      for not equal to
<=      for less than or equal to
>=      for greater than or equal to
x^2     for x to the power of 2, x squared
mu     for the population mean
sigma     for the population standard deviation
To test this null hypothesis, we will state this a different way. What is the probability of drawing at random a sample of 25 horses whose mean life span is more deviate than 24.23 from 22 years. If this probability is low (<= 0.05), then we will reject our null hypothesis. The reason that I said more deviate, is that we can also reject if the Xbar was much smaller than 22 as well as rejecting if Xbar is much larger than 22.

The first step is to convert Xbar to a Z score (example 6.6, page 81)
Z = (Xbar - mu)/(sigma/square root of n)
Z = (24.23-22)/(3.8/square root of 25) = 2.23/0.76 = 2.93

Thus, we want the quantity P(Z > 2.93) + P(Z < -2.93). This includes Xbar being more deviant by being larger (Z > 2.93) and by being smaller (Z < -2.93).

P(Z > 2.93) + P(Z < -2.93) = 0.0017 + 0.0017 = 0.0034.

We used table B2, because we knew the population standard deviation and thus our standard deviate was distributed as a normal curve. The chance of getting an Xbar this deviant from a population whose mean is 22 and whose standard deviation is 3.8 is less than 0.05, thus we reject our null hypothesis that mu of the new breed is equal to 22. It is unlikely (P = 0.0034) that our sample of 25 horses came from a population whose mean is 22.

We are working with the standard normal curve. You might realize that in general, we would reject any null hypothesis for which the area in the two tails beyond our calculated Z score combined is less than or equal to 0.05. Thus if P(Z > 1.96) + P(-1.96 < Z) = 0.025 + 0.025 = 0.05 we would reject our null hypothesis. Thus 1.96 becomes a critical value, such that when we calculate our Z score if the |Z| >= 1.96, then we reject our null hypothesis. For example, if we had drawn a sample of 16 horses and the Xbar was 24, then our Z score would have been 2.11 and again we would have rejected our null hypothesis.

One-tailed tests (chapter 6.4)

What we did above is called a two-tailed test, that is we will reject our null hypothesis if Xbar is too large OR too small. We can also choose to write what are called one-tailed hypotheses, that is we only reject the null hypothesis if Xbar is too large or if Xbar is too small. For example, in developing our new breed of horse, we may have wanted a horse that had a longer life span. So we are only interested in rejecting the null hypothesis if Xbar is larger than 22 years. We write our null and alternate hypotheses like this.

Ho: mu of the new breed <= 22 years
HA: mu of the new breed > 22 years.

Note that to show that the new breed of horse is longer lived, we must reject a null hypothesis that says that it is not longer lived. To do this, we need to know what value of Z would satisfy the equation P(Z > ??) = 0.05 (the area to reject is totally within the right tail. If we look at table B2, we can see that this value is 1.65. The true value for Z is between 1.64 and 1.65. Thus, we will reject our null hypothesis if our calculated Z score is >= 1.65. As our calculated Z score for Xbar equals 24.23 years is 2.93, we again reject our null hypothesis.

What if we turned our null hypothesis around. We want to show that Xbar comes from a population that has a shorter life span.

Ho: mu of the new breed >= 22 years
HA: mu of the new breed < 22 years
In this case, to show that the new breed of horse has a shorter life span, we attempt to reject a null hypothesis that says that the new breed comes from a population whose mean is equal to or greater than 22. We want to reject the null hypothesis, if the Z score falls in the left tail. As the normal curve is symmetrical, we will reject our null hypothesis if our calculated Z score is less than - 1.65. As our calculated Z score is +2.93, we fail to reject our null hypothesis.

When doing two-tailed tests, you can use the absolute value of the Z score to test relative to a critical value. On the other hand, when doing one-tailed tests, it is important to understand in what tail the area to reject lies (left or right tail) and that the sign of the calculated Z score is also important. You can not use absolute values. It is often helpful to sketch the normal curve so that you can picture the relationship between the area to reject and the calculated Z score (figure 6.5, page 81)

It is also important that you choose to do a one-tailed or two-tailed test before collecting the data and calculating the statistic. You probably noticed than one-tailed tests have lower critical values making it easier to reject a null hypothesis. You can not calculate a one-tailed test just because it is easier to reject your null hypothesis.

Errors in statistical testing (chapter 6.4)

 conclusion about null hypothesisfrom statistical test fail to reject reject truth about null hypothesis true correct type I error false type II error correct

From the table above (table 6.1, page 83), if the null hypothesis is true and we fail to reject it then we have made a correct decision and if the null hypothesis is false and we rejected it, then we have also made a correct decision.

If the null hypothesis is correct and we reject it then we have made a type I error (falsely rejecting a true null hypothesis). In effect we have said that the two things are different when they are not. How likely are we to do this. Remember in the above problem we rejected our null hypothesis if we got Z scores that fell into that 0.05 part of the curve in the tails. This value 0.05, is called our level of significance or alpha. If the new breed of horses did have the same longevity as horses in general, we would still reject our null hypothesis 5% of the time. Thus you can see that the chance of making a type I error is set by our level of significance or alpha. If we lower alpha to say 0.01, we would decrease our chance of making a type I error. If alpha is set at 0.05, then 5 out of every 100 times that we reject a null hypothesis, we do so incorrectly, on average. Alpha is our chance of making a type I error.

If the null hypothesis is false, but we fail to reject the null hypothesis, then we make a type II error. In this case we conclude that there is no difference when there really is. The probability of making this error is beta. The problem is that for most cases there is no simple way to determine beta. Alpha and beta are also related in that if you decrease alpha (less chance of a type I error) then you increase the chance of making a type II error. Thus there is a trade-off in terms of the probabilities of these errors. For this reason, statisticians have chosen alpha equal to 0.05 as a good trade-off between type I and type II errors. As you gain more understanding of these types of errors, you may have good reasons for choosing a different alpha value to minimize one type of error relative to the other.

If you increase the sample size, you can decrease beta while holding alpha constant. Thus increasing your sampling effort is an excellent (and often only) way to reduce your chance of making an error. This is the reason that your major advisor or project leader may suggest that you need more data, especially if you have been unable to reject a null hypothesis. When you fail to reject a null hypothesis, you do not know if it is because the two populations are really equal or if you have been unable to reject because of type II error.

The quantity (1- beta) is called power. Power is the ability to reject a false null hypothesis. As beta decreases with increasing sample size, it follows that 1 - beta (power) increases with increasing sample size. Thus failing to reject a null hypothesis may be due to low power (low sample size).

Assessing normality (chapters 6.5 and 7.14)

Previously, we had discussed two quantities (g1 and g2) that can be calculated that tell us something about how our data are distributed relative to a normal curve. It is important that we be able to assess if our data come from a population whose distribution is normal because many of the statistical tests that we will study later have an underlying assumption of normality. Deviations from normality will affect the power of our tests.

For skewness, we calculated a value called g1. The simplest way to evaluate this number is to compare your calculated value of g1 to the table of critical values in table b22 in the appendix. In chapter 7.14, Zar presents the calculation so that you can use a table of the standard normal curve if table b22 were not available. We will not dwell on these calculation, but you should be aware that there is an alternative way to test the null hypothesis that the sample comes from a population whose distribution is normal.

For kurtosis, we calculated a value called g2. Again the simplest way to evaluate this number is to compare your calculated value of g2 to the tabled values in table B23 in the appendix. In chapter 7.14, Zar presents the calculation so that you can use a table for the standard normal curve if table B23 were not available. Again we will not dwell on these calculations.

Other methods include a statistic called the Kolmogorov-Smirnoff goodness-of-fit procedures. Essentially the observed distribution is compared to a theoretical normal curve with the mean and standard deviation as the sample. If the curves were to be exact then of course the sample comes from a population whose distribution is normal. To assess how close the fit is between the two curves, you can calculate a D-statistic which is a measure of the distance between the two curves. In general, this test has relatively low power. Even if the sample comes from a population whose distribution is not normal, the K-S D statistic may not allow you to reject the null hypothesis.

If you recall from our basic statistics that we calculated using Sigmastat, we got output concerning g1 and g2.

Skewness (g1) was -0.740. We use table B22 to look-up a critical value for alpha(2) = 0.05, n = 37. Note that there is no entry for n = 37, thus we will use the entry for n = 36. This value is 0.780. As this critical value is greater than the absolute value of our calculated value of -0.740, we conclude that we fail to reject our null hypothesis (Ho: the population distribution is symmetrical or Ho: gamma1 = 0) with a P value greater than 0.05 (0.05 < P < 0.10).

Kurtosis (g2) was 0.896. We use table B23 to look-up a critical value for alpha(2) = 0.05, n = 37. Note that there is no entry for n = 37, thus we will use the entry for n = 36. This value is 1.919. Again, as this critical value is greater than the absolute value of our calculated value of 0.896, we conclude that we fail to reject our null hypothesis (Ho: the population distribution is mesokurtic or Ho: gamma2 = 0) with a P value greater than 0.05 (P > 0.20).

Read through chapter 7.14 to understand one-tailed testing of these hypothesis concerning normality.

In the output from SigmaStat, you also found a value for the K-S Distance (Komolgorov-Smirnoff D Statistic). This value was 0.149. Notice also that the P value associated with this number is 0.037. As P is less than 0.05 (remember this is our level of significance), we reject our null hypothesis (Ho: the sample comes from a population whose distribution is normal.). In this case, we failed to reject normality based upon g1 and g2 statistics but we did reject based upon the K-S Distance.

One-sample hypotheses with the variance unknown (chapter 7.1)

In the vast majority of cases, we will not know what the population variance is but will have to estimate it from a sample. For example, in the above example with the horses, we assumed that the population variance was 3.8, however it is unlikely that we would have know this. In that case 3.8 would have been the sample standard deviation. Now to test the null hypothesis that mu of the new breed of horses equals 22, we would still calculate the quantity (Xbar - 22)/SEM (equation 7.1). However this quantity, as we noted earlier, is not distributed as a normal distribution but has a t-distribution when sigma must be estimated from the sample. Thus we can not use the standard normal curve to determine a critical value to reject the null hypothesis, instead we must use table B3, which are values for t- distribution and the quantity given by equation 7.1 is a t value.

To look up a critical value for the t-distribution, we need to know the degrees of freedom, which is n - 1 (25 - 1 = 24). Table B3 lists alpha (the level of significance) across the top and as we are doing a two-tailed test, we will use the column with alpha(2) values. Along the left side are the degrees of freedom, the critical value is at the intersection of 24 degrees of freedom and alpha(2) = 0.05, which is 2.064.

Now when we calculate our t-value, if the absolute value (we are doing a two-tailed test) is greater than or equal to 2.064 we will reject our null hypothesis and if the absolute value of the calculated t-value is less than 2.064, then we fail to reject our null hypothesis.

t = (Xbar - mu)/(sigma/square root of n) = (24.23-22)/(3.8/square root of 25) = 2.23/0.76 = 2.93

As our calculated t-value is greater than 2.064, we reject the null hypothesis that the mean longevity of our new breed of horses equals 22 years.

This is our first introduction to the use of the distribution for the testing of hypotheses. This is generally called a t-test and is one of the most common statistical tests used. It should be mentioned that one of the underlying assumption of the test is that the observations come from a population whose distribution is normal. However a t-test is robust, in that it still performs quite well such that its power stays high even when there are moderate deviations from normality. Therefore, it is important to assess if your data come from a population whose distribution is normal (see above). The robustness of a t-test varies with how the data deviate from normality, and for this reason the g-statistics are very helpful for they tell you how the population deviates from normality, while the D statistics does not. Be sure to read chapter 7.1 very carefully.

Considerations about a one-tailed test for the mean are essentially the same as they were for the normal distribution (chapter 7.2).

Chapters 7.5 and 7.6 are important in that these chapters give you some insights into how the power of a t-test varies with sample size. I will not cover this nor expect you to know any of these formulas, but you should read through this section paying attention to the general concepts. The rest of the chapter will not be covered in this course.

 Do problem 6.6 at the end of chapter 6 (page 90) and problems 7.1, 7.2, and 7.4 at the end of chapter 7 (pages 120 - 121). Answers:

Last updated on 15 September 2009.
Provide comments to Dwight Moore at mooredwi@emporia.edu.