Chapter 17

In linear regression we are attempting to determine if a functional relatiohsip exists between two variables (an X and a Y variable). This relationship, at least for now, will be assumed to be a linear cause-and-effect relationship, with the causal variable being X and the effected variable being Y. We are attempting to determine if a change in X causes a change in Y, and, if this is true, to estimate the mathematical formula that describes that relationship.

In general the formula for the realtioship between X and Y will take the form of

To estimate the population parameters that define the relatiosnhip between X and Y, we need the "

We will now follow through a problem.

This problem comes from Sokal and Rolf's textbook and you should compare this to the worked problem in your textbook.

(percent humidity) X _{i} | (weight loss in flour beetles) Y _{i} | (X_{i} - Xbar)^2 | (Y_{i} -
Ybar)^2 | (X_{i} - Xbar)(Y_{i} - Ybar) |

0 | 8.98 | 2539.1521 | 8.7498 | -149.0536 |

12 | 8.14 | 1473.7921 | 4.4859 | -81.3100 |

29.5 | 6.67 | |||

43 | 6.08 | |||

53 | 5.90 | |||

62.5 | 5.83 | |||

75.5 | 4.68 | |||

85 | 4.20 | |||

93 | 3.72 | |||

To calculate this line, we will need the sum of squares of X, the sum of squares of Y, and a new term called the

The sum of the crossproducts is now used to calculate the slope (

We can now use the slope to determine the equation of the best fit line through our observed set of points. The point (Xbar, Ybar) lies on the best fit line. As we now know the slope and a point on the line, we can determine the equation for the line. We take the general equation for a line Y = a + bX and substitute our value of Ybar for Y, Xbar for X, and our calculated slope for b. Then using algebra, we solve for a, which is the Y-intercept (study example 17.2 and equations 17.5 through 17.7).

Y = a + b * X

6.022 = a + -0.05322 * 50.39

a = 6.022 - (-0.05322 * 50.39)

a = 8.7038

thus the euqation of the best fit through our set of 9 points is

Once we have the regression equation, we can use the equation to predict values of Y given a value of X. We do this by simply substituting values of X into our equation and solving for Y.

percent humidity X _{i} | weight loss in flour beetles Y _{i} | Yhat_{i} | (Y_{i} -
Yhat_{i})^2 |

0 | 8.98 | 8.7038 | 0.0763 |

12 | 8.14 | 8.0652 | 0.0056 |

29.5 | 6.67 | 7.1338 | |

43 | 6.08 | 6.4153 | |

53 | 5.90 | 5.8831 | |

62.5 | 5.83 | 5.3776 | |

75.5 | 4.68 | 4.6857 | |

85 | 4.20 | 4.1801 | |

93 | 3.72 | 3.7543 | |

The sum of (Y

There are three assumptions that are made about linear regression:

1) The Xs are fixed and measured without error.

2) For any given X, the Ys are independent and their population is normally distributed.

3) The variances for the populations of Ys that are associated with the different Xs are equal.

To determine if there is a significant relationship between X and Y, we must test the null hypothesis that the population slope is equal to zero. If we fail to reject this null hypothesis, then we conclude that we do not have a significant relationship between X and Y. However, if we reject this null hypothesis and show that beta is not equal to zero, then we conclude that there is a significant relationship between X and Y. There are several ways to test this null hypothesis; either with an analysis of variance procedure or with a t-test. For the null hypothesis that beta equals zero, the two tests yield exactly the same conclusion with the same power. However, t-tests can be used to test one-tailed hypotheses, where ANOVA can not.

To use ANOVA testing we must calculate three sum of squares that define the amount and source of variation among the Ys. The first is the

source of variation | sum of squares | degrees of freedom | mean square | F-value | P-value |
---|---|---|---|---|---|

total | 24.1306 | n - 1 = 8 | 3.0832 | ||

regression | 23.5145 | 1 | 23.5145 | 267.18 | P < 0.001 |

residual | 0.6161 | n - 2 = 7 | 0.08801 | ||

To calculate the F-value, you divide the regression mean square by the residual mean square. This calculated F-value is then compared to a one-tailed value with 1 and n-2 degrees of freedom. If the calculated F-value is greater than or equal to the tabled value, then you reject your null hypothesis.

Two other quantities need to be mentioned as both are useful values.

The square root of the residual mean square is called the

The general formula for a t-test is given by equation 17.18. The first quantity that we need to determine is the s

The advantage of using a t-test is that you can test one-tailed tests and you can test a given slope against nonzero values. For example the null hypothesis that beta is greater than or equal to 4.

We worked with confidence intervals earlier and the basic formula is the same. To begin we need an estimate of the standard error of the quantity that we want to estimate. To calculate a 95% confidence interval for the slope we simply use the standard error of the slope found when we calculated a t-value for significance testing. This quantity is then used in equation 17.24 and 17.25 to calculate L

We can also calculate a 95% confidence interval for each value of Yhat given X. In this case the

Use SigmaStat to run Simple Linear Regression

Last updated on 18 November 1999.