Linear Regression
Chapter 17

In linear regression we are attempting to determine if a functional relatiohsip exists between two variables (an X and a Y variable). This relationship, at least for now, will be assumed to be a linear cause-and-effect relationship, with the causal variable being X and the effected variable being Y. We are attempting to determine if a change in X causes a change in Y, and, if this is true, to estimate the mathematical formula that describes that relationship.

Regression equation (chapter 17.1)

In general the formula for the realtioship between X and Y will take the form of Y = alpha + beta(X), where alpha is the Y-intercept and beta is the slope of the line (see figure 17.1). Alpha and beta are population parameters that will likely never be known. In our analysis, we will estimate these with a (the estimate of the population Y-intercept) and b (the estimate of the population slope) (see equations 17.1 and 17.1a).

To estimate the population parameters that define the relatiosnhip between X and Y, we need the "best-fit line" through the set of X-Y pairs. You may have done this before by "eyeballing" this line with a ruler, but what we want is a more precise way of determining this line. This "best-fit line" will be the line that minimizes the distance of each X-Y pair from the line itself. It is very unlikely that all of the points will fit onto a single line, and thus some points will not be on the line. The vertical distance of a point from the "best-fit line" represents the error or unexplained variation in Y. The vertical distance is the distance from a point on the line (Xi, Yhati) to an observed point with the same value of X (Xi, Yi). This distance is Yi - Yhati. If we take this vertical distance from the line and square it to obtain a positive number and then sum these squared deviations we have a quantity called the residual sum of squares. The best-fit line will minimize this quantity.

We will now follow through a problem. First, we will calculate the best-fit line through a series of points. Second, we will test if the population slope is equal to zero and to do this we will use both an F-ratio and a t-test. Third, we will calculate the 95% conficende interval for the slope and then for the line itself. Fourth, we will examine the plot of residuals.

This problem comes from Sokal and Rolf's textbook and you should compare this to the worked problem in your textbook.
 (percent humidity)Xi (weight loss inflour beetles)Yi (Xi - Xbar)^2 (Yi - Ybar)^2 (Xi - Xbar)(Yi - Ybar) 0 8.98 2539.1521 8.7498 -149.0536 12 8.14 1473.7921 4.4859 -81.3100 29.5 6.67 43 6.08 53 5.90 62.5 5.83 75.5 4.68 85 4.20 93 3.72

Xbar = 50.39
Ybar = 6.022
sum of squares of X = sum (Xi - Xbar)^2 = 8301.3889
sum of squares of Y = sum (Yi - Ybar)^2 = 24.1306
sum of the crossproducts = sum (Xi - Xbar) * (Yi - Ybar) = - 441.8176

To calculate this line, we will need the sum of squares of X, the sum of squares of Y, and a new term called the sum of the crossproducts. As the sum of squares of X and the sum of squares of Y are well know to us, I will not cover them here, but will discuss the sum of the crossproducts. To calculate the sum of the crossproducts you sum across all XY pairs the product of (Xi - Xbar) * (Yi - Ybar) equation 17.2. You might note that the sum of the crossproducts can vary from minus infinity to plus infinity. If the relationship is such that as X increases, Y increases, then the sum of the crossproducts will be positive. However, if the relationship is such that as X increases, Y decreases, then the sum of the crossproducts is negative. Just like the sum of squares, there is a machine formula that is better to use than equation 17.2 (see equation 17.3).

The sum of the crossproducts is now used to calculate the slope (regression coefficient) of the best fit line. This is calculated by dividing the sum of the crossproducts by the sum of squares of X (equation 17.4). If the slope (b) is negative then as X increases, Y decreases. If the slope is positive then Y increases as X increases. If the slope is zero, there is no effect of X on Y and thus no relationship between X and Y (see figure 17.3). To calculate b, we divide the sum of the crossproducts by the sum of squares of x (-441.8176/8301.3889), which in this case equals -0.05322.

We can now use the slope to determine the equation of the best fit line through our observed set of points. The point (Xbar, Ybar) lies on the best fit line. As we now know the slope and a point on the line, we can determine the equation for the line. We take the general equation for a line Y = a + bX and substitute our value of Ybar for Y, Xbar for X, and our calculated slope for b. Then using algebra, we solve for a, which is the Y-intercept (study example 17.2 and equations 17.5 through 17.7).

Y = a + b * X
6.022 = a + -0.05322 * 50.39
a = 6.022 - (-0.05322 * 50.39)
a = 8.7038
thus the euqation of the best fit through our set of 9 points is Y = 8.7038 - 0.05322 * X

Once we have the regression equation, we can use the equation to predict values of Y given a value of X. We do this by simply substituting values of X into our equation and solving for Y.

 percent humidityXi weight loss in flour beetlesYi Yhati (Yi - Yhati)^2 0 8.98 8.7038 0.0763 12 8.14 8.0652 0.0056 29.5 6.67 7.1338 43 6.08 6.4153 53 5.90 5.8831 62.5 5.83 5.3776 75.5 4.68 4.6857 85 4.20 4.1801 93 3.72 3.7543

The sum of (Yi - Yhati)^2 is called the residual sum of squatres, which equals 0.6160 in this case. When calculating the residual sum of squares it is important to only use the original values of X. However, there is no reason why in the calculation of points on the line that you must be restricted to the original Xs. But you should restrict your choice of Xs to the range of the original set of Xs. To have values for X that are outside the original range is to extrapolate, which is often dangerous because you really have no data on the behavior of the relationship of Y with X outside the original range of X. Even though you have assumed that the relationship is linear, it may not be and even though it appears to be linear within the original range of Xs, it may not be outside that range.

Assumptions (chapter 17.2)

There are three assumptions that are made about linear regression:

1) The Xs are fixed and measured without error.
2) For any given X, the Ys are independent and their population is normally distributed.
3) The variances for the populations of Ys that are associated with the different Xs are equal.

Significance (chapter 17.3)

To determine if there is a significant relationship between X and Y, we must test the null hypothesis that the population slope is equal to zero. If we fail to reject this null hypothesis, then we conclude that we do not have a significant relationship between X and Y. However, if we reject this null hypothesis and show that beta is not equal to zero, then we conclude that there is a significant relationship between X and Y. There are several ways to test this null hypothesis; either with an analysis of variance procedure or with a t-test. For the null hypothesis that beta equals zero, the two tests yield exactly the same conclusion with the same power. However, t-tests can be used to test one-tailed hypotheses, where ANOVA can not.

To use ANOVA testing we must calculate three sum of squares that define the amount and source of variation among the Ys. The first is the total sum of squares (Yi - Ybar). This is the sum of squares of Y and measures the total variation among the Ys (you already know how to calculate this quantity). The second is the regression sum of squares (Yhati - Ybar). This measures the variation in the predicted values of Y given the Xs. This is really the sum of squares for the values of Y that are on the line (Yhat). This is the variation in Y that is due to X, as points lie on the line as a function of X. The third is the residual sum of squares, which we discussed earlier (Yi - Yhati). This represented the variation in Y that is not explained by X. This is derived from the vertical deviation of the observed point from the line. These sum of squares can now be arranged in an ANOVA table along with their degrees of freedom.

source of
variation
sum of
squares
degrees of
freedom
mean squareF-valueP-value

total24.1306 n - 1 = 83.0832
regression 23.5145 1 23.5145 267.18 P < 0.001
residual0.6161 n - 2 = 7 0.08801

To calculate the F-value, you divide the regression mean square by the residual mean square. This calculated F-value is then compared to a one-tailed value with 1 and n-2 degrees of freedom. If the calculated F-value is greater than or equal to the tabled value, then you reject your null hypothesis.

Two other quantities need to be mentioned as both are useful values.

r2 (r squared) is the proportion of the variation in Y that is explained by X. Recall that the total sum of squares is the total variation in Y and the regression sum of squares is the variation in Y that is explained by X. If we divided the regression sum of squares by the total sum of squares, we will have a quantity that gives the proportion of the total variation in Y that is explained by X. This quantity is often reported as part of the statistics for regression and is useful in making statements about the usefulness of X in explaining Y. However, you can not use r squared values for testing the significance of the regression between X and Y. There are cases when a low value for r squared is associated with a significant regression and vice versa. See the discussion on sample size and power for why this might be true.

The square root of the residual mean square is called the standard error of the estimate (sY.X). This quantity is used to determine the significance of a regression when using a t-test. It is also useful as an indication of the accuracy of a prediction of Y as a function of X.

Significance with a t-test.

The general formula for a t-test is given by equation 17.18. The first quantity that we need to determine is the standard error of the slope. This is found by taking the square root of the quantity residual mean square divided by the sum of squares of X, and then using equation 17.21 to calculate the t-value.

The advantage of using a t-test is that you can test one-tailed tests and you can test a given slope against nonzero values. For example the null hypothesis that beta is greater than or equal to 4.

Confidence Intervals (chapter 17.4)

We worked with confidence intervals earlier and the basic formula is the same. To begin we need an estimate of the standard error of the quantity that we want to estimate. To calculate a 95% confidence interval for the slope we simply use the standard error of the slope found when we calculated a t-value for significance testing. This quantity is then used in equation 17.24 and 17.25 to calculate L1 and L2, the lower and upper confidence limits.

We can also calculate a 95% confidence interval for each value of Yhat given X. In this case the standard error of Yhat varies with the values of X. The standard error for any given Yhat decreases as you approach the center of the range of X values and increases as you move away from the center. If you calculate the 95% confidence interval of Yhat for each value of X, you can connect the upper limits for Yhat with a line and lower limits of each Yhat with another line, then the two lines encompass the 95% confidence interval for the line (study example 17.5 for the calculations). In figure 17.8, you should notice that lines connecting the upper and lower limits curve because the 95% confidence intervals are wider for the ends of the range of X values.

Use SigmaStat to run Simple Linear Regression

Last updated on 18 November 1999.
Provide comments to Dwight Moore at mooredwi@emporia.edu.
Return to the RDA Home Page at Emporia State University.