Chapter 2 Simple Regression

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

In the last chapter, we looked at correlation analysis where we measured the linear relationship between pairs of variables. In this chapter we will use simple regression to develop a model describing the relationship between a single dependent variable and a single independent variable. This type of model is not frequently used in business, but understanding simple regression is a good way to begin our discussion of multiple regression.

Recall from high school geometry that the equation for a straight line can be written as follows:

Straight Line

Y = mX + b

In this equation, m is the slope of the line and b is the Y-intercept, or just intercept for short. Simple regression is a process for fitting a line to a set of points so the result of that process will be a value for the slope and intercept for that line. Regression uses different symbols from high school geometry for the slope and intercept and writes the intercept first, giving the following equation:

Simple Regression Equation

Y = β₀ + β₁X

where the β₀ represents the intercept and the β₁ represents the slope. These are population symbols; the sample symbols are b₀ and b₁, respectively.

The assumptions of regression are discussed in full in the next chapter, but one of the assumptions of regression is that the data set consists of a random sample of pairs of X and Y variables from a population of all possible pairs of values. Because any sampling involves error, an error term is often added to the equation:

Simple Regression Equation With Error Term

Y = β₀ + β₁X + ε

Although ε simply stands for the error, it cannot be estimated. Also note that it is common to refer to the error term as the residual.

Naturally, we wish to estimate the regression equation using sample data, which is written in the following equation:

Sample Simple Regression Equation With Error Term

Y = b₀ + b₁X + e

where b₀ estimates β₀, b₁ estimates β₁, and e represents the observed error—the leftover, or residual—from fitting the regression line to a specific set of data. This equation can be written with the subscript “i” to represent the specific data points:

Sample Simple Regression Equation for Specific Data Points

Y = b₀ + b₁X_i + e_i

where i goes from 1 to n and e₁ is the distance between the line and the first observed point, e₂ is the distance between the line and the second observed point, and so on. When used to estimate values, the equation is written as follows:

Sample Simple Regression Equation for Estimates

Ŷ = b₀ + b₁X₁

Here, Ŷ (pronounced “y-hat”) is the value of the dependent variable Y that lies on the fitted regression line at point X. That is, Ŷ₁ is the result of the equation for X_1,1, Ŷ₂ is the value for X_1,2, and so on.¹

Figure 2.1 shows a scatterplot for two variables. It also shows three different lines that might be drawn to fit the data. Two have a positive slope and appear to be almost parallel. The other has a negative slope. It is the job of regression to examine all possible lines and to choose the single line that best fits the data.

Figure 2.1 More than one line can be used to fit this data

The single line that best fits the data has specific criteria. To illustrate this, Figure 2.2 shows a simplified set of data points with a single line under consideration. For each point, the vertical distance between each point and the line under consideration is measured. Distances that go up are positive distances, whereas distances that go down are negative. To avoid having the positive and negative distances cancel out, the distances are squared. This makes them all positive and removes any potential for cancellation. The resulting distances are added. This total is called a sum of squares. The line with the smallest sum of squares is the line selected. This procedure is called least squares regression.

Figure 2.2 Measuring the vertical distance between the point and the line under consideration

Naturally, the techniques we will be using do not require you to actually test every possible line. After all, there are an infinite number of potential lines to consider. The mathematical development of the formulas used to calculate the regression coefficients guarantees that the sum of squares will be minimized. However, it is nice to know the background of the formulas.

We begin by noting that the sums of squares we just mentioned are SS_errors (or SSE) because they represent the mistake, or error, in the estimate of the line. The following is the formula for SSE:

Sum of Squared Errors

Using calculus, we can then take the partial derivatives of SSE with respect to b₀ and b₁ and, because we wish to minimize SSE, we set them equal to zero. This yields the normal equations

Normal Regression Equations

and

Solving for b₀ and b₁, and rewriting the equations, we obtain the following equations:

Regression Coefficients²

A few notes are in order regarding these formulas:

• As with the correlation coefficient, these formulas only compute sample estimates of population parameters.

• Unlike the correlation coefficient, b₀ and b₁ can take on any value between negative infinity (–∞) and positive infinity (+∞).

• It is important not to read too much into the relative magnitudes of these coefficients. The magnitude is a function of the units used to measure the data. Measure sales in dollars and the coefficients will have one magnitude, measure those same sales in millions of dollars and the coefficients will have a very different magnitude.

• As with the correlation coefficient, except for n, only totals are used in this formula. As a result, the calculations will be very similar to the calculations of the correlation coefficient.

• These are point estimates of β₀ and β₁ and these estimates have variances. This, and the assumption of normalcy, allows us to develop confidence interval estimates and to perform hypothesis testing on them.

• This formula always results in the line going through the point X, Y.

Normally, you would never perform regression on a data set with an insignificant correlation coefficient. Correlation tests to see if a linear relationship exists and then regression quantifies that relationship. If the hypothesis test of the correlation coefficient indicates that there is no correlation, then there is no correlation for regression to quantify. Nevertheless, we will continue with one of the examples described in chapter 1 because the calculations are fairly straightforward and the sample size was small. Additionally, we will show how to perform the calculations by hand, although these are easily performed using Excel or a statistics package, so there would rarely be an occasion to perform these hand calculations.

Age and Tag Numbers

In Table 1.2, we showed the ages of seven people and the last two digits of their tag numbers. A chart of this data was shown in Figure 1.2. Table 1.6 (repeated here) gave us the data needed to compute the correlation coefficient of 0.4157. Hypothesis testing would then show this correlation coefficient to be insignificant. Table 1.6 also gave us the data we need to compute the slope and intercept of the regression equation. Those regression calculations are shown here as well. Unlike correlation, it does matter which variables are treated as the dependent and independent variables. In this case, it does not make intuitive sense to say that age influences tag numbers or that tag numbers influence age so we will use Age as the independent (X) variable and Tag Number as the dependent (Y ) variable.

Table 1.6 Correlation Coefficient Calculations

Computing the Intercept

Computing the Slope

So the regression equation is given by the line Ŷ = 8.1149 + 0.7350X.

Using Excel

For the dynamic approach, Excel offers matrix functions that can be used to calculate the regression coefficients and a few other pieces of information, but the information reported by Excel using the static approach is so much more extensive that it makes little sense to calculate regression any other way in Excel. We will use the previous example to illustrate how to use Excel to perform regression calculations.

Regression is performed using the Analysis Toolpak. You may need to install this, as described in chapter 1. After doing this, the first step is to load the data file containing the data on which you wish to perform regression, TagNumber.xls in this case. The data will have to be entered in column format. Whereas the data in TagNumber.xls is side by side, it is not necessary to have the dependent variable in a column next to the independent variables, although in practice it is a good idea. In the next chapter, we will be working with multiple independent variables, and Excel does require that all the independent variables be located side by side.

Once the worksheet is loaded, you click on the Data tab and then Data Analysis. This brings up the Data Analysis dialog box shown back in Figure 1.15. This time, select Regression from this list. This brings up the Regression dialog box shown in Figure 2.3. You use this dialog box to feed the data into Excel, set options, and control what output you get and where the output is placed.

Figure 2.3 The Regression dialog box

The Input Y Range is the range of cells containing the dependent variable, or B1 to B8 in this case. We have included this label, so we will have to include the label for the other variable and we will have to check the Labels box. Including labels is a good idea as it makes the printout easier to read. There is no need to remember the range for the data; you can click on the arrow to the right side of the input box and highlight the range manually. The Input X Range is the range of cells containing the independent variable, or A1 to A8 in this case.

You must tell Excel where to put the output. Regression generates a lot of output, so it is always best to use a separate worksheet page for the results. In this example, we have given that new page the name “Regression,” although you can give it any valid name.

Once everything is entered correctly in the Regression dialog box, you click on OK to run the regression. Excel performs the calculations and places the results in the new worksheet page, as specified. Those results are shown in Figure 2.4. As you can see, the results are not all that readable. Some of the labels and numbers are large and Excel does not automatically change the column width to accommodate this wider information. In addition, Excel does not format the numbers in a readable format. While the data are still highlighted, we can adjust the column widths by clicking on the Home tab, in the Cells group clicking on Format, and under Cell Size, clicking on AutoFit Column Width. You can also format the numbers to a reasonable number of decimal points. Those results are shown in Figure 2.5. Of course, adjusting the column widths would affect any other data that might be included on this worksheet page. This is yet another reason for placing the regression output on a new page.

Figure 2.4 Initially, the results of an Excel regression run are jumbled together

Reading an Excel Simple Regression Printout

Figure 2.6 shows Figure 2.5 with reference numbers added. We will be referring to these reference numbers in this discussion. The reference numbers do not, of course, show up in actual Excel results.

Figure 2.5 The results of an Excel regression run after some formatting

The following list explains each of the numbered captions shown in Figure 2.6.

Figure 2.6 Reading an Excel printout

1. Excel shows a 95 percent confidence interval for each coefficient (b₀ and b₁). We will see how to compute these later in this chapter. For now, notice that each interval is given twice. This is somewhat of a bug in Excel. The beginning dialog box allows you to select any given confidence interval you like, and Excel will display that level along with the 95 percent level. When you leave the confidence level at the default value of 95 percent, Excel does not compensate and shows the interval only once. For the remainder of this book, we will not show this duplicate set of values, as we usually delete these two extra columns from our worksheets.

2. In simple regression, the multiple r is the same as the correlation coefficient. This will not be the case with multiple regression.

3. R squared is the r value squared. This is true in both simple and multiple regression. R squared has a very specific meaning. It is the percentage of the variation in the dependent variable that is explained by variations in the independent variables. So in this case, variations in the ages of the respondents in the data set explained only 17.3 percent of the variation in the tag numbers. As expected, that is not a very good showing. Because there is really no relationship between these two variables, even this small value is only due to the small sample size and sampling error.

4. Significant F is the p-value for testing the overall significance of the model. In simple regression, this will always yield the same results as a two-tailed significance test on the correlation coefficient, so it can be ignored in simple regression. (If the correlation coefficient is significant, then the overall model is significant. If the correlation coefficient is not significant, then the overall model is not significant.) Whereas this can be ignored in simple regression, it will become a very important measure in multiple regression.

5. This is the intercept coefficient.

6. This is the slope coefficient.

More of the values shown on an Excel printout will be discussed later in this chapter and in chapter 3.

Using SPSS

We saw the car tag data in SPSS back in Figure 1.11. To perform simple regression, you click on Analyze, Regression, and then Linear. That brings up the dialog box shown in Figure 2.7. From here, you click on the age variable and the arrow to move it into the Independent(s) box and you click on the tag number variable and the arrow to move it into the Dependent box. Once you have done this, the OK button will change from gray to black and you click on it to run the simple regression. None of the other options needs to be set for this scenario.

Figure 2.7 The Simple Regression dialog box in SPSS

Figure 2.8 shows the results of running simple regression on the car tag data in SPSS. The left side of the screen is used to navigate between sections of the results. While it is not useful here, it can be very useful when working with large models or multiple scenarios. To move to a new section of the results, just click on that section.

Figure 2.8 The SPSS output of running a simple regression

Currently, the Title is “Regression,” but you can double-click on it and change it to something meaningful like “Car Tag Simple Regression.” This simple regression run has no Notes. The Variables Entered/Removed is not meaningful here but is useful in complex model building. The Model Summary area gives us the measures of model quality, r, r², adjusted r², and the standard error of the estimate. Both r and r² have already been discussed. Adjusted r² and the standard error of the estimate will be discussed later in this chapter.

The ANOVA³ section is next and measures the statistical significance of the model. The Coefficients section gives the slope and intercept for the model along with measures of their statistical significance.

While Excel and SPSS present the results very differently, they both present the same results, at least within rounding differences. That is to be expected. Both tools do an excellent job of simplifying the calculation of regression, both simple regression as we are calculating here and multiple regression as we will calculate in the next chapter. However, what we will find is that when the models get to be more complex, a statistical package like SPSS has some very real advantages over a general-purpose tool like Excel.

More on the Regression Equation

Look back at the regression equation for the car tag example:

Ŷ = 8.1149 + 0.7350X.

What exactly does this mean? We can see this visually represented in Figure 2.9. In this chart, the original data points are shown as dots with the regression overlaid as a line. Notice that the line slopes up. This is expected from the positive slope of 0.7350. Notice that the regression line crosses the Y-axis just above the X-axis. This, too, is to be expected from the Y-intercept value of 8.1149. Finally, notice how the points are spread out widely and are not close to the line at all. This behavior indicates a low r², 0.1729 in this case. For a higher r², the points would be less scattered.

Figure 2.9 A chart of the car tag data and the resulting line

It is important not to read too much into the magnitude of the slope or the Y-intercept, for that matter. Their values are a function of the units used to measure the various variables. Had we measured the ages in months or used the last three digits of the car tags, the magnitude of the slope and Y-intercept would be very different. Furthermore, the r and r² would remain the same. Because multiplying one or both variables by a fixed number is a linear transformation, the linear relationship between the variables remains the same. Therefore, the measures of that relationship do not change. Finally, the standard error and ANOVA numbers change because they are measured in the units of the variables. However, the F-value and p-value of ANOVA do not change as they, too, are unaffected by linear transformations.

Federal Civilian Workforce Statistics

We need to explore the simple regression results in more detail, but the car tag example is a poor choice because the results are insignificant. We have only used it so far because the limited number of observations makes it easy to understand and even calculate some of the numbers by hand.

Table 1.4 showed a state-by-state breakdown of the number of federal employees and their average salary for 2007. In chapter 1, we computed the r-value as 0.5350. Whereas that value is fairly low, in testing we found that the correlation was statistically significant. That is important. When the correlation is significant, the simple regression will also be significant. Likewise, when the correlation is insignificant, the simple regression will be insignificant. Note, however, that this does not hold in multiple regression.

We immediately have a problem with this data. For correlation, we did not have to worry about which variable was the independent variable and which was the dependent variable, but it does matter for simple regression. Does salary depend on the number of workers or does the number of workers depend on the salary? Both relationships make sense. A higher salary would attract more workers, whereas having a large number of workers might drive down salaries. For this analysis, we will assume the number of workers is the independent variable.

In your own work, you are unlikely to have this problem. You will define the dependent variable you are looking to explain and then search for one or more independent variables that are likely to help explain that already-defined dependent variable.

The Results

Figure 2.10 shows the results of running a simple regression on the federal civilian workforce data with the number of workers as the independent variable. We will look at what many of the numbers in this printout mean:

Figure 2.10 The Excel simple regression results on the federal civilian workforce data with the number of workers as the independent variable

• Multiple R. In simple regression, this is the same as the correlation coefficient. It goes from –1 to +1 and measures the strength of the relationship. The closer the value is to –1 or +1, the stronger the relationship, and the closer it is to 0, the weaker the relationship. This relationship is weak.

• R squared. In one respect, this is simply the multiple r². It goes from 0 to 1. The closer it is to 1, the stronger the relationship, and the closer it is to 0, the weaker the relationship. However, it is also the percentage of the variation in the dependent variable explained by the independent variable. We will see the reason for this later.

• Adjusted R squared. This is explained in more detail in chapter 3. Adjusted r² is not really an issue for simple regression. With multiple regression, r² goes up when you add new variables even if those variables do not help explain the dependent variable. Adjusted r² adjusts for this issue so models with different numbers of variables can be compared.

• Standard error. This is short for the standard error of Y given X. It measures the variability in the predictions made based on resulting regression model.

• Observations. This is simply a count of the number of pairs of observations used in the simple regression calculations.

• ANOVA. Most of these values are beyond the scope of this chapter and will not be discussed. Some of these values will be briefly discussed in chapter 3.

• Significant F. This is the one critical piece of information in the ANOVA table that we need to discuss. The Significant F expresses the significances of the overall model as a p-value. Stated very simply, when this value is below 0.05, the overall model is significant.⁴ Likewise, when this value is above 0.05, the overall model is insignificant. This is not important for simple regression because the significance of the model mirrors the significance of the correlation coefficient, but that relationship will not hold in multiple regression.

• Coefficient. This gives the values for the intercept and slope, or 57,909.4559 and 0.1108, respectively, in this model.

• Standard error. This standard error is the standard error associated with either the intercept or the slope.

• t-stat. This is the calculated t-statistic used to test to see if the intercept and slope are significant.

• p-value. When this value is less than 0.05, the corresponding slope or intercept is significant, and when it is greater than 0.05, they are insignificant. As a general rule, we do not test the intercept for significance as it is just an extension of the regression line to the Y-intercept. In simple regression, if the model is significant, the slope will be significant, and if the model is insignificant, the slope will be insignificant.

• Lower 95 percent and upper 95 percent. This is a 95 percent confidence interval on the intercept and slope. It is calculated as the coefficient value ±1.96 times the standard error of that value.

Interpretation

So what do these values tell us? The Significant F value of 0.0001 tells us the overall model is significant at α = 0.05. The r² of 0.2862 tells us variation in number of employees explains less than 29 percent of the variation in salary, a very poor showing. The slope of 0.1108 tells us that for every one unit increase in the number of employees, the average salary goes up by 11 cents.

You might be tempted to say that the model is insignificant simply because the slope is only 11 cents; that would be a mistake. When there is little variation in the dependent variable, even a very strong model will have a relatively small slope. Likewise, when there is a large amount of variation in the dependent variable, even a poor model can have a relatively large slope. As a result, you can never judge the strength of the model based on the magnitude of the slope. Additionally, the units used to measure the data will directly affect the magnitude of the slope.

Why does this model do such a poor job? One possible explanation is simply that the number of employees has little or even no impact on salaries, and what correlation we are seeing is being driven by something else. In this case, it is very likely that the cost of living in the individual states is what is driving the salaries, and the size of states is driving the number of employees.

Number of Broilers

Now on to a realistic business example. Whereas realistic simple regression examples in business are few, this next example is actual business data where simple regression works extremely well. Figure 1.1 showed the top 25 broiler-chicken producing states for 2001 by both numbers and pounds, according to the National Chicken Council. The underlying data was shown in Table 1.1. When we explored the correlation in chapter 1, it was very strong at 0.9970. That makes sense: The more broilers a state produces, the higher the weight of those broilers should be. Additionally, broilers likely weigh about the same state to state, so this relationship should be very strong.

Figure 2.11 shows the resulting simple regression. Number of broilers (in millions) is the independent variable and pounds liveweight (in millions) is the dependent variable. The Significant F value of 0.0000 tells us the model is significant. The r² = 0.9940 tells us that variation in the independent variable explains over 99 percent of the variation in the dependent variable.

Figure 2.11 Simple regression on the data for number of broiler chickens

The intercept is –0.2345 or almost 0. The intercept does not always make sense because many times it is nothing more than an extension of the regression line to a Y-axis that may be far away from the actual data. However, in this case you would expect 0 broilers to weigh 0 pounds, so an intercept very near 0 makes perfect sense. The slope of 5.0603 means that an increase of 1 million broilers will increase the pounds liveweight by 5.0603 million. In other words, 1 broiler chicken weighs, on average, about 5 pounds—again exactly what you would expect.

Exploring the Broiler Model Further

This is a great model. The model is significant, it explains most of the variation in the dependent variable, and all the coefficients make perfect sense theoretically. You could not ask anything more of the model. That makes this model a good foundation for some additional discussions. Some of the material in this section is based on confidence intervals. You may want to review material on confidence intervals from a prior statistics course before continuing.

The slope in this model is 5.0603. This is b₁, a sample statistic and an estimate of the population parameter β₁. That is, we estimate the population slope to be 5.0603 based on this sample data. Had a different sample been used—say, a selection of different states or the same data from another year—then our estimate of the population slope would be different. But how different would it have been? A confidence interval can give us an indication. Recall that you calculate a 95 percent confidence interval using the following formula:

Formula for a Confidence Interval on the Slope

The b₁ is, of course, the sample value for the slope, or 5.0603 in this example. The t is the Student t-value with α = 0.05 and n – 2 degrees of freedom. Because n = 26 in this example, the degrees of freedom are 26 – 2 = 24, giving us a t-value of 2.0639. The s-value is the standard error, which the printout tells us is 0.0805. The confidence interval is calculated as follows:

Confidence Interval Calculations

This is, of course, the same interval shown in the Excel printout. When this interval contains zero, the slope is insignificant. Because the interval above does not contain zero, the interval is significant. In simple regression, this test of model significance will always match the other two tests (i.e., the hypothesis test on the correlation coefficient and the F-test on the regression model) we have already discussed.

Recall our regression equation.

Regression Equation

Ŷ = –0.2345 + 5.0603X.

This is the equation we use to produce a forecast. If you look back at the data, you will see that Georgia produced 1,247.3 million broilers in 2001 with pounds liveweight of 6,236.5. Suppose we wished to estimate how much pounds liveweight Georgia would have produced had the state produced 1,300 million broilers. We simply plug 1,300 in for X in the previous equation, and we see their pounds liveweight would have increased to 6,578.2.

Forecast for 1,300 Million Broilers

Ŷ = –0.2345 + 5.0603X.

= –0.2345 + 5.0603(1,300)

= 6,578.2

But how good a forecast is that? If it ranged from 3,000 to 10,000, then it is not very useful. On the other hand, if it ranged from only 6,500 to 6,656, then it is a very useful estimate. The 6,578.2 is a point estimate of a population value. Once again, we can compute a confidence interval to find the 95 percent range.

The formula used for the confidence interval depends on the type of interval you are constructing. You use the first formula shown when you are computing the confidence interval for the average fitted value. That is, the resulting interval would be the interval for the average of all states that produce 1,300 broilers.

You use the second formula when computing the confidence interval for a single prediction for a new value. Because this confidence interval for the second formula is for a single observation, there is no opportunity for observations to average out, so it results in a wider interval.

Confidence Interval for Average Fitted Value

Confidence Interval for Predicted Value

Either equation uses several values Excel (or SPSS) does not give us plus one value (mean squared error [MSE]) that we have not yet discussed. The value (X_i – X)² is different for each confidence interval because the X-value is included in the formula. The value Σ(X_i – X)² is not provided by Excel or SPSS either, but it is easy to compute in Excel. For this problem X = 322.5 and Σ(X_i – X)² = 3,273,902.4. It was computed by subtracting the 322.5 from each observation, squaring the result, and computing the total.

If you look at the ANOVA table in an Excel or SPSS printout, there is a column labeled “MS.” MSE is the bottom number, the one on the residual row. For this example, it is 21,198.57.

This gives us the information needed to compute both intervals. Because the only difference for the second interval is the additional “1+,” we will take a few shortcuts in its calculation.

Confidence Interval for Average Fitted Value

Confidence Interval for Predicted Value

Whereas SPSS does not give you the values needed to compute these confidence intervals, it will compute them for you. To do this, begin by entering the value of the independent variable you wish to forecast at the bottom of the data set without entering a dependent variable. You can see the 1,300 entered in Figure 2.12. While we are getting only a single prediction here, SPSS can handle as many predictions as you like.

Figure 2.12 Data set up for having SPSS calculate a confidence interval for a predicted value and an average fitted value

Now, begin as before and click on Analyze, Regression, and then Linear, which brings up the dialog box shown in Figure 2.7. From there, click on the Save button. That brings up the dialog box shown in Figure 2.13. As shown in Figure 2.13, we wish to save the Mean and Individual Confidence Intervals, and as always, we will use a 95 percent confidence interval, although SPSS allows you to specify any value you like. You click on Continue to the Linear Regression dialog box and continue your regression as before.

Figure 2.13 Telling it to save the confidence intervals in SPSS

In addition to producing the regression results, and adding more data to the output display, SPSS adds four variables to the data file, as shown in Figure 2.14. The variables LMCL_1 and UMCL_1 are the confidence interval for the mean prediction confidence interval, and LICI_1 and UICI_1 are the confidence interval for the single-value confidence interval.

Figure 2.14 The data file after running regression and saving the intervals

The widths of the confidence intervals are not constant because the values of X_i is included in the equation. In this example, the widths for the average value confidence interval range from a low of 59.2 to a high of 164.5. The confidence interval is narrowest near X and widest at the extreme values. Note that the last line of Figure 2.14 shows the two confidence intervals for 1,300 and the values are the same as we computed earlier.

Some Final Thoughts on Simple Regression

The widespread use of spreadsheets and inexpensive statistical software has made the use of regression analysis both easy and common. This has both positive and negative consequences. On the good side, more people now have access to a very powerful tool for analyzing relationships and performing forecasts. However, it has caused some people to use regression analysis without understanding it and in situations for which it was not appropriate. To help the reader avoid the downfalls of using regression inappropriately, we now offer a few suggestions:

1. Never use regression, or for that matter any statistical tool, without understanding the underlying data. As we saw in the discussion of causality in chapter 1, it takes a deep understanding of the data and the theory behind it to establish causality. Without either a cause-and-effect relationship or some other theoretical reason for the two variables to move in common, it makes no sense to try to model the variables statistically.

2. Never assume a cause-and-effect relationship exists simply because correlation—even a strong correlation—exists between the variables.

3. Start off with a scatterplot and then correlation analysis before performing simple regression. There is no point trying to model a weak or nonexistent relationship.

4. When using regression to forecast, remember that the further away you go from the range of data used to construct the model, the less reliable the forecast will be.

We will return to these issues in multiple regression. As we will see, all these items will not only still be issues but also be even more complex in multiple regression.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 2 Simple Regression

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 2 Simple Regression