4.8 Binary or Dummy Variables

All of the variables we have used in regression examples have been quantitative variables such as sales figures, payroll numbers, square footage, and age. These have all been easily measurable and have had numbers associated with them. There are many times when we believe a qualitative variable rather than a quantitative variable would be helpful in predicting the dependent variable Y. For example, regression may be used to find a relationship between annual income and certain characteristics of the employees. Years of experience at a particular job would be a quantitative variable. However, information regarding whether or not a person has a college degree might also be important. This would not be a measurable value or quantity, so a special variable called a dummy variable (or a binary variable or an indicator variable) would be used. A dummy variable is assigned a value of 1 if a particular condition is met (e.g., a person has a college degree) and a value of 0 otherwise.

Return to the Jenny Wilson Realty example. Jenny believes that a better model can be developed if the condition of the property is included. To incorporate the condition of the house into the model, Jenny looks at the information available (see Table 4.5) and sees that the three categories are good condition, excellent condition, and mint condition. Since these are not quantitative variables, she must use dummy variables. These are defined as

X3=1 if house is in excellent condition=0 otherwiseX4=1 if house is in mint condition=0 otherwise

Notice there is no separate variable for “good” condition. If X3 and X4 are both 0, then the house cannot be in excellent or mint condition, so it must be in good condition. When using dummy variables, the number of variables must be 1 less than the number of categories. In this problem, there were three categories (good, excellent, and mint condition), so we must have two dummy variables. If we had mistakenly used too many variables and the number of dummy variables equaled the number of categories, then the mathematical computations could not be performed or would not give reliable values.

These dummy variables will be used with the two previous variables (X1 - square footage, and X2 - age) to try to predict the selling prices of houses for Jenny Wilson. Programs 4.5A and 4.5B provide the Excel input and output for these new data, and this shows how the dummy variables were coded. The significance level for the F test is 0.00017, so this model is statistically significant. The coefficient of determination (r2) is 0.898, so this is a much better model than the previous one. The regression equation is

Y^=121,658+56.43X13,962X2+33,162X3+47,369X4

This indicates that a house in excellent condition (X3=1, X4=0) would sell for about $33,162 more than a house in good condition (X3=0, X4=0). A house in mint condition (X3=0, X4=1) would sell for about $47,369 more than a house in good condition.

A screenshot showing the Jenny Wilson Realty table from a prior figure, with columns A: Sell Price, B: SF, and C: Age; Column D titled X 3 open parens Exc.

Program 4.5A Input Screen for Jenny Wilson Realty Example with Dummy Variables in Excel 2016

A screenshot of the Summary Output table for the dummy variables is shown.

Program 4.5B Output Screen for Jenny Wilson Realty Example with Dummy Variables in Excel 2016

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset