Chapter 6 Regression Algorithms in Data Mining

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Regression is a basic statistical tool. In data mining, it is one of the basic tools for analysis used in the classification applications through logistic regression and discriminant analysis, as well as the prediction of continuous data through ordinary least squares (OLS) and other forms. As such, regression is often taught in one (or more) three-hour courses. We cannot hope to cover all of the basics of regression. However, we, here, present ways in which regression is used within the context of data mining.

Regression is used on a variety of data types. If data is time series, the output from regression models is often used for forecasting. Regression can be used to build predictive models for other types of data. Regression can be applied in a number of different forms. The class of regression models is a major class of tools available to support the Modeling phase of the data mining process.

Probably, the most widely used data mining algorithms are data fitting, in the sense of regression. Regression is a fundamental tool for statistical analysis to characterize relationships between a dependent variable and one or more independent variables. Regression models can be used for many purposes, to include explanation and prediction. Linear and logistic regression models are both primary tools in most general-purpose data mining software. Nonlinear data can sometimes be transformed into useful linear data and analyzed with linear regression. Some special forms of nonlinear regression also exist. Neural network models are also widely used for the same classes of models. Both regression and neural network models require data be expressed numerically (or at least as 0 to 1 dummy variables). The primary operational difference between regression and neural networks is that regression provides a formula that has a strong body of theory behind it for application and interpretation. Neural networks generally do not provide models for interpretation and are usually applied internally within the software that built the model. In this sense, neural networks appear to users as “black boxes” that classify or predict without explanation. There are, of course, the models behind these classifications and predictions, but they tend to be so complex that they are neither printed out nor analyzed.

Regression Models

OLS regression is a model of the form:

Y = b0 + b1X1 + b2X2 + … + bnXn + e

where, Y is the dependent variable (the one being forecast)

Xn are the n independent (explanatory) variables

b0 is the intercept term

bn are the n coefficients for the independent variables

e is the error term

OLS regression is a straight line (with intercept and slope coefficients bn), which minimizes the sum of squared error terms ei over all i observations. The idea is that you look at past data to determine the b coefficients that worked best. The model gives you the most likely future value of the dependent variable, given knowledge of the Xn for future observations. This approach assumes a linear relationship, and error terms that are normally distributed around zero without patterns. While these assumptions are often unrealistic, regression is highly attractive because of the existence of the widely available computer packages as well as highly developed statistical theory. The statistical packages provide the probability that estimated parameters differ from zero.

Classical Tests of the Regression Model

The universal test for classification in data mining is the coincidence matrix that focuses on the ability of the model to categorize the data. For continuous regression, this requires identification of cutoffs between the classes. Data mining software doesn’t do that, but we will demonstrate how it can be done. There are many other aspects to accuracy, just as there are many applications of different models, especially regression. The classical tests of regression models are based on the assumption that errors are normally distributed around the mean, with no patterns. The basis of regression accuracy is the residuals, or difference between the predicted and observed values. Residuals are then extended to a general measure of regression fit, R-squared.

Linear Discriminant Analysis

Discriminant analysis groups objects defined by a set of variables into a predetermined set of outcome classes. One example of this type of analysis is the classification of employees by their rated performance within an organization. The bank loan example could be divided into past cases sorted by the two distinct categories of repayment or default. The technical analysis is, thus, determining the combination of variables that best predict membership in one of the given output categories.

A number of methods can be used for discriminant analysis. Regression can be used for discriminant analysis. For the two group case, this would require a cutoff between the groups, and if a new set of data yielded a functional value below the cutoff, the prediction would be that group, or conversely, if the value was above the cutoff, the prediction would be the other group. However, other techniques can be used for discriminant analysis.¹ A discriminant function can be used in binary data to separate observations into two groups, with a cutoff limit used to divide the observations.

Logistic Regression

Some data of interest in a regression study may be ordinal or nominal. For instance, in our example job application model, sex and college degree are nominal. In the loan application data, the outcome is nominal, while credit rating is ordinal. Since regression analysis requires numerical data, we included them by coding the variables. Here, each of these variables is dichotomous; therefore, we can code them as either 0 or 1 (as we did in the regression model for loan applicants). For example, a male is assigned a code of 0, while a female is assigned a code of 1. The employees with a college degree can be assigned a code of 1, and those without a degree a code of 0.

The purpose of logistic regression is to classify cases into the most likely category. Logistic regression provides a set of b parameters for the intercept (or intercepts in the case of ordinal data with more than two categories) and independent variables, which can be applied to a logistic function to estimate the probability of belonging to a specified output class. The formula for probability of acceptance of a case i to a stated class j is:

where, b coefficients are obtained from logistic regression.

The regression model provides a continuous formula. A cutoff needs to be determined to divide the value obtained from this formula, given independent variable values, which will divide the data into output categories in proportion to the population of cases.

Chapter 4 included a set of data for insurance claims, some of which were fraudulent. The independent variables included the claimant age, gender, claim amount, number of traffic tickets on record, number of prior claims, and attorney. Here, we simplify the attorney data to focus on attorney Smith, making it a 0 to1 variable. Table 6.1 provides 10 observations, reflecting that data given in Table 4.10.

Table 6.1 Insurance claim training data

Age	Gender	Claim	Tickets	Prior	Attorney	Outcome
52	0	2,000	0	1	0	OK
38	0	1,800	0	0	0	OK
21	1	5,600	1	2	1	Fraud
36	1	3,800	0	1	0	OK
19	0	600	2	2	0	OK
41	0	4,200	1	2	1	Fraud
38	0	2,700	0	0	0	OK
33	1	2,500	0	1	0	Fraud
18	1	1,300	0	0	0	OK
26	0	2,600	2	0	0	OK

A logistic regression model was developed for this data using SAS, yielding the model given in Table 6.2. This report gives the model in terms of coefficients for the intercept and each variable (the Estimate column) with the standard error of each estimate. Since the logistic regression is based on discrete values for some variables, a chi-squared test is conventional for evaluation of each model coefficient (given in the Chi-Square column). The evaluation of these coefficients is the easiest to understand by viewing the last column, giving the probability of a random measure being greater than the chi-squared value. If this probability is high, the implication is that the coefficient is not very significant. If this probability is very low (or near zero), the implication is that the coefficient is significant.

Table 6.2 Logistic regression model for insurance claim data

Parameter	Estimate	Std. error	Chi-square	Pr>ChiSq
Intercept	81.624	309.3	0.0697	0.7918
Age	−2.778	10.4	0.0713	0.7894
Gender	−75.893	246.7	0.0946	0.7584
Claim	0.017	0.055	0.0959	0.7569
Tickets	−36.648	164.5	0.0496	0.8237
Prior	6.914	84.8	0.0067	0.9350
Smith?	−29.361	103.3	0.0809	0.7761

The Estimate column gives the model b coefficients. This model can be applied to the test data set given in Table 4.11 by using the probability formula given earlier. The calculations are shown in Table 6.3.

Table 6.3 Logistic regression model applied to insurance fraud test cases

Age	Gender	Claim	Tickets	Prior	Attorney	Model	Prob	Predict	Actual
23	0	2,800	1	0	0	28.958	1.0	OK	OK
31	1	1,400	0	0	0	−56.453	0.0	Fraud	OK
28	0	4,200	2	3	1	−6.261	0.002	Fraud	Fraud
19	0	2,800	0	1	0	83.632	1.0	OK	OK
41	0	1,600	0	0	0	−4.922	0.007	Fraud	OK

The coincidence matrix for this set of data is given in Table 6.4.

Table 6.4 Coincidence matrix for insurance fraud data using logistic regression

Actual	Fraud	OK	Total
Fraud	1	0	1
OK	2	2	4
Totals	3	2	0.60

In this case, the model identified the one actually fraudulent case, at the expense of overpredicting the fraud.

Software Demonstrations

For OLS regression, both SAS and Excel were demonstrated earlier. Both obviously provide identical models. The only limitation we perceive to using Excel is that Excel regression is limited to 16 independent variables. The basic SAS has the ability to do OLS regression (so does Excel) and logistic regression.

The first 4,000 observations of the insurance fraud dataset were used for training. Standardized scores (between 0 and 1) were used, although continuous data, or even categorical data, could have been used. Standardizing the data transforms it so that scale doesn’t matter. There are reasons to do that, if different variables have radically different scales. Regression results should be identical between standardized and original data (standardized data is continuous, just like the original—it is just transformed). As a check, you could run against the original data and see if you get the same R-squared and t statistics. The coefficients will be different. Categorical data is transformed into a form where details are lost. You will get different results between regressions over original continuous data and categorical data.

Regression Discriminant Model: For the regression model over standardized data, the result was as shown in Table 6.5:

Table 6.5 OLS regression output—Insurance fraud data

Summary output

Regression statistics
Multiple R	0.203298
R square	0.04133
Adjusted R square	0.03989
Standard error	0.119118
Observations	4,000
ANOVA
	df	SS	MS	F	Significance F
Regression	6	2.442607	0.407101	28.69096	8.81E-34
Residual	3,993	56.65739	0.014189
Total	3,999	59.1
	Coefficients	Standard error	t Stat	P-value	Lower 95%	Upper 95%
Intercept	0.0081	0.012594	0.643158	0.520158	−0.01659	0.032792
Age	0.001804	0.005147	0.350421	0.726041	−0.00829	0.011894
Gender	−0.00207	0.003772	−0.54928	0.582843	−0.00947	0.005323
Claim	0.007607	0.0191	0.398289	0.690438	−0.02984	0.045054
Tickets	−0.0076	0.004451	−1.70738	0.08783	−0.01633	0.001127
Prior	0.000148	0.004174	0.035408	0.971756	−0.00804	0.008332
Attorney	0.201174	0.018329	10.97554	1.24E-27	0.165238	0.237109

Of course, the same model was obtained with SAS. Only the presence of an attorney (highly significant) and number of tickets on record (marginally significant) had any significance. The b coefficients give the discriminant function. A cutoff value for this function is needed. We applied the model to the training set and sorted the results. There were 60 fraudulent cases in the training set of 4,000 observations. Therefore, a logical cutoff would be the 60th largest functional value in this training set. The 60th largest model value was 0.196197. The cutoff for prediction of 0.19615 was used. This yielded the coincidence matrix shown in Table 6.6.

Table 6.6 Coincidence matrix—OLS regression of insurance fraud test data

Actual	Model fraud	Model OK	Total
Fraud	5	17	22
OK	17	961	978
Totals	22	978	1,000

This model had a correct classification rate of 0.966, which is very good. The model applied to the test data predicted 22 fraudulent cases, and 978 not fraudulent. Of the 22 test cases that the model predicted to be fraudulent, 5 actually were. Therefore, the model would have triggered investigation of 17 cases in the test set, which were not actually fraudulent. Of the 978 test cases that the model predicted to be OK, 17 were actually fraudulent and would not have been investigated. If the cost of an investigation were $500, and the cost of loss were $2,500, this would have an expected cost of $500 × 17 + $2,500 × 17, or $51,000.

Centroid Discriminant Model: We can compare this model with a centroid discriminant model. The training set is used to identify the mean variable values for the two outcomes (Fraud and OK) shown in Table 6.7.

Table 6.7 Centroid discriminant function means

Cluster	Age	Gender	Claim	Tickets	Prior	Attorney
OK	0.671	0.497	0.606	0.068	0.090	0.012
Fraud	0.654	0.467	0.540	0.025	0.275	0.217

The squared distance to each of these clusters was applied on the 1,000 test observations, yielding the coincidence matrix shown in Table 6.8.

Table 6.8 Coincidence matrix—centroid discriminant model of insurance fraud test data

Actual	Model fraud	Model OK	Total
Fraud	7	15	22
OK	133	845	978
Totals	140	860	1,000

Here, the correct classification rate was 0.852, quite a bit lower than with the regression model. The model had many errors, where applicants who turned out to be OK were denied loans. There were two fewer cases, where applicants who turned out bad were approved for loans. The cost of error here is $500 × 133 + $2,500 × 15, or $104,000.

Logistic Regression Model: A logistic regression model was run on SAS. The variables Gender and Attorney were 0 to1 variables, and thus categorical. The model, based on the maximum likelihood estimates, is given in Table 6.9.

Table 6.9 Logistic regression model—Insurance fraud data

Parameter	DF	Estimate	Std. error	Chi-square	Pr>ChiSq
Intercept	1	−2.9821	0.7155	17.3702	<0.0001
Age	1	0.1081	0.3597	0.0903	0.7637
Claim	1	0.3219	1.2468	0.0667	0.7962
Tickets	1	−0.8535	0.5291	2.6028	0.1067
Prior	1	0.0033	0.3290	0.0001	0.9920
Gender 0	1	0.0764	0.1338	0.3260	0.5680
Attorney 0	1	−1.6429	0.4107	15.9989	<0.0001

The outputs obtained were all between 0 and 1, but the maximum was 0.060848. The division between the 60th and 61st largest training values was 0.028. Using this cutoff, the coincidence matrix shown in Table 6.10 was obtained.

Table 6.10 Coincidence matrix—Logistic regression of insurance fraud data

Actual	Model fraud	Model OK	Total
Fraud	5	17	22
OK	16	962	978
Totals	21	979	1,000

The correct classification rate is up very slightly, to 0.967. The cost of error here is $500 × 16 + $2,500 × 17, or $50,500.

The Job Application dataset involves 500 cases. Since there are four distinct outcomes, discriminant analysis is appropriate. (Cluster analysis might have been appropriate to identify these four outcomes in the first place.) We will use 250 cases for training, and test on the remaining 250. Excel turns out to be very easy to use for distance calculation. The first step is to convert data into a 0–1 scale (see Table 6.11).

Table 6.11 Data conversion to 0–1 scale

Variable	Range	Value
Age	<20 20–50 >50	0 (Age-20)/30 1.0
State	CA Rest	1.0 0
Degree	Cert UG Rest	0 0.5 1.0
Major	IS Csci, Engr, Sci BusAd Other None	1.0 0.9 0.7 0.5 0
Experience	Max 1	Years/5

In Excel, we place the outcome variable to the left of the four columns with the converted data, so that we can sort the 250 training observations on outcome. This simplifies the calculation of averages for each of the four variables by each of the four outcomes. Table 6.12 provides that information.

Table 6.12 Average transformed variable values for job applicant data

	Age	State	Degree	Major	Experience
Unacceptable	0.156322	0.137931	0.241379	0.186207	0.475862
Minimal	0.232068	0.303797	0.594937	0.517722	0.772152
Adequate	0.292346	0.237037	0.707407	0.833333	0.903704
Excellent	0.338095	0.285714	0.571429	0.985714	0.942857

The distance of each of the 250 test observations was measured to these averages using the squared distance metric. Observation 251 was a 28-year-old applicant from Utah with a professional certification (no major) and 6 years experience (outcome minimal). First, the data needs to be transformed. Age 28 is 8 years above the minimum, yielding a transformed value of 0.267. The transformed state value is 0, transformed degree value is 0, and transformed major value is 0. Experience of 6 years transforms to a value of 1.0. The distance calculation is shown in Table 6.13.

Table 6.13 Calculation of squared distances

Average	Age	State	Degree	Major	Experience	Total
Unacceptable	(0.267−0.156)2	(0−0.138)2	(0−0.241)2	(0−0.186)2	(1−0.476)2
	0.012176	0.019025	0.058264	0.034673	0.274721	0.398859
Minimal	(0.267−0.232)2	(0−0.304)2	(0−0.594)2	(0−0.518)2	(1−0.772)2
	0.001197	0.092293	0.35395	0.268036	0.051915	0.76739
Adequate	(0.267−0.292)2	(0−0.237)2	(0−0.707)2	(0−0.833)2	(1−0.904)2
	0.000659	0.056187	0.500425	0.694444	0.009273	1.260989
Excellent	(0.267–0.338)2	(0−0.286)2	(0−0.571)2	(0−0.986)2	(1−0.943)2
	0.005102	0.081633	0.326531	0.971633	0.003265	1.388163

The minimum sum of the squared distances was to the unacceptable group. Here, the minimum distance is to the unacceptable average. Table 6.14 shows the coincidence matrix for all 250 test cases.

Table 6.14 Coincidence matrix for job applicant matching model using the squared error distance

Actual	Unacceptable	Minimal	Adequate	Excellent	Total
Unacceptable	19	5	6	0	30
Minimal	28	14	33	1	76
Adequate	2	16	73	37	128
Excellent	0	0	3	13	16
Total	49	35	115	51	250

This metric correctly classified 119 out of 250 chances, for a correct classification rate of 0.476. It was quite good at predicting the extreme cases.

R—logistic

We select the file LoanRaw400.csv in Figure 6.1. We deselect the intermediate variables Assets, Debts, and Want, which were used to generate the risk rating. We unclick the Partition box because we are supplying a test set. Then, click on the Execute tab.

Figure 6.1 Loading data for loan regression

R has the full slate of data mining algorithms, to include regression models. They are attained under the Model tab as seen in Figure 6.2, where you need to select the Linear button. Given that the data has a categorical outcome, the Logistic button will automatically be selected.

Figure 6.2 Model tab in R

Selecting Execute yields Figure 6.3.

Figure 6.3 Logistic regression output for loan data from R

The model for a logistic regression has b coefficients for continuous variables, which are multiplied by variable values. For categorical variables, the intercept contains the contribution for the case of amber credit and high risk, which is adjusted if other categories are present. Significance is as in linear regression. The calculation for the dependent variable is on a logistic scale, which the software adjusts.

Evaluation of the model can be conducted by applying it to a test set (in our case 250 observations held out from the training set of 400—total dataset was 650). Figure 6.4 shows linking the test dataset. An alternative is to use Rattle’s default by leaving the Partition box selected on the data page (70% training, 15% validation, 15% testing).

Click on the Evaluate tab, and select CSV File, which makes the Document Link tab available. Click on that search for the test file from your hard drive. Execute yields Table 6.15.

Table 6.15 Coincidence matrix for R logistic regression model

	Predicted OK	Predicted problem
Actual OK	226	4	230
Actual problem	18	2	20
	244	6	250

This coincidence matrix indicates an overall correct classification rate of 0.912 over the independent test set. The correct classification for OK cases was 0.926, while the correct classification for Problem cases was only 0.333. Thus, this model must be noted to be pretty bad at identifying Problem cases. The model can be applied to new cases if there is a file containing them. Figure 6.5 shows the Rattle screen. Note that you can select Class (for categorical prediction) or Probability (for continuous).

Figure 6.5 Rattle screen to apply model to new cases

The Outcome column in this file can contain “?.” Table 6.16 contains this data, with model predictions in probability form (0 is OK, 1 Fraud).

Table 6.16 LoanRawNew.csv

Age	Income	Assets	Debts	Want	Credit	Risk	Outcome
55	75,000	80,605	90,507	3,000	Amber	High	0.200
30	23,506	22,300	18,506	2,200	Amber	Low	0.099
48	48,912	72,507	123,541	7,600	Red	High	0.454
22	8,106	0	1,205	800	Red	High	0.627
31	28,571	20,136	30,625	2,500	Amber	High	0.351
36	61,322	108,610	80,542	6,654	Green	Low	0.012
41	70,486	150,375	120,523	5,863	Green	Low	0.011
22	22,400	32,512	12,521	3,652	Green	Low	0.022
25	27,908	12,582	8,654	4,003	Amber	Medium	0.094
28	41,602	18,366	12,587	2,875	Green	Low	0.017

Applying the logistic regression model from R to this data resulted in classification of these cases, found the third and fourth rows predicted as problem loans, the rest OK.

KNIME

A File Reader is used to link LoanRaw400.csv from the hard drive. This is Configured and Executed. A Logistic Regression Learner is dragged to the workflow and connected to the File Reader. In the Configure operation, the intermediate variables Assets, Debts, and Want are removed, as indicated in Figure 6.6.

Figure 6.6 Regression learner for loan logistic regression

Executing the Regression Learner yields Figure 6.7, the logistic regression model providing b coefficients as we obtained with R (although notably different).

Figure 6.7 Logistic regression model from KNIME

We show the entire workflow for KNIME in Figure 6.8 to better explain the subsequent links and functions.

Figure 6.8 KNIME workflow for loan data

A File Reader in node 4 is used to link the test data (250 observations) from the hard drive. A Regression Predictor (node 3) is used to apply the model from Logistic Regression Learner to the test data. After the nodes are linked, and all Executed, the output is linked to a Scorer (node 5) to get a coincidence matrix, in this case, shown in Table 6.17.

Table 6.17 Coincidence matrix from KNIME for the loan logistic regression model

	Predicted OK	Predicted problem
Actual OK	224	6	230
Actual problem	17	3	20
	241	9	250

Here, the relative accuracy is 227/250, the same 0.908 as obtained from the different logistic regression obtained with R. We add another File Reader to apply the model to 10 new cases (from file LoanRawNew.csv on the hard drive, which is linked to another Regression Predictor (node 7), which in turn is linked to an Interactive Table to predict the 10 new cases (shown in Figure 6.9).

Figure 6.9 KNIME logistic regression loan new cases

In Figure 6.9, you can see that the KNIME logistic regression only predicted problems for case 4 (row3), as opposed to cases 3 and 4 from the R logistic regression.

WEKA

The same operation can be conducted with WEKA, beginning with opening the dataset (Figure 6.10). We use all 650 observations, as it is difficult to get WEKA to read test sets.

Figure 6.10 WEKA data loading for the loan dataset

Here, we remove the intermediate variables Assets, Debts, and Want as we did with the other software. Modeling can be accomplished by selecting Classify, followed by Functions, and then Logistic Regression (only works with categorical output variable). This yields Figure 6.11.

Figure 6.11 WEKA logistic regression screen

We can use Cross-validation (10 fold is a good option), or a Percentage split. (In theory, you can use Supplied test set, but that is what WEKA has trouble reading). We choose 62 percent for the split to get close to what we used with R and KNIME. The model output is shown in Table 6.18.

Table 6.18 WEKA logistic regression model

Variable	Logistic coefficient
Age	−0.1407
Income	0
Credit=red	−1.1046
Credit=green	0.9042
Credit=amber	−0.4965
Risk=high	−0.7171
Risk=medium	0.3647
Risk=low	0.6593
Intercept	5.0063

Table 6.19 shows the resulting coincidence matrix.

Table 6.19 Coincidence matrix from WEKA for loan logistic regression model

	Predicted OK	Predicted problem
Actual OK	227	1	228
Actual problem	19	0	19
	246	1	247

This model actually has a slightly (very slightly) better fit than the R and KNIME models, with a 0.919 fit.

To apply the model to new cases, you can add the new cases to the input data file. In this case, we added the 10 cases given in Table 6.16 to the LoanRaw.csv file, saving under the name LoanRawNew.csv. We read this into WEKA, as shown in Figure 6.12.

Figure 6.12 Logistic regression with WEKA—predicting new cases

Note that we selected Use training set. Clicking on More options …, we obtain the screen shown in Figure 6.13, and select Output predictions.

Figure 6.13 More options

Then, the logistic regression model is run, as in Figure 6.14.

Figure 6.14 WEKA logistic regression for prediction

Figure 6.14 includes the full model, and a confusion matrix, but these are warped with the 10 new cases and are not the purpose of the output. Their purpose is to read the predictions, displayed (after 650 data points in the original full data set) as rows 651 through 660. The dataset had “?” values entered. The Prediction column shows that the fourth of these was categorized as a “Problem,” the other nine as “OK.” This matches the R logistic regression predictions.

Summary

Regression models have been widely used in classical modeling. They continue to be very useful in data mining environments, which differ primarily in the scale of observations and the number of variables used. Classical regression (usually OLS) can be applied to continuous data. If the output variables (or input variables) are categorical, logistic regression can be applied. Regression can also be applied to identify a discriminant function, separating observations into groups. If this is done, the cutoff limits to separate the observations based on the discriminant function score need to be identified. While discriminant analysis can be applied to multiple groups, it is much more complicated if there are more than two groups. Thus, other discriminant methods, such as the centroid method demonstrated in this chapter, are often used.

Regression can be applied by conventional software such as SAS, SPSS, or Excel. Additionally, there are many refinements to regression that can be accessed, such as stepwise linear regression. Stepwise regression uses partial correlations to select entering independent variables iteratively, providing some degree of automatic machine development of a regression model. Stepwise regression has its proponents and opponents, but is a form of machine learning.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 6 Regression Algorithms in Data Mining

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 6 Regression Algorithms in Data Mining