Chapter 6

Regression Algorithms in Data Mining

Regression is a basic statistical tool. In data mining, it is one of the basic tools for analysis used in the classification applications through logistic regression and discriminant analysis, as well as the prediction of continuous data through ordinary least squares (OLS) and other forms. As such, regression is often taught in one (or more) three-hour courses. We cannot hope to cover all of the basics of regression. However, we, here, present ways in which regression is used within the context of data mining.

Regression is used on a variety of data types. If data is time series, the output from regression models is often used for forecasting. Regression can be used to build predictive models for other types of data. ­Regression can be applied in a number of different forms. The class of regression models is a major class of tools available to support the Modeling phase of the data mining process.

Probably, the most widely used data mining algorithms are data fitting, in the sense of regression. Regression is a fundamental tool for ­statistical analysis to characterize relationships between a dependent variable and one or more independent variables. Regression models can be used for many purposes, to include explanation and prediction. Linear and logistic regression models are both primary tools in most general-­purpose data mining software. Nonlinear data can sometimes be transformed into useful linear data and analyzed with linear regression. Some special forms of nonlinear regression also exist. Neural network models are also widely used for the same classes of models. Both regression and neural network models require data be expressed numerically (or at least as 0 to 1 dummy variables). The primary operational difference between regression and neural networks is that regression provides a formula that has a strong body of theory behind it for application and interpretation. Neural networks generally do not provide models for interpretation and are usually applied internally within the software that built the model. In this sense, neural networks appear to users as “black boxes” that classify or predict without explanation. There are, of course, the models behind these classifications and predictions, but they tend to be so complex that they are neither printed out nor analyzed.

Regression Models

OLS regression is a model of the form:

Y = b0 + b1X1 + b2X2 ++ bnXn + e

where, Y is the dependent variable (the one being forecast)

  Xn are the n independent (explanatory) variables

  b0 is the intercept term

  bn are the n coefficients for the independent variables

  e is the error term

OLS regression is a straight line (with intercept and slope coefficients bn), which minimizes the sum of squared error terms ei over all i observations. The idea is that you look at past data to determine the b coefficients that worked best. The model gives you the most likely future value of the dependent variable, given knowledge of the Xn for future observations. This approach assumes a linear relationship, and error terms that are normally distributed around zero without patterns. While these assumptions are often unrealistic, regression is highly attractive because of the ­existence of the widely available computer packages as well as highly developed statistical theory. The statistical packages provide the probability that estimated parameters differ from zero.

Classical Tests of the Regression Model

The universal test for classification in data mining is the coincidence matrix that focuses on the ability of the model to categorize the data. For continuous regression, this requires identification of cutoffs between the classes. Data mining software doesn’t do that, but we will demonstrate how it can be done. There are many other aspects to accuracy, just as there are many applications of different models, especially regression. The classical tests of regression models are based on the assumption that errors are normally distributed around the mean, with no patterns. The basis of regression accuracy is the residuals, or difference between the predicted and observed values. Residuals are then extended to a general measure of regression fit, R-squared.

Linear Discriminant Analysis

Discriminant analysis groups objects defined by a set of variables into a predetermined set of outcome classes. One example of this type of analysis is the classification of employees by their rated performance within an organization. The bank loan example could be divided into past cases sorted by the two distinct categories of repayment or default. The technical analysis is, thus, determining the combination of variables that best predict membership in one of the given output categories.

A number of methods can be used for discriminant analysis. Regression can be used for discriminant analysis. For the two group case, this would require a cutoff between the groups, and if a new set of data yielded a functional value below the cutoff, the prediction would be that group, or conversely, if the value was above the cutoff, the prediction would be the other group. However, other techniques can be used for discriminant analysis.1 A discriminant function can be used in binary data to separate observations into two groups, with a cutoff limit used to divide the observations.

Logistic Regression

Some data of interest in a regression study may be ordinal or nominal. For instance, in our example job application model, sex and college degree are nominal. In the loan application data, the outcome is nominal, while credit rating is ordinal. Since regression analysis requires numerical data, we included them by coding the variables. Here, each of these variables is dichotomous; therefore, we can code them as either 0 or 1 (as we did in the regression model for loan applicants). For example, a male is assigned a code of 0, while a female is assigned a code of 1. The employees with a college degree can be assigned a code of 1, and those without a degree a code of 0.

The purpose of logistic regression is to classify cases into the most likely category. Logistic regression provides a set of b parameters for the intercept (or intercepts in the case of ordinal data with more than two categories) and independent variables, which can be applied to a logistic function to estimate the probability of belonging to a specified output class. The formula for probability of acceptance of a case i to a stated class j is:

where, b coefficients are obtained from logistic regression.

The regression model provides a continuous formula. A cutoff needs to be determined to divide the value obtained from this formula, given independent variable values, which will divide the data into output categories in proportion to the population of cases.

Chapter 4 included a set of data for insurance claims, some of which were fraudulent. The independent variables included the claimant age, gender, claim amount, number of traffic tickets on record, number of prior claims, and attorney. Here, we simplify the attorney data to focus on attorney Smith, making it a 0 to1 variable. Table 6.1 provides 10 observations, reflecting that data given in Table 4.10.


Table 6.1 Insurance claim training data

Age

Gender

Claim

Tickets

Prior

Attorney

Outcome

52

0

2,000

0

1

0

OK

38

0

1,800

0

0

0

OK

21

1

5,600

1

2

1

Fraud

36

1

3,800

0

1

0

OK

19

0

600

2

2

0

OK

41

0

4,200

1

2

1

Fraud

38

0

2,700

0

0

0

OK

33

1

2,500

0

1

0

Fraud

18

1

1,300

0

0

0

OK

26

0

2,600

2

0

0

OK



A logistic regression model was developed for this data using SAS, yielding the model given in Table 6.2. This report gives the model in terms of coefficients for the intercept and each variable (the Estimate column) with the standard error of each estimate. Since the logistic regression is based on discrete values for some variables, a chi-squared test is conventional for evaluation of each model coefficient (given in the Chi-Square column). The evaluation of these coefficients is the easiest to understand by viewing the last column, giving the probability of a random measure being greater than the chi-squared value. If this probability is high, the implication is that the coefficient is not very significant. If this probability is very low (or near zero), the implication is that the coefficient is significant.


Table 6.2 Logistic regression model for insurance claim data

Parameter

Estimate

Std. error

Chi-square

Pr>ChiSq

Intercept

81.624

309.3

0.0697

0.7918

Age

−2.778

10.4

0.0713

0.7894

Gender

−75.893

246.7

0.0946

0.7584

Claim

0.017

0.055

0.0959

0.7569

Tickets

−36.648

164.5

0.0496

0.8237

Prior

6.914

84.8

0.0067

0.9350

Smith?

−29.361

103.3

0.0809

0.7761



The Estimate column gives the model b coefficients. This model can be applied to the test data set given in Table 4.11 by using the probability formula given earlier. The calculations are shown in Table 6.3.


Table 6.3 Logistic regression model applied to insurance fraud test cases

Age

Gender

Claim

Tickets

Prior

Attorney

Model

Prob

Predict

Actual

23

0

2,800

1

0

0

28.958

1.0

OK

OK

31

1

1,400

0

0

0

−56.453

0.0

Fraud

OK

28

0

4,200

2

3

1

−6.261

0.002

Fraud

Fraud

19

0

2,800

0

1

0

83.632

1.0

OK

OK

41

0

1,600

0

0

0

−4.922

0.007

Fraud

OK



The coincidence matrix for this set of data is given in Table 6.4.


Table 6.4 Coincidence matrix for insurance fraud data using logistic regression

Actual

Fraud

OK

Total

Fraud

1

0

1

OK

2

2

4

Totals

3

2

0.60



In this case, the model identified the one actually fraudulent case, at the expense of overpredicting the fraud.

Software Demonstrations

For OLS regression, both SAS and Excel were demonstrated earlier. Both obviously provide identical models. The only limitation we perceive to using Excel is that Excel regression is limited to 16 independent variables. The basic SAS has the ability to do OLS regression (so does Excel) and logistic regression.

The first 4,000 observations of the insurance fraud dataset were used for training. Standardized scores (between 0 and 1) were used, although continuous data, or even categorical data, could have been used. Standardizing the data transforms it so that scale doesn’t matter. There are reasons to do that, if different variables have radically different scales. Regression results should be identical between standardized and original data (standardized data is continuous, just like the original—it is just transformed). As a check, you could run against the original data and see if you get the same R-squared and t statistics. The coefficients will be ­different. Categorical data is transformed into a form where details are lost. You will get different results between regressions over original ­continuous data and categorical data.

Regression Discriminant Model: For the regression model over standardized data, the result was as shown in Table 6.5:


Table 6.5 OLS regression output—Insurance fraud data

Summary output

Regression statistics

Multiple R

0.203298

R square

0.04133

Adjusted R square

0.03989

Standard error

0.119118

Observations

4,000

ANOVA

df

SS

MS

F

Significance F

Regression

6

2.442607

0.407101

28.69096

8.81E-34

Residual

3,993

56.65739

0.014189

Total

3,999

59.1

Coefficients

Standard error

t Stat

P-value

Lower 95%

Upper 95%

Intercept

0.0081

0.012594

0.643158

0.520158

−0.01659

0.032792

Age

0.001804

0.005147

0.350421

0.726041

−0.00829

0.011894

Gender

0.00207

0.003772

−0.54928

0.582843

−0.00947

0.005323

Claim

0.007607

0.0191

0.398289

0.690438

−0.02984

0.045054

Tickets

0.0076

0.004451

−1.70738

0.08783

−0.01633

0.001127

Prior

0.000148

0.004174

0.035408

0.971756

−0.00804

0.008332

Attorney

0.201174

0.018329

10.97554

1.24E-27

0.165238

0.237109



Of course, the same model was obtained with SAS. Only the presence of an attorney (highly significant) and number of tickets on record (marginally significant) had any significance. The b coefficients give the discriminant function. A cutoff value for this function is needed. We applied the model to the training set and sorted the results. There were 60 fraudulent cases in the training set of 4,000 observations. Therefore, a logical cutoff would be the 60th largest functional value in this training set. The 60th largest model value was 0.196197. The cutoff for prediction of 0.19615 was used. This yielded the coincidence matrix shown in Table 6.6.


Table 6.6 Coincidence matrix—OLS regression of insurance fraud test data

Actual

Model fraud

Model OK

Total

Fraud

5

17

22

OK

17

961

978

Totals

22

978

1,000



This model had a correct classification rate of 0.966, which is very good. The model applied to the test data predicted 22 fraudulent cases, and 978 not fraudulent. Of the 22 test cases that the model predicted to be fraudulent, 5 actually were. Therefore, the model would have triggered investigation of 17 cases in the test set, which were not actually fraudulent. Of the 978 test cases that the model predicted to be OK, 17 were actually fraudulent and would not have been investigated. If the cost of an investigation were $500, and the cost of loss were $2,500, this would have an expected cost of $500 × 17 + $2,500 × 17, or $51,000.

Centroid Discriminant Model: We can compare this model with a centroid discriminant model. The training set is used to identify the mean variable values for the two outcomes (Fraud and OK) shown in Table 6.7.


Table 6.7 Centroid discriminant function means

Cluster

Age

Gender

Claim

Tickets

Prior

Attorney

OK

0.671

0.497

0.606

0.068

0.090

0.012

Fraud

0.654

0.467

0.540

0.025

0.275

0.217



The squared distance to each of these clusters was applied on the 1,000 test observations, yielding the coincidence matrix shown in Table 6.8.


Table 6.8 Coincidence matrix—centroid discriminant model of insurance fraud test data

Actual

Model fraud

Model OK

Total

Fraud

7

15

22

OK

133

845

978

Totals

140

860

1,000



Here, the correct classification rate was 0.852, quite a bit lower than with the regression model. The model had many errors, where applicants who turned out to be OK were denied loans. There were two fewer cases, where applicants who turned out bad were approved for loans. The cost of error here is $500 × 133 + $2,500 × 15, or $104,000.

Logistic Regression Model: A logistic regression model was run on SAS. The variables Gender and Attorney were 0 to1 variables, and thus categorical. The model, based on the maximum likelihood estimates, is given in Table 6.9.


Table 6.9 Logistic regression model—Insurance fraud data

Parameter

DF

Estimate

Std. error

Chi-square

Pr>ChiSq

Intercept

1

−2.9821

0.7155

17.3702

<0.0001

Age

1

0.1081

0.3597

0.0903

0.7637

Claim

1

0.3219

1.2468

0.0667

0.7962

Tickets

1

−0.8535

0.5291

2.6028

0.1067

Prior

1

0.0033

0.3290

0.0001

0.9920

Gender 0

1

0.0764

0.1338

0.3260

0.5680

Attorney 0

1

−1.6429

0.4107

15.9989

<0.0001



The outputs obtained were all between 0 and 1, but the maximum was 0.060848. The division between the 60th and 61st largest training values was 0.028. Using this cutoff, the coincidence matrix shown in Table 6.10 was obtained.


Table 6.10 Coincidence matrix—Logistic regression of insurance fraud data

Actual

Model fraud

Model OK

Total

Fraud

5

17

22

OK

16

962

978

Totals

21

979

1,000



The correct classification rate is up very slightly, to 0.967. The cost of error here is $500 × 16 + $2,500 × 17, or $50,500.

The Job Application dataset involves 500 cases. Since there are four distinct outcomes, discriminant analysis is appropriate. (Cluster analysis might have been appropriate to identify these four outcomes in the first place.) We will use 250 cases for training, and test on the remaining 250. Excel turns out to be very easy to use for distance calculation. The first step is to convert data into a 0–1 scale (see Table 6.11).


Table 6.11 Data conversion to 0–1 scale

Variable

Range

Value

Age

<20

20–50

>50

0

(Age-20)/30

1.0

State

CA

Rest

1.0

0

Degree

Cert

UG

Rest

0

0.5

1.0

Major

IS

Csci, Engr, Sci

BusAd

Other

None

1.0

0.9

0.7

0.5

0

Experience

Max 1

Years/5



In Excel, we place the outcome variable to the left of the four columns with the converted data, so that we can sort the 250 training observations on outcome. This simplifies the calculation of averages for each of the four variables by each of the four outcomes. Table 6.12 provides that information.


Table 6.12 Average transformed variable values for job applicant data

Age

State

Degree

Major

Experience

Unacceptable

0.156322

0.137931

0.241379

0.186207

0.475862

Minimal

0.232068

0.303797

0.594937

0.517722

0.772152

Adequate

0.292346

0.237037

0.707407

0.833333

0.903704

Excellent

0.338095

0.285714

0.571429

0.985714

0.942857



The distance of each of the 250 test observations was measured to these averages using the squared distance metric. Observation 251 was a 28-year-old applicant from Utah with a professional certification (no major) and 6 years experience (outcome minimal). First, the data needs to be transformed. Age 28 is 8 years above the minimum, yielding a transformed value of 0.267. The transformed state value is 0, transformed degree value is 0, and transformed major value is 0. Experience of 6 years transforms to a value of 1.0. The distance calculation is shown in Table 6.13.


Table 6.13 Calculation of squared distances

Average

Age

State

Degree

Major

Experience

Total

Unacceptable

(0.267−0.156)2

(0−0.138)2

(0−0.241)2

(0−0.186)2

(1−0.476)2

0.012176

0.019025

0.058264

0.034673

0.274721

0.398859

Minimal

(0.267−0.232)2

(0−0.304)2

(0−0.594)2

(0−0.518)2

(1−0.772)2

0.001197

0.092293

0.35395

0.268036

0.051915

0.76739

Adequate

(0.267−0.292)2

(0−0.237)2

(0−0.707)2

(0−0.833)2

(1−0.904)2

0.000659

0.056187

0.500425

0.694444

0.009273

1.260989

Excellent

(0.267–0.338)2

(0−0.286)2

(0−0.571)2

(0−0.986)2

(1−0.943)2

0.005102

0.081633

0.326531

0.971633

0.003265

1.388163



The minimum sum of the squared distances was to the unacceptable group. Here, the minimum distance is to the unacceptable average. Table 6.14 shows the coincidence matrix for all 250 test cases.


Table 6.14 Coincidence matrix for job applicant matching model using the squared error distance

Actual

Unacceptable

Minimal

Adequate

Excellent

Total

Unacceptable

19

5

6

0

30

Minimal

28

14

33

1

76

Adequate

2

16

73

37

128

Excellent

0

0

3

13

16

Total

49

35

115

51

250



This metric correctly classified 119 out of 250 chances, for a correct classification rate of 0.476. It was quite good at predicting the extreme cases.

R—logistic

We select the file LoanRaw400.csv in Figure 6.1. We deselect the intermediate variables Assets, Debts, and Want, which were used to generate the risk rating. We unclick the Partition box because we are supplying a test set. Then, click on the Execute tab.

Image

Figure 6.1 Loading data for loan regression

R has the full slate of data mining algorithms, to include regression models. They are attained under the Model tab as seen in Figure 6.2, where you need to select the Linear button. Given that the data has a categorical outcome, the Logistic button will automatically be selected.

Image

Figure 6.2 Model tab in R

Selecting Execute yields Figure 6.3.

Image

Figure 6.3 Logistic regression output for loan data from R

The model for a logistic regression has b coefficients for continuous variables, which are multiplied by variable values. For categorical ­variables, the intercept contains the contribution for the case of amber credit and high risk, which is adjusted if other categories are present. Significance is as in linear regression. The calculation for the dependent variable is on a logistic scale, which the software adjusts.

Evaluation of the model can be conducted by applying it to a test set (in our case 250 observations held out from the training set of 400—total dataset was 650). Figure 6.4 shows linking the test dataset. An alternative is to use Rattle’s default by leaving the Partition box selected on the data page (70% training, 15% validation, 15% testing).

Click on the Evaluate tab, and select CSV File, which makes the ­Document Link tab available. Click on that search for the test file from your hard drive. Execute yields Table 6.15.


Table 6.15 Coincidence matrix for R logistic regression model

Predicted OK

Predicted problem

Actual OK

226

4

230

Actual problem

18

2

20

244

6

250



This coincidence matrix indicates an overall correct classification rate of 0.912 over the independent test set. The correct classification for OK cases was 0.926, while the correct classification for Problem cases was only 0.333. Thus, this model must be noted to be pretty bad at identifying Problem cases. The model can be applied to new cases if there is a file containing them. Figure 6.5 shows the Rattle screen. Note that you can select Class (for categorical prediction) or Probability (for continuous).

Image

Figure 6.5 Rattle screen to apply model to new cases

The Outcome column in this file can contain “?.” Table 6.16 contains this data, with model predictions in probability form (0 is OK, 1 Fraud).


Table 6.16 LoanRawNew.csv

Age

Income

Assets

Debts

Want

Credit

Risk

Outcome

55

75,000

80,605

90,507

3,000

Amber

High

0.200

30

23,506

22,300

18,506

2,200

Amber

Low

0.099

48

48,912

72,507

123,541

7,600

Red

High

0.454

22

8,106

0

1,205

800

Red

High

0.627

31

28,571

20,136

30,625

2,500

Amber

High

0.351

36

61,322

108,610

80,542

6,654

Green

Low

0.012

41

70,486

150,375

120,523

5,863

Green

Low

0.011

22

22,400

32,512

12,521

3,652

Green

Low

0.022

25

27,908

12,582

8,654

4,003

Amber

Medium

0.094

28

41,602

18,366

12,587

2,875

Green

Low

0.017



Applying the logistic regression model from R to this data resulted in classification of these cases, found the third and fourth rows predicted as problem loans, the rest OK.

KNIME

A File Reader is used to link LoanRaw400.csv from the hard drive. This is Configured and Executed. A Logistic Regression Learner is dragged to the workflow and connected to the File Reader. In the Configure operation, the intermediate variables Assets, Debts, and Want are removed, as indicated in Figure 6.6.

Image

Figure 6.6 Regression learner for loan logistic regression

Executing the Regression Learner yields Figure 6.7, the logistic regression model providing b coefficients as we obtained with R (although notably different).

Image

Figure 6.7 Logistic regression model from KNIME

We show the entire workflow for KNIME in Figure 6.8 to better explain the subsequent links and functions.

Image

Figure 6.8 KNIME workflow for loan data

A File Reader in node 4 is used to link the test data (250 observations) from the hard drive. A Regression Predictor (node 3) is used to apply the model from Logistic Regression Learner to the test data. After the nodes are linked, and all Executed, the output is linked to a Scorer (node 5) to get a coincidence matrix, in this case, shown in Table 6.17.


Table 6.17 Coincidence matrix from KNIME for the loan logistic regression model

Predicted OK

Predicted problem

Actual OK

224

6

230

Actual problem

17

3

20

241

9

250



Here, the relative accuracy is 227/250, the same 0.908 as obtained from the different logistic regression obtained with R. We add another File Reader to apply the model to 10 new cases (from file LoanRawNew.csv on the hard drive, which is linked to another Regression Predictor (node 7), which in turn is linked to an Interactive Table to predict the 10 new cases (shown in Figure 6.9).

Image

Figure 6.9 KNIME logistic regression loan new cases

In Figure 6.9, you can see that the KNIME logistic regression only predicted problems for case 4 (row3), as opposed to cases 3 and 4 from the R logistic regression.

WEKA

The same operation can be conducted with WEKA, beginning with opening the dataset (Figure 6.10). We use all 650 observations, as it is difficult to get WEKA to read test sets.

Image

Figure 6.10 WEKA data loading for the loan dataset

Here, we remove the intermediate variables Assets, Debts, and Want as we did with the other software. Modeling can be accomplished by selecting Classify, followed by Functions, and then Logistic Regression (only works with categorical output variable). This yields Figure 6.11.

Image

Figure 6.11 WEKA logistic regression screen

We can use Cross-validation (10 fold is a good option), or a Percentage split. (In theory, you can use Supplied test set, but that is what WEKA has trouble reading). We choose 62 percent for the split to get close to what we used with R and KNIME. The model output is shown in Table 6.18.


Table 6.18 WEKA logistic regression model

Variable

Logistic coefficient

Age

−0.1407

Income

0

Credit=red

−1.1046

Credit=green

0.9042

Credit=amber

−0.4965

Risk=high

−0.7171

Risk=medium

0.3647

Risk=low

0.6593

Intercept

5.0063



Table 6.19 shows the resulting coincidence matrix.


Table 6.19 Coincidence matrix from WEKA for loan logistic regression model

Predicted OK

Predicted problem

Actual OK

227

1

228

Actual problem

19

0

19

246

1

247



This model actually has a slightly (very slightly) better fit than the R and KNIME models, with a 0.919 fit.

To apply the model to new cases, you can add the new cases to the input data file. In this case, we added the 10 cases given in Table 6.16 to the LoanRaw.csv file, saving under the name LoanRawNew.csv. We read this into WEKA, as shown in Figure 6.12.

Image

Figure 6.12 Logistic regression with WEKA—predicting new cases

Note that we selected Use training set. Clicking on More options …, we obtain the screen shown in Figure 6.13, and select Output predictions.

Image

Figure 6.13 More options

Then, the logistic regression model is run, as in Figure 6.14.

Image

Figure 6.14 WEKA logistic regression for prediction

Figure 6.14 includes the full model, and a confusion matrix, but these are warped with the 10 new cases and are not the purpose of the output. Their purpose is to read the predictions, displayed (after 650 data points in the original full data set) as rows 651 through 660. The dataset had “?” values entered. The Prediction column shows that the fourth of these was categorized as a “Problem,” the other nine as “OK.” This matches the R logistic regression predictions.

Summary

Regression models have been widely used in classical modeling. They continue to be very useful in data mining environments, which differ primarily in the scale of observations and the number of variables used. Classical regression (usually OLS) can be applied to continuous data. If the output variables (or input variables) are categorical, logistic regression can be applied. Regression can also be applied to identify a discriminant function, separating observations into groups. If this is done, the cutoff limits to separate the observations based on the discriminant function score need to be identified. While discriminant analysis can be applied to multiple groups, it is much more complicated if there are more than two groups. Thus, other discriminant methods, such as the centroid method demonstrated in this chapter, are often used.

Regression can be applied by conventional software such as SAS, SPSS, or Excel. Additionally, there are many refinements to regression that can be accessed, such as stepwise linear regression. Stepwise regression uses partial correlations to select entering independent variables ­iteratively, providing some degree of automatic machine development of a regression model. Stepwise regression has its proponents and opponents, but is a form of machine learning.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset