Chapter 4

Overview of Data Mining Techniques

Data useful to business comes in many forms. For instance, an automobile insurance company, faced with millions of accident claims, realizes that not all claims are legitimate. If they are extremely tough and investigate each claim thoroughly, they will spend more money on investigation than they would pay in claims. They also will find that they are unable to sell new policies. If they are as understanding and trusting as their television ads imply, they will reduce their investigation costs to zero, but will leave themselves vulnerable to fraudulent claims. Insurance firms have developed ways to profile claims, considering many variables, to provide an early indication of cases that probably merit expending funds for investigation. This has the effect of reducing the overall policy expenses, because it discourages fraud, while minimizing the imposition on valid claims. The same approach is used by the Internal Revenue Service in processing individual tax returns. Fraud detection has become a viable data mining industry, with a large number of software vendors. This is typical of many applications of data mining.

Data mining can be conducted in many business contexts. This ­chapter presents four datasets that will be utilized to demonstrate the techniques to be covered in Part II of the book. In addition to insurance fraud, files have been generated reflecting other common business applications, such as loan evaluation and customer segmentation. The same concepts can be applied to other applications, such as evaluation of the employees.

We have described data mining, its process, and data storage systems that make it possible. The next section of the book will describe the data mining methods. Data mining tools have been classified by the tasks of classification, estimation, clustering, and summarization. Classification and estimation are predictive. Clustering and summarization are descriptive. Not all methods will be presented, but those most commonly used will be. We will demonstrate each of these methods with small example datasets intended to show how these methods work. We do not intend to give the impression that these datasets are anywhere near the scale of real data mining applications. But, they do represent the micro versions of real applications and are much more convenient to demonstrate concepts.

Data Mining Models

Data mining uses a variety of modeling tools for a variety of purposes. Various authors have viewed these purposes along with the available tools (see Table 4.1). These methods come from both classical statistics as well as from artificial intelligence. Statistical techniques have strong diagnostic tools that can be used for the development of confidence intervals on parameter estimates, hypothesis testing, and other things. Artificial intelligence techniques require fewer assumptions about the data and are generally more automatic.


Table 4.1 Data mining modeling tools

Algorithms

Functions

Basis

Task

Cluster detection

Cluster analysis

Statistics

Classification

Regression

Linear regression

Statistics

Prediction

Logistic regression

Statistics

Classification

Discriminant analysis

Statistics

Classification

Neural networks

Neural networks

AI

Classification

Kohonen nets

AI

Cluster

Decision trees

Association rules

AI

Classification

Rule induction

Association rules

AI

Description

Link analysis

Description

Query tools

Description

Descriptive statistics

Statistics

Description

Visualization tools

Statistics

Description



Regression comes in a variety of forms, to include ordinary least squares regression, logistic regression (widely used in data mining when outcomes are binary), and discriminant analysis (used when outcomes are categorical and predetermined).

The point of data mining is to have a variety of tools available to assist the analyst and user in better understanding what the data consists of. Each method does something different, and usually, this implies a ­specific problem is best treated with a particular algorithm type. However, sometimes different algorithm types can be used for the same ­problem. Most involve setting the parameters, which can be important in the effectiveness of the method is needed. Further, the output needs to be interpreted.

There are a number of overlaps. Cluster analysis helps data miners to visualize the relationship among customer purchases, and is supported by visualization techniques that provide a different perspective. Link analysis helps identify the connections between variables, often displayed through graphs as a means of visualization. An example of link analysis application is in telephony, where calls are represented by the linkage between the caller and the receiver. Another example of linkage is the physician referral patterns. A patient may visit their regular doctor, who detects something that they don’t know a lot about. They refer to their network of acquaintances to identify a reliable specialist who does. Clinics are collections of physician specialists, and might be referred to for especially difficult cases.

Data Mining Perspectives

Methods can be viewed from different perspectives. From the perspective of statistics and operations research, data mining methods include:

  • Cluster analysis
  • Regression of various forms
  • Discriminant analysis (use of linear regression for classification)
  • Line fitting through the operations research tool of multiple objective linear programming

From the perspective of artificial intelligence, these methods include:

  • Neural networks (best fit methods)
  • Rule induction (decision trees)
  • Genetic algorithms (often used to supplement other ­
    methods)

Regression and neural network approaches are best fit methods and are usually applied together. Regression tends to have advantages with linear data, while neural network models do very well with irregular data. Software usually allows the user to apply variants of each and lets the analyst select the model that fits best. Cluster analysis, discriminant analysis, and case-based reasoning seek to assign new cases to the closest cluster of past observations. Rule induction is the basis of decision tree methods of data mining. Genetic algorithms apply to the special forms of data and are often used to boost or improve the operation of other techniques.

The ability of some of these techniques to deal with the common data mining characteristics is compared in Table 4.2.


Table 4.2 General ability of data mining techniques to deal with data features

Data characteristic

Rule induction

Neural networks

Case-based reasoning

Genetic algorithms

Handle noisy data

Good

Very good

Good

Very good

Handle missing data

Good

Good

Very good

Good

Process large datasets

Very good

Poor

Good

Good

Process different data types

Good

Transform to numerical

Very good

Transformation needed

Predictive accuracy

High

Very high

High

High

Explanation ­capability

Very good

Poor

Very good

Good

Ease of integration

Good

Good

Good

Very good

Ease of operation

Easy

Difficult

Easy

Difficult



Table 4.2 demonstrates that there are different tools for different types of problems. If the data is especially noisy, this can lead to difficulties for the classical statistical methods, such as regression, cluster analysis, and discriminant analysis. The methods using rule induction and case-based reasoning can deal with such problems, but if the noise was false information, this can lead to rules concluding the wrong things. ­Neural ­networks and genetic algorithms have proven useful relative to the ­classical ­methods in environments where there are complexities in the data, to include interactions among variables that are nonlinear.

Neural networks have relative disadvantages in dealing with very large numbers of variables, as the computational complexity increases dramatically. Genetic algorithms require a specific data structure for genetic algorithms to operate, and it is not always easy to transform data to accommodate this requirement.

Another negative feature of neural networks is their hidden nature. Due to the large number of node connections, it is impractical to print out and analyze a large neural network model. This makes it difficult to transport a model built on one system to another system. Therefore, new data must be entered in the system where the neural network model was built in order to apply it to the new cases. This makes it nearly impossible to apply neural network models outside of the system upon which they are built.

Data Mining Functions

Problem types can be described in four categories:

  • Association identifies the rules that determine the ­relationships among entities, such as in market basket analysis, or the ­association of symptoms with diseases.
  • Prediction identifies the key attributes from data to develop a formula for prediction of future cases, as in regression models.
  • Classification uses a training dataset to identify classes or clusters, which then are used to categorize data. Typical applications include categorizing risk and return characteristics of investments and credit risk of loan applicants.
  • Detection determines the anomalies and irregularities, valuable in fraud detection.

Table 4.3 compares the common techniques and applications by ­business area.


Table 4.3 Data mining applications by method

Area

Technique

Application

Problem type

Finance

Neural network

Forecast stock price

Prediction

Neural network

Rule induction

Forecast bankruptcy

Forecast price index futures

Fraud detection

Prediction

Prediction

Detection

Neural network

Case-based reasoning

Forecast interest rates

Prediction

Neural network

Visualization

Delinquent bank loan detection

Detection

Rule induction

Forecast defaulting loans

Credit assessment

Portfolio management

Risk classification

Financial customer classification

Prediction

Prediction

Prediction

Classification

Classification

Rule induction

Case-based reasoning

Corporate bond rating

Prediction

Rule induction

Visualization

Loan approval

Prediction

Telecom

Neural network

Rule induction

Forecast network behavior

Prediction

Rule induction

Churn management

Fraud detection

Classification

Detection

Case-based reasoning

Call tracking

Classification

Marketing

Rule induction

Market segmentation

Cross-selling ­improvement

Classification

Association

Rule induction

Visualization

Lifestyle behavior analysis

Product performance analysis

Classification

Association

Rule induction

Genetic algorithm

Visualization

Customer reaction to promotion

Prediction

Case-based reasoning

Online sales support

Classification

Web

Rule induction

Visualization

User browsing ­similarity analysis

Classification,

Association

Rule-based heuristics

Web page content similarity

Association

Others

Neural network

Software cost ­estimation

Detection

Neural network

Rule induction

Litigation assessment

Prediction

Rule induction

Insurance fraud detection

Healthcare exception reporting

Detection

Detection

Case-based reasoning

Insurance claim estimation

Software quality control

Prediction

Classification

Genetic algorithms

Budget expenditure

Classification



Many of these applications combined techniques to include visualization and statistical analysis. The point is that there are many data mining tools available for a variety of functional purposes, spanning almost every area of human endeavor (including business). This section of the book seeks to demonstrate how these primary data mining tools work.

Demonstration Datasets

We will use some simple models to demonstrate the concepts. These datasets were generated by the authors, reflecting important business applications. The first model includes loan applicants, with 20 observations for building data, and 10 applicants serving as a test dataset. The second dataset represents job applicants. Here, 10 observations with known outcomes serve as the training set, with 5 additional cases in the test set. A third dataset of insurance claims has 10 known outcomes for training and 5 observations in the test set. All three datasets will be applied to new cases.

Larger datasets for each of these three cases will be provided as well as a dataset on expenditure data. These larger datasets will be used in various chapters to demonstrate methods.

Loan Analysis Data

This dataset (Table 4.4) consists of information on applicants for appliance loans. The full dataset involves 650 past observations. Applicant information on age, income, assets, debts, and credit rating (from a credit bureau, with red for bad credit, yellow for some credit problems, and green for clean credit record) is assumed available from loan applications. Variable Want is the amount requested in the appliance loan application. For past observations, variable On-time is 1 if all payments were received on time and 0 if not (Late or Default). The majority of past loans were paid on time. Data was transformed to obtain categorical data for some of the techniques. Age was grouped by less than 30 (young), 60 or over (old), and in between (middle-aged). Income was grouped as less than or equal to $30,000 per year or lower (low income), $80,000 per year or more (high income), and average in between. Asset, debt, and loan amount (variable Want) are used by rule to generate categorical variable Risk. Risk was categorized as High if debts exceeded the assets, as low if assets exceeded the sum of debts plus the borrowing amount requested, and average in between.


Table 4.4 Loan analysis training dataset

Age

Income

Assets

Debts

Want

Risk

Credit

Result

20 (young)

17,152 (low)

11,090

20,455

400

High

Green

On-time

23 (young)

25,862 (low)

24,756

30,083

2,300

High

Green

On-time

28 (young)

26,169 (low)

47,355

49,341

3,100

High

Yellow

Late

23 (young)

21,117 (low)

21,242

30,278

300

High

Red

Default

22 (young)

7,127 (low)

23,903

17,231

900

Low

Yellow

On-time

26 (young)

42,083 (average)

35,726

41,421

300

High

Red

Late

24 (young)

55,557 (average)

27,040

48,191

1,500

High

Green

On-time

27 (young)

34,843 (average)

0

21,031

2,100

High

Red

On-time

29 (young)

74,295 (average)

88,827

100,599

100

High

Yellow

On-time

23 (young)

38,887 (average)

6,260

33,635

9,400

Low

Green

On-time

28 (young)

31,758 (average)

58,492

49,268

1,000

Low

Green

On-time

25 (young)

80,180 (high)

31,696

69,529

1,000

High

Green

Late

33 (middle)

40,921 (average)

91,111

90,076

2,900

Average

Yellow

Late

36 (middle)

63,124 (average)

164,631

144,697

300

Low

Green

On-time

39 (middle)

59,006 (average)

195,759

161,750

600

Low

Green

On-time

39 (middle)

125,713 (high)

382,180

315,396

5,200

Low

Yellow

On-time

55 (middle)

80,149 (high)

511,937

21,923

1,000

Low

Green

On-time

62 (old)

101,291 (high)

783,164

23,052

1,800

Low

Green

On-time

71 (old)

81,723 (high)

776,344

20,277

900

Low

Green

On-time

63 (old)

99,522 (high)

783,491

24,643

200

Low

Green

On-time



Table 4.5 gives a test set of data.


Table 4.5 Loan analysis test data

Age

Income

Assets

Debts

Want

Risk

Credit

Result

37 (middle)

37,214 (average)

123,420

106,241

4,100

Low

Green

On-time

45 (middle)

57,391 (average)

250,410

191,879

5,800

Low

Green

On-time

45 (middle)

36,692 (average)

175,037

137,800

3,400

Low

Green

On-time

25 (young)

67,808 (average)

25,174

61,271

3,100

High

Yellow

On-time

36 (middle)

102,143 (high)

246,148

231,334

600

Low

Green

On-time

29 (young)

34,579 (average)

49,387

59,412

4,600

High

Red

On-time

26 (young)

22,958 (low)

29,878

36,508

400

High

Yellow

Late

34 (middle)

42,526 (average)

109,934

92,494

3,700

Low

Green

On-time

28 (young)

80,019 (high)

78,632

100,957

12,800

High

Green

On-time

32 (middle)

57,407 (average)

117,062

101,967

100

Low

Green

On-time


-

The model can be applied to the new applicants given in Table 4.6.


Table 4.6 New appliance loan analysis

Age

Income

Assets

Debts

Want

Credit

25

28,650

9,824

2,000

10,000

Green

30

35,760

12,974

32,634

4,000

Yellow

32

41,862

625,321

428,643

3,000

Red

36

36,843

80,431

120,643

12,006

Green

37

62,743

421,753

321,845

5,000

Yellow

37

53,869

286,375

302,958

4,380

Green

37

70,120

484,264

303,958

6,000

Green

38

60,429

296,843

185,769

5,250

Green

39

65,826

321,959

392,817

12,070

Green

40

90,426

142,098

25,426

1,280

Yellow

40

70,256

528,493

283,745

3,280

Green

42

58,326

328,457

120,849

4,870

Green

42

61,242

525,673

184,762

3,300

Green

42

39,676

326,346

421,094

1,290

Red

43

102,496

823,532

175,932

3,370

Green

43

80,376

753,256

239,845

5,150

Yellow

44

74,623

584,234

398,456

1,525

Green

45

91,672

436,854

275,632

5,800

Green

52

120,721

921,482

128,573

2,500

Yellow

63

86,521

241,689

5,326

30,000

Green



Job Application Data

The second dataset involves 500 past job applicants. Variables are:

Age integer, 20 to 65
State State of origin
Degree Cert Professional certification
UG Undergraduate degree
MBA Masters in Business Administration
MS Masters of Science
PhD Doctorate
Major none
Engr Engineering
Sci Science or Math
Csci Computer Science
BusAd Business Administration
IS Information Systems
Experience integer years of experience in this field
Outcome ordinal Unacceptable
Minimal
Adequate
Excellent

Table 4.7 gives the 10 observations in the learning set.


Table 4.7 Job applicant training dataset

Record

Age

State

Degree

Major

Experience (in years)

Outcome

1

27

CA

BS

Engineering

2

Excellent

2

33

NV

MBA

Business Administration

5

Adequate

3

30

CA

MS

Computer Science

0

Adequate

4

22

CA

BS

Information Systems

0

Unacceptable

5

28

CA

BS

Information Systems

2

Minimal

6

26

CA

MS

Business Administration

0

Excellent

7

25

CA

BS

Engineering

3

Adequate

8

28

OR

MS

Computer Science

2

Adequate

9

25

CA

BS

Information Systems

2

Minimal

10

24

CA

BS

Information Systems

1

Adequate



Notice that some of these variables are quantitative and others are nominal. State, degree, and major are nominal. There is no information content intended by state or major. State is not expected to have a specific order prior to analysis, nor is major. (The analysis may conclude that there is a relationship between state, major, and outcome, however.) Degree is ordinal, in that MS and MBA are higher degrees than BS. However, as with state and major, the analysis may find a reverse relationship with the outcome.

Table 4.8 gives the test dataset for this case.


Table 4.8 Job applicant test dataset

Record

Age

State

Degree

Major

Experience (in years)

Outcome

11

36

CA

MS

Information Systems

0

Minimal

12

28

OR

BS

Computer Science

5

Unacceptable

13

24

NV

BS

Information Systems

0

Excellent

14

33

CA

BS

Engineering

2

Adequate

15

26

CA

BS

Business Administration

3

Minimal



Table 4.9 provides a set of new job applicants to be classified by ­predicted job performance.


Table 4.9 New job applicant set

Age

State

Degree

Major

Experience (in years)

28

CA

MBA

Engr

0

26

NM

UG

Sci

3

33

TX

MS

Engr

6

21

CA

Cert

none

0

26

OR

Cert

none

5

25

CA

UG

BusAd

0

32

AR

UG

Engr

8

41

PA

MBA

BusAd

2

29

CA

UG

Sci

6

28

WA

UG

Csci

3



Insurance Fraud Data

The third dataset involves insurance claims. The full dataset includes 5,000 past claims with known outcomes. Variables include the claimant age, gender, amount of insurance claim, number of traffic tickets currently on record (less than 3 years old), number of prior accident claims of the type insured, and attorney (if any). Table 4.10 gives the training dataset.


Table 4.10 Training dataset—Insurance claims

Claimant age

Gender

Claim amount

Tickets

Prior claims

Attorney

Outcome

52

Male

2,000

0

1

Jones

OK

38

Male

1,800

0

0

None

OK

21

Female

5,600

1

2

Smith

Fraudulent

36

Female

3,800

0

1

None

OK

19

Male

600

2

2

Adams

OK

41

Male

4,200

1

2

Smith

Fraudulent

38

Male

2,700

0

0

None

OK

33

Female

2,500

0

1

None

Fraudulent

18

Female

1,300

0

0

None

OK

26

Male

2,600

2

0

None

OK



The test set is given in Table 4.11.


Table 4.11 Test dataset—Insurance claims

Claimant age

Gender

Claim amount

Tickets

Prior claims

Attorney

Outcome

23

Male

2,800

1

0

None

OK

31

Female

1,400

0

0

None

OK

28

Male

4,200

2

3

Smith

Fraudulent

19

Male

2,800

0

1

None

OK

41

Male

1,600

0

0

Henry

OK



A set of new claims is given in Table 4.12.


Table 4.12 New insurance claims

Claimant age

Gender

Claim amount

Tickets

Prior claims

Attorney

23

Male

1,800

1

1

None

32

Female

2,100

0

0

None

20

Female

1,600

0

0

None

18

Female

3,300

2

0

None

55

Male

4,000

0

0

Smith

41

Male

2,600

1

1

None

38

Female

3,100

0

0

None

21

Male

2,500

1

0

None

16

Female

4,500

1

2

Gold

24

Male

2,600

1

1

None



Expenditure Data

This dataset represents the consumer data for a community gathered by a hypothetical market research company in a moderate sized city. Ten thousand observations have been gathered over the following variables:

DEMOGRAPHIC Age integer, 16 and up
Gender 0-female, 1-male
Marital Status 0-single, 0.5-divorced, 1-married
Dependents Number of dependents
Income Annual income in dollars
Job yrs Years in the current job (integer)
Town yrs Years in this community
Yrs Ed Years of education completed
Dri Lic Drivers License (0-no, 1-yes)
Own Home 0-no, 1-yes
#Cred C number of credit cards
CONSUMER Churn Number of credit card balances canceled last year
ProGroc Proportion of income spent at grocery stores
ProRest Proportion of income spent at restaurants
ProHous Proportion of income spent on housing
ProUtil Proportion of income spent on utilities
ProAuto Proportion of income spent on automobiles
(owned and operated)
ProCloth Proportion of income spent on clothing
ProEnt Proportion of income spent on entertainment

This dataset can be used for a number of studies to include questions, such as what types of customers are most likely to seek restaurants, what the market for home furnishings might be, what type of customers are most likely to be interested in clothing or in entertainment, and what is the relationship of spending to demographic variables.

Bankruptcy Data

This data concerns 100 U.S. firms that underwent bankruptcy.1 All of the sample data are from the U.S. companies. About 400 bankrupt company names were obtained using google.com, and the next step is to find out the Ticker name of each company using the Compustat database, at the same time, we just kept the companies bankrupted during January 2006 and December 2009, since we hopefully want to get some different results because of this economic crisis. 99 companies left after this step. After getting the company the Ticker code list, we submitted the Ticker list to the Compustat database to get the financial data ratios during January 2005 to December 2009. Those financial data and ratios are the factors from which we can predict the company bankruptcy. The factors we collected are based on the literature, which contain total asset, book value per share, inventories, liabilities, receivables, cost of goods sold, total dividends, earnings before interest and taxes, gross profit (loss), net income (loss), operating income after depreciation, total revenue, sales, dividends per share, and total market value. We make a match for scale and size as 1:2 ratios. It means that we need to collect the same financial ratios for 200 nonfailed companies during the same periods. First, we used the ­LexisNexis database to find the company Securities and Exchange Commission filling after June 2010, which means that companies are still active today, and then, we selected 200 companies from the results and got the company CIK code list. The final step, we submitted the CIK code list to the Compustat database and got the financial data and ratios during January 2005 to December 2009, which is the same period with that of getting failed companies.

The dataset consists of 1,321 records with full data over 19 attributes, as shown in Table 4.13. The outcome attribute in bankruptcy has a value of 1 if the firm went bankrupt by 2011 (697 cases) and a value of 0 if it did not (624 cases).


Table 4.13 Attributes in bankruptcy data

No.

Short name

Long name

1

fyear

Data year—fiscal

2

cik

CIK number

3

at

Assets—total

4

bkvlps

Book value per share

5

invt

Inventories—total

6

Lt

Liabilities—total

7

rectr

Receivables—trade

8

cogs

Cost of goods sold

9

dvt

Dividends—total

10

ebit

Earnings before interest and taxes

11

gp

Gross profit (loss)

12

ni

Net income (loss)

13

oiadp

Operating income after depreciation

14

revt

Revenue—total

15

sale

Sales-turnover (net)

16

dvpsx_f

Dividends per share—ex-date—fiscal

17

mkvalt

Market value—total—fiscal

18

prch_f

Price high—annual—fiscal

19

bankruptcy

Bankruptcy (output variable)



This is real data concerning firm bankruptcy, which could be updated by going to the web sources.

Summary

There are a number of tools available for data mining, which can accomplish a number of functions. The tools come from areas of statistics, ­operations research, and artificial intelligence, providing analytical techniques that can be used to accomplish a variety of analytic functions, such as cluster identification, discriminant analysis, and development of association rules. Data mining software provides powerful means to apply these tools to large sets of data, giving the organizational management means to cope with an overwhelming glut of data and the ability to ­convert some of this glut into useful knowledge.

This chapter begins with an overview of tools and functions. It also previews four datasets that are used in subsequent chapters, plus a fifth dataset of real firm bankruptcies available for use. These datasets are small, but provide the readers with views of the type of data typically encountered in some data mining studies.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset