Chapter 4 Overview of Data Mining Techniques

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Data useful to business comes in many forms. For instance, an automobile insurance company, faced with millions of accident claims, realizes that not all claims are legitimate. If they are extremely tough and investigate each claim thoroughly, they will spend more money on investigation than they would pay in claims. They also will find that they are unable to sell new policies. If they are as understanding and trusting as their television ads imply, they will reduce their investigation costs to zero, but will leave themselves vulnerable to fraudulent claims. Insurance firms have developed ways to profile claims, considering many variables, to provide an early indication of cases that probably merit expending funds for investigation. This has the effect of reducing the overall policy expenses, because it discourages fraud, while minimizing the imposition on valid claims. The same approach is used by the Internal Revenue Service in processing individual tax returns. Fraud detection has become a viable data mining industry, with a large number of software vendors. This is typical of many applications of data mining.

Data mining can be conducted in many business contexts. This chapter presents four datasets that will be utilized to demonstrate the techniques to be covered in Part II of the book. In addition to insurance fraud, files have been generated reflecting other common business applications, such as loan evaluation and customer segmentation. The same concepts can be applied to other applications, such as evaluation of the employees.

We have described data mining, its process, and data storage systems that make it possible. The next section of the book will describe the data mining methods. Data mining tools have been classified by the tasks of classification, estimation, clustering, and summarization. Classification and estimation are predictive. Clustering and summarization are descriptive. Not all methods will be presented, but those most commonly used will be. We will demonstrate each of these methods with small example datasets intended to show how these methods work. We do not intend to give the impression that these datasets are anywhere near the scale of real data mining applications. But, they do represent the micro versions of real applications and are much more convenient to demonstrate concepts.

Data Mining Models

Data mining uses a variety of modeling tools for a variety of purposes. Various authors have viewed these purposes along with the available tools (see Table 4.1). These methods come from both classical statistics as well as from artificial intelligence. Statistical techniques have strong diagnostic tools that can be used for the development of confidence intervals on parameter estimates, hypothesis testing, and other things. Artificial intelligence techniques require fewer assumptions about the data and are generally more automatic.

Table 4.1 Data mining modeling tools

Algorithms	Functions	Basis	Task
Cluster detection	Cluster analysis	Statistics	Classification
Regression	Linear regression	Statistics	Prediction
	Logistic regression	Statistics	Classification
	Discriminant analysis	Statistics	Classification
Neural networks	Neural networks	AI	Classification
Neural networks	Kohonen nets	AI	Cluster
Decision trees	Association rules	AI	Classification
Rule induction	Association rules	AI	Description
Link analysis			Description
	Query tools		Description
	Descriptive statistics	Statistics	Description
	Visualization tools	Statistics	Description

Regression comes in a variety of forms, to include ordinary least squares regression, logistic regression (widely used in data mining when outcomes are binary), and discriminant analysis (used when outcomes are categorical and predetermined).

The point of data mining is to have a variety of tools available to assist the analyst and user in better understanding what the data consists of. Each method does something different, and usually, this implies a specific problem is best treated with a particular algorithm type. However, sometimes different algorithm types can be used for the same problem. Most involve setting the parameters, which can be important in the effectiveness of the method is needed. Further, the output needs to be interpreted.

There are a number of overlaps. Cluster analysis helps data miners to visualize the relationship among customer purchases, and is supported by visualization techniques that provide a different perspective. Link analysis helps identify the connections between variables, often displayed through graphs as a means of visualization. An example of link analysis application is in telephony, where calls are represented by the linkage between the caller and the receiver. Another example of linkage is the physician referral patterns. A patient may visit their regular doctor, who detects something that they don’t know a lot about. They refer to their network of acquaintances to identify a reliable specialist who does. Clinics are collections of physician specialists, and might be referred to for especially difficult cases.

Data Mining Perspectives

Methods can be viewed from different perspectives. From the perspective of statistics and operations research, data mining methods include:

Cluster analysis
Regression of various forms
Discriminant analysis (use of linear regression for classification)
Line fitting through the operations research tool of multiple objective linear programming

From the perspective of artificial intelligence, these methods include:

Neural networks (best fit methods)
Rule induction (decision trees)
Genetic algorithms (often used to supplement other
methods)

Regression and neural network approaches are best fit methods and are usually applied together. Regression tends to have advantages with linear data, while neural network models do very well with irregular data. Software usually allows the user to apply variants of each and lets the analyst select the model that fits best. Cluster analysis, discriminant analysis, and case-based reasoning seek to assign new cases to the closest cluster of past observations. Rule induction is the basis of decision tree methods of data mining. Genetic algorithms apply to the special forms of data and are often used to boost or improve the operation of other techniques.

The ability of some of these techniques to deal with the common data mining characteristics is compared in Table 4.2.

Table 4.2 General ability of data mining techniques to deal with data features

Data characteristic	Rule induction	Neural networks	Case-based reasoning	Genetic algorithms
Handle noisy data	Good	Very good	Good	Very good
Handle missing data	Good	Good	Very good	Good
Process large datasets	Very good	Poor	Good	Good
Process different data types	Good	Transform to numerical	Very good	Transformation needed
Predictive accuracy	High	Very high	High	High
Explanation capability	Very good	Poor	Very good	Good
Ease of integration	Good	Good	Good	Very good
Ease of operation	Easy	Difficult	Easy	Difficult

Table 4.2 demonstrates that there are different tools for different types of problems. If the data is especially noisy, this can lead to difficulties for the classical statistical methods, such as regression, cluster analysis, and discriminant analysis. The methods using rule induction and case-based reasoning can deal with such problems, but if the noise was false information, this can lead to rules concluding the wrong things. Neural networks and genetic algorithms have proven useful relative to the classical methods in environments where there are complexities in the data, to include interactions among variables that are nonlinear.

Neural networks have relative disadvantages in dealing with very large numbers of variables, as the computational complexity increases dramatically. Genetic algorithms require a specific data structure for genetic algorithms to operate, and it is not always easy to transform data to accommodate this requirement.

Another negative feature of neural networks is their hidden nature. Due to the large number of node connections, it is impractical to print out and analyze a large neural network model. This makes it difficult to transport a model built on one system to another system. Therefore, new data must be entered in the system where the neural network model was built in order to apply it to the new cases. This makes it nearly impossible to apply neural network models outside of the system upon which they are built.

Data Mining Functions

Problem types can be described in four categories:

Association identifies the rules that determine the relationships among entities, such as in market basket analysis, or the association of symptoms with diseases.
Prediction identifies the key attributes from data to develop a formula for prediction of future cases, as in regression models.
Classification uses a training dataset to identify classes or clusters, which then are used to categorize data. Typical applications include categorizing risk and return characteristics of investments and credit risk of loan applicants.
Detection determines the anomalies and irregularities, valuable in fraud detection.

Table 4.3 compares the common techniques and applications by business area.

Table 4.3 Data mining applications by method

Area	Technique	Application	Problem type
Finance	Neural network	Forecast stock price	Prediction
	Neural network Rule induction	Forecast bankruptcy Forecast price index futures Fraud detection	Prediction Prediction Detection
	Neural network Case-based reasoning	Forecast interest rates	Prediction
	Neural network Visualization	Delinquent bank loan detection	Detection
	Rule induction	Forecast defaulting loans Credit assessment Portfolio management Risk classification Financial customer classification	Prediction Prediction Prediction Classification Classification
	Rule induction Case-based reasoning	Corporate bond rating	Prediction
	Rule induction Visualization	Loan approval	Prediction
Telecom	Neural network Rule induction	Forecast network behavior	Prediction
	Rule induction	Churn management Fraud detection	Classification Detection
	Case-based reasoning	Call tracking	Classification
Marketing	Rule induction	Market segmentation Cross-selling improvement	Classification Association
	Rule induction Visualization	Lifestyle behavior analysis Product performance analysis	Classification Association
	Rule induction Genetic algorithm Visualization	Customer reaction to promotion	Prediction
	Case-based reasoning	Online sales support	Classification
Web	Rule induction Visualization	User browsing similarity analysis	Classification, Association
Web	Rule-based heuristics	Web page content similarity	Association
Others	Neural network	Software cost estimation	Detection
	Neural network Rule induction	Litigation assessment	Prediction
	Rule induction	Insurance fraud detection Healthcare exception reporting	Detection Detection
	Case-based reasoning	Insurance claim estimation Software quality control	Prediction Classification
	Genetic algorithms	Budget expenditure	Classification

Many of these applications combined techniques to include visualization and statistical analysis. The point is that there are many data mining tools available for a variety of functional purposes, spanning almost every area of human endeavor (including business). This section of the book seeks to demonstrate how these primary data mining tools work.

Demonstration Datasets

We will use some simple models to demonstrate the concepts. These datasets were generated by the authors, reflecting important business applications. The first model includes loan applicants, with 20 observations for building data, and 10 applicants serving as a test dataset. The second dataset represents job applicants. Here, 10 observations with known outcomes serve as the training set, with 5 additional cases in the test set. A third dataset of insurance claims has 10 known outcomes for training and 5 observations in the test set. All three datasets will be applied to new cases.

Larger datasets for each of these three cases will be provided as well as a dataset on expenditure data. These larger datasets will be used in various chapters to demonstrate methods.

Loan Analysis Data

This dataset (Table 4.4) consists of information on applicants for appliance loans. The full dataset involves 650 past observations. Applicant information on age, income, assets, debts, and credit rating (from a credit bureau, with red for bad credit, yellow for some credit problems, and green for clean credit record) is assumed available from loan applications. Variable Want is the amount requested in the appliance loan application. For past observations, variable On-time is 1 if all payments were received on time and 0 if not (Late or Default). The majority of past loans were paid on time. Data was transformed to obtain categorical data for some of the techniques. Age was grouped by less than 30 (young), 60 or over (old), and in between (middle-aged). Income was grouped as less than or equal to $30,000 per year or lower (low income), $80,000 per year or more (high income), and average in between. Asset, debt, and loan amount (variable Want) are used by rule to generate categorical variable Risk. Risk was categorized as High if debts exceeded the assets, as low if assets exceeded the sum of debts plus the borrowing amount requested, and average in between.

Table 4.4 Loan analysis training dataset

Age	Income	Assets	Debts	Want	Risk	Credit	Result
20 (young)	17,152 (low)	11,090	20,455	400	High	Green	On-time
23 (young)	25,862 (low)	24,756	30,083	2,300	High	Green	On-time
28 (young)	26,169 (low)	47,355	49,341	3,100	High	Yellow	Late
23 (young)	21,117 (low)	21,242	30,278	300	High	Red	Default
22 (young)	7,127 (low)	23,903	17,231	900	Low	Yellow	On-time
26 (young)	42,083 (average)	35,726	41,421	300	High	Red	Late
24 (young)	55,557 (average)	27,040	48,191	1,500	High	Green	On-time
27 (young)	34,843 (average)	0	21,031	2,100	High	Red	On-time
29 (young)	74,295 (average)	88,827	100,599	100	High	Yellow	On-time
23 (young)	38,887 (average)	6,260	33,635	9,400	Low	Green	On-time
28 (young)	31,758 (average)	58,492	49,268	1,000	Low	Green	On-time
25 (young)	80,180 (high)	31,696	69,529	1,000	High	Green	Late
33 (middle)	40,921 (average)	91,111	90,076	2,900	Average	Yellow	Late
36 (middle)	63,124 (average)	164,631	144,697	300	Low	Green	On-time
39 (middle)	59,006 (average)	195,759	161,750	600	Low	Green	On-time
39 (middle)	125,713 (high)	382,180	315,396	5,200	Low	Yellow	On-time
55 (middle)	80,149 (high)	511,937	21,923	1,000	Low	Green	On-time
62 (old)	101,291 (high)	783,164	23,052	1,800	Low	Green	On-time
71 (old)	81,723 (high)	776,344	20,277	900	Low	Green	On-time
63 (old)	99,522 (high)	783,491	24,643	200	Low	Green	On-time

Table 4.5 gives a test set of data.

Table 4.5 Loan analysis test data

Age	Income	Assets	Debts	Want	Risk	Credit	Result
37 (middle)	37,214 (average)	123,420	106,241	4,100	Low	Green	On-time
45 (middle)	57,391 (average)	250,410	191,879	5,800	Low	Green	On-time
45 (middle)	36,692 (average)	175,037	137,800	3,400	Low	Green	On-time
25 (young)	67,808 (average)	25,174	61,271	3,100	High	Yellow	On-time
36 (middle)	102,143 (high)	246,148	231,334	600	Low	Green	On-time
29 (young)	34,579 (average)	49,387	59,412	4,600	High	Red	On-time
26 (young)	22,958 (low)	29,878	36,508	400	High	Yellow	Late
34 (middle)	42,526 (average)	109,934	92,494	3,700	Low	Green	On-time
28 (young)	80,019 (high)	78,632	100,957	12,800	High	Green	On-time
32 (middle)	57,407 (average)	117,062	101,967	100	Low	Green	On-time

The model can be applied to the new applicants given in Table 4.6.

Table 4.6 New appliance loan analysis

Age	Income	Assets	Debts	Want	Credit
25	28,650	9,824	2,000	10,000	Green
30	35,760	12,974	32,634	4,000	Yellow
32	41,862	625,321	428,643	3,000	Red
36	36,843	80,431	120,643	12,006	Green
37	62,743	421,753	321,845	5,000	Yellow
37	53,869	286,375	302,958	4,380	Green
37	70,120	484,264	303,958	6,000	Green
38	60,429	296,843	185,769	5,250	Green
39	65,826	321,959	392,817	12,070	Green
40	90,426	142,098	25,426	1,280	Yellow
40	70,256	528,493	283,745	3,280	Green
42	58,326	328,457	120,849	4,870	Green
42	61,242	525,673	184,762	3,300	Green
42	39,676	326,346	421,094	1,290	Red
43	102,496	823,532	175,932	3,370	Green
43	80,376	753,256	239,845	5,150	Yellow
44	74,623	584,234	398,456	1,525	Green
45	91,672	436,854	275,632	5,800	Green
52	120,721	921,482	128,573	2,500	Yellow
63	86,521	241,689	5,326	30,000	Green

Job Application Data

The second dataset involves 500 past job applicants. Variables are:

Age	integer, 20 to 65
State	State of origin
Degree	Cert	Professional certification
	UG	Undergraduate degree
	MBA	Masters in Business Administration
	MS	Masters of Science
	PhD	Doctorate
Major	none
	Engr	Engineering
	Sci	Science or Math
	Csci	Computer Science
	BusAd	Business Administration
	IS	Information Systems
Experience	integer	years of experience in this field
Outcome	ordinal	Unacceptable
		Minimal
		Adequate
		Excellent

Table 4.7 gives the 10 observations in the learning set.

Table 4.7 Job applicant training dataset

Record	Age	State	Degree	Major	Experience (in years)	Outcome
1	27	CA	BS	Engineering	2	Excellent
2	33	NV	MBA	Business Administration	5	Adequate
3	30	CA	MS	Computer Science	0	Adequate
4	22	CA	BS	Information Systems	0	Unacceptable
5	28	CA	BS	Information Systems	2	Minimal
6	26	CA	MS	Business Administration	0	Excellent
7	25	CA	BS	Engineering	3	Adequate
8	28	OR	MS	Computer Science	2	Adequate
9	25	CA	BS	Information Systems	2	Minimal
10	24	CA	BS	Information Systems	1	Adequate

Notice that some of these variables are quantitative and others are nominal. State, degree, and major are nominal. There is no information content intended by state or major. State is not expected to have a specific order prior to analysis, nor is major. (The analysis may conclude that there is a relationship between state, major, and outcome, however.) Degree is ordinal, in that MS and MBA are higher degrees than BS. However, as with state and major, the analysis may find a reverse relationship with the outcome.

Table 4.8 gives the test dataset for this case.

Table 4.8 Job applicant test dataset

Record	Age	State	Degree	Major	Experience (in years)	Outcome
11	36	CA	MS	Information Systems	0	Minimal
12	28	OR	BS	Computer Science	5	Unacceptable
13	24	NV	BS	Information Systems	0	Excellent
14	33	CA	BS	Engineering	2	Adequate
15	26	CA	BS	Business Administration	3	Minimal

Table 4.9 provides a set of new job applicants to be classified by predicted job performance.

Table 4.9 New job applicant set

Age	State	Degree	Major	Experience (in years)
28	CA	MBA	Engr	0
26	NM	UG	Sci	3
33	TX	MS	Engr	6
21	CA	Cert	none	0
26	OR	Cert	none	5
25	CA	UG	BusAd	0
32	AR	UG	Engr	8
41	PA	MBA	BusAd	2
29	CA	UG	Sci	6
28	WA	UG	Csci	3

Insurance Fraud Data

The third dataset involves insurance claims. The full dataset includes 5,000 past claims with known outcomes. Variables include the claimant age, gender, amount of insurance claim, number of traffic tickets currently on record (less than 3 years old), number of prior accident claims of the type insured, and attorney (if any). Table 4.10 gives the training dataset.

Table 4.10 Training dataset—Insurance claims

Claimant age	Gender	Claim amount	Tickets	Prior claims	Attorney	Outcome
52	Male	2,000	0	1	Jones	OK
38	Male	1,800	0	0	None	OK
21	Female	5,600	1	2	Smith	Fraudulent
36	Female	3,800	0	1	None	OK
19	Male	600	2	2	Adams	OK
41	Male	4,200	1	2	Smith	Fraudulent
38	Male	2,700	0	0	None	OK
33	Female	2,500	0	1	None	Fraudulent
18	Female	1,300	0	0	None	OK
26	Male	2,600	2	0	None	OK

The test set is given in Table 4.11.

Table 4.11 Test dataset—Insurance claims

Claimant age	Gender	Claim amount	Tickets	Prior claims	Attorney	Outcome
23	Male	2,800	1	0	None	OK
31	Female	1,400	0	0	None	OK
28	Male	4,200	2	3	Smith	Fraudulent
19	Male	2,800	0	1	None	OK
41	Male	1,600	0	0	Henry	OK

A set of new claims is given in Table 4.12.

Table 4.12 New insurance claims

Claimant age	Gender	Claim amount	Tickets	Prior claims	Attorney
23	Male	1,800	1	1	None
32	Female	2,100	0	0	None
20	Female	1,600	0	0	None
18	Female	3,300	2	0	None
55	Male	4,000	0	0	Smith
41	Male	2,600	1	1	None
38	Female	3,100	0	0	None
21	Male	2,500	1	0	None
16	Female	4,500	1	2	Gold
24	Male	2,600	1	1	None

Expenditure Data

This dataset represents the consumer data for a community gathered by a hypothetical market research company in a moderate sized city. Ten thousand observations have been gathered over the following variables:

DEMOGRAPHIC	Age	integer, 16 and up
	Gender	0-female, 1-male
	Marital Status	0-single, 0.5-divorced, 1-married
	Dependents	Number of dependents
	Income	Annual income in dollars
	Job yrs	Years in the current job (integer)
	Town yrs	Years in this community
	Yrs Ed	Years of education completed
	Dri Lic	Drivers License (0-no, 1-yes)
	Own Home	0-no, 1-yes
	#Cred C	number of credit cards
CONSUMER Churn		Number of credit card balances canceled last year
	ProGroc	Proportion of income spent at grocery stores
	ProRest	Proportion of income spent at restaurants
	ProHous	Proportion of income spent on housing
	ProUtil	Proportion of income spent on utilities
	ProAuto	Proportion of income spent on automobiles (owned and operated)
	ProCloth	Proportion of income spent on clothing
	ProEnt	Proportion of income spent on entertainment

This dataset can be used for a number of studies to include questions, such as what types of customers are most likely to seek restaurants, what the market for home furnishings might be, what type of customers are most likely to be interested in clothing or in entertainment, and what is the relationship of spending to demographic variables.

Bankruptcy Data

This data concerns 100 U.S. firms that underwent bankruptcy.¹ All of the sample data are from the U.S. companies. About 400 bankrupt company names were obtained using google.com, and the next step is to find out the Ticker name of each company using the Compustat database, at the same time, we just kept the companies bankrupted during January 2006 and December 2009, since we hopefully want to get some different results because of this economic crisis. 99 companies left after this step. After getting the company the Ticker code list, we submitted the Ticker list to the Compustat database to get the financial data ratios during January 2005 to December 2009. Those financial data and ratios are the factors from which we can predict the company bankruptcy. The factors we collected are based on the literature, which contain total asset, book value per share, inventories, liabilities, receivables, cost of goods sold, total dividends, earnings before interest and taxes, gross profit (loss), net income (loss), operating income after depreciation, total revenue, sales, dividends per share, and total market value. We make a match for scale and size as 1:2 ratios. It means that we need to collect the same financial ratios for 200 nonfailed companies during the same periods. First, we used the LexisNexis database to find the company Securities and Exchange Commission filling after June 2010, which means that companies are still active today, and then, we selected 200 companies from the results and got the company CIK code list. The final step, we submitted the CIK code list to the Compustat database and got the financial data and ratios during January 2005 to December 2009, which is the same period with that of getting failed companies.

The dataset consists of 1,321 records with full data over 19 attributes, as shown in Table 4.13. The outcome attribute in bankruptcy has a value of 1 if the firm went bankrupt by 2011 (697 cases) and a value of 0 if it did not (624 cases).

Table 4.13 Attributes in bankruptcy data

No.	Short name	Long name
1	fyear	Data year—fiscal
2	cik	CIK number
3	at	Assets—total
4	bkvlps	Book value per share
5	invt	Inventories—total
6	Lt	Liabilities—total
7	rectr	Receivables—trade
8	cogs	Cost of goods sold
9	dvt	Dividends—total
10	ebit	Earnings before interest and taxes
11	gp	Gross profit (loss)
12	ni	Net income (loss)
13	oiadp	Operating income after depreciation
14	revt	Revenue—total
15	sale	Sales-turnover (net)
16	dvpsx_f	Dividends per share—ex-date—fiscal
17	mkvalt	Market value—total—fiscal
18	prch_f	Price high—annual—fiscal
19	bankruptcy	Bankruptcy (output variable)

This is real data concerning firm bankruptcy, which could be updated by going to the web sources.

Summary

There are a number of tools available for data mining, which can accomplish a number of functions. The tools come from areas of statistics, operations research, and artificial intelligence, providing analytical techniques that can be used to accomplish a variety of analytic functions, such as cluster identification, discriminant analysis, and development of association rules. Data mining software provides powerful means to apply these tools to large sets of data, giving the organizational management means to cope with an overwhelming glut of data and the ability to convert some of this glut into useful knowledge.

This chapter begins with an overview of tools and functions. It also previews four datasets that are used in subsequent chapters, plus a fifth dataset of real firm bankruptcies available for use. These datasets are small, but provide the readers with views of the type of data typically encountered in some data mining studies.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 4 Overview of Data Mining Techniques

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 4 Overview of Data Mining Techniques