Overview of Data Mining Techniques
Data useful to business comes in many forms. For instance, an automobile insurance company, faced with millions of accident claims, realizes that not all claims are legitimate. If they are extremely tough and investigate each claim thoroughly, they will spend more money on investigation than they would pay in claims. They also will find that they are unable to sell new policies. If they are as understanding and trusting as their television ads imply, they will reduce their investigation costs to zero, but will leave themselves vulnerable to fraudulent claims. Insurance firms have developed ways to profile claims, considering many variables, to provide an early indication of cases that probably merit expending funds for investigation. This has the effect of reducing the overall policy expenses, because it discourages fraud, while minimizing the imposition on valid claims. The same approach is used by the Internal Revenue Service in processing individual tax returns. Fraud detection has become a viable data mining industry, with a large number of software vendors. This is typical of many applications of data mining.
Data mining can be conducted in many business contexts. This chapter presents four datasets that will be utilized to demonstrate the techniques to be covered in Part II of the book. In addition to insurance fraud, files have been generated reflecting other common business applications, such as loan evaluation and customer segmentation. The same concepts can be applied to other applications, such as evaluation of the employees.
We have described data mining, its process, and data storage systems that make it possible. The next section of the book will describe the data mining methods. Data mining tools have been classified by the tasks of classification, estimation, clustering, and summarization. Classification and estimation are predictive. Clustering and summarization are descriptive. Not all methods will be presented, but those most commonly used will be. We will demonstrate each of these methods with small example datasets intended to show how these methods work. We do not intend to give the impression that these datasets are anywhere near the scale of real data mining applications. But, they do represent the micro versions of real applications and are much more convenient to demonstrate concepts.
Data Mining Models
Data mining uses a variety of modeling tools for a variety of purposes. Various authors have viewed these purposes along with the available tools (see Table 4.1). These methods come from both classical statistics as well as from artificial intelligence. Statistical techniques have strong diagnostic tools that can be used for the development of confidence intervals on parameter estimates, hypothesis testing, and other things. Artificial intelligence techniques require fewer assumptions about the data and are generally more automatic.
Table 4.1 Data mining modeling tools
Algorithms |
Functions |
Basis |
Task |
Cluster detection |
Cluster analysis |
Statistics |
Classification |
Regression |
Linear regression |
Statistics |
Prediction |
Logistic regression |
Statistics |
Classification |
|
Discriminant analysis |
Statistics |
Classification |
|
Neural networks |
Neural networks |
AI |
Classification |
Kohonen nets |
AI |
Cluster |
|
Decision trees |
Association rules |
AI |
Classification |
Rule induction |
Association rules |
AI |
Description |
Link analysis |
Description |
||
Query tools |
Description |
||
Descriptive statistics |
Statistics |
Description |
|
Visualization tools |
Statistics |
Description |
Regression comes in a variety of forms, to include ordinary least squares regression, logistic regression (widely used in data mining when outcomes are binary), and discriminant analysis (used when outcomes are categorical and predetermined).
The point of data mining is to have a variety of tools available to assist the analyst and user in better understanding what the data consists of. Each method does something different, and usually, this implies a specific problem is best treated with a particular algorithm type. However, sometimes different algorithm types can be used for the same problem. Most involve setting the parameters, which can be important in the effectiveness of the method is needed. Further, the output needs to be interpreted.
There are a number of overlaps. Cluster analysis helps data miners to visualize the relationship among customer purchases, and is supported by visualization techniques that provide a different perspective. Link analysis helps identify the connections between variables, often displayed through graphs as a means of visualization. An example of link analysis application is in telephony, where calls are represented by the linkage between the caller and the receiver. Another example of linkage is the physician referral patterns. A patient may visit their regular doctor, who detects something that they don’t know a lot about. They refer to their network of acquaintances to identify a reliable specialist who does. Clinics are collections of physician specialists, and might be referred to for especially difficult cases.
Data Mining Perspectives
Methods can be viewed from different perspectives. From the perspective of statistics and operations research, data mining methods include:
From the perspective of artificial intelligence, these methods include:
Regression and neural network approaches are best fit methods and are usually applied together. Regression tends to have advantages with linear data, while neural network models do very well with irregular data. Software usually allows the user to apply variants of each and lets the analyst select the model that fits best. Cluster analysis, discriminant analysis, and case-based reasoning seek to assign new cases to the closest cluster of past observations. Rule induction is the basis of decision tree methods of data mining. Genetic algorithms apply to the special forms of data and are often used to boost or improve the operation of other techniques.
The ability of some of these techniques to deal with the common data mining characteristics is compared in Table 4.2.
Table 4.2 General ability of data mining techniques to deal with data features
Data characteristic |
Rule induction |
Neural networks |
Case-based reasoning |
Genetic algorithms |
Handle noisy data |
Good |
Very good |
Good |
Very good |
Handle missing data |
Good |
Good |
Very good |
Good |
Process large datasets |
Very good |
Poor |
Good |
Good |
Process different data types |
Good |
Transform to numerical |
Very good |
Transformation needed |
Predictive accuracy |
High |
Very high |
High |
High |
Explanation capability |
Very good |
Poor |
Very good |
Good |
Ease of integration |
Good |
Good |
Good |
Very good |
Ease of operation |
Easy |
Difficult |
Easy |
Difficult |
Table 4.2 demonstrates that there are different tools for different types of problems. If the data is especially noisy, this can lead to difficulties for the classical statistical methods, such as regression, cluster analysis, and discriminant analysis. The methods using rule induction and case-based reasoning can deal with such problems, but if the noise was false information, this can lead to rules concluding the wrong things. Neural networks and genetic algorithms have proven useful relative to the classical methods in environments where there are complexities in the data, to include interactions among variables that are nonlinear.
Neural networks have relative disadvantages in dealing with very large numbers of variables, as the computational complexity increases dramatically. Genetic algorithms require a specific data structure for genetic algorithms to operate, and it is not always easy to transform data to accommodate this requirement.
Another negative feature of neural networks is their hidden nature. Due to the large number of node connections, it is impractical to print out and analyze a large neural network model. This makes it difficult to transport a model built on one system to another system. Therefore, new data must be entered in the system where the neural network model was built in order to apply it to the new cases. This makes it nearly impossible to apply neural network models outside of the system upon which they are built.
Data Mining Functions
Problem types can be described in four categories:
Table 4.3 compares the common techniques and applications by business area.
Table 4.3 Data mining applications by method
Area |
Technique |
Application |
Problem type |
Finance |
Neural network |
Forecast stock price |
Prediction |
Neural network Rule induction |
Forecast bankruptcy Forecast price index futures Fraud detection |
Prediction Prediction Detection |
|
Neural network Case-based reasoning |
Forecast interest rates |
Prediction |
|
Neural network Visualization |
Delinquent bank loan detection |
Detection |
|
Rule induction |
Forecast defaulting loans Credit assessment Portfolio management Risk classification Financial customer classification |
Prediction Prediction Prediction Classification Classification |
|
Rule induction Case-based reasoning |
Corporate bond rating |
Prediction |
|
Rule induction Visualization |
Loan approval |
Prediction |
|
Telecom |
Neural network Rule induction |
Forecast network behavior |
Prediction |
Rule induction |
Churn management Fraud detection |
Classification Detection |
|
Case-based reasoning |
Call tracking |
Classification |
|
Marketing |
Rule induction |
Market segmentation Cross-selling improvement |
Classification Association |
Rule induction Visualization |
Lifestyle behavior analysis Product performance analysis |
Classification Association |
|
Rule induction Genetic algorithm Visualization |
Customer reaction to promotion |
Prediction |
|
Case-based reasoning |
Online sales support |
Classification |
|
Web |
Rule induction Visualization |
User browsing similarity analysis |
Classification, Association |
Rule-based heuristics |
Web page content similarity |
Association |
|
Others |
Neural network |
Software cost estimation |
Detection |
Neural network Rule induction |
Litigation assessment |
Prediction |
|
Rule induction |
Insurance fraud detection Healthcare exception reporting |
Detection Detection |
|
Case-based reasoning |
Insurance claim estimation Software quality control |
Prediction Classification |
|
Genetic algorithms |
Budget expenditure |
Classification |
Many of these applications combined techniques to include visualization and statistical analysis. The point is that there are many data mining tools available for a variety of functional purposes, spanning almost every area of human endeavor (including business). This section of the book seeks to demonstrate how these primary data mining tools work.
Demonstration Datasets
We will use some simple models to demonstrate the concepts. These datasets were generated by the authors, reflecting important business applications. The first model includes loan applicants, with 20 observations for building data, and 10 applicants serving as a test dataset. The second dataset represents job applicants. Here, 10 observations with known outcomes serve as the training set, with 5 additional cases in the test set. A third dataset of insurance claims has 10 known outcomes for training and 5 observations in the test set. All three datasets will be applied to new cases.
Larger datasets for each of these three cases will be provided as well as a dataset on expenditure data. These larger datasets will be used in various chapters to demonstrate methods.
Loan Analysis Data
This dataset (Table 4.4) consists of information on applicants for appliance loans. The full dataset involves 650 past observations. Applicant information on age, income, assets, debts, and credit rating (from a credit bureau, with red for bad credit, yellow for some credit problems, and green for clean credit record) is assumed available from loan applications. Variable Want is the amount requested in the appliance loan application. For past observations, variable On-time is 1 if all payments were received on time and 0 if not (Late or Default). The majority of past loans were paid on time. Data was transformed to obtain categorical data for some of the techniques. Age was grouped by less than 30 (young), 60 or over (old), and in between (middle-aged). Income was grouped as less than or equal to $30,000 per year or lower (low income), $80,000 per year or more (high income), and average in between. Asset, debt, and loan amount (variable Want) are used by rule to generate categorical variable Risk. Risk was categorized as High if debts exceeded the assets, as low if assets exceeded the sum of debts plus the borrowing amount requested, and average in between.
Table 4.4 Loan analysis training dataset
Age |
Income |
Assets |
Debts |
Want |
Risk |
Credit |
Result |
20 (young) |
17,152 (low) |
11,090 |
20,455 |
400 |
High |
Green |
On-time |
23 (young) |
25,862 (low) |
24,756 |
30,083 |
2,300 |
High |
Green |
On-time |
28 (young) |
26,169 (low) |
47,355 |
49,341 |
3,100 |
High |
Yellow |
Late |
23 (young) |
21,117 (low) |
21,242 |
30,278 |
300 |
High |
Red |
Default |
22 (young) |
7,127 (low) |
23,903 |
17,231 |
900 |
Low |
Yellow |
On-time |
26 (young) |
42,083 (average) |
35,726 |
41,421 |
300 |
High |
Red |
Late |
24 (young) |
55,557 (average) |
27,040 |
48,191 |
1,500 |
High |
Green |
On-time |
27 (young) |
34,843 (average) |
0 |
21,031 |
2,100 |
High |
Red |
On-time |
29 (young) |
74,295 (average) |
88,827 |
100,599 |
100 |
High |
Yellow |
On-time |
23 (young) |
38,887 (average) |
6,260 |
33,635 |
9,400 |
Low |
Green |
On-time |
28 (young) |
31,758 (average) |
58,492 |
49,268 |
1,000 |
Low |
Green |
On-time |
25 (young) |
80,180 (high) |
31,696 |
69,529 |
1,000 |
High |
Green |
Late |
33 (middle) |
40,921 (average) |
91,111 |
90,076 |
2,900 |
Average |
Yellow |
Late |
36 (middle) |
63,124 (average) |
164,631 |
144,697 |
300 |
Low |
Green |
On-time |
39 (middle) |
59,006 (average) |
195,759 |
161,750 |
600 |
Low |
Green |
On-time |
39 (middle) |
125,713 (high) |
382,180 |
315,396 |
5,200 |
Low |
Yellow |
On-time |
55 (middle) |
80,149 (high) |
511,937 |
21,923 |
1,000 |
Low |
Green |
On-time |
62 (old) |
101,291 (high) |
783,164 |
23,052 |
1,800 |
Low |
Green |
On-time |
71 (old) |
81,723 (high) |
776,344 |
20,277 |
900 |
Low |
Green |
On-time |
63 (old) |
99,522 (high) |
783,491 |
24,643 |
200 |
Low |
Green |
On-time |
Table 4.5 gives a test set of data.
Table 4.5 Loan analysis test data
Age |
Income |
Assets |
Debts |
Want |
Risk |
Credit |
Result |
37 (middle) |
37,214 (average) |
123,420 |
106,241 |
4,100 |
Low |
Green |
On-time |
45 (middle) |
57,391 (average) |
250,410 |
191,879 |
5,800 |
Low |
Green |
On-time |
45 (middle) |
36,692 (average) |
175,037 |
137,800 |
3,400 |
Low |
Green |
On-time |
25 (young) |
67,808 (average) |
25,174 |
61,271 |
3,100 |
High |
Yellow |
On-time |
36 (middle) |
102,143 (high) |
246,148 |
231,334 |
600 |
Low |
Green |
On-time |
29 (young) |
34,579 (average) |
49,387 |
59,412 |
4,600 |
High |
Red |
On-time |
26 (young) |
22,958 (low) |
29,878 |
36,508 |
400 |
High |
Yellow |
Late |
34 (middle) |
42,526 (average) |
109,934 |
92,494 |
3,700 |
Low |
Green |
On-time |
28 (young) |
80,019 (high) |
78,632 |
100,957 |
12,800 |
High |
Green |
On-time |
32 (middle) |
57,407 (average) |
117,062 |
101,967 |
100 |
Low |
Green |
On-time |
The model can be applied to the new applicants given in Table 4.6.
Table 4.6 New appliance loan analysis
Age |
Income |
Assets |
Debts |
Want |
Credit |
25 |
28,650 |
9,824 |
2,000 |
10,000 |
Green |
30 |
35,760 |
12,974 |
32,634 |
4,000 |
Yellow |
32 |
41,862 |
625,321 |
428,643 |
3,000 |
Red |
36 |
36,843 |
80,431 |
120,643 |
12,006 |
Green |
37 |
62,743 |
421,753 |
321,845 |
5,000 |
Yellow |
37 |
53,869 |
286,375 |
302,958 |
4,380 |
Green |
37 |
70,120 |
484,264 |
303,958 |
6,000 |
Green |
38 |
60,429 |
296,843 |
185,769 |
5,250 |
Green |
39 |
65,826 |
321,959 |
392,817 |
12,070 |
Green |
40 |
90,426 |
142,098 |
25,426 |
1,280 |
Yellow |
40 |
70,256 |
528,493 |
283,745 |
3,280 |
Green |
42 |
58,326 |
328,457 |
120,849 |
4,870 |
Green |
42 |
61,242 |
525,673 |
184,762 |
3,300 |
Green |
42 |
39,676 |
326,346 |
421,094 |
1,290 |
Red |
43 |
102,496 |
823,532 |
175,932 |
3,370 |
Green |
43 |
80,376 |
753,256 |
239,845 |
5,150 |
Yellow |
44 |
74,623 |
584,234 |
398,456 |
1,525 |
Green |
45 |
91,672 |
436,854 |
275,632 |
5,800 |
Green |
52 |
120,721 |
921,482 |
128,573 |
2,500 |
Yellow |
63 |
86,521 |
241,689 |
5,326 |
30,000 |
Green |
Job Application Data
The second dataset involves 500 past job applicants. Variables are:
Age | integer, 20 to 65 | |
State | State of origin | |
Degree | Cert | Professional certification |
UG | Undergraduate degree | |
MBA | Masters in Business Administration | |
MS | Masters of Science | |
PhD | Doctorate | |
Major | none | |
Engr | Engineering | |
Sci | Science or Math | |
Csci | Computer Science | |
BusAd | Business Administration | |
IS | Information Systems | |
Experience | integer | years of experience in this field |
Outcome | ordinal | Unacceptable |
Minimal | ||
Adequate | ||
Excellent |
Table 4.7 gives the 10 observations in the learning set.
Table 4.7 Job applicant training dataset
Record |
Age |
State |
Degree |
Major |
Experience (in years) |
Outcome |
1 |
27 |
CA |
BS |
Engineering |
2 |
Excellent |
2 |
33 |
NV |
MBA |
Business Administration |
5 |
Adequate |
3 |
30 |
CA |
MS |
Computer Science |
0 |
Adequate |
4 |
22 |
CA |
BS |
Information Systems |
0 |
Unacceptable |
5 |
28 |
CA |
BS |
Information Systems |
2 |
Minimal |
6 |
26 |
CA |
MS |
Business Administration |
0 |
Excellent |
7 |
25 |
CA |
BS |
Engineering |
3 |
Adequate |
8 |
28 |
OR |
MS |
Computer Science |
2 |
Adequate |
9 |
25 |
CA |
BS |
Information Systems |
2 |
Minimal |
10 |
24 |
CA |
BS |
Information Systems |
1 |
Adequate |
Notice that some of these variables are quantitative and others are nominal. State, degree, and major are nominal. There is no information content intended by state or major. State is not expected to have a specific order prior to analysis, nor is major. (The analysis may conclude that there is a relationship between state, major, and outcome, however.) Degree is ordinal, in that MS and MBA are higher degrees than BS. However, as with state and major, the analysis may find a reverse relationship with the outcome.
Table 4.8 gives the test dataset for this case.
Table 4.8 Job applicant test dataset
Record |
Age |
State |
Degree |
Major |
Experience (in years) |
Outcome |
11 |
36 |
CA |
MS |
Information Systems |
0 |
Minimal |
12 |
28 |
OR |
BS |
Computer Science |
5 |
Unacceptable |
13 |
24 |
NV |
BS |
Information Systems |
0 |
Excellent |
14 |
33 |
CA |
BS |
Engineering |
2 |
Adequate |
15 |
26 |
CA |
BS |
Business Administration |
3 |
Minimal |
Table 4.9 provides a set of new job applicants to be classified by predicted job performance.
Table 4.9 New job applicant set
Age |
State |
Degree |
Major |
Experience (in years) |
28 |
CA |
MBA |
Engr |
0 |
26 |
NM |
UG |
Sci |
3 |
33 |
TX |
MS |
Engr |
6 |
21 |
CA |
Cert |
none |
0 |
26 |
OR |
Cert |
none |
5 |
25 |
CA |
UG |
BusAd |
0 |
32 |
AR |
UG |
Engr |
8 |
41 |
PA |
MBA |
BusAd |
2 |
29 |
CA |
UG |
Sci |
6 |
28 |
WA |
UG |
Csci |
3 |
Insurance Fraud Data
The third dataset involves insurance claims. The full dataset includes 5,000 past claims with known outcomes. Variables include the claimant age, gender, amount of insurance claim, number of traffic tickets currently on record (less than 3 years old), number of prior accident claims of the type insured, and attorney (if any). Table 4.10 gives the training dataset.
Table 4.10 Training dataset—Insurance claims
Claimant age |
Gender |
Claim amount |
Tickets |
Prior claims |
Attorney |
Outcome |
52 |
Male |
2,000 |
0 |
1 |
Jones |
OK |
38 |
Male |
1,800 |
0 |
0 |
None |
OK |
21 |
Female |
5,600 |
1 |
2 |
Smith |
Fraudulent |
36 |
Female |
3,800 |
0 |
1 |
None |
OK |
19 |
Male |
600 |
2 |
2 |
Adams |
OK |
41 |
Male |
4,200 |
1 |
2 |
Smith |
Fraudulent |
38 |
Male |
2,700 |
0 |
0 |
None |
OK |
33 |
Female |
2,500 |
0 |
1 |
None |
Fraudulent |
18 |
Female |
1,300 |
0 |
0 |
None |
OK |
26 |
Male |
2,600 |
2 |
0 |
None |
OK |
The test set is given in Table 4.11.
Table 4.11 Test dataset—Insurance claims
Claimant age |
Gender |
Claim amount |
Tickets |
Prior claims |
Attorney |
Outcome |
23 |
Male |
2,800 |
1 |
0 |
None |
OK |
31 |
Female |
1,400 |
0 |
0 |
None |
OK |
28 |
Male |
4,200 |
2 |
3 |
Smith |
Fraudulent |
19 |
Male |
2,800 |
0 |
1 |
None |
OK |
41 |
Male |
1,600 |
0 |
0 |
Henry |
OK |
A set of new claims is given in Table 4.12.
Table 4.12 New insurance claims
Claimant age |
Gender |
Claim amount |
Tickets |
Prior claims |
Attorney |
23 |
Male |
1,800 |
1 |
1 |
None |
32 |
Female |
2,100 |
0 |
0 |
None |
20 |
Female |
1,600 |
0 |
0 |
None |
18 |
Female |
3,300 |
2 |
0 |
None |
55 |
Male |
4,000 |
0 |
0 |
Smith |
41 |
Male |
2,600 |
1 |
1 |
None |
38 |
Female |
3,100 |
0 |
0 |
None |
21 |
Male |
2,500 |
1 |
0 |
None |
16 |
Female |
4,500 |
1 |
2 |
Gold |
24 |
Male |
2,600 |
1 |
1 |
None |
Expenditure Data
This dataset represents the consumer data for a community gathered by a hypothetical market research company in a moderate sized city. Ten thousand observations have been gathered over the following variables:
DEMOGRAPHIC | Age | integer, 16 and up |
Gender | 0-female, 1-male | |
Marital Status | 0-single, 0.5-divorced, 1-married | |
Dependents | Number of dependents | |
Income | Annual income in dollars | |
Job yrs | Years in the current job (integer) | |
Town yrs | Years in this community | |
Yrs Ed | Years of education completed | |
Dri Lic | Drivers License (0-no, 1-yes) | |
Own Home | 0-no, 1-yes | |
#Cred C | number of credit cards | |
CONSUMER Churn | Number of credit card balances canceled last year | |
ProGroc | Proportion of income spent at grocery stores | |
ProRest | Proportion of income spent at restaurants | |
ProHous | Proportion of income spent on housing | |
ProUtil | Proportion of income spent on utilities | |
ProAuto | Proportion of income spent on automobiles (owned and operated) |
|
ProCloth | Proportion of income spent on clothing | |
ProEnt | Proportion of income spent on entertainment |
This dataset can be used for a number of studies to include questions, such as what types of customers are most likely to seek restaurants, what the market for home furnishings might be, what type of customers are most likely to be interested in clothing or in entertainment, and what is the relationship of spending to demographic variables.
Bankruptcy Data
This data concerns 100 U.S. firms that underwent bankruptcy.1 All of the sample data are from the U.S. companies. About 400 bankrupt company names were obtained using google.com, and the next step is to find out the Ticker name of each company using the Compustat database, at the same time, we just kept the companies bankrupted during January 2006 and December 2009, since we hopefully want to get some different results because of this economic crisis. 99 companies left after this step. After getting the company the Ticker code list, we submitted the Ticker list to the Compustat database to get the financial data ratios during January 2005 to December 2009. Those financial data and ratios are the factors from which we can predict the company bankruptcy. The factors we collected are based on the literature, which contain total asset, book value per share, inventories, liabilities, receivables, cost of goods sold, total dividends, earnings before interest and taxes, gross profit (loss), net income (loss), operating income after depreciation, total revenue, sales, dividends per share, and total market value. We make a match for scale and size as 1:2 ratios. It means that we need to collect the same financial ratios for 200 nonfailed companies during the same periods. First, we used the LexisNexis database to find the company Securities and Exchange Commission filling after June 2010, which means that companies are still active today, and then, we selected 200 companies from the results and got the company CIK code list. The final step, we submitted the CIK code list to the Compustat database and got the financial data and ratios during January 2005 to December 2009, which is the same period with that of getting failed companies.
The dataset consists of 1,321 records with full data over 19 attributes, as shown in Table 4.13. The outcome attribute in bankruptcy has a value of 1 if the firm went bankrupt by 2011 (697 cases) and a value of 0 if it did not (624 cases).
Table 4.13 Attributes in bankruptcy data
No. |
Short name |
Long name |
1 |
fyear |
Data year—fiscal |
2 |
cik |
CIK number |
3 |
at |
Assets—total |
4 |
bkvlps |
Book value per share |
5 |
invt |
Inventories—total |
6 |
Lt |
Liabilities—total |
7 |
rectr |
Receivables—trade |
8 |
cogs |
Cost of goods sold |
9 |
dvt |
Dividends—total |
10 |
ebit |
Earnings before interest and taxes |
11 |
gp |
Gross profit (loss) |
12 |
ni |
Net income (loss) |
13 |
oiadp |
Operating income after depreciation |
14 |
revt |
Revenue—total |
15 |
sale |
Sales-turnover (net) |
16 |
dvpsx_f |
Dividends per share—ex-date—fiscal |
17 |
mkvalt |
Market value—total—fiscal |
18 |
prch_f |
Price high—annual—fiscal |
19 |
bankruptcy |
Bankruptcy (output variable) |
This is real data concerning firm bankruptcy, which could be updated by going to the web sources.
Summary
There are a number of tools available for data mining, which can accomplish a number of functions. The tools come from areas of statistics, operations research, and artificial intelligence, providing analytical techniques that can be used to accomplish a variety of analytic functions, such as cluster identification, discriminant analysis, and development of association rules. Data mining software provides powerful means to apply these tools to large sets of data, giving the organizational management means to cope with an overwhelming glut of data and the ability to convert some of this glut into useful knowledge.
This chapter begins with an overview of tools and functions. It also previews four datasets that are used in subsequent chapters, plus a fifth dataset of real firm bankruptcies available for use. These datasets are small, but provide the readers with views of the type of data typically encountered in some data mining studies.