Chapter 3

Data Mining Processes and Knowledge Discovery

In order to conduct data mining analysis, a general process is useful. This chapter describes an industry standard process, which is often used, and a shorter vendor process. While each step is not needed in every analysis, this process provides a good coverage of the steps needed, starting with data exploration, data collection, data processing, analysis, inferences drawn, and implementation.

There are two standard processes for data mining that have been presented. CRISP-DM (cross-industry standard process for data mining) is an industry standard, and SEMMA (sample, explore, modify, model, and assess) was developed by the SAS Institute Inc., a leading vendor of data mining software (and a premier statistical software vendor). Table 3.1 gives a brief description of the phases of each process. You can see that they are basically similar, only with different emphases.


Table 3.1 CRISP-DM and SEMMA

CRISP-DM

SEMMA

Business understanding

Assumes well-defined questions

Data understanding

Sample

Data preparation

Explore

Modeling

Modify data

Evaluation

Model

Deployment

Assess



Industry surveys indicate that CRISP-DM is used by over 70 percent of the industry professionals, while about half of these professionals use their own methodologies. SEMMA has a lower reported usage, as per the KDNuggets.com survey.

CRISP-DM

CRISP-DM is widely used by the industry members. This model consists of six phases intended as a cyclical process shown in Figure 3.1.

Image

CRISP-DM process

This six-phase process is not a rigid, by-the-numbers procedure. There is usually a great deal of backtracking. Additionally, experienced analysts may not need to apply each phase for every study. But, CRISP-DM ­provides a useful framework for data mining.

Business Understanding

The key element of a data mining study is understanding the purpose of the study. This begins with the managerial need for new knowledge and the expression of the business objective of the study to be undertaken. Goals in terms of things, such as which types of customers are interested in each of our products or what are the typical profiles of our customers, and how much value do each of them provide to us, are needed. Then, a plan for finding such knowledge needs to be developed, in terms of those responsible for collecting data, analyzing data, and reporting. At this stage, a budget to support the study should be established, at least in preliminary terms.

Data Understanding

Once the business objectives and the project plan are established, data understanding considers data requirements. This step can include initial data collection, data description, data exploration, and verification of data quality. Data exploration, such as viewing summary statistics (which includes visual display of the categorical variables), can occur at the end of this phase. Models such as cluster analysis can also be applied during this phase, with the intent of identifying patterns in the data.

Data sources for data selection can vary. Normally, the types of data sources for business applications include demographic data (such as income, education, number of households, and age), sociographic data (such as hobby, club membership, and entertainment), transactional data (sales record, credit card spending, and issued checks), and so on. The data type can be categorized as quantitative and qualitative data. Quantitative data is measurable by numerical values. It can be either discrete (such as integers) or continuous (such as real numbers). Qualitative data, also known as categorical data, contains both nominal and ordinal data. Nominal data has finite nonordered values, such as gender data having two values: male and female. Ordinal data has finite ordered values. For example, customer credit ratings are ordinal data since ratings can be excellent, fair, or bad.

Data Preparation

The purpose of data preprocessing is to clean the selected data for better quality. Some selected data may have different formats because they are chosen from different data sources. If selected data are from flat files, voice messages, and web texts, they should be converted to a consistent electronic format. In general, data cleaning means to filter, aggregate, and fill the missing values (imputation). By filtering data, the selected data are examined for outliers and redundancies. Outliers have huge differences from the majority of data or data that are clearly out of range of the selected data groups. For example, if the income of a customer included in the middle class is $250,000, it is an error and should be taken out from the data mining project examining aspects of the middle class. Outliers may be caused by many reasons, such as human errors or technical errors, or may naturally occur in a dataset due to extreme events. Suppose the age of a credit card holder is recorded as 12. This is likely a human error. However, there may be such an independently wealthy preteenager with important purchasing habits. Arbitrarily deleting this outlier could lose valuable information.

Redundant data are the same information recorded in several different ways. The daily sales of a particular product are redundant to seasonal sales of the same product, because we can derive the sales from either daily data or seasonal data. By aggregating data, the data dimensions are reduced to obtain aggregated information. Note that although an aggregated dataset has a small volume, the information will remain. If a marketing promotion for furniture sales is considered in the next three or four years, then the available daily sales data can be aggregated as annual sales data. The size of the sales data is dramatically reduced. By smoothing data, the missing values of the selected data are found and new or reasonable values will be added. These added values could be the average number of the variable (mean) or the mode. A missing value often causes no solution when a data mining algorithm is applied to discover the knowledge patterns.

Data can be expressed in a number of different forms. For instance, in CLEMENTINE, the following data types can be used:

  • RANGE: numeric values (integer, real, or date and time).
  • FLAG: binary—yes or no, 0 or 1, or other data with two ­outcomes—(text, integer, real number, or date and time).
  • SET: data with distinct multiple values (numeric, string, or date and time).
  • TYPELESS: for other types of data.

Usually, we think of data as real numbers, such as age in years or annual income in dollars (we would use RANGE in those cases). Sometimes, variables occur as either and or types, such as having a driving license or not, an insurance claim being fraudulent or not. This case could be dealt with by real numeric values (such as 0 or 1). But, it is more efficient to treat them as FLAG variables. Often it is more appropriate to deal with categorical data, such as age in terms of the set {young, middle-aged, elderly} or income in the set {low, middle, high}. In that case, we could group the data and assign the appropriate category in terms of a string, using a set. The most complete form is RANGE, but sometimes data does not come in that form, and analysts are forced to use SET or FLAG types. Sometimes, it may actually be more accurate to deal with SET data types than RANGE data types.

As another example, PolyAnalyst (a typical treatment) has the following data types available:

  • numerical               continuous values
  • integer                      integer values
  • yes or no                  binary data
  • category                   a finite set of possible values
  • date
  • string
  • text

Each software tool will have a different data scheme, but the primary types of data dealt with are represented in these two lists.

There are many statistical methods and visualization tools that can be used to preprocess the selected data. Common statistics, such as max, min, mean, and mode, can be readily used to aggregate or smooth the data, while scatter plots and box plots are usually used to filter outliers. More advanced techniques, including regression analysis, cluster analysis, decision tree, or hierarchical analysis, may be applied in data preprocessing depending on the requirements for the quality of the selected data. Because data preprocessing is detailed and tedious, it demands a great deal of time. In some cases, data preprocessing could take over 50 percent of the time of the entire data mining process. Shortening data processing time can reduce much of the total computation time in data mining. The simple and standard data format resulting from data preprocessing can provide an environment of information sharing across different computer systems, which creates the flexibility to implement various data mining algorithms or tools.

Modeling

Data modeling is where the data mining software is used to generate results for various situations. Cluster analysis and or visual exploration of the data is usually applied first. Depending on the type of data, various models might then be applied. If the task is to group data and the groups are given, discriminant analysis might be appropriate. If the purpose is estimation, regression is appropriate if the data is continuous (and logistic regression, if not). Neural networks could be applied for both tasks. Decision trees are yet another tool to classify data. Other modeling tools are available as well. The point of data mining software is to allow the user to work with the data to gain understanding. This is often fostered by the iterative use of multiple models.

Data treatment: Data mining is essentially analysis of the statistical data, usually using very large datasets. The standard process of data mining is to take this large set of data and divide it using a portion of the data (the training set) for the development of the model (no matter which modeling technique is used), and reserving a portion of the data (the test set) for testing the model that is built. The principle is that if you build a model on a particular set of data, it will of course test quite well. By dividing the data and using part of it for model development, and testing it on a separate set of data, a more convincing test of model accuracy is obtained.

This idea of splitting the data into components is often carried to the additional levels in the practice of data mining. Further portions of the data can be used for refinement of the model.

Data mining techniques: Data mining can be achieved by association, classification, clustering, predictions, sequential patterns, and similar time sequences.

In association, the relationship of some item in a data transaction with other items in the same transaction is used to predict patterns. For example, if a customer purchases a laptop PC (X), then he or she also buys a mouse (Y) in 60 percent of the cases. This pattern occurs in 5.6 percent of laptop PC purchases. An association rule in this situation can be “X implies Y, where 60 percent is the confidence factor and 5.6 percent is the support factor.” When the confidence factor and support factor are represented by linguistic variables “high” and “low,” respectively, the association rule can be written in the fuzzy logic form, such as “when the support factor is low, X implies Y is high.” In the case of many qualitative variables, fuzzy association is a necessary and promising technique in data mining.

Sequential pattern analysis seeks to find similar patterns in data transaction over a business period. These patterns can be used by the business analysts to identify relationships among data. The mathematical models behind sequential patterns are logic rules, fuzzy logic, and so on. As an extension of sequential patterns, similar time sequences are applied to discover sequences similar to a known sequence over the past and ­current business periods. In the data mining stage, several similar sequences can be studied to identify the future trends in transaction development. This approach is useful in dealing with databases that have time-series characteristics.

We have already discussed the important tools of clustering, prediction, and classification in Chapter 2.

Evaluation

The data interpretation stage is very critical. It assimilates knowledge from mined data. There are two essential issues. One is how to recognize the business value from knowledge patterns discovered in the data mining stage. Another issue is which visualization tool should be used to show the data mining results. Determining the business value from discovered knowledge patterns is similar to playing “puzzles.” The mined data is a puzzle that needs to be put together for a business purpose. This operation depends on the interaction between data analysts, business analysts, and decision makers (such as managers or CEOs). Because data analysts may not be fully aware of the purpose of the data mining goal or objective, while business analysts may not understand the results of sophisticated mathematical solutions, interaction between them is necessary. In order to properly interpret knowledge patterns, it is necessary to choose an appropriate visualization tool. There are many visualization packages or tools available, including pie charts, histograms, box plots, scatter plots, and distributions. A good interpretation will lead to productive business decisions, while a poor interpretation analysis may miss useful information. Normally, the simpler the graphical interpretation, the easier it is for the end users to understand.

Deployment

Deployment is the act of using data mining analyses. New knowledge generated by the study needs to be related to the original project goals. Once developed, models need to be monitored for performance. The patterns and relationships developed based on the training set need to be changed if the underlying conditions generating the data change. For instance, if the customer profile performance changes due to the changes in the economic conditions, the predicted rates of response cannot be expected to remain the same. Thus, it is important to check the relative accuracy of data mining models and adjust them to new conditions, if necessary.

SEMMA

In order to be applied successfully, the data mining solution must be viewed as a process, rather than a set of tools or techniques. In addition to the CRISP-DM, there is yet another well-known methodology developed by the SAS Institute Inc., called SEMMA. The acronym SEMMA stands for sample, explore, modify, model, and assess. Beginning with a statistically representative sample of your data, SEMMA intends to make it easy to apply the exploratory statistical and visualization techniques, select and transform the most significant predictive variables, model the variables to predict outcomes, and finally, confirm a model’s accuracy.

By assessing the outcome of each stage in the SEMMA process, one can determine how to model new questions raised by the previous results, and thus, proceed back to the exploration phase for additional refinement of the data. That is, as is the case with CRISP-DM, SEMMA is also driven by a highly iterative experimentation cycle.

Step 1 (Sample)

This is where a portion of a large dataset (big enough to contain the significant information, yet small enough to manipulate quickly) is extracted. For optimal cost and computational performance, some (including the SAS Institute Inc.) advocate a sampling strategy, which applies a reliable, statistically representative sample of the full-detail data. In the case of very large datasets, mining a representative sample instead of the whole volume may drastically reduce the processing time required to get crucial business information. If general patterns appear in the data as a whole, these will be traceable in a representative sample. If a niche (a rare pattern) is so tiny that it is not represented in a sample and yet so important that it influences the big picture, then it should be discovered using the exploratory data description methods. It is also advised to create partitioned datasets for better accuracy assessment.

  • Training—used for model fitting.
  • Validation—used for assessment and for preventing over fitting.
  • Test—used to obtain an honest assessment of how well a model generalizes.

A more detailed discussion and relevant techniques for the assessment and validation of data mining models is given in Chapter 5 of this book.

Step 2 (Explore)

This is where the user searches for unanticipated trends and anomalies in order to gain a better understanding of the dataset. After sampling your data, the next step is to explore them visually or numerically for inherent trends or groupings. Exploration helps refine and redirect the discovery process. If visual exploration does not reveal clear trends, one can explore the data through statistical techniques, including factor analysis, correspondence analysis, and clustering. For example, in data mining for a direct mail campaign, clustering might reveal the groups of customers with distinct ordering patterns. Limiting the discovery process to each of these distinct groups individually may increase the likelihood of exploring richer patterns that may not be strong enough to be detected if the whole dataset is to be processed together.

Step 3 (Modify)

This is where the user creates, selects, and transforms the variables upon which to focus the model-construction process. Based on the discoveries in the exploration phase, one may need to manipulate data to include information, such as the grouping of customers and significant subgroups, or to introduce new variables. It may also be necessary to look for outliers and reduce the number of variables, to narrow them down to the most significant ones. One may also need to modify data when the “mined” data change. Because data mining is a dynamic, iterative process, you can update the data mining methods or models when new information is available.

Step 4 (Model)

This is where the user searches for a variable combination that reliably predicts a desired outcome. Once you prepare your data, you are ready to construct models that explain patterns in the data. Modeling techniques in data mining include artificial neural networks, decision trees, rough set analysis, support vector machines, logistic models, and other statistical models, such as time-series analysis, memory-based reasoning, and principal component analysis. Each type of model has particular strengths and is appropriate within the specific data mining situations, depending on the data. For example, artificial neural networks are very good at fitting highly complex nonlinear relationships, while rough sets analysis is known to produce reliable results with uncertain and imprecise problem situations.

Step 5 (Assess)

This is where the user evaluates the usefulness and reliability of the findings from the data mining process. In this final step of the data mining process, the user assesses the models to estimate how well it performs. A common means of assessing a model is to apply it to a portion of dataset put aside (and not used during the model building) during the sampling stage. If the model is valid, it should work for this reserved sample as well as for the sample used to construct the model. Similarly, you can test the model against known data. For example, if you know which customers in a file had high retention rates and your model predicts retention, you can check to see whether the model selects these customers accurately. In addition, practical applications of the model, such as partial mailings in a direct mail campaign, help prove its validity.

The SEMMA approach is completely compatible with the CRISP-DM approach. Both aid the knowledge discovery process. Once the models are obtained and tested, they can then be deployed to gain value with respect to a business or research application.

Evaluation of Model Results

We demonstrate with a dataset divided into a training set (about two-thirds of 2,066 cases) and test set (the remaining cases). Datasets are sometimes divided into three (or may be more) groups if a lot of model development is conducted. The basic idea is to develop models on the training set and then test the resulting models on the test set. It is typical to try to develop multiple models (such as various decision trees, logistic regression, and neural network models) for the same training set and to evaluate errors on the test set.

Classification errors are commonly displayed in coincidence matrixes (called confusion matrixes by some). A coincidence matrix shows the count of cases correctly classified as well as the count of cases classified in each incorrect category. But, in many data mining studies, the model may be very good at classifying one category, while very poor at classifying another category. The primary value of the coincidence matrix is that it identifies what kinds of errors are made. It may be much more important to avoid one kind of error than another. Assume a loan vice president suffers a great deal more from giving a loan to someone who’s expected to repay and does not, than making the mistake of not giving a loan to an applicant who actually would have paid. Both instances would be classification errors, but in data mining, often one category of error is much more important than another. Coincidence matrixes provide a means of focusing on what kinds of errors particular models tend to make.

When classifying data, in the simplest binary case, there are two opportunities for the model to be wrong. If the model is seeking to predict true or false, correctly classifying true is true positive (TP), and correctly classifying false is true negative (TN). One type of error is to incorrectly classify an actual false as true (false positive (FP), type I error). A second type of error is to incorrectly classify an actual true case as false (false negative (FN), type II error).

A way to reflect the relative error importance is through cost. This is a relatively simple idea, allowing the user to assign relative costs by the type of error. For instance, if our model predicted that an account was insolvent, that might involve an average write-off of $500. On the other hand, waiting for an account that ultimately was repaid might involve a cost of $20. Thus, there would be a major difference in the cost of errors in this case. Treating a case that turned out to be repaid as a dead account would risk the loss of $480 in addition to alienating the customer (which may or may not have future profitability implications). Conversely, treating an account that was never going to be repaid may involve carrying the account on the books longer than needed, at an additional cost of $20. Here, a cost function for the coincidence matrix could be:

$500 × (closing good account) + $20 × (keeping bad account open)

(Note that we used our own dollar costs for purposes of demonstration and were not based on the real case.) This measure (like the correct classification rate) can be used to compare alternative models. We assume a model is built on a training set (predicting 250 defaults), which is then applied to a test set of 1,000 cases (200 of which defaulted and 800 paid back, or were OK). The coincidence matrix for this model is displayed in Table 3.2.


Table 3.2 The coincidence matrix—equal misclassification costs

Loans

Model default

Model OK

Actual default

150

50

200

Actual OK

100

700

800

250

750

1,000



The overall classification accuracy is obtained by dividing the correct number of classifications (150 + 700 = 850) by the total number of cases (1,000). Thus, the test data was correctly classified in 0.85 of the cases. The cost function value here was:

$500 × 50 + $20 × 100 = $27,000

There are a number of other measures obtainable from the confusion matrix. Most are self-defining, such as:

True positive rate (TPR), which is equal to TP/(TP + FN) (also called sensitivity)

True negative rate (TNR) equal to TN/(FP + TN) (also called ­specificity)

Positive predictive value (PPV) equal to TP/(TP + FP) (also called ­precision)

Negative predictive value (NPV) equal to TN/(TN + FN)

False positive rate (FPR) equal to FP/(FP + TN) (also called fall-out)

False discovery rate (FDR) equal to FP/(FP + TP)

False negative rate (FNR) equal to FN/FN + TP) (also called miss rate).

Accuracy is equal to (TP + TN)/(TP + TN + FP + FN)

A receiver operating characteristic (ROC) curve is obtained by plotting TPR versus FPR for various threshold settings. This is equivalent to plotting the cumulative distribution function of the detection probability on the y axis versus the cumulative distribution of the false-alarm probability on the x axis.

Summary

The industry standard CRISP-DM process has six stages: (1) business understanding, (2) data understanding, (3) data preparation, (4) modeling, (5) evaluation, and (6) deployment. SEMMA is another process outline with a very similar structure. Using the CRISP-DM framework, data selection and understanding, preparation, and model interpretation require teamwork between data mining analysts and business analysts, while data transformation and data mining are conducted by data mining analysts. Each stage is a preparation for the next stage. In the remainder chapters of this book, we will discuss the details of this process from a different perspective, such as data mining tools and applications. This will provide the reader with a better understanding on why the correct process, sometimes, is even more important than correct performance of the methodology.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset