Have you ever wondered why your spouse gets all of these strange catalogs for obscure products in the mail? Have you also wondered at his or her strong interest in these things, and thought that the spouse was overly responsive to advertising of this sort? For that matter, have you ever wondered why 90 percent of your telephone calls, especially during meals, are opportunities to purchase products? (Or for that matter, why calls assuming you are a certain type of customer occur over and over, even though you continue to tell them that their database is wrong?)
One of the earliest and most effective business applications of data mining is in support of customer segmentation. This insidious application utilizes massive databases (obtained from a variety of sources) to segment the market into categories, which are studied with data mining tools to predict the response to particular advertising campaigns. It has proven highly effective. It also represents the probabilistic nature of data mining, in that it is not perfect. The idea is to send catalogs to (or call) a group of target customers with a 5 percent probability of purchase rather than waste these expensive marketing resources on customers with a 0.05 percent probability of purchase. The same principle has been used in election campaigns by party organizations—give free rides to the voting booth to those in your party; minimize giving free rides to voting booths to those likely to vote for your opponents. Some call this bias. Others call it sound business.
Data mining offers the opportunity to apply technology to improve many aspects of business. Some standard applications are presented in this chapter. The value of education is to present you with past applications, so that you can use your imagination to extend these application ideas to new environments.
Data mining has proven valuable in almost every academic discipline. Understanding business application of data mining is necessary to expose business college students to current analytic information technology. Data mining has been instrumental in customer relationship management,1 credit card management,2 banking,3 insurance,4 telecommunications,5 and many other areas of statistical support to business. Business data mining is made possible by the generation of masses of data from computer information systems. Understanding this information generation system and tools available leading to analysis is fundamental for business students in the 21st century. There are many highly useful applications in practically every field of scientific study. Data mining support is required to make sense of the masses of business data generated by computer technology.
This chapter will describe some of the major applications of data mining. By doing so, there will also be opportunities to demonstrate some of the different techniques that have proven useful. Table 2.1 compares the aspects of these applications.
Table 2.1 Common business data mining applications
Application |
Function |
Statistical technique |
AI tool |
Catalog sales |
Customer segmentation Mail stream optimization |
Cluster analysis |
K-means Neural network |
CRM (telecom) |
Customer scoring Churn analysis |
Cluster analysis |
Neural network |
Credit scoring |
Loan applications |
Cluster analysis Pattern search |
K-means |
Banking (loans) |
Bankruptcy prediction |
Prediction Discriminant analysis |
Decision tree |
Investment risk |
Risk prediction |
Prediction |
Neural network |
Insurance |
Customer retention (churn) Pricing |
Prediction Logistic regression |
Decision tree Neural network |
A wide variety of business functions are supported by data mining. Those applications listed in Table 2.1 represent only some of these applications. The underlying statistical techniques are relatively simple—to predict, to identify the case closest to past instances, or to identify some pattern.
Customer Profiling
We begin with probably the most spectacular example of business data mining. Fingerhut, Inc. was a pioneer in developing methods to improve business. In this case, they sought to identify the small subset of the most likely purchasers of their specialty catalogs. They were so successful that they were purchased by Federated Stores. Ultimately, Fingerhut operations were a victim to the general malaise in IT business in 2001 and 2002. But, they still represent a pioneering development of data mining application in business.
Lift
This section demonstrates the concept of lift used in customer segmentation models. We can divide the data into groups as fine as we want (here, we divide them into 10 equal portions of the population, or groups of 10 percent each). These groups have some identifiable features, such as zip code, income level, and so on (a profile). We can then sample and identify the portion of sales for each group. The idea behind lift is to send promotional material (which has a unit cost) to those groups that have the greatest probability of positive response first. We can visualize lift by plotting the responses against the proportion of the total population of potential customers, as shown in Table 2.2. Note that the segments are listed in Table 2.2 sorted by expected customer response.
Table 2.2 Lift calculation
Ordered segment |
Expected customer response |
Proportion (expected responses) |
Cumulative response proportion |
Random average proportion |
Lift |
Origin |
0 |
0 |
0 |
0 |
0 |
1 |
0.20 |
0.172 |
0.172 |
0.10 |
0.072 |
2 |
0.17 |
0.147 |
0.319 |
0.20 |
0.119 |
3 |
0.15 |
0.129 |
0.448 |
0.30 |
0.148 |
4 |
0.13 |
0.112 |
0.560 |
0.40 |
0.160 |
5 |
0.12 |
0.103 |
0.664 |
0.50 |
0.164 |
6 |
0.10 |
0.086 |
0.750 |
0.60 |
0.150 |
7 |
0.09 |
0.078 |
0.828 |
0.70 |
0.128 |
8 |
0.08 |
0.069 |
0.897 |
0.80 |
0.097 |
9 |
0.07 |
0.060 |
0.957 |
0.90 |
0.057 |
10 |
0.05 |
0.043 |
1.000 |
1.00 |
0.000 |
Both the cumulative responses and cumulative proportion of the population are graphed to identify the lift. Lift is the difference between the two lines in Figure 2.1.
The purpose of lift analysis is to identify the most responsive segments. Here, the greatest lift is obtained from the first five segments. We are probably more interested in profit, however. We can identify the most profitable policy. What needs to be done is to identify the portion of the population to send promotional materials to. For instance, if an average profit of $200 is expected for each positive response and a cost of $25 is expected for each set of promotional material sent out, it obviously would be more profitable to send to the first segment containing an expected 0.2 positive responses ($200 times 0.2 equals an expected revenue of $40, covering the cost of $25 plus an extra $15 profit). But, it still might be possible to improve the overall profit by sending to other segments as well (always selecting the segment with the larger response rates in order). The plot of cumulative profit is shown in Figure 2.2 for this set of data. The second most responsive segment would also be profitable, collecting $200 times 0.17 or $34 per $25 mailing for a net profit of $9. It turns out that the fourth most responsive segment collects 0.13 times $200 ($26) for a net profit of $1, while the fifth most responsive segment collects $200 times 0.12 ($24) for a net loss of $1. Table 2.3 shows the calculation of the expected payoff.
Table 2.3 Calculation of the expected payoff
Segment |
Expected segment revenue ($200 × P) |
Cumulative expected revenue |
Random cumulative cost ($25 × i) |
Expected payoff |
0 |
0 |
0 |
0 |
0 |
1 |
40 |
40 |
25 |
15 |
2 |
34 |
74 |
50 |
24 |
3 |
30 |
104 |
75 |
29 |
4 |
26 |
130 |
100 |
30 |
5 |
24 |
154 |
125 |
29 |
6 |
20 |
174 |
150 |
24 |
7 |
18 |
192 |
175 |
17 |
8 |
16 |
208 |
200 |
8 |
9 |
14 |
222 |
225 |
–3 |
10 |
10 |
232 |
250 |
–18 |
The profit function in Figure 2.2 reaches its maximum with the fourth segment.
It is clear that the maximum profit is found by sending to the four most responsive segments of the ten in the population. The implication is that in this case, the promotional materials should be sent to the four segments expected to have the largest response rates. If there was a promotional budget, it would be applied to as many segments as the budget would support, in order of the expected response rate, up to the fourth segment.
It is possible to focus on the wrong measure. The basic objective of lift analysis in marketing is to identify those customers whose decisions will be influenced by marketing in a positive way. In short, the methodology described earlier identifies those segments of the customer base that would be expected to purchase. This may or may not have been due to the marketing campaign effort. The same methodology can be applied, but more detailed data is needed to identify those whose decisions would have been changed by the marketing campaign, rather than simply those who would purchase.
Another method that considers multiple factors is Recency, Frequency, and Monetary (RFM) analysis. As with lift analysis, the purpose of an RFM is to identify customers who are more likely to respond to new offers. While lift looks at the static measure of response to a particular campaign, RFM keeps track of customer transactions by time, by frequency, and by amount. Time is important as some customers may not have responded to the last campaign, but might now be ready to purchase the product being marketed. Customers can also be sorted by the frequency of responses and by the dollar amount of sales. The subjects are coded on each of the three dimensions (one approach is to have five cells for each of the three measures, yielding a total of 125 combinations, each of which can be associated with a positive response to the marketing campaign). The RFM still has limitations, in that there are usually more than three attributes important to a successful marketing program, such as product variation, customer age, customer income, customer lifestyle, and so on.6 The approach is the basis for a continuing stream of techniques to improve customer segmentation marketing.
Understanding lift enables understanding the value of specific types of customers. This enables more intelligent customer management, which is discussed in the next section.
Comparisons of Data Mining Methods
Initial analyses focus on discovering patterns in the data. The classical statistical methods, such as correlation analysis, is a good start, often supplemented with visual tools to see the distributions and relationships among variables. Clustering and pattern search are typically the first activities in data analysis, good examples of knowledge discovery. Then, appropriate models are built. Data mining can then involve model building (extension of the conventional statistical model building to very large datasets) and pattern recognition. Pattern recognition aims to identify groups of interesting observations. Often, experts are used to assist in pattern recognition.
There are two broad categories of models used for data mining. Continuous, especially time series, data often calls for forecasting. Linear regression provides one tool, but there are many others. Business data mining has widely been used for classification or developing models to predict which category a new case will most likely belong to (such as a customer profile relative to the expected purchases, whether or not loans will be problematic, or whether insurance claims will turn out to be fraudulent). The classification modeling tools include statistically based logistic regression as well as artificial intelligence-based neural networks and decision trees.
Sung et al. compared a number of these methods with respect to their advantages and disadvantages. Table 2.4 draws upon their analysis and expands it to include the other techniques covered.
Table 2.4 Comparison of data mining method features7
Method |
Advantages |
Disadvantages |
Assumptions |
Cluster analysis |
Can generate understandable formula Can be applied automatically |
Computation time increases with dataset size Requires identification of parameters, with results sensitive to choices |
Need to make data numerical |
Discriminant analysis |
Ability to incorporate multiple financial ratios simultaneously Coefficients for combining the independent variables Ability to apply to new data |
Violates normality and independence assumptions Reduction of dimensionality issues Varied interpretation of the relative importance of variables Difficulty in specifying the classification algorithm Difficulty in interpreting the time-series prediction tests |
Assume multivariate normality within groups Assume equal group covariances across all groups Groups are discrete, nonoverlapping, and identifiable |
Regression |
Can generate understandable formula Widely understood Strong body of theory |
Computation time increases with dataset size Not very good with nonlinear data |
Normality of errors No error autocorrelation, heteroskedasticity, multicollinearity |
Neural network models |
Can deal with a wide range of problems Produce good results in complicated domains (nonlinear) Can deal with both continuous and categorical variables Have many software packages available |
Require inputs in the range of 0 to 1 Do not explain results May prematurely converge to an inferior solution |
Groups are discrete, nonoverlapping, and identifiable |
Decision trees |
Can generate understandable rules Can classify with minimal computation Use easy calculations Can deal with continuous and categorical variables Provide a clear indication of variable importance |
Some algorithms can only deal with binary-valued target classes Most algorithms only examine a single field at a time Can be computationally expensive |
Groups are discrete, nonoverlapping, and identifiable |
Knowledge Discovery
Clustering: One unsupervised clustering technique is partitioning, the process of examining a set of data to define a new categorical variable partitioning the space into a fixed number of regions. This amounts to dividing the data into clusters. The most widely known partitioning algorithm is k-means, where k center points are defined, and each observation is classified to the closest of these center points. The k-means algorithm attempts to position the centers to minimize the sum of distances. Centroids are used as centers, and the most commonly used distance metric is Euclidean. Instead of k-means, k-median can be used, providing a partitioning method expected to be more stable.
Pattern search: Objects are often grouped to seek patterns. Clusters of customers might be identified with particularly interesting average outcomes. On the positive side, you might look for patterns in highly profitable customers. On the negative side, you might seek patterns unique to those who fail to pay their bills to the firm.
Both clustering and pattern search seek to group the objects. Cluster analysis is attractive, in that it can be applied automatically (although ample computational time needs to be available). It can be applied to all types of data, as demonstrated in our example. Cluster analysis is also easy to apply. However, its use requires selection from among alternative distance measures, and weights may be needed to reflect variable importance. The results are sensitive to these measures. Cluster analysis is appropriate when dealing with large, complex datasets with many variables and specifically identifiable outcomes. It is often used as an initial form of analysis. Once different clusters are identified, pattern search methods are often used to discover the rules and patterns. Discriminant analysis has been the most widely used data mining technique in bankruptcy prediction. Clustering partitions the entire data sample, assigning each observation to exactly one group. Pattern search seeks to identify local clusterings, in that there are more objects with similar characteristics than one would expect. Pattern search does not partition the entire dataset, but identifies a few groups exhibiting unusual behavior. In the application on real data, clustering is useful for describing broad behavioral classes of customers. Pattern search is useful for identifying groups of people behaving in an anomalous way.
Predictive Models
Regression is probably the most widely used analytical tool historically. A main benefit of regression is the broad understanding people have about regression models and tests of their output. Logistic regression is highly appropriate in data mining, due to the categorical nature of resultant variables that is usually present. While regression is an excellent tool for statistical analysis, it does require assumptions about parameters. Errors are assumed to be normally distributed, without autocorrelation (errors are not related to the prior errors), without heteroskedasticity (errors don’t grow with time, for instance), and without multicollinearity (independent variables don’t contain high degrees of overlapping information content). Regression can deal with nonlinear data, but only if the modeler understands the underlying nonlinearity and develops appropriate variable transformations. There usually is a tradeoff—if the data are fit well with a linear model, regression tends to be better than neural network models. However, if there is nonlinearity or complexity in the data, neural networks (and often, genetic algorithms) tend to do better than regression. A major relative advantage of regression relative to neural networks is that regression provides an easily understood formula, while neural network models have a very complex model.
Neural network algorithms can prove highly accurate, but involve difficulty in the application to new data or interpretation of the model. Neural networks work well unless there are many input features. The presence of many features makes it difficult for the network to find patterns, resulting in long training phases, with lower probabilities of convergence. Genetic algorithms have also been applied to data mining, usually to bolster operations of other algorithms.
Decision tree analysis requires only the last assumption, that groups are discrete, nonoverlapping, and identifiable. They provide the ability to generate understandable rules, can perform classification with minimal computation, and these calculations are easy. Decision tree analysis can deal with both continuous and categorical variables, and provide a clear indication of variable importance in prediction and classification. Given the disadvantages of the decision tree method, it is a good choice when the data mining task is classification of records or prediction of outcomes.
Summary
Data mining applications are widespread. This chapter sought to give concrete examples of some of the major business applications of data mining. We began with a review of Fingerhut data mining to support catalog sales. That application was an excellent demonstration of the concept of lift applied to retail business. We also reviewed five other major business applications, intentionally trying to demonstrate a variety of different functions, statistical techniques, and data mining methods. Most of those studies applied multiple algorithms (data mining methods). Software such as Enterprise Miner has a variety of algorithms available, encouraging data miners to find the method that works best for a specific set of data.
The second portion of the book seeks to demonstrate these methods with small demonstration examples. The small examples can be run on Excel or other simple spreadsheet packages with statistical support. Businesses can often conduct data mining without purchasing large-scale data mining software. Therefore, our philosophy is that it is useful to understand what the methods are doing, which also provides the users with better understanding of what they are doing when applying data mining.