Chapter 2

Business Data Mining Tools

Have you ever wondered why your spouse gets all of these strange catalogs for obscure products in the mail? Have you also wondered at his or her strong interest in these things, and thought that the spouse was overly responsive to advertising of this sort? For that matter, have you ever wondered why 90 percent of your telephone calls, especially during meals, are opportunities to purchase products? (Or for that matter, why calls assuming you are a certain type of customer occur over and over, even though you continue to tell them that their database is wrong?)

One of the earliest and most effective business applications of data mining is in support of customer segmentation. This insidious application utilizes massive databases (obtained from a variety of sources) to segment the market into categories, which are studied with data mining tools to predict the response to particular advertising campaigns. It has proven highly effective. It also represents the probabilistic nature of data mining, in that it is not perfect. The idea is to send catalogs to (or call) a group of target customers with a 5 percent probability of purchase rather than waste these expensive marketing resources on customers with a 0.05 ­percent probability of purchase. The same principle has been used in election campaigns by party organizations—give free rides to the voting booth to those in your party; minimize giving free rides to voting booths to those likely to vote for your opponents. Some call this bias. Others call it sound business.

Data mining offers the opportunity to apply technology to improve many aspects of business. Some standard applications are presented in this chapter. The value of education is to present you with past applications, so that you can use your imagination to extend these application ideas to new environments.

Data mining has proven valuable in almost every academic discipline. Understanding business application of data mining is necessary to expose business college students to current analytic information technology. Data mining has been instrumental in customer relationship management,1 credit card management,2 banking,3 insurance,4 telecommunications,5 and many other areas of statistical support to business. Business data mining is made possible by the generation of masses of data from computer information systems. Understanding this information generation system and tools available leading to analysis is fundamental for business students in the 21st century. There are many highly useful applications in practically every field of scientific study. Data mining support is required to make sense of the masses of business data generated by computer technology.

This chapter will describe some of the major applications of data mining. By doing so, there will also be opportunities to demonstrate some of the different techniques that have proven useful. Table 2.1 compares the aspects of these applications.


Table 2.1 Common business data mining applications

Application

Function

Statistical technique

AI tool

Catalog sales

Customer segmentation

Mail stream optimization

Cluster analysis

K-means

Neural network

CRM (telecom)

Customer scoring

Churn analysis

Cluster analysis

Neural network

Credit scoring

Loan applications

Cluster analysis

Pattern search

K-means

Banking (loans)

Bankruptcy ­prediction

Prediction

Discriminant analysis

Decision tree

Investment risk

Risk prediction

Prediction

Neural network

Insurance

Customer retention (churn)

Pricing

Prediction

Logistic regression

Decision tree

Neural network



A wide variety of business functions are supported by data mining. Those applications listed in Table 2.1 represent only some of these applications. The underlying statistical techniques are relatively simple—to predict, to identify the case closest to past instances, or to identify some pattern.

Customer Profiling

We begin with probably the most spectacular example of business data mining. Fingerhut, Inc. was a pioneer in developing methods to improve business. In this case, they sought to identify the small subset of the most likely purchasers of their specialty catalogs. They were so successful that they were purchased by Federated Stores. Ultimately, Fingerhut operations were a victim to the general malaise in IT business in 2001 and 2002. But, they still represent a pioneering development of data mining application in business.

Lift

This section demonstrates the concept of lift used in customer segmentation models. We can divide the data into groups as fine as we want (here, we divide them into 10 equal portions of the population, or groups of 10 percent each). These groups have some identifiable features, such as zip code, income level, and so on (a profile). We can then sample and identify the portion of sales for each group. The idea behind lift is to send promotional material (which has a unit cost) to those groups that have the greatest probability of positive response first. We can visualize lift by plotting the responses against the proportion of the total population of potential customers, as shown in Table 2.2. Note that the segments are listed in Table 2.2 sorted by expected customer response.


Table 2.2 Lift calculation

Ordered segment

Expected customer response

Proportion (expected responses)

Cumulative response proportion

Random average proportion

Lift

Origin

0

0

0

0

0

1

0.20

0.172

0.172

0.10

0.072

2

0.17

0.147

0.319

0.20

0.119

3

0.15

0.129

0.448

0.30

0.148

4

0.13

0.112

0.560

0.40

0.160

5

0.12

0.103

0.664

0.50

0.164

6

0.10

0.086

0.750

0.60

0.150

7

0.09

0.078

0.828

0.70

0.128

8

0.08

0.069

0.897

0.80

0.097

9

0.07

0.060

0.957

0.90

0.057

10

0.05

0.043

1.000

1.00

0.000



Both the cumulative responses and cumulative proportion of the population are graphed to identify the lift. Lift is the difference between the two lines in Figure 2.1.

Image

Figure 2.1 Lift identified by the mail optimization system

The purpose of lift analysis is to identify the most responsive segments. Here, the greatest lift is obtained from the first five segments. We are probably more interested in profit, however. We can identify the most profitable policy. What needs to be done is to identify the portion of the population to send promotional materials to. For instance, if an average profit of $200 is expected for each positive response and a cost of $25 is expected for each set of promotional material sent out, it obviously would be more profitable to send to the first segment containing an expected 0.2 positive responses ($200 times 0.2 equals an expected revenue of $40, covering the cost of $25 plus an extra $15 profit). But, it still might be possible to improve the overall profit by sending to other segments as well (always selecting the segment with the larger response rates in order). The plot of cumulative profit is shown in Figure 2.2 for this set of data. The second most responsive segment would also be profitable, collecting $200 times 0.17 or $34 per $25 mailing for a net profit of $9. It turns out that the fourth most responsive segment collects 0.13 times $200 ($26) for a net profit of $1, while the fifth most responsive segment collects $200 times 0.12 ($24) for a net loss of $1. Table 2.3 shows the calculation of the expected payoff.

Image

Figure 2.2 Profit impact of lift


Table 2.3 Calculation of the expected payoff

Segment

Expected segment revenue ($200 × P)

Cumulative expected revenue

Random cumulative cost ($25 × i)

Expected payoff

0

0

0

0

0

1

40

40

25

15

2

34

74

50

24

3

30

104

75

29

4

26

130

100

30

5

24

154

125

29

6

20

174

150

24

7

18

192

175

17

8

16

208

200

8

9

14

222

225

3

10

10

232

250

18



The profit function in Figure 2.2 reaches its maximum with the fourth segment.

It is clear that the maximum profit is found by sending to the four most responsive segments of the ten in the population. The implication is that in this case, the promotional materials should be sent to the four segments expected to have the largest response rates. If there was a promotional budget, it would be applied to as many segments as the budget would support, in order of the expected response rate, up to the fourth segment.

It is possible to focus on the wrong measure. The basic objective of lift analysis in marketing is to identify those customers whose decisions will be influenced by marketing in a positive way. In short, the methodology described earlier identifies those segments of the customer base that would be expected to purchase. This may or may not have been due to the marketing campaign effort. The same methodology can be applied, but more detailed data is needed to identify those whose decisions would have been changed by the marketing campaign, rather than simply those who would purchase.

Another method that considers multiple factors is Recency, Frequency, and Monetary (RFM) analysis. As with lift analysis, the purpose of an RFM is to identify customers who are more likely to respond to new offers. While lift looks at the static measure of response to a particular campaign, RFM keeps track of customer transactions by time, by frequency, and by amount. Time is important as some customers may not have responded to the last campaign, but might now be ready to purchase the product being marketed. Customers can also be sorted by the frequency of responses and by the dollar amount of sales. The subjects are coded on each of the three dimensions (one approach is to have five cells for each of the three measures, yielding a total of 125 combinations, each of which can be associated with a positive response to the marketing campaign). The RFM still has limitations, in that there are usually more than three attributes important to a successful marketing program, such as product variation, customer age, customer income, customer lifestyle, and so on.6 The approach is the basis for a continuing stream of techniques to improve customer segmentation marketing.

Understanding lift enables understanding the value of specific types of customers. This enables more intelligent customer management, which is discussed in the next section.

Comparisons of Data Mining Methods

Initial analyses focus on discovering patterns in the data. The classical statistical methods, such as correlation analysis, is a good start, often supplemented with visual tools to see the distributions and relationships among variables. Clustering and pattern search are typically the first activities in data analysis, good examples of knowledge discovery. Then, appropriate models are built. Data mining can then involve model building (extension of the conventional statistical model building to very large datasets) and pattern recognition. Pattern recognition aims to identify groups of interesting observations. Often, experts are used to assist in pattern recognition.

There are two broad categories of models used for data mining. Continuous, especially time series, data often calls for forecasting. Linear regression provides one tool, but there are many others. Business data mining has widely been used for classification or developing models to predict which category a new case will most likely belong to (such as a customer profile relative to the expected purchases, whether or not loans will be problematic, or whether insurance claims will turn out to be fraudulent). The classification modeling tools include statistically based logistic regression as well as artificial intelligence-based neural networks and decision trees.

Sung et al. compared a number of these methods with respect to their advantages and disadvantages. Table 2.4 draws upon their analysis and expands it to include the other techniques covered.


Table 2.4 Comparison of data mining method features7

Method

Advantages

Disadvantages

Assumptions

Cluster analysis

Can generate understandable formula

Can be applied automatically

Computation time increases with dataset size

Requires identification of parameters, with results sensitive to choices

Need to make data numerical

Discriminant analysis

Ability to incorporate multiple financial ratios simultaneously

Coefficients for combining the independent variables

Ability to apply to new data

Violates normality and independence assumptions

Reduction of dimensionality issues

Varied interpretation of the relative ­importance of variables

Difficulty in specifying the classification algorithm

Difficulty in interpreting the time-series prediction tests

Assume multivariate normality within groups

Assume equal group covariances across all groups

Groups are discrete, nonoverlapping, and identifiable

Regression

Can generate understandable formula

Widely understood

Strong body of theory

Computation time increases with dataset size

Not very good with nonlinear data

Normality of errors

No error autocorrelation, ­heteroskedasticity, multicollinearity

Neural network models

Can deal with a wide range of problems

Produce good results in complicated domains (nonlinear)

Can deal with both continuous and categorical variables

Have many software packages available

Require inputs in the range of 0 to 1

Do not explain results

May prematurely converge to an inferior solution

Groups are discrete, nonoverlapping, and identifiable

Decision trees

Can generate understandable rules

Can classify with minimal computation

Use easy calculations

Can deal with continuous and categorical variables

Provide a clear indication of variable importance

Some algorithms can only deal with ­binary-valued target classes

Most algorithms only examine a single field at a time

Can be computationally expensive

Groups are discrete, nonoverlapping, and identifiable



Knowledge Discovery

Clustering: One unsupervised clustering technique is partitioning, the process of examining a set of data to define a new categorical variable partitioning the space into a fixed number of regions. This amounts to dividing the data into clusters. The most widely known partitioning algorithm is k-means, where k center points are defined, and each observation is classified to the closest of these center points. The k-means algorithm attempts to position the centers to minimize the sum of distances. Centroids are used as centers, and the most commonly used distance metric is Euclidean. Instead of k-means, k-median can be used, providing a partitioning method expected to be more stable.

Pattern search: Objects are often grouped to seek patterns. Clusters of customers might be identified with particularly interesting average outcomes. On the positive side, you might look for patterns in highly profitable customers. On the negative side, you might seek patterns unique to those who fail to pay their bills to the firm.

Both clustering and pattern search seek to group the objects. Cluster analysis is attractive, in that it can be applied automatically (although ample computational time needs to be available). It can be applied to all types of data, as demonstrated in our example. Cluster analysis is also easy to apply. However, its use requires selection from among alternative distance measures, and weights may be needed to reflect variable importance. The results are sensitive to these measures. Cluster analysis is appropriate when dealing with large, complex datasets with many variables and specifically identifiable outcomes. It is often used as an initial form of analysis. Once different clusters are identified, pattern search methods are often used to discover the rules and patterns. Discriminant analysis has been the most widely used data mining technique in bankruptcy prediction. Clustering partitions the entire data sample, assigning each observation to exactly one group. Pattern search seeks to identify local clusterings, in that there are more objects with similar characteristics than one would expect. Pattern search does not partition the entire dataset, but identifies a few groups exhibiting unusual behavior. In the application on real data, clustering is useful for describing broad behavioral classes of customers. Pattern search is useful for identifying groups of people behaving in an anomalous way.

Predictive Models

Regression is probably the most widely used analytical tool historically. A main benefit of regression is the broad understanding people have about regression models and tests of their output. Logistic regression is highly appropriate in data mining, due to the categorical nature of resultant variables that is usually present. While regression is an excellent tool for statistical analysis, it does require assumptions about parameters. Errors are assumed to be normally distributed, without autocorrelation (errors are not related to the prior errors), without heteroskedasticity (errors don’t grow with time, for instance), and without multicollinearity (independent variables don’t contain high degrees of overlapping information ­content). Regression can deal with nonlinear data, but only if the modeler understands the underlying nonlinearity and develops appropriate variable transformations. There usually is a tradeoff—if the data are fit well with a linear model, regression tends to be better than neural network models. However, if there is nonlinearity or complexity in the data, neural networks (and often, genetic algorithms) tend to do better than regression. A major relative advantage of regression relative to neural ­networks is that regression provides an easily understood formula, while neural ­network models have a very complex model.

Neural network algorithms can prove highly accurate, but involve difficulty in the application to new data or interpretation of the model. Neural networks work well unless there are many input features. The presence of many features makes it difficult for the network to find ­patterns, resulting in long training phases, with lower probabilities of convergence. Genetic algorithms have also been applied to data mining, usually to ­bolster operations of other algorithms.

Decision tree analysis requires only the last assumption, that groups are discrete, nonoverlapping, and identifiable. They provide the ability to generate understandable rules, can perform classification with minimal computation, and these calculations are easy. Decision tree analysis can deal with both continuous and categorical variables, and provide a clear indication of variable importance in prediction and classification. Given the disadvantages of the decision tree method, it is a good choice when the data mining task is classification of records or prediction of outcomes.

Summary

Data mining applications are widespread. This chapter sought to give concrete examples of some of the major business applications of data mining. We began with a review of Fingerhut data mining to support catalog sales. That application was an excellent demonstration of the concept of lift applied to retail business. We also reviewed five other major business applications, intentionally trying to demonstrate a variety of different functions, statistical techniques, and data mining methods. Most of those studies applied multiple algorithms (data mining methods). Software such as Enterprise Miner has a variety of algorithms available, encouraging data miners to find the method that works best for a specific set of data.

The second portion of the book seeks to demonstrate these ­methods with small demonstration examples. The small examples can be run on Excel or other simple spreadsheet packages with statistical support. ­Businesses can often conduct data mining without purchasing large-scale data mining software. Therefore, our philosophy is that it is useful to understand what the methods are doing, which also provides the users with better understanding of what they are doing when applying data mining.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset