Chapter 23
Association Rules

23.1 Affinity Analysis and Market Basket Analysis

Affinity analysis is the study of attributes or characteristics that “go together.” Methods for affinity analysis, also known as market basket analysis, seek to uncover associations among these attributes; that is, it seeks to uncover rules for quantifying the relationship between two or more attributes. Association rules take the form “If antecedent, then consequent,” along with a measure of the support and confidence associated with the rule. For example, a particular supermarket may find that of the 1000 customers shopping on a Thursday night, 200 bought diapers, and of the 200 who bought diapers, 50 bought beer. Thus, the association rule would be: “If buy diapers, then buy beer,” with a support of c23-math-0001 and a confidence of c23-math-0002.

Examples of association tasks in business and research include

  • investigating the proportion of subscribers to your company's cell phone plan that respond positively to an offer of a service upgrade;
  • examining the proportion of children whose parents read to them who are themselves good readers;
  • predicting degradation in telecommunications networks;
  • finding out which items in a supermarket are purchased together, and which items are never purchased together;
  • determining the proportion of cases in which a new drug will exhibit dangerous side effects.

What types of algorithms can we apply to mine association rules from a particular data set? The daunting problem that awaits any such algorithm is the curse of dimensionality: The number of possible association rules grows exponentially in the number of attributes. Specifically, if there are k attributes, we limit ourselves to binary attributes, and we account only for the positive cases (e.g., buy diapers = yes), which are on the order of c23-math-0003 possible association rules.1 Consider that a typical application for association rules is market basket analysis and that there may be thousands of binary attributes (buy beer? buy popcorn? buy milk? buy bread? etc.), the search problem appears at first glance to be utterly hopeless. For example, suppose that a tiny convenience store has only 100 different items, and a customer could either buy or not buy any combination of those 100 items. Then there are c23-math-0004 possible association rules that await your intrepid search algorithm.

The a priori algorithm for mining association rules, however, takes advantage of structure within the rules themselves to reduce the search problem to a more manageable size. Before we examine the a priori algorithm, however, let us consider some basic concepts and notation for association rule mining. We begin with a simple example.

Suppose that a local farmer has set up a roadside vegetable stand and is offering the following items for sale: {asparagus, beans, broccoli, corn, green peppers, squash, tomatoes}. Denote this set of items as I. One by one, customers pull over, pick up a basket, and purchase various combinations of these items, subsets of I. (For our purposes, we do not keep track of how much of each item is purchased, just whether or not that particular item is purchased.) Suppose Table 23.1 lists the transactions made during one fine fall afternoon at this roadside vegetable stand.

Table 23.1 Transactions made at the roadside vegetable stand

Transaction Items Purchased
1 Broccoli, green peppers, corn
2 Asparagus, squash, corn
3 Corn, tomatoes, beans, squash
4 Green peppers, corn, tomatoes, beans
5 Beans, asparagus, broccoli
6 Squash, asparagus, beans, tomatoes
7 Tomatoes, corn
8 Broccoli, tomatoes, green peppers
9 Squash, asparagus, beans
10 Beans, corn
11 Green peppers, broccoli, beans, squash
12 Asparagus, beans, squash
13 Squash, corn, asparagus, beans
14 Corn, green peppers, tomatoes, beans, broccoli

23.1.1 Data Representation for Market Basket Analysis

There are two principal methods of representing this type of market basket data: using either the transactional data format or the tabular data format. The transactional data

format requires only two fields, an ID field and a content field, with each record representing a single item only. For example, the data in Table 23.1 could be represented using transactional data format as shown in Table 23.2.

Table 23.2 Transactional data format for the roadside vegetable stand data

Transaction ID Items
1 Broccoli
1 Green peppers
1 Corn
2 Asparagus
2 Squash
2 Corn
3 Corn
3 Tomatoes
c23-math-0005 c23-math-0006

In the tabular data format, each record represents a separate transaction, with as many 0/1 flag fields as there are items. The data from Table 23.1 could be represented using the tabular data format, as shown in Table 23.3.

Table 23.3 Tabular data format for the roadside vegetable stand data

Transaction Asparagus Beans Broccoli Corn Green Peppers Squash Tomatoes
1 0 0 1 1 1 0 0
2 1 0 0 1 0 1 0
3 0 1 0 1 0 1 1
4 0 1 0 1 1 0 1
5 1 1 1 0 0 0 0
6 1 1 0 0 0 1 1
7 0 0 0 1 0 0 1
8 0 0 1 0 1 0 1
9 1 1 0 0 0 1 0
10 0 1 0 1 0 0 0
11 0 1 1 0 1 1 0
12 1 1 0 0 0 1 0
13 1 1 0 1 0 1 0
14 0 1 1 1 1 0 1

23.2 Support, Confidence, Frequent Itemsets, and the A Priori Property

Let D be the set of transactions represented in Table 23.1, where each transaction T in D represents a set of items contained in I. Suppose that we have a particular set of items A (e.g., beans and squash), and another set of items B (e.g., asparagus). Then an association rule takes the form if A, then B (i.e., c23-math-0007), where the antecedent A and the consequent B are proper subsets of I, and A and B are mutually exclusive. This definition would exclude, for example, trivial rules such as if beans and squash, then beans.

The support s for a particular association rule c23-math-0008 is the proportion of transactions in D that contain both A and B. That is,

equation

The confidence c of the association rule c23-math-0010 is a measure of the accuracy of the rule, as determined by the percentage of transactions in D containing A that also contain B. In other words,

equation

Analysts may prefer rules that have either high support or high confidence, and usually both. Strong rules are those that meet or surpass certain minimum support and confidence criteria. For example, an analyst interested in finding which supermarket items are purchased together may set a minimum support level of 20% and a minimum confidence level of 70%. However, a fraud detection analyst or a terrorism detection analyst would need to reduce the minimum support level to 1% or less, because comparatively few transactions are either fraudulent or terror-related.

An itemset is a set of items contained in I, and a k-itemset is an itemset containing k items. For example, {beans, squash} is a 2-itemset, and {broccoli, green peppers, corn} is a 3-itemset, each from the vegetable stand set I. The itemset frequency is simply the number of transactions that contain the particular itemset. A frequent itemset is an itemset that occurs at least a certain minimum number of times, having itemset frequency ≥φ. For example, suppose that we set φ = 4. Then itemsets that occur more than four times are said to be frequent. We denote the set of frequent k-itemsets as Fk.

23.3 How Does The A Priori Algorithm Work (Part 1)? Generating Frequent Itemsets

Consider the set of transactions D represented in Table 23.1. How would the a priori algorithm mine association rules from this data set?

Let φ = 4, so that an itemset is frequent if it occurs four or more times in D. We first find F1, the frequent 1-itemsets, which represent simply the individual vegetable items themselves. To do so, we may turn to Table 23.3 and take the column sums, which give us the number of transactions containing each particular vegetable. As each sum meets or exceeds φ = 4, we conclude that each 1-itemset is frequent. Thus, F1 = {asparagus, beans, broccoli, corn, green peppers, squash, tomatoes}.

Next, we turn to finding the frequent 2-itemsets. In general, to find Fk, the a priori algorithm first constructs a set Ck of candidate k-itemsets by joining Fk−1 with itself. Then it prunes Ck using the a priori property. The itemsets in Ck that survive the pruning step then form Fk. Here, C2 consists of all the combinations of vegetables in Table 23.4.

Table 23.4 Candidate 2-ItemSets

Combination Count Combination Count
Asparagus, beans 5 Broccoli, corn 2
Asparagus, broccoli 1 Broccoli, green peppers 4
Asparagus, corn 2 Broccoli, squash 1
Asparagus, green peppers 0 Broccoli, tomatoes 2
Asparagus, squash 5 Corn, green peppers 3
Asparagus, tomatoes 1 Corn, squash 3
Beans, broccoli 3 Corn, tomatoes 4
Beans, corn 5 Green peppers, squash 1
Beans, green peppers 3 Green peppers, tomatoes 3
Beans, squash 6 Squash, tomatoes 2
Beans, tomatoes 4

As φ = 4, we have F2 = { {asparagus, beans}, {asparagus, squash}, {beans, corn}, {beans, squash}, {beans, tomatoes}, {broccoli, green peppers}, {corn, tomatoes} }. Next, we use the frequent itemsets in F2 to generate C3, the candidate 3-itemsets. To do so, we join F2 with itself, where itemsets are joined if they have the first k − 1 items in common (in alphabetical order). For example, {asparagus, beans} and {asparagus, squash} have the first k − = 1 item in common, asparagus. Thus, they are joined into the new candidate itemset {asparagus, beans, squash}. Similarly, {beans, corn} and {beans, squash} have the first item, beans, in common, generating the candidate 3-itemset {beans, corn, squash}. Finally, candidate 3-itemsets {beans, corn, tomatoes} and {beans, squash, tomatoes} are generated in like manner. Thus, C3 = { {asparagus, beans, squash}, {beans, corn, squash}, {beans, corn, tomatoes}, {beans, squash, tomatoes} }.

C3 is then pruned, using the a priori property. For each itemset s in C3, its size k − 1 subsets are generated and examined. If any of these subsets are not frequent, s cannot be frequent and is therefore pruned. For example, consider s = {asparagus, beans, squash}. The subsets of size k − 1 = 2 are generated, as follows: {asparagus, beans}, {asparagus, squash}, and {beans, squash}. From Table 23.4, we see that each of these subsets is frequent and that therefore s = {asparagus, beans, squash} is not pruned. The reader will verify that s = {beans, corn, tomatoes} will also not be pruned.

However, consider s = {beans, corn, squash}. The subset {corn, squash} has frequency 3 < 4 = φ, so that {corn, squash} is not frequent. By the a priori property, therefore, {beans, corn, squash} cannot be frequent, is therefore pruned, and does not appear in F3. Also consider s = {beans, squash, tomatoes}. The subset {squash, tomatoes} has frequency 2 < 4 = φ, and hence is not frequent. Again, by the a priori property, its superset {beans, squash, tomatoes} cannot be frequent and is also pruned, not appearing in F3.

We still need to check the count for these candidate frequent itemsets. The itemset {asparagus, beans, squash} occurs four times in the transaction list, but {beans, corn, tomatoes} occurs only three times. Therefore, the latter candidate itemset is also pruned, leaving us with a singleton frequent itemset in F3: {asparagus, beans, squash}. This completes the task of finding the frequent itemsets for the vegetable stand data D.

23.4 How Does The A Priori Algorithm Work (Part 2)? Generating Association Rules

Next, we turn to the task of generating association rules using the frequent itemsets. This is accomplished using the following two-step process, for each frequent itemset s:

For example, consider s = {asparagus, beans, squash} from F3. The proper subsets of s are {asparagus}, {beans}, {squash}, {asparagus, beans}, {asparagus, squash}, and {beans, squash}. For the first association rule shown in Table 23.5, we consider ss = {asparagus, beans}, so that (sss) = {squash}. We consider the rule R: {asparagus, beans} ⇒ {squash}. The support is the proportion of transactions in which both {asparagus, beans} and {squash} occur, which is 4 (or 28.6%) of the 14 total transactions in D. To find the confidence, we note that {asparagus, beans} occurs in 5 of the 14 transactions, four of which also contain {squash}, giving us our confidence of c23-math-0015. The statistics for the second rule in Table 23.5 arise similarly. For the third rule in Table 23.5, the support is still c23-math-0016, but the confidence falls to 66.7%. This is because {beans, squash} occurs in six transactions, four of which also contain {asparagus}. Assuming that our minimum confidence criterion is set at 60% and that we desire a single consequent, we therefore have the candidate rules shown in Table 23.5. If our minimum confidence were set at 80%, the third rule would not be reported.

Table 23.5 Candidate association rules for vegetable stand data: two antecedents

If Antecedent, then Consequent Support Confidence
If buy asparagus and beans, then buy squash c23-math-0017 c23-math-0018
If buy asparagus and squash, then buy beans c23-math-0019 c23-math-0020
If buy beans and squash, then buy asparagus c23-math-0021 c23-math-0022

Finally, we turn to single-antecedent/single-consequent rules. Applying the association rule generation method outlined in the box above, and using the itemsets in F2, we may generate the candidate association rules shown in Table 23.6.

Table 23.6 Candidate association rules for vegetable stand data: one antecedent

If Antecedent, then Consequent Support Confidence
If buy asparagus, then buy beans c23-math-0023 c23-math-0024
If buy beans, then buy asparagus c23-math-0025 c23-math-0026
If buy asparagus, then buy squash c23-math-0027 c23-math-0028
If buy squash, then buy asparagus c23-math-0029 c23-math-0030
If buy beans, then buy corn c23-math-0031 c23-math-0032
If buy corn, then buy beans c23-math-0033 c23-math-0034
If buy beans, then buy squash c23-math-0035 c23-math-0036
If buy squash, then buy beans c23-math-0037 c23-math-0038
If buy beans, then buy tomatoes c23-math-0039 c23-math-0040
If buy tomatoes, then buy beans c23-math-0041 c23-math-0042
If buy broccoli, then buy green peppers c23-math-0043 c23-math-0044
If buy green peppers, then buy broccoli c23-math-0045 c23-math-0046
If buy corn, then buy tomatoes c23-math-0047 c23-math-0048
If buy tomatoes, then buy corn c23-math-0049 c23-math-0050

To provide an overall measure of usefulness for an association rule, analysts sometimes multiply the support times the confidence. This allows the analyst to rank the rules according to a combination of prevalence and accuracy. Table 23.7 provides such a list for our present data set, after first filtering the rules through a minimum confidence level of 80%.

Table 23.7 Final list of association rules for vegetable stand data: ranked by support × confidence, minimum confidence 80%

Support ×
If Antecedent, then Consequent Support Confidence Confidence
If buy squash, then buy beans c23-math-0051 c23-math-0052 0.3677
If buy asparagus, then buy beans c23-math-0053 c23-math-0054 0.2974
If buy asparagus, then buy squash c23-math-0055 c23-math-0056 0.2974
If buy broccoli, then buy green peppers c23-math-0057 c23-math-0058 0.2288
If buy green peppers, then buy broccoli c23-math-0059 c23-math-0060 0.2288
If buy asparagus and beans, then buy squash c23-math-0061 c23-math-0062 0.2288
If buy asparagus and squash, then buy beans c23-math-0063 c23-math-0064 0.2288

Compare Table 23.7 with Figure 23.1, the association rules reported by Modeler's version of the a priori algorithm, with minimum 80% confidence, and sorted by support × confidence. The third column, which Modeler calls “Support %,” is actually not what we defined support to be in this chapter (following Han and Kamber,2 Hand et al.,3 and other texts). Instead, what Modeler calls “support” is the proportion of occurrences of the antecedent alone rather than the antecedent and the consequent. To find the actual support for the association rule using the Modeler results, multiply the reported “support” times the reported confidence. For example, Modeler reports 50% support and 85.714% confidence for the first association rule, but this really means c23-math-0065 support, according to the generally accepted definition of support. Be careful with Figure 23.1, because it reports the consequent before the antecedent. Apart from the “support” anomaly, the software's association rules shown in Figure 23.1 represent the same rules as those we found step by step, and by hand, for the vegetable stand data.

c23fz001

Figure 23.1 Association rules for vegetable stand data, generated by Modeler.

Armed with this knowledge, the vegetable stand entrepreneur can deploy marketing strategies that take advantage of the patterns uncovered above. Why do these particular products co-occur in customers' market baskets? Should the product layout be altered to make it easier for customers to purchase these products together? Should personnel be alerted to remind customers not to forget item B when purchasing associated item A?

23.5 Extension From Flag Data to General Categorical Data

Thus far, we have examined association rules using flag data types only. That is, all of the vegetable stand attributes took the form of Boolean 0/1 flags, resulting in the tabular data format found in Table 23.3, reflecting a straightforward market basket analysis problem. However, association rules are not restricted to flag data types. In particular, the a priori algorithm can be applied to categorical data in general. Let us look at an example.

Recall the normalized adult data set analyzed in Chapters 8 and 9. Here in Chapter 12, we apply the a priori algorithm, for the predictor variables marital status, sex, work class, and the target variable income in that same data set, using Modeler. Minimum support of 15%, minimum confidence of 80%, and a maximum of two antecedents are specified, with the resulting association rules shown in Figure 23.2.

c23fz002

Figure 23.2 Association rules for categorical attributes found by the a priori algorithm.

Some of these rules contain the nominal variables Marital status and Work class, each of which contain several values, so that these attributes are truly non-flag categorical attributes. The a priori algorithm simply finds the frequent itemsets just as before, this time counting the occurrences of the values of the categorical variables rather than simply the occurrence of the flag.

For example, consider the second rule reported in Figure 23.2: “If Marital status = Never-married, then income <=50K,” with confidence 95.319%. There were 8225 instances in the data set where the attribute Marital status took the value Never-married, which represents 32.9% of the number of records in the data set. (Again, Modeler refers to this as the “support,” which is not how most researchers define that term.) The support for this rule is (0.329)(0.95319) = 0.3136. That is, 31.362% of the records contained the value Never-married for Marital status and the value “<=50K” for income, thus making this pairing a frequent 2-itemset of categorical attributes.

23.6 Information-Theoretic Approach: Generalized Rule Induction Method

The structure of association rules, where the antecedent and consequent are both Boolean statements, makes them particularly well suited for handling categorical data, as we have seen. However, what happens when we try to extend our association rule mining to a broader range of data, specifically numerical attributes?

Of course, it is always possible to discretize the numerical attributes, for example, by arbitrarily defining income under $30,000 as low, income over $70,000 as high, and other income as medium. Also, we have seen how both C4.5 and CART handle numerical attributes by discretizing the numerical variables at favorable locations. Unfortunately, the a priori algorithm is not well equipped to handle numeric attributes unless they are discretized during preprocessing. Of course, discretization can lead to a loss of information, so if the analyst has numerical inputs and prefers not to discretize them, he or she may choose to apply an alternative method for mining association rules: generalized rule induction (GRI). The GRI methodology can handle either categorical or numerical variables as inputs, but still requires categorical variables as outputs.

GRI was introduced by Smyth and Goodman in 1992.4 Rather than using frequent itemsets, GRI applies an information-theoretic approach (as did the C4.5 decision tree algorithm) to determining the “interestingness” of a candidate association rule.

23.6.1 J-Measure

Specifically, GRI applies the J-measure:

equation

where

  • p(x) represents the probability or confidence of the observed value of x. This is a measure of the coverage of the antecedent. How prevalent is this value of the antecedent attribute? You can calculate p(x) using a frequency distribution for the variable in the antecedent.
  • p(y) represents the prior probability or confidence of the value of y. This is a measure of the prevalence of the observed value of y in the consequent. You can calculate p(y) using a frequency distribution for the variable in the consequent.
  • p(y|x) represents the conditional probability, or posterior confidence, of y given that x has occurred. This is a measure of the probability of the observed value of y given that this value of x has occurred. That is, p(y|x) represents an updated probability of observing this value of y after taking into account the additional knowledge of the value of x. In association rule terminology, p(y|x) is measured directly by the confidence of the rule.
  • ln represents the natural log function (log to the base e).

lsvdvpds

For rules with more than one antecedent, p(x) is considered to be the probability of the conjunction of the variable values in the antecedent.

As usual, the user specifies desired minimum support and confidence criteria. For GRI, however, the user also specifies how many association rules he or she would like to be reported, thereby defining the size of an association rule table referenced by the algorithm. The GRI algorithm then generates single-antecedent association rules, and calculates J, the value of the J-measure for the rule. If the “interestingness” of the new rule, as quantified by the J-measure, is higher than the current minimum J in the rule table, the new rule is inserted into the rule table, which keeps a constant size by eliminating the rule with minimum J. More specialized rules with more antecedents are then considered.

How can the behavior of the J-statistic be described? Clearly (as p(x) sits outside the brackets), higher values of J will be associated with higher values of p(x). That is, the J-measure will tend to favor those rules whose antecedent value is more prevalent, reflecting higher coverage in the data set. Also, the J-measure tends toward higher values when p(y) and p(y|x) are more extreme (near 0 or 1). Hence, the J-measure will also tend to favor those rules whose consequent probability, p(y), is more extreme, or whose rule confidence, p(y|x), is more extreme.

The J-measure favors rules with either very high or very low confidence. Why would we be interested in an association rule with extremely low confidence? For example, suppose that we have a rule R: If buy beer, then buy fingernail polish, with confidence p(y|x) = 0.01%, which would presumably be favored by the J-measure, because the confidence is so low. The analyst could then consider the negative form of R: If buy beer, then NOT buy fingernail polish, with confidence 99.99%. Although such negative rules are often interesting (“I guess we better move that fingernail polish out of the beer section…”), they are often not directly actionable.

23.7 Association Rules are Easy to do Badly

Association rules need to be applied with care, because their results are sometimes deceptive. Let us look at an example. Turning back to the a priori algorithm, we asked Modeler to mine association rules from the adult database using 10% minimum support, 60% minimum confidence, and a maximum of two antecedents. One association rule is shown from the results, in Figure 23.3.

c23fz003

Figure 23.3 An association rule that is worse than useless.

The results (not shown) include the following association rule: If Work_Class = Private, then Sex = Male, with 65.63% confidence. Marketing analysts interested in small business owners might be tempted to use this association rule in support of a new marketing strategy aimed at males. However, seen in its proper light, this rule may in fact be worse than useless.

One needs to take into account the raw (prior) proportion of males in the data set, which in this case is 66.84%. In other words, applying this association rule actually reduces the probability of randomly selecting a male from 0.6684 to 0.6563. You would have been better advised to pull a name out of a hat from the entire data set than apply this rule.

Why, then, if the rule is so useless, did the software report it? The quick answer is that the default ranking mechanism for Modeler's a priori algorithm is confidence. However, it needs to be emphasized here that data miners should never simply believe the computer output without making the effort to understand the models and mechanisms underlying the results. With the onset of sophisticated point-and-click data mining software, poor analysis costing millions of dollars is more prevalent than ever. In a word, data mining is easy to do badly. Insightful human expertise and constant human vigilance are required to translate the nuggets hidden in the database into actionable and profitable results.

With association rules, one needs to keep in mind the prior probabilities involved. To illustrate, we now ask Modeler to provide us with a priori association rules, but this time using the confidence difference as the evaluative measure. Here, rules are favored that provide the greatest increase in confidence from the prior to the posterior. One such association rule is shown in Figure 23.4: If Marital status = Divorced, then Sex = Female. The data set contains 33.16% females, so an association rule that can identify females with 60.029% confidence is useful. The confidence difference for this association rule is 0.60029 − 0.3316 = 0.26869 between the prior and posterior confidences.

c23fz004

Figure 23.4 This association rule is useful, because the posterior probability (0.60029) is much greater than the prior probability (0.3316).

Alternatively, analysts may prefer to use the confidence ratio to evaluate potential rules. This is defined as

equation

For example, for the rule: If Marital status = Divorced, then Sex = Female, we have p(y) = 0.3316 and p(y|x) = 0.60029, so that

equation

and the confidence ratio equals 1 − 0.5524 = 0.4476. In the exercises, we explore further the differences among these rule selection criteria.

23.8 How Can We Measure the Usefulness of Association Rules?

As we have seen, not all association rules are equally useful. Here we are introduced to a measure that can quantify the usefulness of an association rule: lift. We define lift as follows:

equation

Recall the supermarket example where, of 1000 customers, 200 bought diapers, and of these 200 customers who bought diapers, 50 also bought beer. The prior proportion of those who bought beer is c23-math-0070, while the rule confidence is c23-math-0071. Therefore, the lift for the association rule, “If buy diapers, then buy beer,” is

equation

This may be interpreted as “Customers who buy diapers are five times as likely to buy beer as customers from the entire data set.” Clearly, this association rule would be useful to a store manager wishing to sell more diapers. Next, suppose, of that 40 of the 1000 customers bought expensive makeup, whereas, of the 200 customers who bought diapers, only 5 bought expensive makeup. In this case, the lift for the association rule “If buy diapers, then buy expensive makeup” is

equation

So, customers who buy diapers are only 62.5% as likely to buy expensive makeup as customers in the entire data set.

In general, association rules with lift values different from 1 will be more interesting and useful than rules with lift values close to 1. Why are rules with lift values close to 1 not useful? Recall the definition of confidence for the association rule “If A then B”:

equation

Then, to obtain lift, we divide by the prior probability of the consequent B, giving us:

equation

Now, events A and B are independent when c23-math-0076. Thus, the ratio c23-math-0077 being close to 1 implies that A and B are independent events, meaning that knowledge of the occurrence of A does not alter the probability of the occurrence of B. Such relationships are not useful from a data mining perspective, and thus it makes sense that we prefer our association rules to have a lift value different from 1.

23.9 Do Association Rules Represent Supervised or Unsupervised Learning?

Before we leave the subject of association rules, let us touch on a few topics of interest. First, we may ask whether association rules represent supervised or unsupervised learning. Recall that most data mining methods represent supervised learning, because (i) a target variable is prespecified, and (ii) the algorithm is provided with a rich collection of examples where possible association between the target variable and the predictor variables may be uncovered. Conversely, in unsupervised learning, no target variable is identified explicitly. Rather, the data mining algorithm searches for patterns and structure among all the variables. Clustering is perhaps the most common unsupervised data mining method.

Association rule mining, however, can be applied in either a supervised or an unsupervised manner. In market basket analysis, for example, one may simply be interested in “which items are purchased together,” in which case no target variable would be identified. However, some data sets are naturally structured so that a particular variable fulfills the role of a consequent, and not an antecedent (see the play example in the exercises). For example, suppose that political pollsters have collected demographic data in their exit polling, along with the subject's voting preference. In this case, association rules could be mined from this data set, where the demographic information could represent possible antecedents, and the voting preference could represent the single consequent of interest. In this way, association rules could be used to help classify the voting preferences of citizens with certain demographic characteristics, in a supervised learning process.

Thus, the answer to the question is that association rules, while generally used for unsupervised learning, may also be applied for supervised learning for a classification task.

23.10 Local Patterns Versus Global Models

Finally, data analysts need to consider the difference between models and patterns. A model is a global description or explanation of a data set, taking a high-level perspective. Models may be descriptive or inferential. Descriptive models seek to summarize the entire data set in a succinct manner. Inferential models aim to provide a mechanism that enables the analyst to generalize from samples to populations. Either way, the perspective is global, encompassing the entire data set. However, patterns are essentially local features of the data. Recognizable patterns may in fact hold true for only a few variables or a fraction of the records in the data.

Most of the modeling methods we have covered have dealt with global model building. Association rules, however, are particularly well suited to uncovering local patterns in the data. As soon as one applies the if clause in an association rule, one is partitioning a data so that, usually, most of the records do not apply. Applying the if clause “drills down” deeper into a data set, with the aim of uncovering a hidden local pattern that may or may not be relevant to the bulk of the data.

For example, consider the following association rule from Figure 23.4: If Marital status = Divorced, then Sex = Female, with confidence 60.029%. We see that this association rule applies to only 13.74% of the records and ignores the remaining 86.24% of the data set. Even among this minority of records, the association rule ignores most of the variables, concentrating on only two. Therefore, this association rule cannot claim to be global and cannot be considered a model in the strict sense. It represents a pattern that is local to these records and variables only.

Then again, finding interesting local patterns is one of the most important goals of data mining. Sometimes, uncovering a pattern within the data can lead to the deployment of new and profitable initiatives. For example, recall from the churn data set (Chapter 3) that those customers who belonged to the VoiceMail Plan were at considerably lower risk of churning than other customers (see Figure 23.5). This finding affected only 922 (27.663%) of the 3333 records and only two of the variables, and is thus to be considered a local pattern. Nevertheless, the discovery of this nugget could lead to policy changes that, if properly deployed, could lead to increased profits for the cell phone company.

c23fz005

Figure 23.5 Profitable pattern: VoiceMail Plan adopters less likely to churn.

R References

  1. Hahsler, M, Buchta, C, Gruen, Bettina and Hornik, Kurt (2013). arules: Mining Association Rules and Frequent Itemsets. R package version 1.0-15. http://CRAN.R-project.org/package=arules. Accessed 2014 Oct 06.
  2. R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2012. 3-900051-07-0, http://www.R-project.org/. Accessed 2014 Oct 06.

Exercises

1. Describe the two main methods of representing market basket data. What are the benefits and drawbacks of each?

2. Describe support and confidence. Express the formula for confidence using support.

3. Restate the a priori property in your own words.

For the following several exercises, consider the following data set from Quinlan1 shown as Table 23.8. The goal is to develop association rules using the a priori algorithm for trying to predict when a certain (evidently indoor) game may be played. Therefore, unlike the vegetable stand example, we may restrict our itemset search to items that include the attribute play.

Table 23.8 Weather data set for association rule mining

No. Outlook Temperature Humidity Windy Play
1 Sunny Hot High False No
2 Sunny Hot High True No
3 Overcast Hot High False Yes
4 Rain Mild High False Yes
5 Rain Cool Normal False Yes
6 Rain Cool Normal True No
7 Overcast Cool Normal True Yes
8 Sunny Mild High False No
9 Sunny Cool Normal False Yes
10 Rain Mild Normal False Yes
11 Sunny Mild Normal True Yes
12 Overcast Mild High True Yes
13 Overcast Hot Normal False Yes
14 Rain Mild High True No

4. Let φ = 3. Generate the frequent 1-itemsets.

5. Let φ = 3. Generate the frequent 2-itemsets.

6. Let φ = 3. Generate the frequent 3-itemsets.

7. Using 75% minimum confidence and 20% minimum support, generate one-antecedent association rules for predicting play.

8. Using 75% minimum confidence and 20% minimum support, generate two-antecedent association rules for predicting play.

9. Multiply the observed support times the confidence for each of the rules in Exercises 7 and 8, and rank them in a table.

10. Verify your manually found results using association rule software.

11. For each of the association rules found above by the a priori algorithm, find the J-measure. Then order the rules by J-measure. Compare the ordering with that from the a priori support × confidence ordering.

12. Find the value of the J-measure for the sixth rule from Figure 23.5.

Hands-On Analysis

Use the churn data set, given at the book series web site, for the following exercises. Use the Churn_Training_File. Filter out all variables except the following: VMail Plan, Intl Plan, CustServ Calls, and Churn. Set CustServ Calls to be ordinal. Allow the three predictors to be in either antecedent or consequent, but do not allow Churn to be in the antecedent.

13. Set the minimum antecedent support to 1%, the minimum rule confidence to 5%, and the maximum number of antecedents to 1. Use rule confidence as your evaluation measure.

  1. Find the association rule with the greatest lift.
  2. Report the following for the rule in (a).
    1. Number of instances
    2. Support % (as defined in this chapter)
    3. Confidence %
    4. Rule support %
    5. Lift
    6. Deployability
  3. Using hand calculations, show how the measures were calculated.
  4. Explain, in terms of this data, what each of the measures in (c) means (you can skip (i)).

14. Set the minimum antecedent support to 1%, the minimum rule confidence to 5%, and the maximum number of antecedents to 1.

  1. Generate rules using confidence difference as your evaluation measure with evaluation measure lower bound = 40. Explain what this evaluation measure means.
  2. For the rules that are generated, use hand calculations to compute the reported evaluation measure, and show that the evaluation measure lower bound has been met.
  3. Generate rules using confidence difference as your evaluation measure with evaluation measure lower bound = 30.
  4. Select the rule with the highest deployability. Explain why the deployability of this rule is greater than the rule we found in Question 13a.

15. Set the minimum antecedent support to 1%, the minimum rule confidence to 5%, and the maximum number of antecedents to 1.

  1. Generate rules using confidence ratio as your evaluation measure with evaluation measure lower bound = 40. Explain what this evaluation measure means.
  2. Select the rule involving Intl Plan. Use hand calculations to compute the reported evaluation measure, and show that the evaluation measure lower bound has been met.

16. Compare the results from Exercise 13 with the results from the EDA and decision tree analysis in Chapters 3 and 6. Discuss similarities and differences. Which analysis format do you prefer? Do you find a confluence of results?

17. Apply the GRI algorithm to uncover association rules for predicting either churn or non-churn behavior. Specify reasonable lower bounds for support and confidence.

18. Compare the results from the a priori algorithm with those of the GRI algorithm. Which algorithm yields a richer set of rules, and why? Which algorithm is probably preferable for this particular data set? Why?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset