The association rule mining problem

To advance our general introduction of concepts, let's next turn to association rules, as first introduced in Mining Association Rules between Sets of Items in Large Databases, available at http://arbor.ee.ntu.edu.tw/~chyun/dmpaper/agrama93.pdf. In contrast to solely counting the occurrences of items in our database, we now want to understand the rules or implications of patterns. What I mean is, given a pattern P₁ and another pattern P₂, we want to know whether P₂ is frequently present whenever P₁ can be found in D, and we denote this by writing P₁⇒ P₂. To make this more precise, we need a concept for rule frequency similar to that of support for patterns, namely confidence. For a rule P₁⇒ P₂, confidence is defined as follows:

conf(P₁ ⇒ P₂) = supp(P₁ ∪ P₂) / supp(P₁)

This can be interpreted as the conditional support of P₂ given to P₁; that is, if it were to restrict D to all the transactions supporting P₁, the support of P₂ in this restricted database would be equal to conf(P₁⇒ P₂). We call P₁⇒ P₂ a rule in D if it exceeds a minimum confidence threshold t, just as in the case of frequent patterns. Finding all the rules for a confidence threshold represents the formal answer to the second question, association rule mining. Moreover, in this situation, we call P₁the antecedent and P₂ the consequent of the rule. In general, there is no restriction imposed on the structure of either the antecedent or the consequent. However, in what follows, we will assume that the consequent's length is 1, for simplicity.

In our running example, the pattern {f, m} occurs three times, while {f, m, p} is just present in two cases, which means that the rule {f, m} ⇒ {p} has confidence 2/3. If we set the minimum confidence threshold to t = 0.6, we can easily check that the following association rules with an antecedent and consequent of length 1 are valid for our case:

{a} ⇒ {c}, {a} ⇒ {f}, {a} ⇒ {m}, {a} ⇒ {p}

{c} ⇒ {a}, {c} ⇒ {f}, {c} ⇒ {m}, {c} ⇒ {p}

{f} ⇒ {a}, {f} ⇒ {c}, {f} ⇒ {m}

{m} ⇒ {a}, {m} ⇒ {c}, {m} ⇒ {f}, {m} ⇒ {p}

{p} ⇒ {a}, {p} ⇒ {c}, {p} ⇒ {f}, {p} ⇒ {m}

From the preceding definition of confidence, it should now be clear that it is relatively straightforward to compute the association rules once we have the support value of all the frequent patterns. In fact, as we will soon see, Spark's implementation of association rules is based on calculating frequent patterns upfront.

At this point, it should be noted that while we will restrict ourselves to the measures of support and confidence, there are many other interesting criteria available that we can't discuss in this book; for instance, the concepts of conviction, leverage, or lift. For an in-depth comparison of the other measures, refer to http://www.cse.msu.edu/~ptan/papers/IS.pdf.

Table of Contents for The association rule mining problem

Create new playlist

Sign In

Sign Up

Table of Contents for
The association rule mining problem