The association rule mining problem

To advance our general introduction of concepts, let's next turn to association rules, as first introduced in Mining Association Rules between Sets of Items in Large Databases, available at http://arbor.ee.ntu.edu.tw/~chyun/dmpaper/agrama93.pdf. In contrast to solely counting the occurrences of items in our database, we now want to understand the rules or implications of patterns. What I mean is, given a pattern P1 and another pattern P2, we want to know whether P2 is frequently present whenever P1 can be found in D, and we denote this by writing P⇒ P2. To make this more precise, we need a concept for rule frequency similar to that of support for patterns, namely confidence. For a rule P⇒ P2, confidence is defined as follows:

conf(P1 ⇒ P2) = supp(P1 ∪ P2) / supp(P1)

This can be interpreted as the conditional support of P2 given to P1; that is, if it were to restrict D to all the transactions supporting P1, the support of P2 in this restricted database would be equal to conf(P⇒ P2). We call P⇒ P2 a rule in D if it exceeds a minimum confidence threshold t, just as in the case of frequent patterns. Finding all the rules for a confidence threshold represents the formal answer to the second question, association rule mining. Moreover, in this situation, we call Pthe antecedent and P2 the consequent of the rule. In general, there is no restriction imposed on the structure of either the antecedent or the consequent. However, in what follows, we will assume that the consequent's length is 1, for simplicity.

In our running example, the pattern {f, m} occurs three times, while {f, m, p} is just present in two cases, which means that the rule {f, m} ⇒ {p} has confidence 2/3. If we set the minimum confidence threshold to t = 0.6, we can easily check that the following association rules with an antecedent and consequent of length 1 are valid for our case:

 {a} ⇒ {c}, {a} ⇒ {f}, {a} ⇒ {m}, {a} ⇒ {p}
 {c} ⇒ {a}, {c} ⇒ {f}, {c} ⇒ {m}, {c} ⇒ {p}
 {f} ⇒ {a}, {f} ⇒ {c}, {f} ⇒ {m}
 {m} ⇒ {a}, {m} ⇒ {c}, {m} ⇒ {f}, {m} ⇒ {p}
 {p} ⇒ {a}, {p} ⇒ {c}, {p} ⇒ {f}, {p} ⇒ {m}

From the preceding definition of confidence, it should now be clear that it is relatively straightforward to compute the association rules once we have the support value of all the frequent patterns. In fact, as we will soon see, Spark's implementation of association rules is based on calculating frequent patterns upfront.

At this point, it should be noted that while we will restrict ourselves to the measures of support and confidence, there are many other interesting criteria available that we can't discuss in this book; for instance, the concepts of conviction, leverage, or lift. For an in-depth comparison of the other measures, refer to http://www.cse.msu.edu/~ptan/papers/IS.pdf.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset