Frequent pattern mining

When presented with a new data set, a natural sequence of questions is:
  • What kind of data do we look at; that is, what structure does it have?
  • Which observations in the data can be found frequently; that is, which patterns or rules can we identify within the data?
  • How do we assess what is frequent; that is, what are the good measures of relevance and how do we test for it?

On a very high level, frequent pattern mining addresses precisely these questions. While it's very easy to dive head first into more advanced machine learning techniques, these pattern mining algorithms can be quite informative and help build an intuition about the data.

To introduce some of the key notions of frequent pattern mining, let's first consider a somewhat prototypical example for such cases, namely shopping carts. The study of customers being interested in and buying certain products has been of prime interest to marketers around the globe for a very long time. While online shops certainly do help in further analyzing customer behavior, for instance, by tracking the browsing data within a shopping session, the question of what items have been bought and what patterns in buying behavior can be found applies to purely offline scenarios as well. We will see a more involved example of clickstream data accumulated on a website soon; for now, we will work under the assumption that only the events we can track are the actual payment transactions of an item.

Just this given data, for instance, for groceries shopping carts in supermarkets or online, leads to quite a few interesting questions, and we will focus mainly on the following three:

  • Which items are frequently bought together? For instance, there is anecdotal evidence suggesting that beer and diapers are often bought together in one shopping session. Finding patterns of products that often go together may, for instance, allow a shop to physically place these products closer to each other for an increased shopping experience or promotional value even if they don't belong together at first sight. In the case of an online shop, this sort of analysis might be the base for a simple recommender system.
  • Based on the previous question, are there any interesting implications or rules to observe in shopping behaviour?, continuing with the shopping cart example, can we establish associations such as if bread and butter have been bought, we also often find cheese in the shopping cart? Finding such association rules can be of great interest, but also need more clarification of what we consider to be often, that is, what does frequent mean.
  • Note that, so far, our shopping carts were simply considered a bag of items without additional structure. At least in the online shopping scenario, we can endow data with more information. One aspect we will focus on is that of the sequentiality of items; that is, we will take note of the order in which the products have been placed into the cart. With this in mind, similar to the first question, one might ask, which sequence of items can often be found in our transaction data? For instance, larger electronic devices bought might be followed up by additional utility items.

The reason we focus on these three questions in particular is that Spark MLlib comes with precisely three pattern mining algorithms that roughly correspond to the aforementioned questions by their ability to answer them. Specifically, we will carefully introduce FP-growth, association rules, and prefix span, in that order, to address these problems and show how to solve them using Spark. Before doing so, let's take a step back and formally introduce the concepts we have been motivated for so far, alongside a running example. We will refer to the preceding three questions throughout the following subsection.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset