Pattern mining on MSNBC clickstream data

Having spent a considerable amount of time explaining the basics of pattern mining, let's finally turn to a more realistic application. The data we will be discussing next comes from server logs from http://msnbc.com (and in parts from http://msn.com, when news-related), and represents a full day's worth of browsing activity in terms of page views of users of these sites. The data collected in September 1999 and has been made available for download at http://archive.ics.uci.edu/ml/machine-learning-databases/msnbc-mld/msnbc990928.seq.gz. Storing this file locally and unzipping it, the msnbc990928.seq file essentially consists of a header and space-separated rows of integers of varying length. The following are the first few lines of the file:

% Different categories found in input file:

frontpage news tech local opinion on-air misc weather msn-news health living business msn-sports sports summary bbs travel

% Sequences:

1 1 
2 
3 2 2 4 2 2 2 3 3 
5 
1 
6 
1 1 
6 
6 7 7 7 6 6 8 8 8 8 

Each row in this file is a sequence of encoded page visits of users within that day. Page visits have not been collected to the most granular level but rather grouped into 17 news-related categories, which are encoded as integers. The category names corresponding to these categories are listed in the preceding header and are mostly self-explanatory (with the exception of bbs, which stands for bulletin board service). The n-th item in this list corresponds to category n; for example, 1 stands for frontpage, while travel is encoded as 17. For instance, the fourth user in this file hit opinion once, while the third had nine page views in total, starting and ending with tech.

It is important to note that the page visits in each row have indeed been stored chronologically, that is, this really is sequential data with respect to page visit order. In total, data for 989,818 users has been collected; that is, the data set has precisely that number of sequences. Unfortunately, it is unknown how many URLs have been grouped to form each category, but we do know it ranges rather widely from 10 to 5,000. See the description available at http://archive.ics.uci.edu/ml/machine-learning-databases/msnbc-mld/msnbc.data.html for more information.

Just from the description of this data set, it should be clear that all the three pattern mining problems we have discussed so far can be applied to this data--we can search for sequential patterns in this sequential database and, neglecting the sequentiality, analyze both frequent patterns and association rules. To do so, let's first load the data using Spark. In what follows, we will assume that the header of the file has been removed and a Spark shell session has been created from the folder the sequence file is stored in:

val transactions: RDD[Array[Int]] = sc.textFile("./msnbc990928.seq") map { line =>
line.split(" ").map(_.toInt)
}

We load the sequence file into an RDD of integer-valued arrays first. Recall from earlier sections that one of the assumptions of transactions in frequent pattern mining was that the item sets are, in fact, sets and thus contain no duplicates. To apply FP-growth and association rule mining, we therefore have to delete duplicate entries, as follows:

val uniqueTransactions: RDD[Array[Int]] = transactions.map(_.distinct).cache()

Note that not only did we restrict to distinct items for each transaction but we also cached the resulting RDD, which is recommended for all the three pattern mining algorithms. This allows us to run FP-growth on this data, for which we have to find a suitable minimum support threshold t. So far, in the toy examples, we have chosen t to be rather large (between 0.6 and 0.8). It is not realistic to expect any patterns to have such large support values in larger databases. Although we only have to deal with 17 categories, browsing behaviors can vary drastically from user to user. Instead, we choose a support value of just 5 % to gain some insights:

val fpGrowth = new FPGrowth().setMinSupport(0.05)
val model = fpGrowth.run(uniqueTransactions)
val count = uniqueTransactions.count()

model.freqItemsets.collect().foreach { itemset =>
println(itemset.items.mkString("[", ",", "]") + ", " + itemset.freq / count.toDouble )
}

The output of this computation shows that for t=0.05 we only recover 14 frequent patterns, as follows:

Not only are there, maybe, less patterns than you may have expected, but among those, all but one have a length of 1. Less surprising is the fact that the front page is hit most often, with 31%, followed by the categories, on-air and news. Both the front page and news sites have been visited by only 7% of users on that day and no other pair of site categories was visited by more than 5% of the user base. Categories 5, 15, 16, and 17 don't even make the list. If we repeat the experiment with a t value of 1% instead, the number of patterns increases to a total of 74.

Let's see how many length-3 patterns are among them:

model.freqItemsets.collect().foreach { itemset =>
if (itemset.items.length >= 3)
println(itemset.items.mkString("[", ",", "]") + ", " + itemset.freq / count.toDouble )
}

Running this on an FPGrowth instance with a minimum support value of t=0.01 will yield the following result:

As one could have guessed, the most frequent length-1 patterns are also predominant among the 3-patterns. Within these 11 patterns, 10 concern the front page, and nine the news. Interestingly, the category misc, while only visited 7% of the time, according to the earlier analysis, shows up in a total of four 3-patterns. If we had more information about the underlying user groups, it would be interesting to follow up on this pattern. Speculatively, users that have an interest in a lot of miscellaneous topics will end up in this mixed category, along with some other categories.

Following this up with an analysis of association rules is technically easy; we just run the following lines to get all the rules with confidence 0.4 from the existing FP-growth model:

val rules = model.generateAssociationRules(confidence = 0.4)
rules.collect().foreach { rule =>
println("[" + rule.antecedent.mkString(",") + "=>"
+ rule.consequent.mkString(",") + "]," + (100 * rule.confidence).round / 100.0)
}

Note how we can conveniently access the antecedent, consequent, and confidence of the respective rules. The output of this is as follows; this time with the confidence rounded to two decimals:

Again, naturally, the most frequent length-1 patterns show up in many of the rules, most notably, frontpage as a consequent. Throughout this example, we chose the support and confidence values so that the outputs are short and counts easy to validate manually, but let's do some automated calculations on rule sets, regardless:

rules.count
val
frontPageConseqRules = rules.filter(_.consequent.head == 1)
frontPageConseqRules.count
frontPageConseqRules.filter(_.antecedent.contains(2)).count

Executing these statements, we see that about two-thirds of the rules have front page as the consequent, that is, 14 of 22 rules in total, and among these, nine contain news in their antecedent.

Moving on to the sequence mining problem for this data set, we need to transform our original transactions to an RDD of the Array[Array[Int]] type first, since nested arrays are the way to encode sequences for prefix span in Spark, as we have seen before. While somewhat obvious, it is still important to point out that with sequences, we don't have to discard the additional information of repeating items, as we just did for FP-growth.

In fact, we even gain more structure by imposing sequentiality on individual records. To do the transformation just indicated, we simply do the following:

val sequences: RDD[Array[Array[Int]]] = transactions.map(_.map(Array(_))).cache()

Again, we cache the result to improve the performance of the algorithm, this time, prefixspan. Running the algorithm itself is done as before:

val prefixSpan = new PrefixSpan().setMinSupport(0.005).setMaxPatternLength(15)
val psModel = prefixSpan.run(sequences)

We set the minimum support value very low at 0.5%, to get a slightly bigger result set this time. Note that we also search for patterns no longer than 15 sequence items. Let's analyze the distribution over a frequent sequence length by running the following:

psModel.freqSequences.map(fs => (fs.sequence.length, 1))
.reduceByKey(_ + _)
.sortByKey()
.collect()
.foreach(fs => println(s"${fs._1}: ${fs._2}"))

In this chain of operations, we first map each sequence to a key-value pair consisting of its length and a count of 1. We then proceed with a reduce operation that sums up the values by key, that is, we count how often this length occurs. The rest is just sorting and formatting, which yields the following result:

As we can see, the longest sequence has a length of 14, which, in particular, means that our maximum value of 15 did not restrict the search space and we found all the sequential patterns for the chosen support threshold of t=0.005. Interestingly, most of the frequent sequential visits of users have a length between two and six touch points on http://msnbc.com.

To complete this example, let's see what the most frequent pattern of each length is and what the longest sequential pattern actually looks like. Answering the second question will also give us the first, since there is only one length-14 pattern. Computing this can be done as follows:

psModel.freqSequences
.map(fs => (fs.sequence.length, fs))
.groupByKey()
.map(group => group._2.reduce((f1, f2) => if (f1.freq > f2.freq) f1 else f2))
.map(_.sequence.map(_.mkString("[", ", ", "]")).mkString("[", ", ", "]"))
.collect.foreach(println)

Since this is one of the more complicated RDD operations we've considered so far, let's discuss all the steps involved. We first map each frequent sequence to a pair consisting of its length and the sequence itself. This may seem a bit strange at first, but it allows us to group all the sequences by length, which we do in the next step. Each group consists of its key and an iterator over frequent sequences. We map each group to its iterator and reduce the sequences by only keeping the one with the largest frequency. To then properly display the result of this operation, we make use of mkString twice to create a string from the otherwise not readable nested arrays (when printed). The result of the preceding chain is as follows:

We discussed earlier that front page was the most frequent item by far, which makes a lot of sense intuitively, since it is the natural entry point to a website. However, it is a bit of a surprise that the most frequent sequences of all lengths, for the chosen threshold, consist of front page hits only. Apparently many users spend a lot of time, and clicks, in and around the front page, which might be a first indication of it's advertising value, as compared to the pages of the other categories. As we indicated in the introduction of this chapter, analyzing data like this, especially if enriched with other data sources, can be of a tremendous value for owners of the respective websites, and we hope to have shown how frequent pattern mining techniques can serve their part in doing so.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset