Extracting Patterns from Clickstream Data

When collecting real-world data between individual measures or events, there are usually very intricate and highly complex relationships to observe. The guiding example for this chapter is the observation of click events that users generate on a website and its subdomains. Such data is both interesting and challenging to investigate. It is interesting, as there are usually many patterns that groups of users show in their browsing behavior and certain rules they might follow. Gaining insights about user groups, in general, is of interest, at least for the company running the website and might be the focus of their data science team. Methodology aside, putting a production system in place that can detect patterns in real time, for instance, to find malicious behavior, can be very challenging technically. It is immensely valuable to be able to understand and implement both the algorithmic and technical sides.

In this chapter, we will look into two topics in depth: doing pattern mining and working with streaming data in Spark. The chapter is split up into two main sections. In the first, we will introduce the three available pattern mining algorithms that Spark currently comes with and apply them to an interesting dataset. In the second, we will take a more technical view on things and address the core problems that arise when deploying a streaming data application using algorithms from the first part. In particular, you will learn the following:

  • The basic principles of frequent pattern mining.
  • Useful and relevant data formats for applications.
  • How to load and analyze a clickstream data set generated from user activity on http://MSNBC.com.
  • Understanding and comparing three pattern mining algorithms available in Spark, namely FP-growth, association rules, and prefix span.
  • How to apply these algorithms to MSNBC click data and other examples to identify the relevant patterns.
  • The very basics of Spark Streaming and what use cases can be covered by it.
  • How to put any of the preceding algorithms into production by deploying them with Spark Streaming.
  • Implementing a more realistic streaming application with click events aggregated on the fly.

By construction, this chapter is more technically involved towards the end, but with Spark Streaming it also allows us to introduce yet another very important tool from the Spark ecosphere. We start off by presenting some of the basic questions of pattern mining and then discuss how to address them.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset