Deploying a pattern mining application

The example developed in the last section was an interesting playground to apply the algorithms we have carefully laid out throughout the chapter, but we have to recognize the fact that we were just handed the data. At the time of writing this book, it was often part of the culture in building data products to draw a line in the sand between data science and data engineering at pretty much exactly this point, that is, between real-time data collection and aggregation, and (often offline) analysis of data, followed up by feeding back reports of the insights gained into the production system. While this approach has its value, there are certain drawbacks to it as well. By not taking the full picture into account, we might, for instance, not exactly know the details of how the data has been collected. Missing information like this can lead to false assumptions and eventually wrong conclusions. While specialization is both useful and necessary to a certain degree, at the very least, practitioners should strive to get a basic understanding of applications end-to-end

When we introduced the MSNBC data set in the last section, we said that it had been retrieved from the server logs of the website. We drastically simplified what this entails, so let us have a closer look:

  • High availability and fault tolerance: Click events on a website need to be tracked without downtime at any point throughout the day. Some businesses, especially when it comes to any sort of payment transactions, for example, in online shops, can not afford to lose certain events.
  • High throughput of live data and scalability: We need a system that can store and process such events in real time and can cope with a certain load without slowing down. For instance, the roughly one million unique users in the MSNBC data set mean that, on average, there is activity of about 11 users per second. There are many more events to keep track of, especially keeping in mind that the only thing we have measured were page views.
  • Streaming data and batching thereof: In principle, the first two points could be addressed by writing events to a sufficiently sophisticated log. However, we haven't even touched the topic of aggregating data yet and we preferably need an online processing system to do so. First, each event has to be attributed to a user, which will have to be equipped with some sort of ID. Next, we will have to think about the concept of a user session. While the user data has been aggregated on a daily level in the MSNBC data set, this is not granular enough for many purposes. It makes sense to analyze users' behavior for the time period they are actually active. For this reason, it is customary to consider windows of activities and aggregate clicks and other events as per such windows.
  • Analytics on streaming data: Assuming we had a system like we just described and access to aggregated user session data in real time, what could we hope to achieve? We would need an analytics platform that allows us to apply algorithms and gain insights from this data.

Spark's proposal to address these problems is its Spark Streaming module, which we will briefly introduce next. Using Spark Streaming, we will build an application that can at least mock generating and aggregating events in order to then apply the pattern mining algorithms we studied to streams of events.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset