Understanding global aggregations

In the previous section, our recipe provided a snapshot count of events. That is, it provided the count of events at the point in time. But what if you want to understand a sum of events for some time window? This is the concept of global aggregations:

If we wanted global aggregations, the same example as before (Time 1: 5 blue, 3 green, Time 2: 1 gohawks, Time 4: 2 greens) would be calculated as:

  • Time 1: 5 blue, 3 green
  • Time 2: 5 blue, 3 green, 1 gohawks
  • Time 4: 5 blue, 5 green, 1 gohawks

Within the traditional batch calculations, this would be similar to a groupbykey or GROUP BY statement. But in the case of streaming applications, this calculation needs to be done within milliseconds, which is typically too short of a time window to perform a GROUP BY calculation. However, with Spark Streaming global aggregations, this calculation can be completed quickly by performing a stateful streaming calculation. That is, using the Spark Streaming framework, all of the information to perform the aggregation is kept in memory (that is, keeping the data in state) so that it can be calculated in its small time window.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset