Exploring count functions

As seen in Chapter 2Installing the Elastic Stack with Machine Learning, Elastic ML jobs contain detectors for a combination of a function applied to some aspect of the data (for example, a field). The example jobs shown in Chapter 2Installing the Elastic Stack with Machine Learning, have detectors using metric-based functions operating on metric-based fields (such as CPU utilization). However, the detectors we will be exploring in this chapter will be those that simply count occurrences of things over time.

The three main functions to get familiar with are as follows:

  • Count: Counts the number of documents in the bucket resulting from a query of the raw data index
  • High Count: The same as Count, but will only flag an anomaly if the count is higher than expected
  • Low Count: The same as Count, but will only flag an anomaly if the count is lower than expected

You will see that there are a variety of one-sided functions in ML (to only detect anomalies in a certain direction). Additionally, it is important to know that this function is not counting a field or even the existence of fields within a document, it is merely counting the documents.

To get a more intuitive feeling for what the Count function does, let's see what a standard (non-ML) Kibana visualization shows us for a particular dataset when that dataset is viewed with a Count aggregation on the Y-Axis and a 10-minute resolution of the Date Histogram aggregation on the X-Axis:

From the preceding screenshot, we can make a few observations:

  • This vertical bar visualization counts the number of documents in the index for each 10-minute bucket of time and displays the resulting view. We can see, for example, that the number of documents at the 11:10 AM mark on February 9 has a spike in documents/events that seems much higher than the typical rate (the points of time excluding the spike); in this case, the count is 277.
  • To automate the analysis of this data, we plotted it with an ML job. We can use a Single Metric Job since there is only one time series (a count of all docs in this index). Configuring the job will look like the following, after the initial steps of the Single Metric Job wizard are completed (as described in Chapter 2, Installing the Elastic Stack with Machine Learning):

We can see that the Count aggregation function is used (although High Count would also have been appropriate), and the Bucket span is set to the same value we have when we build our Kibana visualization. After running the job, the resulting anomaly is found:

Of course, the anomaly of 277 documents/events is exactly what we had hoped would be found, since this is exactly what we saw when we manually analyzed the data in the vertical bar visualization earlier.

Notice what happens, however, if the same data is analyzed with a 60m bucket span instead of a 10m one:

Note that because the rate spike that occurred was so short, when the event count aggregates over the span of an hour, the spike doesn't look anomalous anymore, and as such ML doesn't even consider it anomalous. This is similar to the situation pointed out in Chapter 1, Machine Learning for IT, where the value of the bucket span has a direct effect on the resulting analysis.

As mentioned earlier, the one-sided functions of Low Count and High Count are especially useful when trying to find deviations only in one direction. Perhaps you only want to find a drop of orders on your e-commerce site (because a spike in orders would be good news!), or perhaps you only want to spot a spike in errors (because a drop in errors is a good thing too!).

Remember, the Count functions count documents, not fields. If you have a field that represents a summarized count of something, then that will need special treatment as described in the next section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset