Bucketization

Bucketing input data is an important concept to understand in ML. Set with a key parameter at the job level called bucket_span, the input data from the datafeed (described next) is collected into mini batches for processing. Think of the bucket span as a pre-analysis aggregation interval—the window of time in which a portion of the data is aggregated over for the purposes of analysis. The shorter the duration of the bucket_span, the more granular the analysis, but also the higher the potential for noisy artifacts in the data.

The following graph shows the same dataset aggregated over three different intervals:

Aggregations of the same data over three different time intervals

Notice that the prominent anomalous spike seen in the version aggregated over the 5-minute interval becomes all but lost if the data is aggregated over a 60-minute interval due to the fact of the spike's short (<2 minute) duration. In fact, at this 60-minute interval, the spike doesn't even seem that anomalous anymore.

This is a practical consideration for the choice of bucket_span. On one hand, having a shorter aggregation period is helpful because it will increase the frequency of the analysis (and thus reduce the interval of notification on if there is something anomalous), but making it too short may highlight features in the data that you don't really care about. If the brief spike that's shown in the preceding data is a meaningful anomaly for you, then the 5-minute view of the data is sufficient. If, however, a perturbation of the data that's very brief seems like an unnecessary distraction, then avoid a low value of bucket_span.

Some additional practical considerations can be found on Elastic's blog: https://www.elastic.co/blog/explaining-the-bucket-span-in-machine-learning-for-elasticsearch.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset