Influencers in split versus non-split jobs

You might question whether or not it is necessary to split the analysis by a field, or merely hope that the use of influencers will give the desired effect of identifying the offending entity.

Let's remind ourselves of the difference between the purpose of influencers and the purpose of splitting a job. An entity is identified by ML as an influencer if it has contributed significantly to the existence of the anomaly. This notion of deciding influential entities is completely independent of whether or not the job is split. An entity can be deemed influential on an anomaly only if an anomaly happens in the first place. If there is no anomaly detected, there is no need to figure out whether there is an influencer. However, the job may or may not find that something is anomalous, depending on whether or not the job is split into multiple time series. When splitting the job, you are modeling (creating separate analysis) for each entity of the field chosen for the split.

To illustrate this, let's look at one of my favorite demo datasets, called farequote (available in the GitHub repository for this book at https://github.com/PacktPublishing/Machine-Learning-with-the-Elastic-Stack/tree/master/example_data). This dataset is essentially an access log of the number of times a piece of middleware is called in a travel portal to reach out to third-party airlines for a quote of airline fares. The JSON documents look like this:

{
"@timestamp": "2017-02-11T23:59:54.000Z",
"responsetime": 251.573,
"airline": "FFT"
}

The number of events per unit of time corresponds to the number of requests being made, and the responsetime field is the response time of that individual request to that airline's fare quoting web service.

Let's take a look at the following cases:

  • Case 1: An analysis of count over time, not split on airline, but using airline as an influencer

If we analyze the overall count of events (no split), we can see that the prominent anomaly (the spike) in the event volume was determined to be influenced by airline=AAL:

This is quite sensible because the increased occurrence of requests for AAL affects the overall event count (of all airlines together) very prominently.

  • Case 2: An analysis of count over time, split on airline, and using airline as an influencer

If we set partition_field_name=airline to split the analysis so that each airline's count of documents is analyzed independently, then of course, we still properly see that airline=AAL is still the most unusual:

  • Case 3: Analysis of mean(responsetime), no split, but using airline as an influencer

In this case, the results are as follows:

Here, remember that all of the airline's response times are getting averaged together each bucket_span, because the job is not split. In this case, the most prominent anomaly (even though it is a relatively minor variation above normal) is shown and is deemed to be influenced by airline=NKSHowever, this may be misleading. You see, airline=NKS has a very stable response time during this period, but note that its normal operating range is much higher than the rest of the group:

As such, the contribution of NKS to the total aggregate response times of all airlines is more significant than the others. So, of course, ML identifies NKS as the most prominent influencer.

But this anomaly is not the most significant anomaly of reponsetime in the dataset! That anomaly belongs to airline=AAL, but it isn't visible in the aggregate because data from all the airlines drowns out the detail. See the next case.

  • Case 4: Analysis of mean(responsetime), split on airline, and using airline as an influencer

In this case, the most prominent response time anomaly for AAL properly shows itself when we set partition_field_name=airline to split the analysis:

And there you have it: the moral here is that you should be thoughtful if you are simply relying on influencers to find unusual entities within a dataset of multiple entities. It might be more sensible to individually model each entity independently!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset