Bucket results

At the highest level of abstraction are the results at the bucket level. Essentially, this is the aggregated results for the entire job as a function of time and essentially answers the question "how unusual was this bucket of time?" To understand the structure and content of bucket-level results, let's query the results for a particular ML job. We will start by looking at the results for a simple, single metric job that has no defined influencers:

GET .ml-anomalies-*/_search
{
"query": {
"bool": {
"filter": [
{ "range" : { "timestamp" : { "gte": "now-2y" } } },
{ "term" : { "job_id" : "farequote_single" } },
{ "term" : { "result_type" : "bucket" } },
{ "range" : { "anomaly_score" : {"gte" : "90"}}}
]
}
}
}

Here, the query is asking for any bucket results that have existed over the last two years where the anomaly_score is greater than or equal to 90. The result looks as follows:

{

"hits": {
"total": 1,
"max_score": 0,
"hits": [
{
"_index": ".ml-anomalies-shared",
"_type": "doc",
"_id": "farequote_single_bucket_1486656600000_600",
"_score": 0,
"_source": {
"job_id": "farequote_single",
"timestamp": 1486656600000,
"anomaly_score": 90.67726,
"bucket_span": 600,
"initial_anomaly_score": 85.04854039170988,
"event_count": 277,
"is_interim": false,
"bucket_influencers": [
{
"job_id": "farequote_single",
"result_type": "bucket_influencer",
"influencer_field_name": "bucket_time",
"initial_anomaly_score": 85.04854039170988,
"anomaly_score": 90.67726,
"raw_anomaly_score": 13.99180406849176,
"probability": 6.362276028576088e-17,
"timestamp": 1486656600000,
"bucket_span": 600,
"is_interim": false
}
],
"processing_time_ms": 7,
"result_type": "bucket"
}
}
]
}
}

You can see that just one result record is returned, a single anomalous time bucket (at timestamp 1486656600000, or in my time zone, Thursday, February 9, 2017 11:10:00 A.M. GMT-05:00) that has an anomaly_score greater than 90. In other words, there were no other time buckets with anomalies that big in this time range. Let's look at some key portions of the output to fully understand what this is telling us:

  • timestamp: The timestamp of the leading edge of the time bucket (in epoch format).
  • anomaly_score: The current normalized score of the bucket, based upon the range of the probabilities seen over the entirety of the job. The value of this score may fluctuate over time as new data is processed by the job and new anomalies are found.
  • initial_anomaly_score: The normalized score of the bucket, that is, when that bucket was first analyzed by the analytics. This score, unlike the anomaly_score, will not change as more data is analyzed.
  • event_count: The number of raw Elasticsearch documents seen by the ML algorithms during the bucket's span.
  • is_interim: A flag that signifies whether or not the bucket is finalized or whether the bucket is still waiting for the all of the data within the bucket span to be received. This field is relevant for ongoing jobs that are operating in real time. For certain types of analysis, there could be interim results, despite the fact that not all of the data for the bucket has been seen.
  • bucket_influencers: An array of influencers (and details on them) that have been identified for this current bucket. Even if no influencers have been chosen as part of the job configuration, or there are no influencers as part of the analysis, there will always be a default influencer of the influencer_field_name:bucket_time type, which is mostly an internal record-keeping device to allow for the ordering of bucket-level anomalies in cases where explicit influencers cannot be determined.

If a job does have named and identified influencers, then the bucket_influencers array may look like the following:

          "bucket_influencers": [
{
"job_id": "farequote",
"result_type": "bucket_influencer",
"influencer_field_name": "airline",
"initial_anomaly_score": 85.06429298617539,
"anomaly_score": 99.7634,
"raw_anomaly_score": 15.040566947916583,
"probability": 6.5926436244031685e-18,
"timestamp": 1486656000000,
"bucket_span": 900,
"is_interim": false
},
{
"job_id": "farequote",
"result_type": "bucket_influencer",
"influencer_field_name": "bucket_time",
"initial_anomaly_score": 85.06429298617539,
"anomaly_score": 99.76353,
"raw_anomaly_score": 15.040566947916583,
"probability": 6.5926436244031685e-18,
"timestamp": 1486656000000,
"bucket_span": 900,
"is_interim": false
}
],

Notice that in addition to the default entry of the influencer_field_name:bucket_time type, in this case, there is an entry for a field name of an analytics-identified influencer for the airline field. This is a cue that airline was a relevant influencer type that was discovered at the time of this anomaly. Since multiple influencer candidates can be chosen in the job configuration, it should be noted that in this case, airline is the only influencer field and no other fields were found to be influential. It should also be noted that, at this level of detail, the particular instance of airline (that is, which one) is not disclosed; that information will be disclosed when querying at the lower levels of abstraction, which we will discuss next.

Now that we have knowledge of the bucket-level details, we can look at how we can leverage this information for summary alerts. We will cover this later in this chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset