Record results

At a lower level of abstraction, there are results at the record level. Giving the most amount of detail, record results show specific instances of anomalies and essentially answers the question "what entity was unusual and by how much?" To understand the structure and content of record-level results, let's query the results for a particular ML job. We will start by looking at the following results, which are for a simple, single metric job that has no defined influencers:

GET .ml-anomalies-*/_search
{
"query": {
"bool": {
"filter": [
{ "range" : { "timestamp" : { "gte": "now-2y" } } },
{ "term" : { "job_id" : "farequote_single" } },
{ "term" : { "result_type" : "record" } },
{ "range" : { "record_score" : {"gte" : "90"}}}
]
}
}
}

Here, the query is asking for any record results that have existed over the last two years, where the record_score is greater than or equal to 90. The result looks as follows:

{

"hits": {
"total": 1,
"max_score": 0,
"hits": {
"total": 1,
"max_score": 0,
"hits": [
{
"_index": ".ml-anomalies-shared",
"_type": "doc",
"_id": "farequote_single_record_1486656600000_600_0_29791_0",
"_score": 0,
"_source": {
"job_id": "farequote_single",
"result_type": "record",
"probability": 3.3099524615371287e-20,
"record_score": 90.67726,
"initial_record_score": 85.04854039170988,
"bucket_span": 600,
"detector_index": 0,
"is_interim": false,
"timestamp": 1486656600000,
"function": "count",
"function_description": "count",
"typical": [
120.30986417315765
],
"actual": [
277
]
}
}
]
}
}

Let's look at some key portions of the output:

  • timestamp: The timestamp of the leading edge of the time bucket, inside of which this anomaly occurred.
  • record_score: The current normalized score of the anomaly record, based upon the range of the probabilities seen over the entirety of the job. The value of this score may fluctuate over time as new data is processed by the job and new anomalies are found.
  • initial_record_score: The normalized score of the anomaly record, that is, when that bucket was first analyzed by the analytics. This score, unlike the record_score, will not change as more data is analyzed.
  • detector_index: An internal counter to keep track of which detector configuration that this anomaly belongs to. Obviously, with a single-detector job, this value will be zero, but it may be non-zero in jobs with multiple detectors.
  • function: A reference to keep track of which detector function was used for the creation of this anomaly.
  • is_interim: A flag the signifies whether or not the bucket is finalized or whether the bucket is still waiting for all of the data within the bucket span to be received. This field is relevant for ongoing jobs that are operating in real time. For certain types of analysis, there could be interim results, despite the fact that not all of the data for the bucket has been seen.
  • actual: The actual observed value of the analyzed data in this bucket. For example, if the function is count, then this represents the number of documents that are encountered (and counted) in this time bucket.
  • typical: A representation of the expected or predicted value based upon the ML model for this dataset.

If a job has splits defined (either with by_field_name and/or partition_field_name) and identified influencers, then the record results documents will have more information:


"timestamp": 1486656000000,
"partition_field_name": "airline",
"partition_field_value": "AAL",
"function": "count",
"function_description": "count",
"typical": [
17.853294505163284
],
"actual": [
54
],
"influencers": [
{
"influencer_field_name": "airline",
"influencer_field_values": [
"AAL"
]
}
],
"airline": [
"AAL"
]
} …

Here, we can not only see the addition of the partition_field_name and partition_field_value fields (which would have been by_field_name and by_field_value if a by_field were used), but we can also see that an array for the partition_field_name (airline) was constructed as well, with a value inside of the single instance of the field that was found to be the one that was anomalous. Also, like the bucket results, there is an influencers array with an articulation of which influencers (and the values of those influencers) are relevant to this anomaly record.

In some examples, some of the information in the results document seems redundant, especially in the case where the only influencer defined in the job is the same field that you are splitting the analysis on. While that's a recommended practice, it causes the output record results to seemingly contain superfluous information. Things will look more interesting (and less redundant) if your job configurations have more influencer candidates defined.

If your job is doing population analysis (via the use of over_field_name), then the record results document will be organized slightly differently as the reporting is done with an orientation as to the unusual members of the population. For example, let's say we have an example job of analyzing Apache web logs with a configuration of the following:


"analysis_config": {
"bucket_span": "15m",
"detectors": [
{
"detector_description": "count by status over clientip",
"function": "count",
"by_field_name": "status",
"over_field_name": "clientip",
"detector_index": 0
}
],
"influencers": [
"clientip",
"status",
"uri"
]
},

Here, an example anomaly record could look like this:


{
"_index": ".ml-anomalies-shared",
"_type": "doc",
"_id": "gallery_record_1487223000000_900_0_-628922254_13",
"_score": 0,
"_source": {
"job_id": "gallery",
"result_type": "record",
"probability": 4.593248987780696e-31,
"record_score": 99.71500910125427,
"initial_record_score": 99.71500910125427,
"bucket_span": 900,
"detector_index": 0,
"is_interim": false,
"timestamp": 1487223000000,
"by_field_name": "status",
"function": "count",
"function_description": "count",
"over_field_name": "clientip",
"over_field_value": "173.203.78.60",
"causes": [
{
"probability": 4.593248987780688e-31,
"by_field_name": "status",
"by_field_value": "404",
"function": "count",
"function_description": "count",
"typical": [
1.1177332137173952
],
"actual": [
1215
],
"over_field_name": "clientip",
"over_field_value": "173.203.78.60"
}
],
"influencers": [
{
"influencer_field_name": "uri",
"influencer_field_values": [
"/wp-login.php"
]
},
{
"influencer_field_name": "status",
"influencer_field_values": [
"404"
]
},
{
"influencer_field_name": "clientip",
"influencer_field_values": [
"173.203.78.60"
]
}
],
"clientip": [
"173.203.78.60"
],
"uri": [
"/wp-login.php"
],
"status": [
"404"
]
}
},…
This example is the same brute-force authentication attempt against a non-existent WordPress login page that we saw in Chapter 3, Event Change Detection.

Notice that, first, the main orientation is around over_field (in this case, the IP address of the clients hitting the website), and that once an anomalous IP is found, an array of causes is built to compactly express all of the anomalous things that that IP did in that bucket. Again, many things seem redundant, but it is primarily because these different ways of recording information makes it easier to aggregate. It is also easier to display this information in different ways in the user interface of Kibana. With that being said, we will see that having access to this detailed information means that we can make very detailed alerts.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset