Controlling ML via the API

As with just about everything in the Elastic Stack, ML can also be completely automated via API calls—including job configuration, execution, and result gathering. Actually, all interactions you have in the Kibana UI leverage the ML API behind the scenes. You could, for example, completely write your own UI if there were specific workflows or visualizations that you wanted.

For more in-depth information about the APIs, please refer to https://www.elastic.co/guide/en/elasticsearch/reference/current/ml-apis.html. We won't go into each one of them in this part, but we would like to highlight some parts that are worth a detour.

The first API is the job creation API, which allows for the creation of the ML job configuration. For example, if you wanted to recreate the population analysis job shown in the previous example, the following JSON describes the configuration to create a job called my_cpu_job:

PUT _xpack/ml/anomaly_detectors/my_cpu_job
{
"description":"Processes that use more CPU than others",
"analysis_config":{
"bucket_span":"15m",
"detectors":[
{
"detector_description":"high mean CPU",
"function":"high_mean",
"field_name":"system.process.cpu.total.pct",
"over_field_name":"system.process.name"
}
],
"influencers":[
"system.process.name",
"beat.hostname"
]
},
"data_description":{
"time_field":"@timestamp",
"time_format":"epoch_ms"
}
}

The preceding JSON contains all the configuration we passed through a click experience to Kibana, so it's completely equivalent to what we created in the UI. If you send this to the endpoint, you will get the following JSON response: 

{
"job_id" : "my_cpu_job",
"job_type" : "anomaly_detector",
"job_version" : "6.5.1",
"description" : "Processes that use more CPU than others",
"create_time" : 1543197011209,
"analysis_config" : {
"bucket_span" : "15m",
"detectors" : [
{
"detector_description" : "high mean CPU",
"function" : "high_mean",
"field_name" : "system.process.cpu.total.pct",
"over_field_name" : "system.process.name",
"detector_index" : 0
}
],
"influencers" : [
"system.process.name",
"beat.hostname"
]
},
"analysis_limits" : {
"model_memory_limit" : "1024mb",
"categorization_examples_limit" : 4
},
"data_description" : {
"time_field" : "@timestamp",
"time_format" : "epoch_ms"
},
"model_snapshot_retention_days" : 1,
"results_index_name" : "shared"
}

Note that the job_id field needs to be unique when creating the job.

It's important to note that the job also needs to be configured to know which index of raw data to analyze and which query needs to be executed against that index. This is part of the datafeed configuration and is set via the documentation found at: https://www.elastic.co/guide/en/elasticsearch/reference/current/ml-put-datafeed.html.

An example request to configure the datafeed for a job called my_cpu_job would be the following:

PUT _xpack/ml/datafeeds/datafeed-my_cpu_job
{
"job_id" : "my_cpu_job",
"indexes" : [
"metricbeat-*"
]
}

The response would be as follows:

{
"datafeed_id" : "datafeed-my_cpu_job",
"job_id" : "my_cpu_job",
"query_delay" : "106392ms",
"indices" : [
"metricbeat-*"
],
"types" : [ ],
"query" : {
"match_all" : {
"boost" : 1.0
}
},
"scroll_size" : 1000,
"chunking_config" : {
"mode" : "auto"
}
}

Notice that the default query to the index is match_all, which means that no filtering will take place. We could, of course, insert any valid Elasticsearch DSL in the query block to perform custom filters or aggregations. This concept will be covered later in the book.

There are other APIs that can be used to extract results or modify other operational aspects of the ML job. Consult the online documentation for more information.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset