The importance and limitations of KPIs

Because of the problem of scale and the desire to make some amount of progress in making the collected data actionable, it is natural that some of the first metrics to be tackled for active inspection are those that are the best indicators of performance or operation. The KPIs that an IT organization chooses for measurement, tracking, and flagging can span diverse indicators, including the following:

  • Customer: Impacting metrics such as application response times or error counts
  • Availability: Oriented metrics such as uptime or mean time to repair (MTTR)
  • Business: Oriented metrics such as orders per minute, revenue, or number of active users

As such, these types of metrics are usually displayed, front and center, on most high-level operational dashboards or on staff reports for employees ranging from technicians to executives. A quick Google image search for KPI dashboard will return countless examples of charts, gauges, dials, maps, and other eye candy.

While there is great value in such displays of information that can be consumed with a mere glance, there are still fundamental challenges with manual inspection:

  • Interpretation: Difficulty in understanding the difference between normal operation and abnormal, unless that difference is already intrinsically understood by the human.
  • Challenges of scale: Despite the fact that KPIs are already a distillation of all metrics down to a set of important ones, there still may be more KPIs to display than is feasible given the real estate of the screen that the dashboard is displayed upon. The end result may be crowded visualizations or lengthy dashboards that require scrolling/paging.
  • Lack of proactivity: Many dashboards like this do not have their metrics also tied to alerts, thus requiring constant supervision if it's proactively known that a KPI that is faltering is important.

The bottom line is that KPIs are an extremely important step in the process of identifying and tracking meaningful indicators of health and behavior of an IT system. However, it should be obvious that the mere act of identifying and tracking a set of KPIs with a visual-only paradigm is going to leave some significant deficiencies in the strategy of a successful IT operations plan.

To assist with this, it should be obvious that KPIs are a great candidate for metrics that can be tracked by Elastic's ML. We saw an example of this in Chapter 3, Event Change Detection, with the following data:

{ 
    "metrictype": "kpi", 
    "@timestamp": "2016-02-12T23:11:09.000Z", 
    "events_per_min": 22, 
    "@version": "1", 
    "type": "it_ops_kpi", 
    "metricname": "online_purchases", 
    "metricvalue": "22", 
    "kpi_indicator": "online_purchases" 
  } 

In this case, kpi represents the summarized total number of purchases per minute for some online transaction processing system. We also saw that tracking this KPI over time was extremely easy with ML, and that an unexpected dip in online sales (to a value of 921) is detected and flagged as anomalous:

We also saw in Chapter 3, Event Change Detection, that if there was another categorical field in the data that allowed it to be segmented (for example, sales by product ID, product category, geographical region, and so on), then ML could easily split the analysis along that field to expand the analysis in a parallel fashion (in Chapter 6Alerting on ML Analysis, we'll see how we can easily tie the detected anomalies to proactive alerts). But with all of that, let's not lose sight of what we're accomplishing here: a proactive analysis of a key metric that someone likely cares about. The amount of online sales per unit of time is directly tied to incoming revenue and thus is an obvious KPI.

However, despite the importance of knowing that something unusual is happening with our KPI, there is still no insight as to why it is happening. Is there an operational problem with one of the backend systems that supports this customer-facing application? Was there a user interface coding error in the latest release that makes it harder for users to complete the transaction? Is there a problem with the third-party payment processing provider that is relied upon? None of these questions can be answered merely by scrutinizing the KPI.

To get that kind of insight, we will need to broaden our analysis to include other sets of relevant and related information.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset