Outage background

This scenario is loosely based on a real application outage, although the data was somewhat simplified and sanitized to obfuscate the original user. The problem was with a retail application that processed gift card transactions. Occasionally, the app would stop working and transactions could not be processed. This would only be discovered when the individual stores called headquarters to complain. The root cause of the issue was unknown and couldn't be ascertained easily by the customer. Because they never got to the root cause, and because the problem could be fixed by simply rebooting the application servers, the problem would randomly recur and plagued them for months.

The following data was collected and included in the analysis to help understand the origins of the problem. This data included the following:

  • A summarized (1-minute) count of transaction volume (the main KPI)
  • Application logs (semi-structured text based messages) from the transaction processing engine
  • SQL Server performance metrics from the database that backed the transaction processing engine
  • Network utilization performance metrics from the network the transaction processing engine operates on

As such, four ML jobs were configured against the data. They were as follows:

  • it_ops_kpi: Using low_sum on the number of transactions processed per minute
  • it_ops_logs: Using a count by the mlcategory detector to count the number of log messages by type, but using dynamic ML-based categorization to delineate different message types
  • it_ops_sql: Simple mean analysis of every SQL Server metric in the index
  • it_ops_network: Simple mean analysis of every network performance metric in the index

These four jobs were configured and run on the data when the problem occurred in the application. Anomalies were found, especially in the KPI that tracked the number of transactions being processed. In fact, this is the same KPI that we saw at the beginning of this chapter, where an unexpected dip in order processing was the main indicator that a problem was occurring:

However, the root cause wasn't understood until this KPI's anomaly was correlated with the anomalies in the other three ML jobs that were looking at the data in the underlying technology and infrastructure. Let's see how the power of visual correlation and shared influencers allowed the underlying cause to be discovered.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset