Chapter 6. The Work Environment

Humans follow incentives, get easily distracted, and are forgetful. Systems keep evolving. Remember this whenever a human operator is expected to become an integral part of an operational process. Some fundamental problems related to monitoring and alerting are due to making false assumptions about human nature; others are due to putting insufficient weight on the importance of change. In general, the problem stems from the perception of how things ought to be, rather than how they actually are. The system is dynamic, many parts are movable, and it’s only predictable to a certain degree. The people who designed it are most often not the ones in charge of 24/7 operations. For that reason, the work environment should foster a flexible culture, one that assists in the progress of adaptability and encourages growth.

Keeping an Audit Trail

Responding to alerts means dealing with uncertainty. Even in mature IT organizations outages resulting from changes made by operators, such as new software rollouts, configuration updates, and infrastructure upgrades account for more than 50% of all outages. Keeping an audit trail and consulting it during early outage indications can, therefore, reduce the initial uncertainty in every second case, giving the troubleshooter a massive advantage.

An accurate and complete audit trail does not necessarily have to come at a cost of high manual overhead. It can be greatly automated with the help of a publish-subscribe style messaging system, with elements of the infrastructure automatically publishing updates for routine tasks, such as deployments and upgrades. If the idea isn’t clear, think of GitHub’s activity feeds. Such a model works best for big organizations running their systems in Service Oriented Architecture (SOA). Any single team in charge of service could subscribe to an audit trail feed of upstream services, so that any upstream changes are easily identifiable on a timeline.

Working with Tickets

Most operation teams at any given time designate an on-duty operator (the On-call) whose job it is to respond to incoming alerts and manage the ticket queue. The theory states that on a typical day the On-call comes in to work, opens the ticket queue, and iterates through the list of tickets in descending order of severity. However, the work is prone to interruptions. When a new issue of high enough severity arrives in the queue, the On-call is expected to drop whatever he is working on to deal with the incoming event.

This theory doesn’t always apply in practice. More typically, the On-call comes in, opens the queue with a list of all-too-familiar, inactionable tickets and glances over to catch any new arrivals. When the queue grows big enough, new arriving medium-severity tickets are not even noticed in the crowd of predecessors and therefore the time of the initial response goes up.

On occasion the managers notice an unmanageable amount of tickets in the queue and typically try to deal with the problem by allocating more resources. Here are the three most common ways in which this is done:

Incentive Schemes

Letting engineers know that the count of tickets they resolve is a measure of their performance

Allocating a Secondary On-Call

Getting another pair of hands to work on tickets

Occasional Queue Cleanups

Getting an entire team to clean the queue periodically for a day

All of these methods are equally ineffective because they all rely on the same flawed assumption: that a ticket generated from an alert is a unit of work rather than an indication of a problem in the system. In reality, the root of the problem is the impaired detectability. To solve it, the alerting configuration should be made more effective.

Anomalous events that pose no customer impact should be recorded but they must not be a reason for waking up operators in the middle of the night only to confirm the system’s sanity. All alarms that trigger on nonissues should be done away with if there is no evidence that the resulting alerts are actionable. If this policy is not followed, false alarms will cause more harm than good. There are only two ways in which one can respond to nonissue: ignore it or overreact.

In the former case the detrimental effects will be prolonged and difficult to measure. Initially, the notifications will introduce a mild level of noise; the ticket queue will grow but it will be difficult to pinpoint the reason for this. After a while the operators will get desensitized to real problems and will stop taking tickets seriously. This is where the ball gets dropped. If the neglected problem develops into an outage, no one will understand why the operator had ignored it in the first place.

In the latter case, overreaction, the outcomes can be quite immediate. Let me illustrate this with an example of alarming on cache evictions in a memcache fleet. A cache eviction is dropping a relatively old entry from the cache in view of a memory shortage to make some space for more frequent entries. Cache evictions are not by themselves an indicative of a problem or degraded performance. Let’s assume that a high-priority ticket is created when cache evictions are detected. An ambitious operator might at first try to look for the root cause, but failing to find anything obvious he decides to at least put the alarm out of alert state by restarting the memcache fleet. Now the cache is empty and needs to regenerate itself. In the process the web server fleet must work much harder because it is not being relieved by the caching layer, introducing strain and putting the system at unnecessary risk.

Root Cause Analysis

The term root cause tends to be interpreted differently by everyone, which leads to numerous breakdowns in communication. This issue can be clearly identified in the process of assigning a root cause at ticket resolution time. The outcome depends on the point of view of the person resolving the ticket. Let me explain the confusion with this vague example: If an operator aims at rebooting a subset of hosts in sequence but mistakenly manages to reboot the entire fleet at once, is this an operator error, misallocation of responsibilities, lack of fine-grained tools, bad ACLs, or a problem with the process? With each interpretation, the blame is pointed at someone else. In effect, it depends on who gets asked the question. That subjective approach is not very constructive but there are ways to avoid falling into this trap.

Root Cause Analyses (RCA) are carried out to determine the reasons that major events cause detrimental effects on the production environment. The main goal of RCA is to establish the real reason behind the fault in order to take an informed corrective action and prevent future recurrences. Effective RCAs must have two objectives in mind: they must be carried out with sufficient depth and they must not focus on personal assignment of blame. When executed to find the answers rather than a scapegoat, it quickly becomes apparent that the situation was a lot more complex than we initially believed and that the problem could have possibly been prevented at many levels with varying degree of effort.

The Five Whys

A practical RCA can be carried out via the Five Whys method. The method was developed by Sakichi Toyoda, the founder of Toyota Industries Co., and later used extensively at Toyota Motor Corporation as an efficient problem-solving tool and one of the core concepts in the Toyota production system.

The method also finds its application in carrying out analyses of system failures. It provides a practical approach to discovering causal relationships of events at several levels and draws a clear distinction between technical difficulties, the situational circumstances that led to them, and deficiencies in planning and resource allocation.

The method instructs us to ask approximately five consecutive, related “Why?” questions about the event, starting with the symptoms. Table 6-1 illustrates the question chain with a generalized example.

Table 6-1. Generalized Example of a Five Whys Analysis

“Why” QuestionAnswer
Why were the symptoms observed?Because of an immediate cause.
Why did the immediate cause occur?Because of an exceptional condition.
Why did the exceptional condition arise?Because of a special circumstance.
Why was the special circumstance not handled properly or in time?Due to insufficient X or excessive Y.
Why was there a lack of X or too much Y?

The first two questions focus on immediate technical cause and its source, the third “why” tries to find out more about the circumstances that led to the problem, and the last two questions focus on organizational inefficiencies and misallocations and their origin. It’s worth noting that the answers further down in the chain become more subjective and open to interpretation. They serve well as conclusions, but may not necessarily be accurate.

The Five Whys method provides only an abstract skeleton for a causal chain of events. In order to get to the bottom of issues, assumptions and deductive logic will not suffice. A fair share of hands-on log mining and data analysis must take place in the process. Let’s consider the analysis on a more concrete example:

A batch processing system does not accept new job submissions. Why not? The inspection of running jobs shows that a backlog was accumulated. Why the backlog? Performance graphs show reduced processing throughput. Why reduced throughput? Long delays are observed while processing certain batches. Why only selected batches? These batches differ in structure and contain attributes not understood by the system. Why does the system not understand them? The batches were built contrary to technical specification.

Asking five whys uncovered two contributing factors: submission of bad input and insufficient input validation. Of the two the root cause is the lack of sufficient input validation—accepting malformed input should never be the reason for an outage. The corrective action involves implementation of an input validation and rejection mechanism.

Extracting Categories

A portion of answers to the questions in the Five Whys analysis may be used to form a list of root cause classifying categories. Highly specific classifiers are not very useful as there are too many of them and they get outdated too fast. On the other hand, a classifier that’s too open-ended does not convey meaningful information for the purposes of reporting. Well formulated categories come as a result of generalized answers to the centermost of the five asked questions.

The following list of suspected causes was compiled through Five Why analysis from a sample of tickets. The resulting twelve categories are divided into three main groups: technical errors, monitoring problems used for measuring precision and recall, and other, unidentified faults. The categories describe specific shortcomings; they do not include coinciding events and contributing factors, such as content updates or specific maintenance work that may have led to the problem.

Software Error

Problems as a direct consequence of software flaws. The category includes software bugs, architectural limitations, and gross inefficiencies leading to perceivable impact to be eliminated through the rollout of patched versions.

Misconfiguration

Faults originating from suboptimal or incorrect system settings.

Hardware Error

Physical faults with a visible effect on the system’s operation.

Network Error

Diminished performance traceable to deterioration of the underlying network link.

Data Corruption

Faults incurred in the process of transmission, storage or extraction of data.

Operator Error

Faults that arise as a consequence of mishandling the system through the use of operator privileges. Operator errors come from negligence, inexperience, and the lack of a deep understanding of the system. They occur during migrations, host upgrades, and cruft cleanups, typically due to overaggressive deactivation of parts of the system or lack of adequate preparation.

Capacity Limit

Issues resulting from running a system for which workload exceeds capacity in normal operation. This category excludes capacity exhaustion caused by operator errors or critical software bugs leading to saturation of computational resources.

Dependency Error

Faults generated by downstream services on which the system depends. Example dependencies include databases, external workflow engines, and cloud services. When dependencies experience downtime they may impair the dependent system’s functionality.

False Alarm

Tickets that come as a result of oversensitive monitoring and bugs in monitoring applications. The incidence of false alarms should be reduced to avoid noise.

Duplicate Ticket

Multiple tickets informing about the same issue that come as a result of insufficient aggregation. Only the first ticket in the group should be considered as a valid alert. The remainder introduces noise and should be discarded as duplicates. Their incidence should be reduced to avoid desensitization in operators.

Insufficient Monitoring

Tickets created manually as a result of deficiencies in monitoring. This category applies when a lack of relevant metrics and alerting configuration allows a preventable problem to go unnoticed and develop into a critical issue.

Unknown/Other

Unidentified or unclassified group of problems. It is often feared that the “Other” category serves as a dumping ground for neglectful investigators and it is often decided to remove the “Other” classifier. This approach increases operator effort and reduces the accuracy of classification. When the number of items classified as “Other” grows out of proportion, it is a sign that the classification process may be flawed.

Dealing with Anomalies

In large-scale system operation failure is a norm. Transient errors often occur very briefly, sometimes in spikes at unpredictable intervals. Low percentage errors occur continuously during normal system operation and constitute a tiny percentage of failed events (not more than 0.1%) in context of the all successful events.

Both types of errors will crop up at large scale and are seen as potential threats to availability levels agreed in the SLAs. This belief is not unjustified, but it is important to keep a healthy sense of proportion as the real threat to availability comes from long lasting outages and not occasional errors. Despite that fact, there is a tendency for more human effort to be invested in root cause analysis of petty issues than prevention of potentially disastrous outages.

Having said that, low percentage errors are by no means unimportant. They do happen for a reason and often are an early indication of hitting resource limits. This is not necessarily a bad thing—it might just mean that your fleet is not overscaled and that you get optimal value for money! As long as the errors stay at negligible levels, there might be other, more urgent things to worry about.

Learning from Outages

When high-visibility, unplanned outages hit the system they should be dealt with accordingly—a quick response followed by root cause investigation. The root cause is used to drive the corrective action, such as implementation of a safety trigger or rearrangement of components to limit performance bottlenecks.

But the failure could be embraced even further. There is a wealth of information in outages since what we imagine the system to be is not necessarily what’s really out there. Outages driven by extreme conditions often uncover unexpected behaviors of subsystems and components. Only a fraction of them might be relevant to current issue resolution, but the remainder may be an indication of weak spots where they are least expected. There is a strong case for observing failure beyond the root cause, paying special attention to recoverability and resilience of all subsystems and their components.

Using Checklists

Checklists are an extremely useful device for reducing human error. They strengthen consistency in following procedures and shorten the time otherwise spent on creating improvised solutions. They are particularly suitable for dealing with high severity events. Many of us experience temporary amnesia and panic states in the initial period of getting under pressure. Opening an event response with a checklist is a great way to deal with this.

Well conceived checklists must have a few characteristics:

  • They must be used for nontrivial tasks that bear a degree of responsibility.

  • They must be designed to check for essentials.

  • They must be short.

If the above conditions are not met then some or all of the team will not follow the checklist. This defeats the whole purpose.

Checklists may have a detrimental effect if used inappropriately. Let me name as an example a daily checklist to rule out all faults experienced historically.

Going through a long list of past problems to verify their absence is a boring and frustrating task. It is carried out more reliably by system alarms. A quick peek into the ticket queue combined with a dashboard glance-over should yield the same, if not better, results. If it doesn’t, then either the dashboard or the alarm configuration needs to be improved.

Warning

Some things are easier to put into existence than to remove, and checklist items definitely belong to this category. For a checklist to be effective, their items must be meaningful and there must be very few of them.

Creating Dashboards

Dashboards are a collection of top-level performance indicators, all gathered in the same place to serve as a central point of reference. They are great tools for conveying the essentials of state information in real time. Dashboards are created by system administrators soon after they identify a set of higher-order performance metrics. Frequent, proactive examination of these metrics is essential and helps operators and administrators to stay on top of things.

Creating dashboards is the art of communicating a lot with little. Good dashboards start with high-level overview, assisted by succinct explanation, and allow an operator to click through to ever finer levels of detail. They use browser’s real estate wisely. Poor dashboards present the viewer with information overload and the data is not organized systematically.

Dashboards are created to give an overview of the system but they are sometimes used for watching in anticipation of a problem. That’s fine as long as observing timeseries isn’t someone’s full time job. Allocating people solely for following data point fluctuations is considered a rewarding job only by very few. Even though our brains are superior pattern matching engines, expecting humans to act as an alerting system is inherently unreliable. It makes more sense to invest the time into creating a sophisticated monitoring system.

Service-Level Agreements

SLAs pose a baselining danger. If a team is facing SLAs that can’t be met for purely technical reasons, there are exactly two things that can happen: the team may use unrealistic SLA as the baseline for monitoring or ignore it and use the system’s real baseline, calculated on current availability and performance figures. The former solution has a more detrimental effect on meeting the SLA due to the amount of false positives it generates, yet the latter solution is almost never adopted, at least not initially.

Here is why false positives exacerbate the problem: Suppose an alarm was created with the threshold reflecting an SLA that cannot be met. As a result the alarm constantly goes in and out of alert. Suppose this happens a couple of times a day. Being used to frequent alerts, the operators stop investigating the source of the problem and leave the ticket alone (they are “maturing the ticket”). Once the SLA levels are back to normal the operator closes it off. All is well until a real issue is reported and the operator ignores it, as the problem is assumed to be self-recoverable. The service levels deteriorate even further, subjecting users to an even poorer experience. Tickets that are ever-present in the queue will desensitize operators.

Note

Unrealistic SLAs can be avoided. Inviting a senior member of the technical team to service contract negotiation meetings is always a good idea.

Preventing the Ironies of Automation

With time, increasingly more manual processes become automated. The operators, while still tasked with supervision of the system, will be less and less in touch with system internals. The skill deteriorates when not used. These ironies of automation can be countered, but the process requires conscious cultural effort:

  1. Seek simplicity. Complexity breeds confusion and leads to errors. Simplicity goes hand in hand with consistency. Minimizing confusion and having everyone on the same page expedites recovery.

  2. Automate. You will save time, reduce costs, improve reliability, and have a lot of fun while doing it.

  3. Monitor extensively. Record every relevant piece of information, know where it came from, and be able to pull it up at any time.

  4. Keep SOPs short. De-automation of one end defeats the purpose of automation on the other. The longer your SOP, the higher the likelihood of error.

  5. Encourage learning. Operators are hired for quick and effective response. Some develop strong intuition, but even intuition must be backed by experience, and frankly, it’s not enough. Operators must know what they are doing. It makes sense to give up some productivity in favor of a deepening understanding of the system.

Culture

The human element plays the crucial role in the process. As a result, effective monitoring depends heavily on the right culture of the organization. A strong culture is hard to define in absolute terms but certain characteristics are universal. Successful cultures that drive effectiveness will encourage consistency, trust in technology, and a healthy sense of proportion. Ineffective cultures will do the opposite. In particular

  • Assigning a high degree of personal responsibility to boring tasks will sooner or later result in negligence

  • Allowing for multi-step manual procedures guarantees that some of the steps will be missed for a portion of the time

  • Punishing everyone for one person’s mistake with added bureaucracy introduces unnecessary overhead

In all of the above, the problem exists in the process, not in the operator. Processes can be improved by review, streamlining, and automation. Realizing that yet still keeping the human operator in the picture helps to drive a healthy culture: one that nudges even the wrong people to do the right thing and turns problems into solutions instead of blame.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset