7. Learn from Your Mistakes

Research has long supported the position that we learn more from our failures than from our successes. But we can only truly learn from our failures if we foster an environment of open, honest communication and fold in lightweight processes that help us repeatedly learn and get the most from our mistakes and failures. Rather than emulate the world of politics, where failures are hidden from others and as a result bound to be repeated over time, we should strive to create an environment in which we share our failures as antipatterns to best practices. To be successful, we need to learn aggressively, rely on organizations like Quality Assurance (QA) appropriately, expect systems to fail, and design for those failures appropriately and treat each failure as a precious learning opportunity.

Rule 27—Learn Aggressively

Do people in your organization think they know everything there is to know about building great, scalable products? Or perhaps your organization thinks it knows better than the customer. Have you heard someone say that customers don’t know what they want? Although it might be true that customers can’t necessarily articulate what they want, that doesn’t mean they don’t know it when they see it. Failing to learn continuously and aggressively, meaning at every opportunity, will leave you vulnerable to competitors who are willing to constantly learn.

Our continuing research on social contagion (also known as viral growth) of Internet-based products and services has revealed that organizations that possess a learning culture are far more likely to achieve viral growth than those that do not. In case you’re not familiar with the terms social contagion or viral growth, the term viral derives from epidemiology (the study of health and illness in populations) and is used in reference to Internet-based companies to explain how things spread from user to user. The exponential growth of users is known as viral growth and implies the intentional sharing of information by people. In nature most people do not intentionally spread viruses, but on the Internet they do in the form of information or entertainment, and the resulting spread is similar to a virus. Once this exponential growth starts it is possible to accurately predict its rate because it follows a power law distribution until the product reaches a point of nondisplacement. Figure 7.1 shows the growth in cumulative users for a product achieving viral growth (solid line) and one that just barely misses the tipping point by less than 10%.

Figure 7.1. Viral growth

image

The importance of creating a culture of learning cannot be underestimated. Even if you’re not interested in achieving viral growth but want to produce great products for your customers, you must be willing to learn. There are two areas in which learning is critical. The first, as we have been discussing, is from the customers. The second is from the operations of the business/technology. We discuss each briefly in turn. Both rely on excellent listening skills. We believe that we were given two ears and one mouth to remind us to listen more than we talk.

Focus groups are interesting because you get an opportunity to sit down with your customers and hear what they think. The problem is that they, like most of us, can’t really know how they will react to a product until they get to see and feel it in their own living room/computer. Not to delve too deeply into the philosophical realm, but this in part is caused by what is known as social construction. Put very simply, we make meaning of everything (and we do mean everything—it’s been argued that we do this for reality itself) by labeling things with the meaning that is most broadly held within our social groups. While we can form our own opinions, they are most often just reflections or built on what others believe. So, how do you get around this problem of not being able to trust what customers say? Launch quickly and watch your customers’ reactions.

Watching your customers can be done in a number of ways. Simply keeping track of usage and adoption of new features is a great start. The more classic A/B testing is even better. This is when you segment your customers into A-group and B-group randomly and allow A-group to have access to one version of the product and B-group to the other version. By comparing results, such as abandonment rates, time spent on site, conversion rates, and so on you can decide which version performs better. Obviously some forethought must be put into the metrics that are going to be measured, but this is a great and fairly accurate way to compare product versions.

The other areas in which you must constantly learn if you want to achieve scalability are technology and business operations. We’ll talk more about this topic in Rule 30, but you must not let incidents or problems pass without learning from them. Every site issue, outage, or downtime is an opportunity to learn how to do things better in the future. If you don’t take time to perform a postmortem on the incident, get to the real root cause, and put that learning back into the organization so that you don’t have that same failure again, then you are bound to repeat your failures. Our philosophy is that while mistakes are unavoidable, making the same mistake twice is unacceptable. If a poor performing query doesn’t get caught until it goes into production and results in a site outage, then we must get to the real root cause and fix it. In this case the root cause goes beyond the poorly performing query and includes the process and people that allowed it to get to production. By establishing a peer review of all code, DBA review of all queries, or even a load and performance test, we can minimize the chance that we allow poor performing queries into our production environment. The key here is to learn from everything—mistakes as well as successes.

Rule 28—Don’t Rely on QA to Find Mistakes

Rule 28 has an ugly and slightly misleading and controversial title meant to provoke thought and discussion. Of course it makes sense to have a team responsible for testing products to identify defects. The issue is that you shouldn’t rely solely on these teams to identify all your defects anymore than airlines rely on flight attendants for safe landings of their planes. At the heart of this view is one simple fact: You can’t test quality into your system. Testing only identifies issues that you created during development, and as a result it is an identification of value that you destroyed and can recapture. Testing typically only finds mistakes, which often requires rework that in turn increases the marginal cost per unit of work (functionality) delivered. It is rare that testing, or the group that performs it, identifies untapped opportunities that might create additional value.

Don’t get us wrong—QA definitely has an important role in an engineering organization. It is a role that is even more important when companies are growing at an incredibly fast rate and needing to scale their systems. The primary role of QA is to help identify product problems at a lower cost than having engineers perform the same task. Two important derived benefits from this role are to increase engineering velocity and to increase the rate of defect detection.

These benefits are achieved similarly to the fashion in which the industrial revolution reduced the cost of manufacturing and increased the number of units produced. By pipelining the process of engineering and allowing engineers to focus primarily on building products (and of course unit testing them), less time is spent per engineer in the setup and teardown of the testing process. Engineers now have more time per day to focus on building applications for the business. Typically we see both output per hour and output per day increase as a result of this. Cost per unit drops as a result of higher velocity at static cost. Additionally, the headcount cost of a great QA organization typically is lower on a per-head basis than the cost of an engineering organization, which further reduces cost. Finally, as the testing organization is focused and incented to identify defects, they don’t have any psychological conflicts with finding problems within their own code (as many engineers do) or the code of a good engineering friend who sits next to them.

None of this argues against pairing engineers and QA personnel together as in the case of well run Agile processes. In fact, for many implementations, we recommend such an approach. But the division of labor is still valuable and typically achieves the goals of reducing cost, increasing defect identification, and increasing throughput.

But the greatest as of yet unstated value of QA organizations arises in the case of hyper growth companies. It’s not that this value doesn’t exist within static companies or companies of lower growth, but it becomes even more important in situations where engineering organizations are doubling (or more) in size annually. In these situations, standards are hard to enforce. Engineers with greater tenure within the organization simply don’t have time to keep up with and enforce existing standards and even less time to identify the need for new standards that address scale, quality, or availability needs. In the case where a team doubles year over year, beginning year three of the doubling, half of the existing “experienced” team only has a year or less of company experience!

That brings us to why this rule is in the chapter on learning from mistakes. Imagine an environment in which managers spend nearly half of their time interviewing and hiring engineers and in which in any given year half of the engineers (or more) have less than a full year with the company. Imagine how much time the existing longer tenured engineers will be spending trying to teach the newer engineers about the source code management system, the build environments, the production environments, and so on. In such an environment too little time is spent validating that things have been built correctly, and the number of mistakes released into QA (but hopefully not production) increases significantly.

In these environments, it is QA’s job to teach the organization what is happening from a quality perspective and where it is happening such that the engineering organization can adapt and learn. QA then becomes a tool to help the organization learn what mistakes it is making repeatedly, where those mistakes lie, and ideally how the organization can keep from making them in the future. QA is likely the only team capable of seeing the recurring problems.

Newer engineers, without the benefit of seeing their failures and the impacts of those failures, will likely not only continue to make them, but the approaches that lead to these failures will become habit. Worse yet, they will likely train those bad habits in the newly hired engineers as they arrive. What started out as a small increase in the rate of defects will become a vicious cycle. Everyone will be running around attempting to identify the root cause of the quality nightmare, when the nightmare was bound to happen and is staring them in the face: a failure to learn from past mistakes!

QA must work to identify where a growing organization is having recurring problems and create an environment in which those problems are discussed and eliminated. And here, finally, is the most important benefit of QA—it helps an organization learn from engineering failures. Understanding that they can’t test quality into the system, and unwilling to accept a role as a safety screen behind a catcher in baseball to stop uncaught balls, the excellent QA organization seeks to identify systemic failures in the engineering team that lead to later quality problems. This goes beyond the creation of burn down charts and find/fix ratios; it involves digging into and identifying themes of problems and their sources. Once these themes are identified, they are presented along with ideas on how to solve the problems.

Rule 29—Failing to Design for Rollback Is Designing for Failure

To set the right mood for this next rule we should all be gathered around a campfire late at night telling scary stories. The story we’re about to tell you is your classic scary story, including the people who hear scary noises in the house but don’t get out. Those foolish people who ignored all the warning signs were us. As head of engineering and Chief Technology Officer (CTO), we believed and had been told by almost every manager, architect, and engineer that the application was too complex and not capable of being rolled back. We had several outages/issues after code releases that required a mad scramble to “fix-forward” and get a hot fix out later that same day to fully restore service. We lived with these minor inconveniences because we believed that the application was too complex to roll back.

Along came a major infrastructure release that, like all releases that came before, could not be rolled back. This release was the release-from-hell. Everything looked fine during the wee hours of the morning, but when traffic picked up as the East Coast woke up, the site went down. Had we been able to roll back, we could have done so at that point with a few upset customers and a bruised ego but nothing worse. But we couldn’t. So we coddled the site all day adding capacity, throttling traffic, and so on, trying to keep things working until we had a fix. We pushed a patch late that evening and without the traffic on the site, thought we’d fixed it. The next morning, as traffic increased, the site started having problems again. This pattern of push a fix at night, without traffic think it’s fixed, only to find out the next day that the site still had issues carried on for more than a week.

By the end of that week everyone was exhausted from being up literally days in a row. We finally pushed a patch that completely bypassed the original changes and were able to stabilize the site. While many lessons were learned from that incident, including failures of leadership, the one most relevant for this rule is that all that pain, to us as well as to our customers, could have been avoided had we been able to roll back the code.

One of the actions that came out of our postmortem was no more code was allowed to be released that couldn’t be rolled back. At that point we had no choice but to make that edict, the business had zero tolerance for any more pain of that nature, and every single engineer understood that need as well. Six weeks later, when the next release was ready, we had the ability to roll back. What we all thought were insurmountable challenges turned out to be reasonably straightforward.

The following bulleted points provided us and many other teams since then the ability to roll back. As you’d expect the majority of the problem with rolling back is in the database. By going through the application to clean up any outstanding issues and then adhering to some simple rules every team should be able to roll back.

Database changes must only be additive— Columns or tables should only be added, not deleted, until the next version of code is released that deprecates the dependency on those columns. Once these standards are implemented every release should have a portion dedicated to cleaning up the last release’s data that is no longer needed.

DDL and DML scripted and tested— The database changes that are to take place for the release must be scripted ahead of time instead of applied by hand. This should include the rollback script. The two reasons for this are that 1) the team needs to test the rollback process in QA or staging to validate that they have not missed something that would prevent rolling back and 2) the script needs to be tested under some amount of load condition to ensure it can be executed while the application is utilizing the database.

Restricted SQL queries in the application— The development team needs to disambiguate all SQL by removing all SELECT * queries and adding column names to all UPDATE statements.

Semantic changes of data— The development team must not change the definition of data within a release. An example would be a column in a ticket table that is currently being used as a status semaphore indicating three values such as assigned, fixed, or closed. The new version of the application cannot add a fourth status until code is first released to handle the new status and then code can be released to utilize the new status.

Wire On/Wire Off— The application should have a framework added that allows code paths and features to be accessed by some users and not by others, based on an external configuration. This setting can be in a configuration file or a database table and should allow for both role-based access as well as random percentage based. This framework allows for beta testing of features with a limited set of users and allows for quick removal of a code path in the event of a major bug in the feature, without rolling back the entire code base.

We learned a painful but valuable lesson that left scars so deep we never pushed another piece of code that couldn’t be rolled back. Even though we moved on to other positions with other teams, we carried that requirement with us. As you can see from the preceding guidelines these are not overly complex but rather straightforward rules that any team can apply and have rollback capability going forward.

Rule 30—Discuss and Learn from Failures

Many of us, when discussing world events at social gatherings, have likely uttered sentences something to the effect of “We never seem to learn from history.” But how many of us truly apply that standard to ourselves, our inventions, and our organizations within our work? There exists an interesting paradox within our world of highly available and highly scalable technology platforms: Those systems that are initially built the best fail less often and as a result the organizations have less opportunity to learn. Inherent to this paradox is the notion that each failure of process, systems, or people offers us an opportunity to perform a “postmortem” of the event for the purposes of learning and modifying our systems. A failure to leverage these precious events to improve our people, processes, and technology dooms us to continuing to operate exactly as we do today, which in turn means a failure to improve. A failure to improve, when drawn on a business contextual canvas of hyper growth and therefore a need for aggressive scale, becomes a painting depicting business failure. Too many things happen in our business when we are growing quickly to believe that a solution that we designed two years or even one year ago will be capable of supporting a business 10x the size of the time we built the system.

The world of nuclear power generation offers an interesting insight into this need to learn from our mistakes. In 1979, the TMI-2 reactor at Three Mile Island experienced a partial core meltdown, creating the most significant nuclear power accident in U.S. history. This accident became the source of several books, at least one movie, and two important theories on the source and need for learning in environments in which accidents are rare but costly.

Charles Perrow’s Normal Accident Theory hypothesizes that the complexity inherent to modern coupled systems makes accidents inevitable.1 The coupling inherent to these systems allows interactions to escalate rapidly with little opportunity for humans or control systems to interact successfully. Think back to how often you might have watched your monitoring solution go from all “green” to nearly completely red before you could respond to the first alert message.

Todd LaPorte, who developed the theory of High Reliability Organizations, believes that even in the case of an absence of accidents from which an organization can learn, there are organizational strategies to achieve higher reliability.2 While the authors of these theories do not agree on whether these theories can coexist, they share certain common elements. The first is that organizations that fail often have greater opportunities to learn and grow than those that do not, assuming of course that they take an opportunity to learn from them. The second, which sort of follows from the first, is that systems that fail infrequently offer little opportunity to learn and as a result in the absence of other approaches the teams and systems will not grow and improve.

Having made the point that learning from and improving after mistakes is important, let’s depart from that subject briefly to describe a lightweight process by which we can learn and improve. For any major issue that we experience, we believe an organization should attack that issue with a postmortem process that addresses the problem in three distinct but easily described phases:

Phase 1 Timeline— Focus on generating a timeline of the events leading up to the issue or crisis. Nothing is discussed other than the timeline during this first phase. The phase is complete once everyone in the room agrees that there are no more items to be added to the timeline. We typically find that even after we’ve completed the timeline phase, people will continue to remember or identify timeline worthy events in the next phase of the postmortem.

Phase 2 Issue Identification— The process facilitator walks through the timeline and works with the team to identify issues. Was it okay that the first monitor identified customer failures at 8 a.m. but that no one responded until noon? Why didn’t the auto-failover of the database occur as expected? Why did we believe that dropping the user_authorization table would allow the application to start running again? Each and every issue is identified from the timeline, but no corrections or actions are allowed to be made until the team is done identifying issues. Invariably, team members will start to suggest actions, but it is the responsibility of the process facilitator to focus the team on issue identification during Phase 2.

Phase 3 State Actions— Each item should have at least one action associated with it. The process facilitator walks down the list of issues and works with the team to identify an action, an owner, an expected result, and a time by which it should be completed. Using the SMART principles, each action should be specific, measurable, attainable, realistic, and timely. A single owner should be identified, even though the action may take a group or team to accomplish.

No postmortem should be considered complete until it has addressed the people, process, and technology issues responsible for the failure. Too often we find that clients stop at “a server died” as a root cause for an incident. Hardware fails, as do people and processes, and as a result no single failure should ever be considered the “true root cause” of any incident. The real question for any failure of scalability or availability is to ask “why didn’t the holistic system act more appropriately?” If a database fails due to load, why didn’t the organization identify the need earlier? What process or monitoring should have been in place to help the organization find the issue? Why did the failure take so long to recover? Why isn’t the database split up such that any failure has less of an impact on our customer base or services? Why wasn’t there a read replica that could be quickly promoted as the write database? In our experience, you are never finished unless you can answer “Why” at least five times to cover five different potential problems.

Now that we’ve discussed what we should do, let’s return to the case where we don’t have many opportunities to develop such a system. Weick and Sutcliffe have a solution for organizations lucky enough to have built platforms that scale effectively and fail infrequently.3 Their solution, as modified to fit our needs, is described as follows:

Preoccupation with failure— This practice is all about monitoring our product and our systems and reporting errors in a timely fashion. Success, they argue, narrows perceptions and breeds overconfidence. To combat the resulting complacency, organizations need complete transparency into system faults and failures. Reports should be widely distributed and discussed frequently such as in a daily meeting to discuss the operations of the platform.

Reluctance to simplify interpretations— Take nothing for granted and seek input from diverse sources. Don’t try to box failures into expected behavior and act with a healthy bit of paranoia. The human tendency here is to explain small variations as being “the norm,” whereas they can easily be your best early indicator of future failure.

Sensitivity to operations— Look at detail data at the minute level. Include the usage of real time data and make ongoing assessments and continual updates of this data.

Commitment to resilience— Build excess capability by rotating positions and training your people in new skills. Former employees of eBay operations can attest that DBAs, SAs, and network engineers used to be rotated through the operations center to do just this. Furthermore, once fixes are made the organization should be quickly returned to a sense of preparedness for the next situation.

Deference to expertise— During crisis events, shift the leadership role to the person possessing the greatest expertise to deal with the problem. Consider creating a competency around crisis management such as a “technical duty officer” in the operations center.

Never waste an opportunity to learn from your mistakes, as they are your greatest source of opportunity to make positive change. Put a process, such as a well run postmortem, in place to extract every ounce of learning that you can from your mistakes. If you have a well-designed system that fails infrequently, even under extreme scale, practice organizational “mindfulness” and get close to your data to better identify future failures easily. It is easy to be lured into a sense of complacency in these situations, and you are well served to hypothesize and brainstorm on different failure events that might happen.

Summary

This chapter has been about learning. Learn aggressively, learn from others’ mistakes, learn from your own mistakes, and learn from your customers. Be a learning organization and a learning individual. The people and organizations that constantly learn will always be ahead of those who don’t. As Charlie “Tremendous” Jones, the author of nine books and numerous awards, said, “In ten years you will be the same person you are today except for the people you meet and the books you read.” We like to extend that thought that an organization will be the same tomorrow as they are today except for the lessons they learn from their customers, themselves, and others.

Endnotes

1 Charles Perrow, Normal Accidents (Princeton, NJ: Princeton University Press, 1999).

2 Todd R. LaPorte and Paula M. Consolini, “Working in Practice But Not in Theory: Theoretical Challenges of ‘High-Reliability Organizations,’” Journal of Public Administration Research and Theory, Oxford Journals, http://jpart.oxfordjournals.org/content/1/1/19.extract.

3 Karl E. Weick and Kathleen M. Sutcliffe, “Managing the Unexpected,” http://www.hetzwartegat.info/assets/files/Managing%20the%20Unexpected.pdf.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset