Preface

Prior to the 2014 FIFA World Cup, one of the common stories being discussed at Twitter was how the service routinely became unavailable during the previous FIFA World Cup. In particular, every time Brazil or Japan scored a goal in their matches, the spike in the tweet volume used to take down the service. The Fail Whale (shown below) had become popular with the availability issues during the early days of Twitter. So, one of the goals for the 2014 FIFA World Cup was to have absolutely zero downtime. Further, another key goal was to ensure high performance of the Twitter mobile app—sharing photos or the like should be blazingly fast. How does one go about achieving that?

Image

Akin to the preceding anecdote, with the increasing use of Twitter during mega events such as the Super Bowl, another key emphasis was to ensure high availability in spite of traffic—tweets, retweets, favorites, DMs—spikes. Conceivably, we can analyze the magnitude of the past spikes relative to the normal traffic and then come up with a first-cut estimate of the magnitude of the spike going forward. Having said that, should you deploy capacity to handle such one-time events, particularly given that the capacity would most likely be underutilized for most of the year? How do you handle unplanned events such as the power failure that occurred during Super Bowl XLVII in 2013? Ensuring high availability during such events calls for a systematic approach toward architectural design and capacity planning.

Image

Capacity planning has been around since ancient times, with roots in everything from economics to engineering. In a basic sense, capacity planning is resource management. When resources are finite and come at a cost, you need to do some capacity planning. When a civil engineering firm designs a new highway system, it’s planning for capacity, as is a power company planning to deliver electricity to a metropolitan area. In some ways, their concerns have a lot in common with web operations; many of the basic concepts and concerns can be applied to all three disciplines.

Although systems administration has been around since the 1960s, the branch focused on serving websites is still emerging. A large part of web operations is capacity planning and management. Those are processes, not tasks, and they are composed of many different parts. Although every organization goes about it differently, the basic concepts are the same:

  • Ensure that proper resources (servers, storage, network, etc.) are available to handle expected and unexpected loads

  • Have a clearly defined procurement and approval system in place

  • Be prepared to justify capital expenditures in support of the business

  • Have a deployment and management system in place to manage the resources after they are deployed

Why We Wrote and Revised This Book

One of the common frustrations of engineers in an operations organization and of software developers is not having somewhere to turn for help when figuring out how much capacity is needed to keep the website or mobile app running. Existing books on the topic of computer capacity planning were focused on the mathematical theory of resource planning, rather than the practical implementation of the entire process (refer to Appendix C). Further, in an Agile environment, which is a norm today, capacity planning is a continuous process and should be flexible and adaptive to the situation at hand. Basing capacity planning on static theoretical models would be a recipe for failure.

A lot of literature addressed only rudimentary models of website use cases, and lacked specific information or advice. Instead, they tended to offer mathematical models designed to illustrate the principles of queuing theory, which is the foundation of traditional capacity planning. This approach might be mathematically interesting and elegant (it also can be useful in determining what magnitude of a traffic spike can be “absorbed” by the various services, owing to built-in queues, without affecting the availability of a website/mobile app), but it doesn’t help an operations engineer or a software developer when informed that he has a week to prepare for some unknown amount of additional traffic—perhaps due to the launch of a super new feature—or seeing the site dying under the weight of a link from Facebook, the New York Times, Reddit, Digg, and so on.

Image

We’ve found most books on web capacity planning were written with the implied assumption that concepts and processes found in nonweb environments such as manufacturing or industrial engineering applied uniformly to website environments, as well. Even though some of the theory surrounding such planning might indeed be similar, the practical application of those concepts doesn’t map very well to the short timelines of website development. In most web development settings, it’s been our observation that change happens too fast and too often to allow for the detailed and rigorous capacity investigations common to other fields. By the time an operations engineer or a software developer comes up with the queuing model for her system, new code is deployed and the usage characteristics have likely already changed dramatically. In a 2016 Association for Computing Machinery (ACM) article titled “Why Google Stores Billions of Lines of Code in a Single Repository,” authors R. Potvin and J. Levenberg (both of Google) mentioned the following:

On a typical workday, they commit 16,000 changes to the codebase, and another 24,000 changes are committed by automated systems.

Alternatively, if some other technological, social, or real-world event occurs, it can potentially make the modeling and simulations irrelevant.

What we’ve found to be far more helpful, is talking to colleagues in the industry—people who encounter many of the same scaling and capacity issues. Over time, we’ve had contact with many different companies, each employing diverse architectures, and each experiencing different problems. But quite often they shared very similar approaches to solutions. Our hope is that we can illustrate some of these approaches in this book. The computing landscape has undergone a sea change since the writing of the first edition of this book. Cloud computing was nascent back in 2009; currently, public clouds such as AWS and Azure have grown to businesses of more than $10 billion each. Consequently, not much attention was laid on topics such as autoscaling in the first edition. In a similar vein, public clouds today offer a much larger variety of instance types, including graphics processing units (GPUs) and field-programmable gate array (FPGA)–based instances—although this is beneficial to drive higher operational efficiency, the task of “optimal” selection of an instance type for a given service has become more challenging. The growth of public clouds to new geographic regions, though good from a disaster-recovery perspective, has a direct impact on capacity planning because you need to account for tasks related to, for example, replication and load balancing.

Back in 2009, Service-Oriented Architecture (SOA) had minimal adoption, and microservices were not on the horizon. As of this writing, serverless is the new kid in town. The interplay between the different microservices and third-party services has a direct impact on the capacity planning process. Besides this, a lot of work has been done in the context of tooling. Given all of this, it was about time to write the second edition of this book.

Last but not least, with new websites and mobile apps springing up every day, there’s always something new in the world of web operations. Consequently, the field has been thriving and conferences such as O’Reilly Velocity and Fluent serve as great forums for folks in the industry to share their insights with the community. In addition, blogs serve as a great resource to learn from the experience of others. Several papers have been written on a multitude of topics related to systems, tools, optimization and methodologies for benchmarking, and so on. To this end, among others, two key additions in the second edition are the “Readings” and “Resources” sections at the end of each chapter. These sections provide a rich source of information if you want to dive deeper into a particular subject.

Focus and Topics

This book is not about building complex models and simulations, nor is it about spending time running benchmarks over and over. It’s not about mathematical concepts such as Little’s Law,1 Markov chains, or Poisson arrival rates.

What this book is about is practical capacity planning and management that can take place in the real world. It’s about using real tools and being able to adapt to changing usage on a website that will (hopefully) grow over time. When you have a flat tire on the highway, you could spend a lot of time trying to figure out the cause, or you can get on with the obvious task of installing the spare and getting back on the road. This is the approach we are presenting to capacity planning: adaptive, not theoretical. Keep in mind a good deal of the information in this book will seem a lot like common sense—this is a good thing. Quite often the simplest approaches to problem solving are the best ones, and capacity planning is no exception. This book covers the process of capacity planning for growing websites, including measurement, procurement, and deployment. We’ll discuss some of the more popular and proven measurement tools and techniques. And toward that end, we have kept the discussion platform-agnostic.

Of course, it’s beyond the scope of this book to cover the details of every database, web server, caching server, and storage solution. Instead, we’ll use examples of each to illustrate the process and concepts. The intention is to be as generic as possible when it comes to explaining resource management—it’s the process itself we want to emphasize. For example, a database is used to store data and provide responses to queries. Most of the more popular databases allow for replicating data to other servers, which enhances redundancy, performance, and architectural decisions. It also assists the technical implementation of replication with Postgres, Oracle, or MySQL (a topic for other books). This book covers what replication means in terms of planning capacity and deployment.

Essentially, this book is about measuring, planning, and managing growth for a web application, regardless of the underlying technologies one chooses.

Audience for This Book

This book is for systems, storage, database, and network administrators; software developers; engineering managers; and, of course, capacity planners.

It’s intended for people who hope (or perhaps fear) their website or mobile app will grow like those of Facebook, Instagram, Snap, WhatsApp, YouTube, Twitter, and others—companies that underwent the trial-by-fire process of scaling up as their usage skyrocketed. The approaches in this text come from real experience with sites for which traffic has grown both heavily and rapidly. If you expect the popularity of your site or app will increase dramatically the amount of traffic, please read this book.

Organization of the Material

Chapter 1, Goals, Issues, and Processes in Capacity Planning, presents the issues that arise over and over on heavily trafficked websites.

Chapter 2, Setting Goals for Capacity, illustrates the various concerns involved with planning for the growth of a web app and how capacity fits into the overall picture of availability and performance.

Chapter 3, Measurement: Units of Capacity, discusses capacity measurement and monitoring.

Chapter 4, Predicting Trends, explains how to turn measurement data into robust (i.e., not susceptible to anomalies) forecasts and how trending fits into the overall planning process.

Chapter 5, Deployment, discusses concepts related to deployment: automation of installation, configuration, and management.

Chapter 6, Autoscaling, discusses concepts related to autoscaling in the cloud.

Appendix A, Virtualization and Cloud Computing, discusses where virtualization and cloud services fit into a capacity plan.

Appendix B, Dealing with Instantaneous Growth, offers insight into what you can do in capacity crisis situations, and some best practices for dealing with site outages.

Appendix C, Capacity Tools, is an annotated list of measurement, installation, configuration, and management tools highlighted throughout the book.

Conventions Used in This Book

The following typographical conventions are used in this book:

Italic

Indicates new terms, URLs, filenames, Unix utilities, and command-line options.

Constant width

Indicates the contents of files, the output from commands, and generally anything found in programs.

Constant width bold

Shows commands or other text that should be typed literally by the user, and parts of code or files highlighted to stand out for discussion.

Constant width italic

Shows text that should be replaced with user-supplied values.

O’Reilly Safari

NOTE

Safari (formerly Safari Books Online) is a membership-based training and reference platform for enterprise, government, educators, and individuals.

Members have access to thousands of books, training videos, Learning Paths, interactive tutorials, and curated playlists from over 250 publishers, including O’Reilly Media, Harvard Business Review, Prentice Hall Professional, Addison-Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press, Adobe, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, and Course Technology, among others.

For more information, please visit http://oreilly.com/safari.

Using Code Examples

This book is here to help you get your job done. In general, you may use the code in this book in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into a product’s documentation does require permission.

We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “The Art of Capacity Planning, Second Edition by Arun Kejariwal and John Allspaw. Copyright Arun Kejariwal, John Allspaw, 978-0-596-51857-8.”

If you feel your use of code examples falls outside of fair use or the permission given above, feel free to contact us at [email protected].

We’d Like to Hear from You

Please address comments and questions concerning this book to the publisher:

  • O’Reilly Media, Inc.

  • 1005 Gravenstein Highway North

  • Sebastopol, CA 95472

  • 800-998-9938 (in the United States or Canada)

  • 707-829-0515 (international or local)

  • 707-829-0104 (fax)

We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at: http://bit.ly/the-art-of-capacity-planning-2e

To comment or ask technical questions about this book, send email to:

For more information about our books, conferences, Resource Centers, and the O’Reilly Network, see our website at:

Acknowledgments

We would like to thank Virginia Wilson for her help with the editing and for coordinating the technical review process. Also, we would like to thank Brian Anderson for jump-starting the writing of this—the second edition—book. Thanks to Bryce Yan, Charles Border, and Coburn Watson for the technical review.

Most importantly, Arun would like to thank his wife, Pallavi Pharkya, for being understanding of the absence during the writing of this book. Likewise, John would like to thank his wife, Elizabeth Kairys, for her encouragement and support in this insane endeavor.

1 Little, J. D. C. (1961). A Proof for the Queuing Formula: L = λW.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset