APPENDIX B

Dealing with Instantaneous Growth

Image

Sometimes, events occur beyond our control, foresight, and budget. An unexpected incident—technological or otherwise—can wipe out all of our future projections. There are no magic theories or formulas to banish the capacity woes in these situations, but you might be able to lessen the pain.

Besides catastrophes—like a tornado destroying a datacenter—the biggest problem you are likely to face is too much traffic. Ironically, becoming more popular than you can handle could be the worst web operations nightmare that you have ever experienced. You might be fortunate enough to have a popular piece of content that is the target of links from all over the planet, or launch a new killer feature that draws more attention than you ever planned. This can be as exciting as having your name in lights, but you might not feel so fortunate at the time it’s all happening.

From a capacity point of view, you can’t do much instantaneously. If you are being hosted in a public cloud, it’s possible to add capacity relatively quickly depending on how it will be used—but this approach has limits. Adding servers can only solve the “I need more servers” problem. It can’t solve the more difficult architectural problems that can pop up when you least expect them.

It’s not uncommon to find that edge-use cases arise (probably more often than routine capacity issues!) that tax the infrastructure in ways that you hadn’t expected. For example, back at Flickr, a user had automated his webcam to take a photo of his backyard, upload it to Flickr, and tag it with the Unix timestamp every minute of every day. This makes for interesting database side effects, because it wasn’t expected that it would have to generate that many unique tags for so many photos. Further, there were users who had very few photos but many thousands of tags on each one. Each one of these cases shed light on the limits of the database.

Mitigating Failure

The following tips and tricks are for worst-case scenarios, when other options for increasing capacity are exhausted, and substantially changing the infrastructure itself is impossible for the moment. It should be said that this type of firefighting scenario is most of what capacity planning aims to avoid; yet sometimes it’s simply unavoidable. These tips and tricks aren’t meant to be exhaustive—just a few things that can help when the torrent of traffic comes and the servers are dying under load.

Graceful Degradation and Disabling Heavy Features

One contingency is to disable some of the site’s heavier features. Building in the ability to turn certain features on or off can help capacity and operations respond significantly, even in the absence of some massive traffic event. Having a quick, one-line configuration parameter in the application with values of on or off can be of enormous value, particularly when that feature is either the cause of a problem or contributing to unacceptable performance. For example, you can have the web servers perform geographic (country) lookups based on client IP addresses for logged-out users in an effort to deduce their language preferences. It’s an operation that enhances the user experience, but it is yet another function the application must handle. Back at Flickr, after the launch of the localized version of the service in seven different languages, the aforementioned feature was turned on with the launch. It almost immediately placed too much load on the mechanisms that carried out the country lookups. The problem turned out to be an artificial throttle placed on the rate of requests the geo server could handle, which was tuned too conservatively. The issue was isolated and fixed by lifting the throttle to a more acceptable level and then turned the feature (which is mostly transparent) back on. Had a quick on/off switch not been implemented for that feature—in other words, if it had been hardcoded within the application—it would have taken more time to troubleshoot, disable, and fix. During this time, the site would have been in a degraded state, or possibly even down.

Image

Ideally, you should work with the various products, development, design, and operations groups to identify an agreed upon set of features to which you can apply on/off switches. When faced with a choice between having the full site go down and operating it with a reduced set of features, it’s sometimes easy to compromise. This is particularly important in domains such as ecommerce for which there’s immediate impact to the bottom line.

Here’s an anecdotal example: a large news organization was serving web pages with the tallied results of a United States presidential election. Its web servers were running close to capacity all day. On the night of the election, traffic overwhelmed the organization. It had no spare servers to add quickly, and the website began crumbling, serving broken images, and some pages with images but no other content. The decision was quickly made to stop logging. Now remember, this was before any large-scale traffic counting services were available, and all traffic metrics for ad serving were taken from the logs written by the origin servers themselves, audited alongside any ad-serving logs. By disabling logging, the site could continue and buy some time to assemble more equipment to handle the load. For hours, the site went without any concrete way to measure how much traffic it received, on what was certainly its busiest traffic day up until that point. The decision to stop logging all traffic was the correct one. The relief on the disk systems was enough to allow the servers to recover and serve the rest of the traffic spike, which lasted into the early hours of the next day.

Likewise, at Netflix, at times, personalization of recommendations was foregone and generic recommendations were served in the interest of better response time. Today, most websites use a wide variety of third-party services for, say, advertising, analytics, social media presence, and so on. In times of stress, you can turn off one or more of these third-party services, given that they are not core to the user experience.

Baked Static Pages and Beyond

Another technique frequently employed by sites that encounter heavy and erratic traffic is to convert a dynamic page to a static HTML page. This can be either terribly difficult or very easy, depending on how dynamic the page is originally—increasingly, with today’s websites, a large percentage of client time is spent executing JavaScript code. You can gain some safety by building static pages for only the most popular and least dynamic pages.

Converting pages from dynamic to static is called baking a web page. An example of how this can work well is using a news page showing recent photos that one updates every two or three hours. Under normal conditions, the obvious design is to create a dynamic page that reads in the photos of the hour from a database or other content management system. Under duress, you could hardcode the image URLs into the page and change them manually as needed.

Baking a page into static HTML clearly breaks a lot of functionality found in today’s more dynamic websites, but static pages come with some operational advantages:

  • They don’t initiate database lookups.

  • They can be served very fast. Static content can display up to 10 times faster than dynamic pages that must wait for other backend services.

  • They are easy to cache. If you need even more speed, you can cache static pages quite easily via reverse-proxy caching. This, of course, can introduce an entire new layer of complexity, but if you are already using caching for other parts of a site, it can be easily implemented.

The disadvantages of baking static HTML pages under duress are also worth noting:

  • You need a framework in which to bake and rebake those pages quickly and easily. Ideally, you should have a single command or button on a web page that will replace the original dynamic page with the static HTML replacement, and also reverse the operation. This takes some time and effort to develop.

  • You need to track what is where so that when changes do happen, you can propagate them. The generation of the static content should be synchronized with the content origin (usually a database). If changes are made to the database, those changes must be reflected (rebaked) to all of the static pages that include the content.

Beyond static pages, you can employ many other approaches to minimize the impact on user experience during times of high traffic. For instance, you can substitute high-resolution images with low-resolution images. In a similar vein, you can substitute 8 K UHD/4 K UHD video with HD video.

NOTE

For more information, read High Performance Images: Shrink, Load, and Deliver Images for Speed by C. Bendell et al. (O’Reilly) as well as the paper titled “Image Optimization” by I. Grigorik.

Cache but Serve Stale

Caching is used in many different components within backend infrastructures. Caching frequently requested objects from clients (or other backend server layers) can have a significantly positive effect on performance and scalability but also requires careful deployment and increases the cost of management. Normally, caching done this way accelerates content delivery, and the freshness of each cached object is controlled and monitored by headers that can indicate the age of an object and how long it’s desirable to serve the cached version of it.

Image

As an extension to baking pages, and to take more advantage of caching, you can relax the content’s freshness requirements. This is usually a lot easier than building static pages where there were none before, but it involves more complexities. In the context of images and videos, t he use of Content Delivery Networks (CDNs)—which are typically distributed around the globe—has become a norm. This reduces response time by obviating the need to pull content from the origin server. CDNs also purge (remove and update) content constantly so that fresh content is delivered. At Netflix, the use of a CDN was key to minimizing buffering when a subscriber streamed a video. Likewise, at Twitter, the use of a CDN was key to ensuring fast delivery of photos and videos in tweets.

Handling Outages

When failure comes knocking at the door (and sadly, it will at some point), there are a number of steps that you can take to minimize the pain for the users, as well. Good customer service requires strong and effective communications between the operations and customer care groups, so users are promptly informed about site outage and problems such as bugs, capacity, and performance. We thought to share some of the lessons we’ve learned when serving such a strong and vocal online community during emergencies or outages. If a kitchen is flooded, but a plumber is underneath the sink, you at least have the feeling that someone has recognized the problem and is trying to resolve it. A good plumber will give updates on the cause of the problem and what must be done to fix it.

Image

Web applications are different: you can’t see someone working on a problem, and users can sometimes feel left in the dark. Our experience has been that users are much more forgiving of problems when you keep them in the information loop. To this end, you should set up forums in which users can report bugs and issues and a blog (hosted outside of your own datacenter so that it can’t be affected by outages) where the service provider can post updates on what’s going on if the site or app is down. In addition, nowadays, updates are also posted on Twitter.

An entire book can be written on the topic of customer care for online communities. Unfortunately, this isn’t that book. But from a web operations perspective, site outages can—and do—happen. How your organization handles them is just as important as how long it takes to get back up and running.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset