5. Don’t Duplicate Your Work

No one wants to do the same thing over and over again, unless perhaps you are a professional musician or athlete. For most of us, repetitious work is mind numbing, boring, and tedious. The people who read this book are probably not folks who find personal fulfillment using a torque wrench to drive in the same five screws over and over again every day. Yet, surprisingly, many of us do some seemingly small tasks over and over again to the detriment of the scalability of our platform. The three rules we discuss in this chapter are the three most common drivers of duplicated work and value defeating requirements that we see in our consulting practice day in and day out. Some of them might strike you as obvious or even odd. We entreat you to dig within your organizations and engineering efforts as we suspect that you might find some of these value killers lurking about in the shadows.

Rule 17—Don’t Check Your Work

Carpenters and woodworkers have an expression: “Measure twice and cut once.” You might have learned such a phrase from a high school wood shop teacher—one who might have been missing a finger. Missing digits aside, the logic behind such a statement is sound and based on experience through practice. It’s much better to validate a measurement before making a cut, as a failed measurement will potentially increase production waste by creating a useless board of the wrong size. We won’t argue with such a plan. Instead, we aim to eliminate waste of a different kind: the writing and subsequent immediate validation of the just-written data.

We’ve been surprised over the last several years at how often we find ourselves asking our clients “What do you mean you are reading and validating something that you just wrote?” Sometimes clients have a well thought out reason for their actions, though we have yet to see one with which we agree. More often than not, the client cops a look that reminds us of a child who just got caught doing something he or she knew should not be done. The claims of those with well thought out (albeit in our opinion value destroying) answers are that their application requires an absolute guarantee that the data not only be written but also be written correctly. Keep in mind that most of our clients have SaaS or commerce platforms—they aren’t running nuclear power facilities, sending people into space, controlling thousands of passenger-laden planes in flight, or curing cancer. Fear of failed writes and calculations has long driven extra effort on the part of many a developer. This fear, perhaps justified in the dark ages of computing, was at least partially responsible for the fault-tolerant computer designs developed by both Tandem and Stratus in the late 1970s and early 1980s, respectively. The primary driver of these systems was to reduce mean time to failure (MTTF) within systems through “redundant everything” including CPUs, storage, memory, memory paths, storage paths, and so on. Some models of these computers necessarily compared results of computations and storage operations along parallel paths to validate that the systems were working properly. One of the authors of this book developed applications for an aging Stratus minicomputer, and in the two years he worked with it, the system never identified a failure in computation between the two processors, or failure writes to memory or disk.

Today those fears are much less founded than they were in the late 1970s through the late 1980s. In fact, when we ask our clients who first write something and then attempt to immediately read it how often they find failures, the answer is fairly consistent: “Never.” And the chances are that unless they fail to act upon an error returned from a write operation, they will never experience such an event. Sure, corruption happens from time to time, but in most cases that corruption is identified during the actual write operation. Rather than doubling your activities, thereby halving the number of transactions you can perform on your storage, databases, and systems, simply look at the error codes returned from your operations and react accordingly. As a side note here, the most appropriate protection against corruption is to properly implement high availability and have multiple copies of data around such as a standby database or replicated storage (see Chapter 9, “Design for Fault Tolerance and Graceful Failure”). Ideally you will ultimately implement multiple live sites (see Chapter 3, “Design to Scale Out Horizontally,” Rule 12).

Of course not every “write then immediately read” activity is a result of an overzealous engineer attempting to validate what he or she has just written. Sometimes it’s the result of an end user immediately requesting the thing they just wrote. The question we ask here is why these clients don’t store frequently used (including written) data locally? If you just wrote something and you know you are likely to need it again, just keep it around locally. One common example of such a need is during a registration flow for most products. Typically there is a stage at which one wants to present to the user the data you are about to commit to the permanent registration “record.” Another one might be the purchase flow embedded within most shopping cart systems on commerce sites. Regardless of the case, it makes sense to keep around the information you are writing if it is going to be needed in the future. Storing and then immediately fetching is just a wasteful use of system resources. See Chapter 6, “Use Caching Aggressively,” for more information on how and what to cache.

The point to which all the preceding paragraphs are leading up to is that doubling your activity reduces your ability to scale cost effectively. In fact, it doubles your cost for those transactions. So while you may be engineering a solution to avoid a couple of million in risk associated with failed writes, you may be incurring tens of millions of dollars in extra infrastructure to accomplish it. Rarely, and in our experience never, does this investment in engineering time and infrastructure overcome the risk it mitigates. Reading after writing is bad in most cases because it not only doubles your cost and limits your scalability, it rarely returns value in risk mitigation commensurate with the costs. There are no doubt cases where it is warranted, though those are far fewer in number than justified by many technology teams and businesses.

The observant reader may have identified a conflict in our rules. Storing information locally on a system might be indicative of state and certainly requires affinity to the server to be effective. As such, we’ve violated Rule 40. At a high level, we agree, and if forced to make a choice we would always develop a stateless application over ensuring that we don’t have to read what we just wrote. That said, our rules are meant to be nomothetic or “generally true” rather than idiographic or “specifically true.” You should absolutely try not to duplicate your work and absolutely try to maintain a largely stateless application. Are these two statements sometimes in conflict? Yes. Is that conflict resolvable? Absolutely!

The way we resolve such a conflict in rules is to take the 30,000 foot approach. We want a system that does not waste resources (like reading what we just wrote) while we attempt to be largely stateless for reasons we discuss in Chapter 10, “Avoid or Distribute State.” To do this, we decide to never read for the sake of validation. We also agree that there are times when we might desire affinity for speed and scale versus reading what we just wrote. This means maintaining some notion of state, but we limit these to transactions where it is necessary for us to read something that we just wrote. While this approach causes a violation of our state rules, it makes complete sense as we are attempting to introduce state in a limited set of operations where it actually decreases cost and increases scalability as opposed to how it often does just the opposite.

As with any rule, there are likely exceptions. What if you exist in a regulatory environment that requires absolutely 100% of all writes of a particular piece of data be verified to exist, encrypted, and backed up? We’re not certain such an environment exists, but if it did there are almost always ways to meet requirements such as these without blocking for an immediate read of data that was just written. Here is a bulleted checklist of questions you can answer and steps you can take to eliminate reading what you just wrote and blocking the user transaction to do so:

Regulatory/legal requirement— Is this activity a regulatory or legal requirement? If it is, are you certain that you have read it properly? Rarely does a requirement spell out that you need to do something “in line” with a user transaction. And even if it does, the requirement rarely (probably never) applies to absolutely everything that you do.

Competitive differentiation— Does this activity provide competitive differentiation? Careful—“Yes” is an all-too-common and often incorrect answer to this question. Given the small rate of failures you would expect, it is hard to believe that you will win by correctly handling the .001% of failures that your competitors will have by not checking twice.

Asynchronous completion— If you have to read after writing for the purposes of validation due to either a regulatory requirement (doubtful but possible) or competitive differentiation (beyond doubtful—see above), then consider doing it asynchronously. Write locally and do not block the transaction. Handle any failures to process by re-creating the data from logs, reapplying it from a processing queue or worst case asking the user for it again in the very small percentage of cases where you lose it. If the failure is in copying the data to a remote backup for high availability, simply reapply that record or transaction. Never block the user under any scenario pending a synchronous write to two data sources.

Rule 18—Stop Redirecting Traffic

There are many reasons that you might want to redirect traffic. A few of these include tracking clicks on content or an advertisement, misspelled domains (for example, afkpartners.com instead of akfpartners.com), aliasing or shortening URLs (for example, akfpartners.com/news instead of akfpartners.com/news/index.php), or changing domains (for example, moving the site from akf-consulting.com to akfpartners.com). There is even a design pattern called Post/Redirect/Get (PRG) that is used to avoid some duplicated form submissions. Essentially this pattern calls for the post operation on a form submission to redirect the browser, preferably with an HTTP 303 response. All these and more are valid reasons for redirecting users from one place to another. However, like any good tool it can be used improperly, like trying to use a screwdriver for a hammer, or too frequently, such as splitting a cord of wood without sharpening your axe. Either problem ends up with less than desirable results. Let’s first talk a little more about redirection according to the HTTP standard.

According to RFC2616, Hypertext Transfer Protocol,1 there are several redirect codes, including the more familiar 301 moved permanently and the 302 found for temporary redirection. These codes fall under the Redirection 3xx heading and refer to a class of status code that requires further action to be taken by the user agent to fulfill the request. The complete list of 3xx codes is provided in the following sidebar.

So, we’ve agreed that there are many valid reasons for using redirects and the HTTP standard even has multiple status codes that allow for various types of redirects. What then is the problem with redirects? The problem is that they can be performed in numerous ways, some better than others in terms of resource utilization and performance, and they can easily get out of hand. Let’s examine a few of the most popular methods of redirecting users from one URI to another and discuss the pros and cons of each.

The simplest way to redirect a user from one page or domain to another is to construct an HTML page that requests they click on a link to proceed to the real resources they are attempting to retrieve. The page might look something like this:

<html><head></head><body>
<p>Please click <a href="http://www.akfpartners.com/
techblog">here for your requested page</a></p>
</body></html>

The biggest problem with this method is that it requires the user to click again to retrieve the real page he was after. A slightly better way to redirect with HTML is to use the meta tag “refresh” and automatically send the user’s browser to the new page. The HTML code for that would look like this:

<html><head>
<meta http-equiv="Refresh" content="0;
url=http://www.akfpartners.com/techblog" />
</head><body>
<p>In case your page doesn't automatically refresh, click
<a href="http://www.akfpartners.com/techblog">here for your
requested page</a></p>
</body></html>

With this we solved the user interaction problem, but we’re still wasting resources by requiring our Web server to receive a request and respond with a page back to the browser that must parse the HTML code before the redirection. Another more sophisticated method of handling redirects is through code. Almost all languages allow for redirects; in PHP the code might look like this.

<?
Header( "HTTP/1.1 301 Moved Permanently" );
Header( "Location: http://www.akfpartners.com/techblog" );
?>

This code has the benefit of not requiring the browser to parse HTML but rather redirect through an HTTP status code in a header field. In HTTP, header fields contain the operating parameters of a request or response by defining various characteristics of the data transfer. The PHP preceding code results in the following response:

HTTP/1.1 301 Moved Permanently
Date: Mon, 11 Oct 2010 19:39:39 GMT
Server: Apache/2.2.9 (Fedora)
X-Powered-By: PHP/5.2.6
Location: http://www.akfpartners.com/techblog
Cache-Control: max-age=3600
Expires: Mon, 11 Oct 2010 20:39:39 GMT
Vary: Accept-Encoding,User-Agent
Content-Type: text/html; charset=UTF-8

We’ve now improved our redirection by using HTTP status codes in the header fields, but we’re still requiring our server to interpret the PHP script. Instead of redirecting in code, which requires either interpretation or execution, we can request the server to redirect for us with its own embedded module. In the Apache Web server two primary modules are used for redirecting, mod_alias or mod_rewrite. The mod_alias is the easiest to understand and implement but is not terribly sophisticated in what it can accomplish. This module can implement alias, aliasmatch, redirect, or redirectmatch commands. Following is an example of a mod_alias entry:

Alias /image /www/html/image
Redirect /service http://foo2.akfpartners.com/service

The mod_rewrite module compared to the mod_alias module is sophisticated. According to Apache’s own documentation this module is a “killer one”2 because it provides a powerful way to manipulate URLs, but the price you pay is increased complexity. An example rewrite entry for redirecting all requests for artofscale.com or www.artofscale.com URLs to theartofscalability.com permanently (301 status code) follows:

RewriteEngine on
RewriteCond %{HTTP_HOST} ^artofscale.com$ [OR]
RewriteCond %{HTTP_HOST} ^www.artofscale.com$
RewriteRule ^/?(.*)$
"http://theartofscalability.com/$1" [R=301,L]

To add to the complexity, Apache allows the scripts for these modules to be placed in either the .htaccess files or the httpd.conf main configuration file. However, using the .htaccess files should be avoided in favor of the main configuration files primarily because of performance.3 When configured to allow the use of .htaccess files, Apache looks in every directory for .htaccess files, thus causing a performance hit, whether you use them or not! Also, the .htaccess file is loaded every time a document is requested instead of once at startup like the httpd.conf main configuration file.

We’ve now seen some pros and cons of redirecting through different methods, which hopefully will guide us in how to use redirection as a tool. The last topic to cover is making sure you’re using the right tool in the first place. Ideally we want to avoid redirection completely. A few of the reasons to avoid redirection when possible is that it always delays the user from getting the resource she wants, it takes up computational resources, and there are many ways to mess up redirection hurting user browsing or search engine rankings.

A few examples of ways that redirects can be wrong come directly from Google’s page on why URLs are not followed by its search engine bots.4 These include redirect errors, redirect loops, too long URLs, and empty redirects. You might think that creating a redirect loop would be difficult, but it is much easier than you think, and while most browsers and bots stop when they detect the loop, it takes up a ton of resources trying to service those requests.

As we mentioned in the beginning of this rule there are certainly times when redirection is necessary, but with a little thought there are ways around many of these. Take click tracking for example. There are certainly all types of business needs to keep track of clicks, but there might be a better way than sending the user to a server to record the click in an access log or application log and then sending the user to the desired site. One alternative is in the browser to use the onClick event handler to call a JavaScript function. This function can request a 1x1 pixel through a PHP or other script that records the click. The beauty of this solution is that it doesn’t require the user’s browser to request a page, receive back a page or even a header, before it can start loading the desired page.

When it comes to redirects, make sure you first think through ways that you can avoid them. Using the right tool for the job as discussed in Chapter 4, “Use the Right Tools,” is important, and redirects are specialized tools. Once those options fail, consider how best to use the redirect tool. We covered several methods and discussed their pros and cons. The specifics of your application will dictate the best alternative.

Rule 19—Relax Temporal Constraints

In the domains of mathematics and machine learning (artificial intelligence) there is a set of Constraint Satisfaction Problems (CSP) where the state of a set of objects must satisfy certain constraints. CSPs are often highly complex, requiring a combination of heuristics and combinatorial search methods to be solved.5 Two classic puzzles that can be modeled as CSPs are Sudoku and the map coloring problem. The goal of Sudoku is to fill each nine-square row, each nine-square column, and each nine-square box with the numbers 1 through 9, with each number used once and only once in each section. The goal of a map coloring problem is to color a map so that regions sharing a common border have different colors. Solving this involves representing the map as a graph where each region is a vertex and an edge connects two vertices if the corresponding regions share a border.

A more specific variety of the CSP is a Temporal Constraint Satisfaction Problem (TCSP), which is a representation where variables denote events, and constraints represent the possible temporal relations between them. The goals are ensuring consistency among the variables and determining scenarios that satisfy all constraints. Enforcing what is known as local consistency on the variables ensures that the constraints are satisfied for all nodes, arcs, and paths within the problem. While many problems within machine learning and computer science can be modeled as TCSPs, including machine vision, scheduling, and floor plan design, use cases within SaaS systems can also be thought of as TCSPs.

An example of a temporal constraint within a typical SaaS application would be purchasing an item in stock. There are time lapses between a user viewing an item, putting it in his shopping cart, and purchasing it. One could argue that for the absolute best user experience, the state of the object, whether or not it is available, would ideally remain consistent throughout this process. To do so would require that the application mark the item as “taken” in the database until the user browses off the page, abandons the cart, or makes the purchase.

This is pretty straightforward until we get a lot of users on our site. It’s not uncommon for users to view 100 or more items before they add anything to their cart. One of our clients claims that users look at more than 500 search results before adding a single item to their cart. In this case our application probably needs several read replicas of the database to allow many more people to search and view items than to purchase them. Herein lies the problem; most RDBMSs aren’t good at keeping all the data completely consistent between nodes. Even though read replicas or slave databases can be kept within seconds of each other in terms of consistent data, certainly there will be edge cases when two users want to view the last available inventory of a particular item. We’ll come back and solve this problem, but first let’s talk about why databases make this difficult.

In Chapter 2, “Distribute Your Work,” and Chapter 4, “Use the Right Tools,” we spoke about ACID properties of RDBMSs (refer to Table 2.1). The one property that makes scaling an RDBMS in a distributed manner difficult is consistency. The CAP Theorem, also known as the Brewer Theorem so named after computer scientist Eric Brewer, states that three core requirements exist when designing applications in a distributed environment, but it is impossible to simultaneously satisfy all three requirements. These requirements are expressed in the acronym CAP:

• Consistency—The client perceives that a set of operations has occurred all at once.

• Availability—Every operation must terminate in an intended response.

• Partition tolerance—Operations will complete, even if individual components are unavailable.

What has been derived as a solution to this problem is called BASE, an acronym for architectures that solve CAP and stands for Basically Available, Soft State, and Eventually Consistent. By relaxing the ACID properties of consistency we have greater flexibility in how we scale. A BASE architecture allows for the databases to become consistent, eventually. This might be minutes or even just seconds, but as we saw in the previous example, even milliseconds of inconsistency can cause problems if our application expects to be able to “lock” the data.

The way we would redesign our system to accommodate this eventual consistency would be to relax the temporal constraint. The user just viewing an item would not guarantee that it was available. The application would “lock” the data when it was placed into a shopping cart, and this would be done on the primary write copy or master database. Because we have ACID properties we can guarantee that if our transaction completes and we mark the record of the item as “locked,” then that user can continue through the purchase confident that the item is reserved for them. Other users viewing the item may or may not have it available for them to purchase.

Another area in which temporal constraints are commonly found in applications is the transfer of items (money) or communications between users. Guaranteeing that user A gets the money, message, or item in her account as soon as user B sends it is easy on a single database. Spreading out the data among several copies of the data makes this consistency much more difficult. The way to solve this is to not expect or require the temporal constraint of instant transfer. More than likely it is totally acceptable that user A wait a few seconds before she sees the money that user B sent. The reason is simply that most dyads don’t synchronously transfer items in a system. Obviously synchronous communication such as chat is different.

It is easy to place temporal constraints on your system because at first glance it appears that it would be the best customer experience to do so. However, before doing so consider the long-term ramifications of how difficult that system will be to scale because of the constraint.

Summary

We offered three rules in this chapter that deal with not duplicating your work. Start by not double checking yourself. You employ expensive databases and hardware to ensure your systems properly record transactions and events. Don’t expect them not to work. We all have the need for redirection at times, but excessive use of this tool causes all types of problems from user experience to search engine indexing. Finally, consider the business requirements that you place on your system. Temporal constraints of items and objects make it difficult and expensive to scale. Carefully consider the real costs and benefits of these decisions.

Endnotes

1 R. Fielding et al., Networking Working Group Request for Comments 2616, “Hypertext Transfer Protocol—HTTP/1.1,” June 1999, http://www.w3.org/Protocols/rfc2616/rfc2616.html.

2 Ralf S. Engelschall, “URL Rewriting Guide,” Apache HTTP Server Version 2.2, December 1997, http://httpd.apache.org/docs/current/misc/rewriteguide.html.

3 Apache HTTP Server Version 1.3, “.htaccess Files,” http://httpd.apache.org/docs/1.3/howto/htaccess.html.

4 Google Webmaster Central, Webmaster Tools Help, “URLs Not Followed Errors,” http://www.google.com/support/webmasters/bin/answer.py?answer=35156.

5 Wikipedia, “Constraint satisfaction problem,” http://en.wikipedia.org/wiki/Constraint_satisfaction_problem.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset