8. Database Rules

In Chapter 4, “Use the Right Tools,” we discussed Maslow’s Hammer (aka the Law of the Instrument), which put simply is an overreliance, to a fault, on a familiar tool. We discussed that one common example of overuse is the relational database. Recall that relational databases typically give us certain benefits outlined by an acronym called ACID described in Table 8.1.

Table 8.1. ACID Properties of Databases

image

ACID properties are really powerful when we need to split up data into different entities with each entity having some number of relationships with other entities within the database. They are even more powerful when we want to process a large number of transactions through these entities and relationships; transactions consisting of reads of the data, updates to the data, the addition of new data (inserts or creates), and removal of certain data (deletes). While we should always strive to find more lightweight and faster ways to perform transactions, sometimes there simply isn’t an easy way around using a relational database, and sometimes the relational database is the best option for our implementation given the flexibility it affords. Whereas Rule 14 argued against using databases where they were not necessary, this chapter when used in conjunction with the rules of Chapter 2 (“Distribute Your Work”) helps us make the most out of databases without causing major scalability problems within our architecture.

Rule 31—Be Aware of Costly Relationships

In our personal lives, unless we’re masochistic, we all strive to establish and build relationships that are balanced. Ideally we put into a relationship roughly the same that we get out. When a personal relationship becomes skewed in one person’s favor the other person may become unhappy, reevaluate the relationship, and potentially end it. Although this book isn’t about personal relationships, the same cost = benefit balance that exists in our personal relationships is applicable to our database relationships.

Database relationships are determined by the data model, which captures the cardinality and referential integrity rules of the data. To understand how this occurs and why it is important we need to understand the basic steps involved in building a data model that results in the data definition language (DDL) statements that are used to actually create the physical structure to contain the data, that is, tables and columns. While there are all types of variations on this process, for a relational model the first step generally is to define the entities.

An entity is anything that can exist independently such as a physical object, event, or concept. Entities can have relationships with each other, and both the entity and the relationship can both have attributes describing them. Using the common grammar analogy, entities are nouns, relationships are verbs, and attributes are adjectives or adverbs, depending on what they modify.

Entities are single instances of something, such as a customer’s purchase order, which can have attributes such as an order ID and total value. Grouping of the same type of entities together produces an entity set. In our database the entity is the equivalent of the row, and the entity set is the table. The unique attribute that describes the entity is the primary key of the table. Primary keys enforce entity integrity by uniquely identifying entity instances. The unique attributes that describe the relationship between entities are the foreign keys. Foreign keys enforce referential integrity by completing an association between two entities of different entity sets. Most commonly used to diagram entities, relationships and attributes are entity relationship diagrams (ERD). ERDs show the cardinality between entity sets, one-to-one, one-to-many, or many-to-many relationships.

Once the entities, relationships, and attributes are defined and mapped, the last step in the design of the data model is to consider normalization. The primary purpose of normalizing a data model is to ensure the data is stored in a manner that allows for insert, update, select, and delete (aka CRUD: Create Read Update Delete) with data integrity. Non-normalized data models have a high degree of data redundancy, which means that the risk of data integrity problems is greater. Normal forms build upon each other meaning that for a database to satisfy the requirements for second normal form it first must satisfy first normal form. The most common normal forms are described in the sidebar. If a database adheres to at least the third normal form it is considered normalized.

As you have probably figured out by now, the relationships between entities dramatically affect how efficiently the data is stored, retrieved, and updated. It also plays a large role in scalability as these relationships define how we are able to split or shard our database. If we are attempting to perform a Y axis split of our database by pulling out the order confirmation service, this might prove problematic if the order entity is extensively related to other entities. Trying to untangle this web of relationships is difficult after the fact. It is well worth the time spent up front in the design phase to save you 10x or 100x the effort when you need to split your databases.

One last aspect of data relationships that is important to scalability is how we join tables in our queries. This, of course, is also very much dictated by the data model but also by our developers who are creating reports, new pages in our applications, and so on. We won’t attempt to cover the steps to query optimization in detail here, but suffice it to say that new queries should be reviewed by a competent DBA who is familiar with the data model, and should be analyzed for performance characteristics prior to being placed into the production environment.

You have probably noticed that there is a relationship between a desire for increased data integrity through normalization and the degree to which relationships must be used in a database. The higher the normal form, the greater the number of potential relationships as we create tables specific to such things as repeating values. What was once taught as a law years ago in database design (moving up in normal form is good) is now seen as more of a tradeoff in high transaction system design. This tradeoff is similar to the tradeoff between risk and cost, cost and quality, time and cost, and so on; specifically a decrease in one side typically implies an increase in the other. Often to increase scale, we look to reduce normal forms.

When SQL queries perform poorly because of the requirements to join tables there are several alternatives. The first is to tune the query. If this doesn’t help another alternative is to create a view, materialized view, summary table, and so on that can preprocess the joins. Another alternative is to not join in the query but rather pull the data sets into the application and join in memory in the application. While this is more complex it removes the processing of the join off the database, which is often the most difficult to scale and puts it in the application server tier, which is easier to scale out with more commodity hardware. A final alternative is to push back on the business requirements. Often our business partners will come up with different solutions when it is explained that the way they have requested the report requires a 10% increase in hardware while the removal of a single column may make the report trivial in complexity and nearly as equivalent in business value.

Rule 32—Use the Right Type of Database Lock

Locks are a fact of life within a database; they are the way in which databases allow concurrent users while helping to ensure the consistency and isolation components of the ACID properties of a database. But there are many different types of database locks, and even different approaches to implementing them. Table 8.2 offers a brief and high-level overview of different lock types supported in many different open source and third-party proprietary database management systems. Not all of these locks are supported by all databases, and the lock types can be mixed. For instance, a row lock can be either explicit or implicit.

Table 8.2. Lock Types

image

If you search a bit, you will find many other types of locks. There are, depending on the type of database, key and index locks that work on the indices that you create over your tables. You may also find a discussion of column locking and ways in which different databases might support that notion. To our knowledge, few if any databases actually support this type of locking, and if it is supported it isn’t used very frequently within the industry.

While locking is absolutely critical to the operations of a database to facilitate both isolation and consistency, it is obviously costly. Typically databases allow reads to occur simultaneously on data, while blocking all reads during the course of a write (an update or insertion) on an element undergoing an operation. Reads then can occur very fast, and many of them can happen at one time while typically a write happens in isolation. The finer the granularity of the write operation, such as in the case of a single row, the more of these can happen within the database or even within a table at a time. Increasing the granularity of the object being written or updated, such as updating multiple rows at a time, may require an escalation of the type of lock necessary.

The size or granularity of lock to employ ultimately impacts the throughput of transactions. When updating many rows at a single time within a database, the cost of acquiring multiple row locks and the competition for these rows might result in fewer transactions per second than just acquiring a larger lock of a page, extent, or table. But if too large a lock is grabbed, such as a page when only updating a small number of rows, then transaction throughput will decrease while the lock is held.

Often a component of the database (commonly called an optimizer within many databases) determines what size of element should be locked in an attempt to allow maximum concurrency while ensuring consistency and isolation. In most cases, initially allowing the optimizer to determine what should be locked is your best course of action. This component of the database has more knowledge about what is likely to be updated than you do at the time of the operation. Unfortunately, these systems are bound to make mistakes, and this is where it is critical that we monitor our databases in production and make changes to our DML to make it more efficient as we learn from what happens in our production environments.

Most databases allow performance statistics to be collected that allow us to understand the most common locking conditions and the most common events causing transactions to wait before being processed. By analyzing this information historically, and by monitoring these events aggressively in the production environment, we can identify when the optimizer is incorrectly identifying the type of lock it should use and force the database to use an appropriate type of locking. For instance, if through our analysis we identify that our database is consistently using table locking for a particular table and we believe we would get greater concurrency out of row level locking, we might be able to force this change.

Perhaps as important as the analysis of what is causing bottlenecks and what type of locking we should employ is the notion of determining if we can change the entity relationships to reduce contention and increase concurrency. This of course brings us back to the concepts we discussed in Chapter 2. We can, for instance, split our reads across multiple copies of the database and force writes to a single copy as in the X axis of scale (Rule 7). Or we can split up our tables across multiple databases based partially on contention such as in the Y axis of scale (Rule 8). Finally, we may just reduce table size by pulling out certain customer-specific data into multiple tables to allow the contention to be split across these entities such as in the Z axis of scale (Rule 9).

Finally, where we employ databases to do our work we should try to ensure we are choosing the best solution. As we’ve said time and again, we believe that you can scale nearly any product using nearly any set of technologies. That said, most decisions we make will have an impact on either our cost of operating our product or our time to market. There are, for example, some database storage engine solutions that limit the types of locks we can employ within the database, and as a result limit our ability to tune our databases to maximize concurrent transactions. MySQL is an example where the selection of a storage engine such as MyISAM can limit you to table level locks and, as a result, potentially limits your transaction throughput.

Rule 33—Pass on Using Multiphase Commits

Multiphase commit protocols, which include the popular two-phase commit (2PC) and three-phase commit (3PC), are specialized consensus protocols. The purpose of these protocols is to coordinate processes that participate in a distributed atomic transaction to determine whether to commit or abort (roll back) the transaction.2 Because of these algorithms’ capability to handle systemwide failures of the network or processes, they are often looked to as solutions for distributed data storage or processing.

The basic algorithm of 2PC consists of two phases. The first phase, voting phase, is where the master storage or coordinator makes a “commit request” to all the cohorts or other storage devices. All the cohorts process the transaction up to the point of committing and then acknowledge that they can commit or vote “yes.” Thus begins the second phase or completion phase, where the master sends a commit signal to all cohorts who begin the commit of the data. If any cohorts should fail during the commit then a rollback is sent to all cohorts and the transaction is abandoned. An example of this protocol is shown in Figure 8.1.

Figure 8.1. 2PC example

image

So far this protocol probably sounds pretty good since it provides atomicity of transactions within a distributed database environment. Hold off on your judgment just a short while longer. In the example Figure 8.1, notice that the app server initiated the transaction, step A. Then all the 2PC steps started happening and had to complete, step B, before the master database could acknowledge back to the app server that indeed that transaction was completed, step C. During that entire time the app server thread was held up waiting for the SQL query to complete and the database to acknowledge the transaction. This example is typical of almost any consumer purchase, registration, or bidding transaction on the Web where you might try to implement 2PC. Unfortunately, locking up the app server for that long can have dire consequences. While you might think that you either have plenty of capacity on your app servers or that you can scale them out pretty cost effectively since they are commodity hardware, the locking also occurs on the database. Because you’re committing, all rows of data, assuming you have row-level locking capabilities because it’s even worse for block level, you are locking up all those rows until everything commits and gives the “all clear.”

We’ve implemented (or rather failed to implement) 2PC on a large scale, and the results were disastrous and entirely due to the lock and wait nature of the approach. Our database could initially handle thousands of reads and writes a second prior to the 2PC implementation. After introducing 2PC for just a fraction of the calls (less than 2%), the site completely locked up before processing a quarter of the total number of transactions it could previously handle. While we could have added more application servers, the database was not able to process more queries because of locks on the data.

While 2PC might seem like a good alternative to actually splitting your database by a Y or Z axis split (Rules 8 and 9), think again. Pull (or separate) database tables apart the smart way instead of trying to extend the life of your monolithic database with a multiphase commit protocol.

Rule 34—Try Not to Use “Select For Update”

When leveraged properly, cursors are powerful database control structures that allow us to traverse and process data within some result set defined by the query (or operation) of the cursor. Cursors are useful when we plan to specify some set of data and “cursor through” or process the rows in the data set in an iterative fashion. Items within the data set can be updated, deleted, or modified or simply read and reviewed for other processing. The real power of the cursor is as an extension of the capability of the programming language, as many procedural and object-oriented programming languages don’t offer built-in capabilities of managing data sets within a relational database. One potentially troublesome approach in very high transaction systems is the FOR UPDATE clause in SELECT cursors as we often are not in control of how long the cursor will be active, the resulting lock on records can cause slow-downs or even near deadlock scenarios in our product.

In many databases, when the cursor with a FOR UPDATE clause is opened, the rows identified within the statement are locked until either a commit or rollback is issued within the session. The COMMIT statement saves changes and a ROLLBACK cancels any changes. With the issuing of either statement, the locks associated with the rows in the database are released. Additionally, after issuing the commit or rollback, you lose your position within the cursor and will not be able to execute any more fetches against it.

Pause for a second now and think back to Rule 32 and our discussion of database locks. Can you identify at least two potential problems with the “Select for Update” cursor? The first problem is that the cursor holds locks on rows within the database while you perform your actions. Granted, in many cases this might be useful, and in some smaller number of cases it might either be unavoidable or may be the best approach for the solution. But these locks are going to potentially cause other transactions to block or wait while you perform some number of actions. If these actions are complex or take some time you may stack up a great number of pending transactions. If these other transactions also happen to be cursors expecting to perform a “select for update” we may create a wait queue that simply will not be processed within our users’ acceptable timeframe. In a Web environment, impatient users waiting on slowly responding requests may issue additional requests with the notion that perhaps the subsequent requests will complete more quickly. The result is a disaster; our systems come to a halt as pending requests stack up on the database and ultimately cause our Web servers to fill up their TCP ports and stop responding to users.

The second problem is the mirror image of our first problem and hinted at previously. Future cursors desiring a lock on one or more rows that are currently locked will wait until other locks clear. Note that these locks don’t necessarily need to be placed by other cursors; they can be explicit locks from users or implicit locks from the RDBMS. The more locking that we have going on within the database, even while some of it is likely necessary, the more likely we will have transactions backing up. Very long held locks will engender slower response times for frequently requested data. Some databases, such as Oracle, include the optional keyword “NOWAIT” that releases control back to the process to perform other work or to wait before trying to reacquire the lock. But if the cursor must be processed for some synchronous user request, the end result to the user is the same—a long wait for a client request.

Be aware that some databases default to “for update” for cursors. In fact, the American National Standards Institute (ANSI) SQL standard indicates that any cursor should default to FOR UPDATE unless it includes the clause FOR READ ONLY on the DECLARE statement. Developers and DBAs should refer to their database documentation to identify how to develop cursors with minimal locks.

Rule 35—Don’t Select Everything

This is a pretty simple and straightforward rule. For most of us the first SQL we learned was

Select * from table_name_blah;

When it returned a bunch of data we were thrilled. Unfortunately, some of our developers either never moved beyond this point or regressed back to it over the years. Selecting everything is fast and simple but really never a good idea. There are several problems with this that we’ll cover, but keep in mind that this mentality of selecting unnamed data can be seen in another DML statement, Insert.

There are two primary problems with the Select *. The first is the probability of data mapping problems, and the second is the transfer of unnecessary data. When we execute a select query we’re often expecting to display or manipulate that data and to do so requires that we map the data into some type of variable. In the following code example there are two functions, bad_qry_data and good_qry_data. As the name should give away, bad_qry_data is a bad example of how you can map a query into an array, and good_qry_data shows a better way of doing it. In both functions, we are selecting the values from the table bestpostpage and mapping them into a two-dimensional array. Since we know there are four columns in the table we might feel that we’re safe using the bad_qry_data function. The problem is when the next developer needs to add a column to the table they might issue a command such as this:

ALTER TABLE bestpostpage ADD remote_host varchar(25) AFTER id;

The result is that your mapping from column 1 is no longer the remote_ip but instead is now remote_host. A better solution is to simply declare all the variables that you are selecting and identify them by name when mapping.

function bad_qry_data() {
  $sql = "SELECT * "
    . "FROM bestpostpage "
    . "ORDER BY insert_date DESC LIMIT 100";
  $qry_results = exec_qry($sql);
  $i = 0;
  while($row = mysql_fetch_array($qry_results)) {
      $ResArr[$i]["id"] = $row[0];
      $ResArr[$i]["remote_ip"] = $row[1];
      $ResArr[$i]["post_data"] = $row[2];
      $ResArr[$i]["insert_date"] = $row[3];
      $i++;
  } // while
  return $ResArr;
} //function qry_data

function good_qry_data() {
  $sql = "SELECT id, remote_ip, post_data, insert_date "
    . "FROM bestpostpage "
    . "ORDER BY insert_date DESC LIMIT 100";
  $qry_results = exec_qry($sql);
  $i = 0;
  while($row = mysql_fetch_assoc($qry_results)) {
   $ResArr[$i]["id"] = $row["id"];

   $ResArr[$i]["remote_ip"] = $row["remote_ip"];
   $ResArr[$i]["post_data"] = $row["post_data"];
   $ResArr[$i]["insert_date"] = $row["insert_date"];
   $i++;
  } // while
  return $ResArr;
} //function qry_data

The second big problem with Select * is that usually you don’t need all the data in all the columns. While the actual lookup of additional columns isn’t resource consuming, the transfer of all that additional data from the database server to the application server can add up to significant amounts when that query gets executed dozens or even hundreds of times per minute for different users.

Lest you think this is all about the much-maligned Select statement, Insert can fall prey to the exact same problem of unspecified columns. The following SQL statement is perfectly valid as long as the column count of the table matches the number of values being entered. This will break when an additional column is added to the table, which might cause an issue with your system but should be able to be caught early in testing.

INSERT INTO bestpostpage VALUES (1, '10.97.23.45', 'test
data', '2010-11-19 11:15:00'),

A much better way of inserting the data is to use the actual column names, like this:

INSERT INTO bestpostpage (id, remote_ip, post_data,
insert_date) VALUES (1, '10.97.23.45', 'test data',
'2010-11-19 11:15:00'),

As a best practice, do not get in the habit of using Select or Insert without specifying the columns. Besides wasting resources and being likely to break or potentially even corrupt data, it also prevents you from rolling back. As we discussed in Rule 29 building the capability to roll back is critical to both scalability and availability.

Summary

In this chapter we discussed rules that will help your database scale. Ideally, we’d like to avoid the use of relational databases because they are more difficult to scale than other parts of systems, but sometimes their use is unavoidable. Given that the database is often the most difficult part of the application to scale, particular attention should be paid to these rules. When the rules presented in this chapter are combined with rules from other chapters such as Chapter 2, you should have a strong base of do’s and don’ts to ensure your database scales.

Endnotes

1 E. F. Codd, “A Relationship Model of Data for Large Shared Data Banks,” Communications of the ACM 13 (6): 377-387.

2 Wikipedia, “Two-phase commit,” http://en.wikipedia.org/wiki/Two-phase_commit_protocol.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset