Chapter 7. Database Structure and Performance Tuning
How long are you willing to wait for a computer to respond to your request for information? 30 seconds? 10 seconds? 5 seconds? In truth, we humans aren't very patient at all. Even five seconds can seem like an eternity when you're waiting for something to appear on the screen. A database that has a slow response time to user queries usually means that you will have dissatisfied users.
Slow response times can be the result of any number of problems. You might be dealing with a client workstation that isn't properly configured, a poorly written application program, a query involving multiple join operations, a query that requires reading amounts of data from disk, a congested network, or even a DBMS that isn't robust enough to handle the volume of queries submitted to it.
One of the duties of a database administrator (DBA) is to optimize database performance (also known as performance tuning). This includes modifying the design—where possible—to avoid performance bottlenecks, especially involving queries.
For the most part, a DBMS takes care of storing and retrieving data based on a user's commands without human intervention. The strategy used to process a data manipulation request is handled by the DBMS's query optimizer, a portion of the program that determines the most efficient sequence of relational algebra operations to perform a query.
Although most of the query optimizer's choices are out of the hands of a database designer or application developer, you can influence the behavior of the query optimizer and also optimize database performance to some extent with database design elements. In this chapter you will be introduced to several such techniques.

Joins and Database Performance

In Chapter 6 we discussed the use of joins as part of the theory of relational database design. Joins, however, can also have a major impact on query performance. The extent of the impact depends on your DBMS and how it implements a join.
From a relational algebra point of view, a join can be implemented using two other operations: product and restrict. As you will see, this sequence of operations requires the manipulation of a great deal of data and, if used by a DBMS, can result in very slow query performance.
The restrict operation retrieves rows from a table by matching each row against logical criteria (a predicate). Those rows that meet the criteria are placed in the result table, and those that do not meet the criteria are omitted. Restrict, which was originally called select, cannot choose columns; you get every column in the table.
Note: The confusion with the term select arises because the SQL query command is also “select.” The SQL select is a command that triggers many relational algebra operations, some of which are determined by the DBMS rather than the user.
The product operation (the mathematical Cartesian product) makes every possible pairing of rows from two source tables. In Figure 7-1, for example, the product of the customer and order tables that you saw in Chapter 6 produces a result table with 48 rows (6 customers times the 8 orders). The customer number column appears twice because it is part of both source tables.
B9780123747303000097/gr1.jpg is missing
Figure 7-1
The product of the customer and order tables.
Note: Although 48 rows might not seem like a lot, consider the size of a product table created from tables with 1000 or even 10,000 rows! The manipulation of a table of this size can tie up a lot of disk I/O and CPU time.
In some rows, the customer number is the same. These are the rows that should be included in a join. We can therefore apply a restrict predicate to the product table to end up with the same table provided by the join you saw in Chapter 6. The predicate's logical condition can be written:
customer.customer_numb = order.customer_numb
The rows that are chosen by this predicate appear in boldface in Figure 7-2; those eliminated by the predicate are in regular type. Notice that the boldface rows are exactly the same as those in the result table of the join from Chapter 6.
B9780123747303000097/gr2.jpg is missing
Figure 7-2
The product of the customer and orders tables after applying a restrict predicate.
Note: Although this may seem like a highly inefficient way to implement a join, it is actually quite flexible, in particular because the relationship between the columns over which the join is being performed doesn't have to be equal. A user could just as easily request a join where the value in table A is greater than the value in table B.
Because of the processing overhead created when performing joins in this way, some database designers make a conscious decision to leave tables unnormalized. For example, if Antique Opticals always accessed the line items at the same time it accessed order information, then a designer might choose to combine the order item and order data into one table, knowing full well that the unnormalized relation exhibits anomalies. The benefit is that the retrieval of order information will be faster than if it were split into two tables.
Should you leave unnormalized relations in your database to achieve better retrieval performance? In this author's opinion, there is rarely any need to do so. First, not all DBMSs implement a join in this way. Before you decide not to normalize tables, investigate how your DBMS performs a join. In addition, there are ways to prepare SQL queries (in particular, using uncorrelated subqueries) that can produce the same result as a join but without actually performing the join. That being the case, it does not seem worth the problems that unnormalized relations present to leave them in the database. Careful writing of retrieval queries can provide performance that is nearly as good as that of retrieval from unnormalized relations.
Note: For a complete discussion of writing SQL queries to avoid joins, see the author's book SQL Clearly Explained, also published by Morgan Kaufmann.

Indexing

Indexing is a way of providing a fast access path to the values of a column or a concatenation of columns. New rows are typically added to the bottom of a table, resulting in a relatively random order of the values in any given column. Without some way of ordering the data, the only way the DBMS can search a column is by sequentially scanning each row from top to bottom. The larger a table becomes, the slower a sequential search will be.
Note: On average, in a table of N rows, a sequential search will need to examine N/2 rows to find a row that matches a query predicate. However, the only way for the DBMS to determine that no rows match the predicate is to examine all N rows. A table with 1000 rows requires on average looking at 500 rows; an unsuccessful search requires consulting all 1000 rows. However, the fast searching techniques provided by indexes require looking at about six rows to find a matching row; an unsuccessful search requires consulting about ten rows.
The alternative to indexing for ordering the rows in a table is sorting. A sort physically alters the position of rows in a table, placing the rows in order starting with the first row in the table. Most SQL implementations do sort the virtual tables that are created as the result of queries when directed to do so by the SQL query. However, SQL provides no way to sort base tables, and there is good reason for this. Regardless of the sorting method used, as a table grows large (hundreds of thousands to millions of rows), sorting takes a very long time.
Keeping a table in sorted order also means that on average half of the rows in the table will need to be moved to make room for a new row. In addition, searching a sorted base table takes longer than searching an index, primarily because the index search requires less disk access. The overhead in maintaining indexes is far less than that required to sort base tables whenever a specific data order is needed.
The conceptual operation of an index is diagrammed in Figure 7-3. (The different weights of the lines have no significance other than to make it easier for you to follow the crossed lines.) This illustration shows Antique Opticals’ item relation and an index that provides fast access to rows in the table based on the item's title. The index itself contains an ordered list of keys (the titles) along with the locations of the associated rows in the item table. The rows in the item table are in relatively random order. However, because the index is in alphabetical order by title, it can be searched quickly to locate a specific title. Then the DBMS can use the information in the index to go directly to the correct row or rows in the item table, thus avoiding a slow sequential search of the base table's rows.
B9780123747303000097/gr3.jpg is missing
Figure 7-3
Indexing.
Once you have created an index, the DBMS's query optimizer will use the index whenever it determines that using the index will speed up data retrieval. You never need to access the index again yourself unless you want to delete it.
When you create a primary key for a table, the DBMS automatically creates an index for that table, using the primary key column or columns in the primary key as the index key. The first step in inserting a new row into a table is therefore verification that the index key (the primary key of the table) is unique in the index. In fact, uniqueness is enforced by requiring the index entries to be unique, rather than by actually searching the base table. This is much faster than attempting to verify uniqueness directly on the base table because the ordered index can be searched much more rapidly than the unordered base table.

Deciding Which Indexes to Create

You have no choice as to whether the DBMS creates indexes for your primary keys; you get them whether you want them or not. In addition, you can create indexes on any column or combination of columns you want. However, before you jump headfirst into creating indexes on every column in every table, you must consider some trade-offs:
Indexes take up space in the database. Given that disk space is relatively inexpensive today, this is usually not a major drawback.
▪ When you insert, modify, or delete data in indexed columns, the DBMS must update the index as well as the base table. This may slow down data modification operations, especially if the tables have a lot of rows.
▪ Indexes definitely speed up access to data.
The trade-off is therefore generally between update speed and retrieval speed. A good rule of thumb is to create indexes for foreign keys and for other columns that are used frequently for queries that apply criteria to data. If you find that update speed is severely affected, you may choose at a later time to delete some of the indexes you created.
Also avoid indexes on columns that contain nondiscriminatory data. Nondiscriminatory columns have only a few values throughout the entire table, such as Boolean columns that contain only true and false. Gender (male or female) is also nondiscriminatory. Although you may search on a column containing nondiscriminatory data—for example, a search for all open orders—an index will not provide much performance enhancement because the DBMS must examine so many keys to complete the query.

Clustering

The slowest part of a DBMS's actions is retrieving data from or writing data to a disk. If you can cut down on the number of times the DBMS must read from or write to a disk, you can speed up overall database performance. The trick to doing this is understanding that a database must retrieve an entire disk page of data at one time.
The size of a page varies from one computing platform to another; it can be anywhere from 512 bytes to 4 K, with 1 K being typical on a PC. Data always travel to and from disk in page-sized units. Therefore, if you store data that are often accessed together on the same disk page (or pages that are physically close together), you can speed up data access. This process is known as clustering and is available with many large DBMSs (for example, Oracle).
Note: The term clustering has another meaning in the SQL standard. It refers to groups of catalogs (which in turn are groups of schemas) manipulated by the same DBMS. The use of the term in this section, however, is totally distinct from the SQL meaning.
In practice, a cluster is designed to keep together rows related by matching primary and foreign keys. To define the cluster, you specify a column or columns on which the DBMS should form the cluster and the tables that should be included. Then, all of the rows that share the same value of the column or columns on which the cluster is based are stored as physically close together as possible. As a result, the rows in a table may be scattered across several disk pages, but matching primary and foreign keys are usually on the same disk page.
Clustering can significantly speed up join performance. However, just as with indexes, there are some trade-offs to consider when contemplating creating clusters:
▪ Because clustering involves physical placement of data in a file, a table can be clustered on only one column or combination of columns.
▪ Clustering can slow down performance of operations that require a scan of the entire table because clustering may mean that the rows of any given table are scattered throughout many disk pages.
▪ Clustering can slow down insertion of data.
▪ Clustering can slow down modifying data in the columns on which the clustering is based.

Partitioning

Partitioning is the opposite of clustering. It involves the splitting of large tables into smaller ones so that the DBMS does not need to retrieve as much data at any one time. Consider, for example, what happens to Antique Opticals’ order and order items tables over time. Assuming that the business is reasonably successful, those tables (especially order items) will become very large. Retrieval of data from those tables will therefore begin to slow down. It would speed up retrieval of open orders if filled orders and their items could be separated from open orders and their items.
There are two ways to partition a table: horizontally and vertically. Horizontal partitioning involves splitting the rows of a table between two or more tables with identical structures. Vertical partitioning involves dividing the columns of a table and placing them in two or more tables linked by the original table's primary key. As you might expect, there are advantages and disadvantages to both.

Horizontal Partitioning

Horizontal partitioning involves creating two or more tables with exactly the same structure and splitting rows between those tables. Antique Opticals might use this technique to solve the problem with the order and order items tables becoming increasingly large. The database design might be modified as follows:
open_order (order_numb, customer_numb, order_date)
open_order_items (order_numb, item_numb, quantity, shipped?)
filled_order (order_numb, customer_numb, order_date)
filled_order_items (order_numb, item_numb, quantity, shipped?)
Whenever all items in an open order have shipped, an application program deletes rows from the open order and open order items table and inserts them into the filled order and filled order items table. The open order and open order items tables remain relatively small, speeding up both retrieval and modification performance. Although retrieval from filled order and filled order lines will be slower, Antique Opticals uses those tables much less frequently.
The drawback to this solution occurs when Antique Opticals needs to access all of the orders and/or order items at the same time. A query whose result table includes data from both sets of open and filled tables must actually be two queries connected by the union operator. (The union operation creates one table by merging the rows of two tables with the same structure.) Performance of such a query will be worse than that of a query of either set of tables individually. Nonetheless, if an analysis of Antique Opticals’ data access patterns reveals that such queries occur rarely and that most retrieval involves the open set of tables, then the horizontal partitioning is worth doing.
The only way you can determine whether horizontal partitioning will increase performance is to examine the ways in which your database applications access data. If there is a group of rows that are accessed together significantly more frequently than the rest of the rows in a table, then horizontal partitioning may make sense.

Vertical Partitioning

Vertical partitioning involves creating two or more tables with selected columns and all rows of a table. For example, if Antique Opticals accesses the titles and prices of their merchandise items more frequently than the other columns in the item table, the item table might be partitioned as follows:
item_titles (item_numb, title, price)
item_details (item_numb, distributor, release_date, . . .)
The benefit of this design is that the rows in the smaller item titles table will be physically closer together; the smaller table will take up fewer disk pages and thus support faster retrieval.
Queries that require data from both tables must join the tables over the item number. Like most joins, this will be a relatively slow operation. Therefore, vertical partitioning makes sense only when there is a highly skewed access pattern from the columns of a table. The more often a small, specific group of columns is accessed together, the more vertical partitioning will help.
For Further Reading
Atonini, Christian, Troubleshooting Oracle Performance. (2008) Apress.
Delaney, Kalen; Agarwal, Sunil; Freedman, Craig; Talmage, Ron; Machanic, Adam, Inside Microsoft SQL Server 2005: Query Tuning and Optimization. (2007) Microsoft Press.
Harrison, Guy, Oracle SQL High-Performance Tuning. (2000) Prentice Hall PTR.
IBM Redbooks, A Deep Blue View of DB2 Performance. (2006) IBM.com/Redbooks.
Mittra, Sitansu S, Database Performance Tuning and Optimization. (2002) Springer.
Schwartz, Baron; Zaitsev, Peter; Tkachenko, Vadim; Zawony, Jeremy; Lentz, Arien; Balling, Derek, High Performance MySQL: Optimization, Backups, Replication, and More. (2008) O’Reilly.
Tow, Dan, SQL Tuning. (2003) O’Reilly.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset