Chapter 15. Understand and design indexes

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 15

Understand and design indexes

This chapter dives into indexing of all kinds—not just clustered and nonclustered indexes—including practical development techniques for designing indexes. It mentions memory-optimized tables throughout, including hash indexes for extreme writes and columnstore indexes for extreme reads. The chapter reviews missing indexes and index usage, and then introduces statistics—how they are created and updated. There are important performance-related options for statistics objects. Finally, it explains special types of indexes for niche uses.

In SQL Server you have access to a variety of indexing tools in your toolbox.

We’ve had clustered and nonclustered indexes in all 21st-century versions of SQL Server—those two rowstore index types that are the bread and butter of SQL Server. We cover those in the first half of this chapter, including important new options for SQL Server 2022.

Introduced in SQL Server 2012, columnstore indexes presented a new and exciting way to perform analytical queries on massive amounts of compressed data. They became an essential tool for database developers, and this chapter discusses them in detail. SQL Server 2014 brought memory-optimized tables and their uniquely powerful hash indexes for latchless querying on rapidly changing data. You can even combine the power of these two new concepts now, with columnstore index concepts on memory-optimized tables, allowing for live analytical-scale queries on streamed data. First, though, we’re going to dive into the index design concepts.

Inside OUT

What’s the difference between rowstore and columnstore?

If these terms are new to you, rowstore indexes describe the only type of clustered indexes (and nonclustered indexes) that existed before SQL Server 2012. These indexes are traditional B+ tree indexes that have always existed in SQL Server and continue to be foundational to OLTP workloads. Rowstore structures also include tables without a clustered index (known as heaps), as well as memory-optimized tables.

Columnstore indexes were introduced in SQL Server 2012 and serve a different purpose. They are superior to rowstore data storage for performance only in appropriate situations—specifically, in scans of millions of rows or more in large tables. Highly compressed, columnstore indexes take up less storage (and therefore, need less I/O) to serve queries typical in enterprise reporting, data warehousing, and OLAP scenarios. Columnstore indexes have important new performance enhancements in SQL Server 2022 that we’ll discuss in this chapter.

Both rowstore and columnstore indexes are important tools for database designers in modern applications. In some database designs, rowstore, columnstore, and hash-based indexes on memory-optimized tables all play a role. We discuss all these at length in this chapter.

All scripts for this book are all available for download at https://www.MicrosoftPressStore.com/SQLServer2022InsideOut/downloads.

Design clustered indexes

Let’s be clear about what a clustered index is, and then state the case for why every table in a relational database should have one, with very few exceptions.

First, we will discuss rowstore clustered indexes. It is also possible to create a clustered columnstore index. We discuss that later in this chapter.

Whether you are inheriting and maintaining a database or designing the objects within it, there are important facts to know about clustered indexes. In the case of both rowstore and columnstore indexes, the clustered index stores the data rows for all columns in the table. In the case of rowstore indexes, the table data is logically sorted by the clustered index key; in the case of clustered columnstore indexes, there is no key. Memory-optimized tables don’t have a clustered index structure inherent to their design but could have a clustered columnstore index created for them.

Choose a proper rowstore clustered index key

There are four marks of a good clustered index key for most OLTP applications, or in the case of a compound clustered index key, the first column listed. The column order matters. Let’s review four key factors that will help you understand what role the clustered index key serves, and how best to design one:

Increasing sequential value. A value that increases with every row inserted (such as 1,2,3…, or an increasing point in time, or an increasing alphanumeric) is useful in efficient page organization. This means the insert pattern of the data as it comes in from the business will match the loading of rows onto the physical structures of the table.

A column with the identity property, or populated by a value from a sequence object, matches this perfectly. Use date and time data only if it is highly unlikely to repeat, and then strongly consider using the datetimeoffset data type to avoid repeated data once annually during daylight saving time changes.
Unique. A clustered index key does not need to be unique, but in most cases it should be. (The clustered key also does not need to be the primary key of the table, or the only uniqueness enforced in the table.) A unique (or near-unique) clustered index means efficient seeks. If your application will be searching for individual rows out of this table regularly, you and the business should know what makes those searches unique.

Unique constraints, whether nonclustered or clustered, can improve performance on the same data and create a more efficient structure. A unique constraint is the same as a unique rowstore nonclustered index.

If a clustered index is declared without the UNIQUE property, a second key value is added in the background: a four-byte integer uniquifier column. SQL Server must have some way to uniquely identify each row. The key from the rowstore clustered index is used as the row locator for nonclustered indexes, which leads to the next factor.
Nonchanging. Choose a key that doesn’t change, and is a system-generated key that shouldn’t be visible to end-user applications or reports. In general, when end users can see data, they will eventually see fit to change that data. You do not want clustered index keys to ever change (much less PRIMARY KEY values). A system-generated or surrogate key of sequential values (like an IDENTITY column) is ideal. A field that combines system or application-generated fields such as dates and times or numbers would work too.

The negative impact of changing the clustering key includes the possibility that the first two aforementioned guidelines would be broken. If the clustered key is also a primary key, updating the key’s values could also require cascading updates to enforce referential integrity. It is much easier for everyone involved if only columns with business value are exposed to end users and, therefore, can be changed by end users. In normalized database design, we would call these natural keys as opposed to surrogate keys.
Narrow data type. The decision with respect to data type for your clustered index key can have a large impact on table size, the cost of index maintenance, and the efficiency of queries at scale. The clustered index key value is also stored with every nonclustered index key value, meaning that an unnecessarily wide clustered index key will also cause unnecessarily wide nonclustered indexes on the table. This can have a very large impact on storage on drives and in memory at scale.

The narrow data type guidance should also steer you away from using the uniqueidentifier field, which is 16 bytes per row, or four times the size of an integer column per row, and twice as large as a bigint. It also steers away from using wide strings, such as names, addresses, or URLs.

Inside OUT

Why might unique identifiers be a poor choice for the clustered index key, even for the “oil rig problem”?

There is a common design challenge to store rows from multiple (perhaps disconnected) data sources in the same table—for example, oil rigs, medical devices, or a supervisory control and data acquisition (SCADA) system. Each data source must create unique values for itself, but those values must then be combined into a single table. The uniqueidentifier data type and newid() function can be an option because they generate values uniquely across multiple servers.

This is not a good design for scale, however, because unique identifiers are random, meaning inserts will fragment a table with each new row. This will cause page splits (an expensive I/O operation) as the rows naturally merge into the rest with each insert in the “middle” rather than at the end, inserting sequentially. (You can mitigate this, though not significantly, by altering the fill factor of each index that uses the unique identifier as a key. However, this is also not desirable, because it will further increase the space to store the same data.)

Even the newsequentialid() function, which can only be used as a column default, has fatal flaws. Used to create sequential unique identifier values, after a server restart, the sequence might start at a new point, meaning that eventually you will be back to writing new rows in the middle of existing rows, causing page splits again.

Numerous events could trigger a reset to the starting point of the newsequentialid() function, which is based on the MAC address of the network interface card (NIC) on the server. Therefore, any failover of a failover cluster instance (FCI) or availability group (AG) will result in a new starting point, as well as any future upgrade or migration to new hardware. Similarly, changing the tier of Azure SQL Database, reimaging an Azure Virtual Machine (VM) with an ephemeral OS, or starting a new container without an explicit MAC will reset the starting point. Finally, and most importantly, the MAC of any NIC could change on startup of a VM in VMWare or HyperV. Obviously, given this list, newsequentialid() is fatally flawed and shouldn’t be relied on for sequential values long-term.

This design problem usually involves these devices merging their data periodically—not continuously. A pair of INTs should be a good replacement for a unique identifier field in those cases. Consider instead a solution using multiple integers—one that autoincrements and one that identifies the data source, if you are considering the uniqueidentifier data type. Even two four-byte integers are half the size of a unique identifier, and they compress better.

In the case of continuous connected application integration into a single table, consider using the SEQUENCE feature of SQL Server, introduced in SQL Server 2012, instead of a unique identifier. Using the SEQUENCE object will allow multiple database connections to write rows using a unique, autoincrementing, ascending, procedurally generated integer.

It is ironic that a number of Microsoft-developed platforms use unique identifiers heavily, and sometimes with very public failures—for example, the Windows 7 RC download page. (Read Paul Randal’s blog, “Why did the Windows 7 RC download failure happen?” at https://www.sqlskills.com/blogs/paul/why-did-the-windows-7-rc-download-failure-happen/.)

But systems like Microsoft SharePoint and even SQL Server’s own merge replication needed to be developed for utility and versatility across unlimited client environments and a wide array of user expertise. When designing your own systems, take advantage of your knowledge of the business environment to design better clustered index keys that escape the inefficiencies of the uniqueidentifier data type.

If you must use the uniqueidentifier data type for your clustered index, exclude those tables from automated index reorganization plans, and rebuild your indexes fully at regular intervals. This will avoid additional and unnecessary overhead during maintenance periods by reorganizing indexes that will become fragmented almost immediately. We discuss maintenance plans in Chapter 8, “Maintain and monitor SQL Server.”

The clustered index is an important decision in the structure of a new table. For the vast majority of tables designed for relational database systems, however, the decision is fairly easy. An identity column with an INT or BIGINT data type is the ideal key for a clustered index because it satisfies the aforementioned four recommended qualities of an ideal clustered index. A procedurally generated timestamp or other incrementing time-related value, combined with a unique, autoincrementing number, also provides for a common, albeit less-narrow, clustered index key design.

When a table is created with a primary key constraint and no other mention of a clustered index, the primary key’s columns become the clustered index’s key. This is typically safe, but a table with a compound primary key or a primary key that does not begin with a sequential column could result in a suboptimal clustered index. It is important to note that the primary key does not need to be the clustered index key, but often should be. It is possible to create nonunique clustered indexes or to have multiple unique columns or column combinations in a table.

When combining multiple columns into the clustered index key, keep in mind that the column order of an index, clustered or nonclustered, does matter. If you decide to use multiple columns to create a clustered index key, the first column should still align as closely to the other three rules, even if it alone is not unique.

In the sys.indexes catalog view, the clustered index is always identified as index_id = 1. If the table is a heap, there will instead be a row with index_id = 0. This row represents the heap data.

The case against intentionally designing heaps

Without a clustered index, a table is known as a heap. In a heap, the Database Engine uses a structure known as row identifier (RID), which uniquely identifies every row for internal purposes. The structure of the heap has no order when it is stored. RIDs do not change, so when a record is updated, a forwarding pointer is created in the old location to point to the new. Also, if the row that has the forwarding pointer is moved to another page, it gets another forwarding pointer. Even deleted rows can have forwarding pointers! If that sounds like it is complicated or would increase the amount of I/O activity needed to store and retrieve the data, you’re right.

Furthering the performance problems associated with heaps are that table scans are the only method of access to read from a heap structure, unless a nonclustered index is created on the heap. It is not possible to perform a seek against a heap; however, it is possible to perform a seek against a nonclustered index that has been added to a heap. In this way, a nonclustered index can provide an ordered copy for some of the table data in a separate structure.

One edge case for designing a table purposely without a clustered index is if you would only ever insert into a table. Without any order to the data, you might reap some benefits from rapid, massive data inserts into a heap. Other types of writes to the table (deletes and updates) will likely require table scans to complete and likely be far less efficient than the same writes against a table with a clustered index.

Deletes and updates typically leave wasted space within the heap’s structure, which cannot be reclaimed even with an index rebuild operation. To reclaim wasted space inside a heap without re-creating it, you must, ironically, create a clustered index on the table, then drop the clustered index. You can also use the ALTER TABLE ... REBUILD Transact-SQL (T-SQL) command to rebuild the heap.

The perceived advantage of heaps for workloads exclusively involving inserts can be easily outweighed by the significant disadvantages whenever accessing that data—when query performance would necessitate the creation of a clustered and/or nonclustered index. Table scans and RID lookups for any significant number of rows are likely to dominate the cost of any execution plan accessing the heap. Without a clustered index, queries reading from a table large enough to gain significant advantage from its inserts would perform poorly.

Microsoft’s expansion into modern unstructured data platforms, including integration with Azure Data Lake Storage Gen2, S3-compatible storage, Apache Spark, and other architectures, is likely to be more appropriate when rapid, massive data inserts are required. This is especially true for when you will continuously collect massive amounts of data and then only ever analyze the data in aggregate. These alternatives, integrated with the Database Engine starting with SQL Server 2016, or a focus of new Azure development such as Azure Synapse, would be superior to intentionally designing a heap.

Further, adding a clustered index to optimize the eventual retrieval of data from a heap is nontrivial. Behind the scenes, the Database Engine must write the entire contents of the heap into the new clustered index structure. If any nonclustered indexes exist on the heap, they also will be re-created, using the clustered key instead of the RID. This will likely result in a large amount of transaction log activity and tempdb space being consumed.

Understand the OPTIMIZE_FOR_SEQUENTIAL_KEY feature

Earlier in this chapter, we sang the praises of a clustered index key with an increasing sequential value, such as an integer based on an identity or sequence. For very frequent, multithreaded inserts into a table with an identity or sequence, the “hot spot” of the page in memory with the “next” value can provide some I/O bottleneck. (Long term, this is still likely preferable to fragmentation-upon-insertion, as explained in the previous section, and only surfaces at scale.)

A useful feature introduced in SQL Server 2019 is the OPTIMIZE_FOR_SEQUENTIAL_KEY index option, which improves the concurrency of the page needing rapid inserts for rowstore indexes from multiple threads.

You might observe a high amount of the PAGELATCH_EX wait type on sessions performing inserts into the same table. You can observe this with the dynamic management view (DMV) sys.dm_exec_session_wait_stats, or at an instance aggregate level with sys.dm_os_wait_stats. You should see this wait type drop when the new OPTIMIZE_FOR_SEQUENTIAL_KEY index option is enabled on indexes in tables that are written to by multiple requests simultaneously. Note that this isn’t the PAGEIOLATCH_EX wait type, more associated with physical page contention, but PAGELATCH_EX, associated with memory page contention.

For more information on observing wait types with DMVs, including the differences between the PAGEIOLATCH_EX and PAGELATCH_EX wait types, see Chapter 8.

Let’s take a look at implementing OPTIMIZE_FOR_SEQUENTIAL_KEY. In our contrived example, multiple T-SQL threads frequently executing single-row inserting statements mean that the top two predominant wait types accrued via the DMV sys.dm_os_wait_stats are WRITELOG and PAGELATCH_EX. As Chapter 8 explained, the WRITELOG wait type is fairly self-explanatory—sending data to the transaction log—while PAGELATCH_EX is an indication of a “hot spot” page, symptomatic of rapid concurrent inserts into a sequential key.

By enabling OPTIMIZE_FOR_SEQUENTIAL_KEY on your rowstore indexes—both clustered and nonclustered—you should see some reduction in PAGELATCH_EX and the introduction of a small amount of a new wait type, BTREE_INSERT_FLOW_CONTROL, which is associated with the new OPTIMIZE_FOR_SEQUENTIAL_KEY setting.

Note

The new OPTIMIZE_FOR_SEQUENTIAL_KEY option is not available when creating columnstore indexes or for any indexes on memory-optimized tables.

You can query the value of the OPTIMIZE_FOR_SEQUENTIAL_KEY option with a new column added to sys.indexes of the same name. Note that this new option is not enabled by default, so it must be enabled manually on each index. The option is retained after an index is disabled and then rebuilt. For existing rowstore indexes, you can change the new OPTIMIZE_FOR_SEQUENTIAL_KEY option without a rebuild operation, with the following syntax:

Table of Contents for Chapter 15. Understand and design indexes

Create new playlist

Sign In

Sign Up

Chapter 15

Design clustered indexes

Choose a proper rowstore clustered index key

The case against intentionally designing heaps

Understand the OPTIMIZE_FOR_SEQUENTIAL_KEY feature

Design rowstore nonclustered indexes

Understand nonclustered index design

Choose a proper index

Understand redundant indexes

Understand the INCLUDE list of an index

Create filtered nonclustered indexes

Understand the missing indexes feature

Understand when missing index suggestions are removed

Understand and provide index usage

Understand columnstore indexes

Design columnstore indexes

Understand batch mode

Understand the deltastore of columnstore indexes

Demonstrate the power of columnstore indexes

Understand indexes in memory-optimized tables

Understand hash indexes for memory-optimized tables

Understand nonclustered indexes for memory-optimized tables

Understand index statistics

Automatically create and update statistics

Manually create statistics for on-disk tables

Understand statistics on memory-optimized tables

Understand statistics on external tables

Understand other types of indexes

Understand full-text indexes

Understand spatial indexes

Understand XML indexes

XML compression

Space saving with XML compression

Table of Contents for
Chapter 15. Understand and design indexes