Chapter 15

Delivering a SQL Server Health Check

WHAT’S IN THIS CHAPTER?

  • Understanding the importance of collecting diagnostic and baseline information
  • Using instance level diagnostic queries
  • Understanding how Microsoft releases updates for SQL Server
  • Using database level diagnostic queries

WROX.COM CODE DOWNLOADS FOR THIS CHAPTER

The wrox.com code downloads for this chapter are found at www.wrox.com/remtitle.cgi?isbn=1118177657 on the Download Code tab. The code is in the Ch15HealthCheck.sql download.

THE IMPORTANCE OF A SQL SERVER HEALTH CHECK

One of the first things you should do with a new or unfamiliar database server is collect as much information as possible about that server — from the details of the hardware and storage subsystem, to the operating system, up to the SQL Server instance itself. You need to know what you are dealing with — whether it is a poorly configured, “ancient” server with a completely inadequate storage subsystem, or, hopefully something much better. This information is a critical starting point for focusing your efforts to properly manage and optimize your database servers. As a database professional, there is really no excuse for not knowing the hardware and configuration details about each of your database servers.

This chapter walks you through a set of queries that you can use for your SQL Server health check, explaining what they mean and how to interpret the results. These queries start at the hardware and instance level, and then enable you to drill down to a particular database to gather more specific information about it.

One roadblock that many database administrators face in collecting this type of information is the bureaucracy within their company or organization. Quite often, someone else in a different department is in charge of provisioning and managing the actual database server hardware and operating system on the database servers. This person is usually a system administrator or system engineer. In larger organizations, a SAN administrator is often in charge of the storage subsystems. These other people, who often have different priorities than you, and usually have relatively little knowledge of SQL Server, can be big obstacles in your quest to gather important information about your database servers. They may view your information-gathering efforts as an unwelcome invasion of their territory and area of expertise, and thus may be reluctant to cooperate with you.

I have often asked other DBAs to tell me about how a particular database server is configured and what type of hardware and storage it is using, only to get vague and nearly useless answers that will not enable evaluation of a server or solve a performance problem. Many DBAs are not allowed to actually log on to the desktop of their database servers, and are completely at the mercy of their system administrator and SAN administrator for all server and storage configuration and management. Because of this, many DBAs are completely in the dark about anything they cannot easily see from SQL Server Management Studio (SSMS), and they have very little information about the details of their server hardware and storage subsystems. I think this type of organizational policy is a huge mistake, as it greatly hinders the overall effectiveness of database administrators. If you are faced with this type of situation as a DBA, you should make every effort to properly improve the matter by reaching out to your counterparts in other departments to explain what needs to be done and why it is important. Policies can often be changed, so don’t just accept the situation!

However, regardless of any bureaucratic or organizational obstacles in your way, you can still use techniques from within SQL Server Management Studio (SSMS) to gather most of what you need to do a relatively comprehensive SQL Server health check. One of the most useful and easy to use techniques is to run a standard set of dynamic management view (DMV) and dynamic management function (DMF) queries to gather health-check information about your servers, instances, and databases.

RUNNING DMV AND DMF QUERIES

Since they were first introduced in SQL Server 2005, DMVs and DMFs have been useful tools for easily gathering a wealth of valuable information about your hardware, a SQL Server instance, or individual databases in an instance. With each new version of SQL Server, enhancements have been made to these DMVs and DMFs that increase their utility. SQL Server 2012 is no exception, and offers a number of completely new DMVs and DMFs that you can take advantage of during a SQL Server health check.

This set of queries is designed to be run on SQL Server 2012. Most of the queries will also work on SQL Server 2008 R2 Service Pack 1 or later, but some will not because Microsoft made a few late-breaking changes to some DMVs in SQL Server 2012. Some of the queries will also run on SQL Server 2005, SQL Server 2008, and pre-SP1 builds of SQL Server 2008 R2; but the older your version of SQL Server, the fewer of these queries are going to work for you.

In order to run most DMV and DMF queries, you need the VIEW SERVER STATE permission on your SQL Server instance. You will already have this permission if you have system administrator rights within a SQL Server instance, but you should probably create a dedicated login and matching user that has VIEW SERVER STATE permission for use by non-administrators or monitoring applications. Once you have the rights to run DMV and DMF queries, you are ready to get started.

I strongly recommend that you run each of the queries in the following sections one at a time, after reading the background and instructions first. After you get the results of each query, take a few moments to peruse them to ensure that they make sense. Check the notes about how to interpret the results, and consider what you are seeing and whether the results seem to reinforce or contradict other results and metrics that you may have gathered. It’s a good idea to save the results of these queries in individual, labeled tabs in a spreadsheet so that you have a baseline and a record of the changing results over time.

For these server- and instance-level queries, it does not really matter to which database on the instance you are connected. Once you reach the database-specific queries starting at Listing 15-32, don’t forget to change your database context to the particular database you are interested in. This may seem obvious, but I have seen many people run an entire set of database-specific queries while they are still pointed at the master database. This will give you a wealth of useless information about the master database!

First, you want to find out exactly what version, edition, and build of SQL Server you have running on your instance of SQL Server. You also want to know whether it is x64 or x86 and what operating system you are running on. One very simple, non-DMV query, shown in Listing 15-1, gives you all that information, including the compile date and time of your SQL Server build.

LISTING 15-1: SQL Server and operating system information

-- SQL and OS Version information for current instance
SELECT @@VERSION AS [SQL Server and OS Version Info];
 
-- SQL Server 2012 Builds
-- Build            Description
-- 11.00.1055       CTP0
-- 11.00.1103       CTP1
-- 11.00.1540       CTP3
-- 11.00.1515       CTP3 plus Test Update
-- 11.00.1750       RC0
-- 11.00.1913       RC1
-- 11.00.2300       RTM
-- 11.00.2316       RTM CU1
-- 11.00.2325       RTM CU2
-- 11.00.2809       SP1 CTP3 (un-supported in production)

A sample of the results you will get from the preceding query is shown in Listing 15-2.

LISTING 15-2: SQL Server and operating system results

Microsoft SQL Server 2012 RC0 - 11.0.1750.32 (X64) Nov  4 2011 17:54:22 Copyright 
(c) Microsoft Corporation Enterprise Evaluation Edition (64-bit) on Windows NT 6.1
<X64> (Build 7601: Service Pack 1)

This set of results tells you that you are using the x64 Enterprise Evaluation Edition of SQL Server 2012 Release Candidate 0 (RC0), build 1750 running on x64 Windows NT 6.1 Service Pack 1 (which could be either Windows Server 2008 R2 SP1 or Windows 7 SP1). For a production SQL Server instance, you would want to be running Windows Server 2008 R2 or Windows Server 2012, which are x64 only, rather than on Windows 7. Knowing the version, edition, and build of your SQL Server instance helps you determine what features are available within SQL Server. For example, data compression is available only in Enterprise Edition, so if you have Standard Edition or Business Intelligence Edition of SQL Server 2012, you cannot use data compression.

SQL SERVER BUILDS

Microsoft periodically (currently, every eight weeks) releases a cumulative update (CU) for SQL Server. Each update is a collection of typically 10–40 hotfixes for SQL Server that is released as an integrated setup package. About every 12–18 months, Microsoft releases a service pack for a particular major version of SQL Server, such as SQL Server 2012. When a SQL Server service pack is released, it means that you have a new release branch of that major version of SQL Server, which has its own separate cumulative updates. You need to know what release branch your SQL Server instance is on, so that you know which cumulative update to download and install.

Knowing the build number and compile date and time for your instance tells you how old your SQL Server instance is, which is very important. Microsoft originally releases a new major version of SQL Server as the release to manufacturing (RTM) build, which is what you get from a DVD or .iso image of the product. Many organizations install the RTM build of a major version SQL Server, but never update their SQL Server instances to a newer build, which I think is a huge mistake. I firmly believe that you should make a concerted effort to keep your SQL Server instances up to date with the most current service packs and cumulative updates.

For example, if you installed a new instance of SQL Server 2008 R2 from an .iso image that you downloaded from Microsoft, you would have Build 1600, in the RTM branch. If you then installed SQL Server 2008 R2 SP1, you would have Build 2500, in the Service Pack 1 branch. Any further cumulative updates on this instance would have to be from the Service Pack 1 branch until you moved to a later service pack for SQL Server 2008 R2.

This discussion is important for several reasons. First, if you are running on a very old build of a major version of SQL Server, you are much more likely to encounter defects that have been fixed in newer builds of that version, which makes your life as a DBA more difficult. Second, Microsoft eventually retires older release branches (usually one year after the next service pack is released). If you are on a retired release branch, there will be no more cumulative updates for that release branch; and if you ever have to call Microsoft CSS to open a support case, you will receive only limited troubleshooting support until you get on a release branch that is still fully supported. Don’t let this happen to you!

Making the effort to watch for SQL Server service packs and cumulative updates, checking the fix list, downloading them, testing them, and deploying them in your production environment is a beneficial exercise for you and your organization. Deploying these updates using a planned, rolling upgrade strategy enables you to complete the deployments with minimal service interruptions, and it forces you to exercise some of your high-availability (HA), disaster-recovery (DR) infrastructure while keeping your servers up to date. I think this is much better than being a DBA who is overly hesitant to apply any SQL Server updates because you are afraid you might break something.

The next query you want to run, shown in Listing 15-3, will give you a little more information about your operating system on the database server, including the language.

LISTING 15-3: Windows information

-- Windows information (SQL Server 2012)
SELECT windows_release, windows_service_pack_level, 
       windows_sku, os_language_version
FROM sys.dm_os_windows_info WITH (NOLOCK) OPTION (RECOMPILE);
 
-- Gives you major OS version, Service Pack, Edition, 
-- and language info for the operating system

The preceding query simply gives you a little more specific information about the operating system, in a form that is easier to filter and parse than what you get from SELECT @@VERSION. Now that you know more about your operating system, you can start to gather some hardware information.

The query shown in Listing 15-4 indicates how many logical processors you have, the hyperthread ratio of the processors, how many physical CPUs you have, and how much physical memory you have in your database server. It also indicates whether you are running in a virtual machine and when SQL Server was last started. Unfortunately, the hyperthread ratio does not distinguish between logical and physical cores in each physical processor. For example, the laptop used to write this chapter on is a dual-core processor with hyperthreading, and the hyperthread ratio is 4. If my laptop were a quad-core without hyperthreading, my hyperthread ratio would also be 4. Especially with the new, core-based licensing in SQL Server 2012 Enterprise Edition, it is important to distinguish between true physical cores and logical cores. Fortunately, the next two queries will help clear up this confusion.

LISTING 15-4: Hardware information

-- Hardware information from SQL Server 2012 (new virtual_machine_type_desc)
-- (Cannot distinguish between HT and multi-core)
SELECT cpu_count AS [Logical CPU Count], hyperthread_ratio AS [Hyperthread 
Ratio],cpu_count/hyperthread_ratio AS [Physical CPU Count], 
physical_memory_kb/1024 AS [Physical Memory (MB)], 
affinity_type_desc, virtual_machine_type_desc, sqlserver_start_time
FROM sys.dm_os_sys_info WITH (NOLOCK) OPTION (RECOMPILE);
 
-- Gives you some good basic hardware information about your database server

The query shown next in Listing 15-5 simply reads the SQL Server Error Log to get the manufacturer and model number of your database server. If you are a true hardware geek like me, simply knowing that you have a Dell PowerEdge R710 server tells you quite a bit by itself, without having to look anything else up. Otherwise, you can do some research with your favorite search engine to find out more details about your server after you know the manufacturer and model number.

LISTING 15-5: Server manufacturer and model

-- Get System Manufacturer and model number from 
-- SQL Server Error log. This query might take a few seconds 
-- if you have not recycled your error log recently
EXEC xp_readerrorlog 0,1,"Manufacturer"; 
 
-- This can help you determine the capabilities
-- and capacities of your database server

Knowing the manufacturer and model of your database server enables you to find out important things about it, such as how many processor sockets it has, how many memory slots it has, and how many and what type of PCI-E expansion slots it has. It also tells you what type of processors (Intel or AMD), and what generation(s) of processors, that server model supports.

Now that you know how many physical processors and how many logical cores are visible to your database server, you need to find out the exact model number of the processor(s) in your database server. Once you know this, you can use your favorite search engine to determine each processor’s exact specifications, including how many physical cores it has, whether it supports hyperthreading, TurboBoost, or TurboCore, how large the L2 and L3 cache is, and the rated clock speed of the processor. The query in Listing 15-6 returns the processor description and the rated clock speed from the Windows Registry.

LISTING 15-6: Processor description

-- Get processor description from Windows Registry
EXEC xp_instance_regread 
’HKEY_LOCAL_MACHINE’,
’HARDWAREDESCRIPTIONSystemCentralProcessor’,
’ProcessorNameString’;
 
-- Gives you the model number and rated clock speed of your processor(s)

It is very important to know the rated clock speed of the processors in your database server, as it is possible that your processors are not running at their full speed at all times due to some form of power management. By default, Windows Server 2008 and Windows Server 2008 R2 use the Balanced Windows Power Plan. This means that when the processors are not under a high workload, they reduce their clock speed to reduce their electrical power usage. This is great for extending the battery life on laptops, and it is good for reducing electrical power usage in desktop systems and even web servers. Unfortunately, power management is not such a great option for database servers. That’s because when the processor sees a sudden increase in workload, it responds by increasing the clock speed back to full speed. That sounds good so far, but this response to the spike in the workload does not happen quickly enough to avoid a negative effect on query performance. Some short duration, relatively inexpensive queries may not even trigger the throttle-up mechanism, so they are executed while the processor is still running at reduced speed.

In my experience, I have typically seen a 20–25% hit to performance for OLTP workloads when using the Windows default Balanced power plan instead of the High Performance power plan. It depends on which processor you are using, with Intel Nehalem and Westmere processors (see Chapter 2, “Demystifying Hardware”) being particularly vulnerable. Even with the High Performance power plan, it is still possible that your database server is being affected by hardware power management, controlled from the main system BIOS.

To avoid this problem, first make sure your database servers are using the High Performance power plan, not the Balanced power plan. This setting can be changed dynamically, with no reboot of Windows required. Second, use CPU-Z, a free tool available from cpuid.com to determine the actual clock speed at which your processors are running. If you are using the High Performance power plan and your processor is still not running at full speed, you need to go into the main system BIOS and change its power management settings to either OS control, or to be disabled completely. Depending on your organization, you may have to get your system administrator to make this change for you, as it requires a reboot of the server to get into the main BIOS configuration.

Now that you know a little more about your hardware and whether you are getting the full benefit of the money you spent on it, it is time to collect some information about the SQL Server Services that are installed on the instance. In SQL Server 2008 R2 Service Pack 1 and later, and in SQL Server 2012, you can learn quite a bit about which SQL Server Services are installed and how they are configured and running from the query shown in Listing 15-7.

LISTING 15-7: SQL Server Services information

-- SQL Server Services information from SQL Server 2012
SELECT servicename, startup_type_desc, status_desc,
last_startup_time, service_account, is_clustered, cluster_nodename
FROM sys.dm_server_services WITH (NOLOCK) OPTION (RECOMPILE);
 
-- Gives you information about your installed SQL Server Services, 
-- whether they are clustered, and which node owns the cluster resources

The preceding query tells you exactly which SQL Server Services are installed, their startup type, whether they are running, which account credentials they are using, when they last started, whether they are clustered, and what node they are running on in the cluster. This is all good information to know, and it is very easy to find out with this new DMV. Next, you can get some information about the SQL Server Error Log with the query shown in Listing 15-8.

LISTING 15-8: SQL Server Error Log information

-- Shows you where the SQL Server error log is located and how it is configured
SELECT is_enabled, [path], max_size, max_files
FROM sys.dm_os_server_diagnostics_log_configurations WITH (NOLOCK) 
OPTION (RECOMPILE);
 
-- Knowing this information is important for troubleshooting purposes

This query gives you the file path to the SQL Server Error Log (which is simply a text file that you can open in Notepad in an emergency). If your SQL Server Service ever fails to start, the first place you should look is in the SQL Server Error Log. Of course, if your SQL Server Service is not running, you won’t be able to run this query, so you should run it ahead of time and store the results in your server run book. Next, you will find out whether your database server is using Windows Clustering, with the query shown in Listing 15-9.

LISTING 15-9: Operating system cluster information

-- Get information about your OS cluster 
--(if your database server is in a cluster)
SELECT VerboseLogging, SqlDumperDumpFlags, SqlDumperDumpPath, 
       SqlDumperDumpTimeOut, FailureConditionLevel, HealthCheckTimeout
FROM sys.dm_os_cluster_properties WITH (NOLOCK) OPTION (RECOMPILE);
 
-- You will see no results if your instance is not clustered

This query returns some configuration information about your Windows cluster. If it returns no information, then your operating system on the database server is not clustered (you have a standalone instance), in which case you can skip the query shown in Listing 15-10.

Otherwise, if you are using a cluster, you can use the query shown in Listing 15-10 to get some useful information about your cluster nodes.

LISTING 15-10: Cluster node information

-- Get information about your cluster nodes and their status 
-- (if your database server is in a cluster)
SELECT NodeName, status_description, is_current_owner
FROM sys.dm_os_cluster_nodes WITH (NOLOCK) OPTION (RECOMPILE);
 
-- Knowing which node owns the cluster resources is critical
-- Especially when you are installing Windows or SQL Server updates

This query returns all your cluster nodes, including their status and whether they own the cluster resources. For example, if you have a three-node cluster, this query would return three rows and indicate which node currently owned the SQL Server instance. This is actually important information to know if you are getting ready to do some maintenance on the cluster, such as installing a SQL Server cumulative update, as you would want to first install it on a node that does not own the cluster resources as part of a rolling update.

Next, you want to start gathering some additional information about how your SQL Server instance is configured, which you can do with the query shown in Listing 15-11.

LISTING 15-11: Instance configuration values

-- Get configuration values for instance
SELECT name, value, value_in_use, [description] 
FROM sys.configurations WITH (NOLOCK)
ORDER BY name OPTION (RECOMPILE);
 
-- Focus on
-- backup compression default
-- clr enabled (only enable if it is needed)
-- lightweight pooling (should be zero)
-- max degree of parallelism 
-- max server memory (MB) (set to an appropriate value)
-- optimize for ad hoc workloads (should be 1)
-- priority boost (should be zero)

This query returns current configuration values for a fairly large number of instance-level properties. Some of these properties can be changed using the SSMS graphical user interface, but they can all be changed using the sp_configure system stored procedure.

You should focus on a few key configuration values, which include backup compression default, clr enabled, lightweight pooling, max degree of parallelism, max server memory (MB), optimize for ad hoc workloads, and priority boost. Of these, the first two values that you typically want to change from their default values are backup compression default and optimize for ad hoc workloads (see Chapter 3), both of which should be enabled by setting them to 1. Next, I suggest setting max server memory (MB) to an appropriate, non-default value, taking into account which SQL Server components are installed and running on your SQL Server instance. The idea behind this is to ensure that memory is reserved for the operating system and other SQL SERVER components, like SQL Server Integration Services (SSIS), so there is sufficient memory for them to operate properly.

In SQL Server 2008 R2 and earlier, the max server memory (MB) setting controlled only the memory used by the SQL Server buffer pool, but in SQL Server 2012 it controls overall memory usage by most other Database Engine components, which means you can probably set the max server memory (MB) setting a little bit higher on SQL Server 2012 than in previous versions (See Chapter 3).

If any databases running on your instance use CLR assemblies, you will have to enable CLR integration by setting clr enabled to 1. Otherwise, it should not be enabled, as it uses some resources on your database server and will increase your attack surface. Most other instance configuration options should usually be left at their default values, unless you have a good reason to change them and you know what you are doing.


NOTE Whenever I see every instance-level property set at its default value, I know that the DBA or other administrator who installed and/or manages this instance is inexperienced or perhaps just an “accidental DBA” who does not understand which instance settings should be changed. It is also possible that this person was just too busy to change anything from the default settings or does not pay sufficient attention to detail (a critical trait for a great DBA).

Next, you are going to find out a little bit about the network configuration settings on the database server with the query shown in Listing 15-12.

LISTING 15-12: TCP Listener information

-- Get information about TCP Listener for SQL Server
SELECT listener_id, ip_address, is_ipv4, port, type_desc, state_desc, start_time
FROM sys.dm_tcp_listener_states WITH (NOLOCK) OPTION (RECOMPILE);
 
-- Helpful for network and connectivity troubleshooting

This DMV, which was added in SQL Server 2008 R2 SP1, tells you which TCP ports are being used by the TCP Listener — for T-SQL, the Service Broker, and database mirroring. This is useful information for troubleshooting general network connectivity and firewall issues. It is also useful for investigating connectivity issues with SQL Server AlwaysOn availability replicas.

The next query, shown in Listing 15-13, provides SQL Server–related information collected from the Windows Registry, using the sys.dm_server_registry DMV that was added in SQL Server 2008 R2 SP1.

LISTING 15-13: SQL Server Registry information

-- SQL Server Registry information 
SELECT registry_key, value_name, value_data
FROM sys.dm_server_registry WITH (NOLOCK) OPTION (RECOMPILE);
 
-- This lets you safely read some SQL Server related 
-- information from the Windows Registry

This query, which was added in SQL Server 2008 R2 SP1, gives you quite a bit of useful information about your SQL Server instance, such as which network protocols are enabled, where the SQL Server main executable is located, and where the SQL Server Agent executable is installed. This is safer and easier to use than the old xp_instance_regread extended stored procedure, although it admittedly does not allow you to query as many different values.

Next, to investigate whether this instance of SQL Server has been generating any memory dumps, you can use the query shown in Listing 15-14.

LISTING 15-14: SQL Server memory dump information

-- Get information on location, time and size of any memory dumps from SQL Server
SELECT [filename], creation_time, size_in_bytes
FROM sys.dm_server_memory_dumps WITH (NOLOCK) OPTION (RECOMPILE);
 
-- This will not return any rows if you have 
-- not had any memory dumps (which is a good thing)

This query, which was also added in SQL Server 2008 R2 SP1, tells you if and when your SQL Server instance has generated any memory dumps. Hopefully, you will not see any results for this query. If you do, start looking in the SQL Server Error Log(s) that correspond to the times for the SQL Server memory dumps to see if you can find any relevant information about what happened to generate the memory dump. You should also look at the Windows Event logs, and maybe even get ready to open a support case with Microsoft.

Next, to find out how many databases are running on your SQL Server instance, and where they are located, use the query shown in Listing 15-15.

LISTING 15-15: Database filenames and paths

-- File Names and Paths for Tempdb and all user databases in instance 
SELECT DB_NAME([database_id])AS [Database Name], 
       [file_id], name, physical_name, type_desc, state_desc, 
       CONVERT( bigint, size/128.0) AS [Total Size in MB]
FROM sys.master_files WITH (NOLOCK)
WHERE [database_id] > 4 
AND [database_id] <> 32767
OR [database_id] = 2
ORDER BY DB_NAME([database_id]) OPTION (RECOMPILE);
 
-- Things to look at:
-- Are data files and log files on different drives?
-- Is everything on the C: drive?
-- Is TempDB on dedicated drives?
      -- Are there multiple data files?

This query returns the file paths and sizes for the data and log files for all the user databases and tempdb. If the type_desc column is ROWS, that means you have a data file, whereas LOG means a transaction log file. This query tells you how many user databases are running on the instance and how large they are, which gives you some idea of the complexity of the server’s workload.

You should be looking to see whether the data and log files are on different drive letters. Some SAN administrators like to provision just one large LUN, which makes it harder to track what is going on from SQL Server and Windows. You also want to ensure that tempdb is not running on the C: drive of the database server (which is what happens with a default standalone installation of SQL Server). You can also see whether there are multiple tempdb data files instead of just one (See Chapter 8), and whether the larger user databases have multiple data files instead of just one large data file.

Next, you are going to discover some key database properties for all the databases on the instance, using the query shown in Listing 15-16.

LISTING 15-16: Database property information

-- Recovery model, log reuse wait description, log file size, log usage size 
-- and compatibility level for all databases on instance
SELECT db.[name] AS [Database Name], db.recovery_model_desc AS [Recovery Model], 
db.log_reuse_wait_desc AS [Log Reuse Wait Description], 
ls.cntr_value AS [Log Size (KB)], lu.cntr_value AS [Log Used (KB)],
CAST(CAST(lu.cntr_value AS FLOAT) / CAST(ls.cntr_value AS FLOAT)AS DECIMAL(18,2)) *
100 AS 
[Log Used %], db.[compatibility_level] AS [DB Compatibility Level], 
db.page_verify_option_desc AS [Page Verify Option], db.is_auto_create_stats_on, 
db.is_auto_update_stats_on, db.is_auto_update_stats_async_on, 
db.is_parameterization_forced, 
db.snapshot_isolation_state_desc, db.is_read_committed_snapshot_on,
is_auto_shrink_on, is_auto_close_on
FROM sys.databases AS db WITH (NOLOCK)
INNER JOIN sys.dm_os_performance_counters AS lu WITH (NOLOCK)
ON db.name = lu.instance_name
INNER JOIN sys.dm_os_performance_counters AS ls WITH (NOLOCK)
ON db.name = ls.instance_name
WHERE lu.counter_name LIKE N’Log File(s) Used Size (KB)%’ 
AND ls.counter_name LIKE N’Log File(s) Size (KB)%’
AND ls.cntr_value > 0 OPTION (RECOMPILE);
 
-- Things to look at:
-- How many databases are on the instance?
-- What recovery models are they using?
-- What is the log reuse wait description?
-- How full are the transaction logs ?
-- What compatibility level are they on?

This query returns all the databases on the instance, including the system databases. For each database, a number of important database properties are listed. First is the recovery model for the database, which can be SIMPLE, FULL, or BULK-LOGGED. Knowing the recovery model for each of your user databases is critically important! Next, you get the log reuse wait description for each database, which tells you what is preventing the active portion of the transaction log from being reused.

One of the most common mistakes made by novice DBAs is to have a database running in the default recovery model of FULL without taking regular transaction log backups. When this happens, the transaction log eventually fills up completely and attempts to autogrow the transaction log file (if autogrow is enabled). If the transaction log is able to autogrow, it continues to do so each time it fills up, until at some point it completely fills up the disk where it is located. When this happens, you will have a read-only database until you do something to correct the issue. While this is happening, your log reuse wait description for that database will show up as LOG BACKUP.

Regularly monitoring your log reuse wait description for each database also alerts you about other problems you need to investigate. For example, if you are using database mirroring and there are any problems with mirroring (such as not being able to send the log activity to the mirror or not being able to apply the log activity fast enough on the mirror), you will see a log reuse wait description of DATABASE MIRRORING. Other common log reuse wait descriptions that bear further investigation include REPLICATION and ACTIVE TRANSACTION.

This query also tells you how large your transaction log file is for each database, and how full it is, which is good information to know! I don’t like to see a transaction log become more than 50% full. If that is happening, you can either make the transaction log file bigger or take more frequent transaction log backups.

Finally, you are retrieving a number of other important database-level properties for each database on the instance, including the compatibility level, the page verify option, auto create statistics, auto update statistics, auto update statistics asynchronously (See Chapter 5), forced parameterization, and the snapshot isolation level.

Next, the results of the query shown in Listing 15-17 indicate which database files are seeing the most I/O stalls.

LISTING 15-17: I/O stall information by database file

-- Calculates average stalls per read, per write, 
-- and per total input/output for each database file. 
SELECT DB_NAME(fs.database_id) AS [Database Name], mf.physical_name, 
io_stall_read_ms, num_of_reads,
CAST(io_stall_read_ms/(1.0 + num_of_reads) AS NUMERIC(10,1)) AS 
[avg_read_stall_ms],io_stall_write_ms, 
num_of_writes,CAST(io_stall_write_ms/(1.0+num_of_writes) AS NUMERIC(10,1)) AS 
[avg_write_stall_ms],
io_stall_read_ms + io_stall_write_ms AS [io_stalls], num_of_reads + num_of_writes 
AS [total_io],
CAST((io_stall_read_ms + io_stall_write_ms)/(1.0 + num_of_reads + num_of_writes) AS
NUMERIC(10,1)) 
AS [avg_io_stall_ms]
FROM sys.dm_io_virtual_file_stats(null,null) AS fs
INNER JOIN sys.master_files AS mf WITH (NOLOCK)
ON fs.database_id = mf.database_id
AND fs.[file_id] = mf.[file_id]
ORDER BY avg_io_stall_ms DESC OPTION (RECOMPILE);
 
-- Helps determine which database files on 
-- the entire instance have the most I/O bottlenecks

This query lists each database file (data and log) on the instance, ordered by the average I/O stall time in milliseconds. This is one way of determining which database files are spending the most time waiting on I/O. It also gives you a better idea of the read/write activity for each database file, which helps you characterize your workload by database file. If you see a lot of database files on the same drive that are at the top of the list for this query, that could be an indication that you are seeing disk I/O bottlenecks on that drive. You would want to investigate this issue further, using Windows Performance Monitor metrics such as Avg Disk Sec/Write and Avg Disk Sec/Read for that logical disk. After you have gathered more metrics and evidence, talk to your system administrator or storage administrator about this issue. Depending on what type of storage you are using (See Chapter 4), it might be possible to improve the I/O performance situation by adding more spindles, changing the RAID controller cache policy, or changing the RAID level. You also might consider moving some of your database files to other drives if possible.

Now, using the query shown in Listing 15-18, you are going to see which user databases on the instance are using the most memory.

LISTING 15-18: Total buffer usage by database

-- Get total buffer usage by database for current instance
SELECT DB_NAME(database_id) AS [Database Name],
COUNT(*) * 8/1024.0 AS [Cached Size (MB)]
FROM sys.dm_os_buffer_descriptors WITH (NOLOCK)
WHERE database_id > 4 -- system databases
AND database_id <> 32767 -- ResourceDB
GROUP BY DB_NAME(database_id)
ORDER BY [Cached Size (MB)] DESC OPTION (RECOMPILE);
 
-- Tells you how much memory (in the buffer pool) 
-- is being used by each database on the instance

This query will list the total buffer usage for each user database running on the current instance. Especially if you are seeing signs of internal memory pressure, you are going to be interested in knowing which database(s) are using the most space in the buffer pool. One way to reduce memory usage in a particular database is to ensure that you don’t have a lot of missing indexes on large tables that are causing a large number of index or table scans. Another way, if you have SQL Server Enterprise Edition, is to start using SQL Server data compression on some of your larger indexes (if they are good candidates for data compression). The ideal candidate for data compression is a large static table that is highly compressible because of the data types and actual data in the table. A bad candidate for data compression is a small, highly volatile table that does not compress well. A compressed index will stay compressed in the buffer pool, unless any data is updated. This means that you may be able to save many gigabytes of space in your buffer pool under ideal circumstances.

Next, you will take a look at which user databases on the instance are using the most processor time by using the query shown in Listing 15-19.

LISTING 15-19: CPU usage by database

-- Get CPU utilization by database
WITH DB_CPU_Stats
AS
(SELECT DatabaseID, DB_Name(DatabaseID) AS [DatabaseName], 
 SUM(total_worker_time) AS [CPU_Time_Ms]
 FROM sys.dm_exec_query_stats AS qs
 CROSS APPLY (SELECT CONVERT(int, value) AS [DatabaseID] 
              FROM sys.dm_exec_plan_attributes(qs.plan_handle)
              WHERE attribute = N’dbid’) AS F_DB
 GROUP BY DatabaseID)
SELECT ROW_NUMBER() OVER(ORDER BY [CPU_Time_Ms] DESC) AS [row_num],
       DatabaseName, [CPU_Time_Ms], 
       CAST([CPU_Time_Ms] * 1.0 / SUM([CPU_Time_Ms]) 
       OVER() * 100.0 AS DECIMAL(5, 2)) AS [CPUPercent]
FROM DB_CPU_Stats
WHERE DatabaseID > 4 -- system databases
AND DatabaseID <> 32767 -- ResourceDB
ORDER BY row_num OPTION (RECOMPILE);
 
-- Helps determine which database is 
-- using the most CPU resources on the instance

This query shows you the CPU utilization time by database for the entire instance. It can help you characterize your workload, but you need to take the results with a bit of caution. If you have recently cleared the plan cache for a particular database, using the DBCC FLUSHPROCINDB (database_id) command, it will throw off the overall CPU utilization by database numbers for the query. Still, this query can be useful for getting a rough idea of which database(s) are using the most CPU on your instance.

The next query, shown in Listing 15-20, is extremely useful. It rolls up the top cumulative wait statistics since SQL Server was last restarted or since the wait statistics were cleared by using the DBCC SQLPERF (’sys.dm_os_wait_stats’, CLEAR) command.

LISTING 15-20: Top cumulative wait types for the instance

-- Isolate top waits for server instance since last restart or statistics clear
WITH Waits AS
(SELECT wait_type, wait_time_ms / 1000. AS wait_time_s,
100. * wait_time_ms / SUM(wait_time_ms) OVER() AS pct,
ROW_NUMBER() OVER(ORDER BY wait_time_ms DESC) AS rn
FROM sys.dm_os_wait_stats WITH (NOLOCK)
WHERE wait_type NOT IN (N'CLR_SEMAPHORE',N'LAZYWRITER_SLEEP',N'RESOURCE_QUEUE',
N'SLEEP_TASK',N'SLEEP_SYSTEMTASK',N'SQLTRACE_BUFFER_FLUSH',N'WAITFOR', 
N'LOGMGR_QUEUE',N'CHECKPOINT_QUEUE', N'REQUEST_FOR_DEADLOCK_SEARCH',
N'XE_TIMER_EVENT',N'BROKER_TO_FLUSH',N'BROKER_TASK_STOP',N'CLR_MANUAL_EVENT',
N'CLR_AUTO_EVENT',N'DISPATCHER_QUEUE_SEMAPHORE', N'FT_IFTS_SCHEDULER_IDLE_WAIT',
N'XE_DISPATCHER_WAIT', N'XE_DISPATCHER_JOIN', N'SQLTRACE_INCREMENTAL_FLUSH_SLEEP',
N'ONDEMAND_TASK_QUEUE', N'BROKER_EVENTHANDLER', N'SLEEP_BPOOL_FLUSH',
N'DIRTY_PAGE_POLL', N'HADR_FILESTREAM_IOMGR_IOCOMPLETION', 
N'SP_SERVER_DIAGNOSTICS_SLEEP'))
 
SELECT W1.wait_type, 
CAST(W1.wait_time_s AS DECIMAL(12, 2)) AS wait_time_s,
CAST(W1.pct AS DECIMAL(12, 2)) AS pct,
CAST(SUM(W2.pct) AS DECIMAL(12, 2)) AS running_pct
FROM Waits AS W1
INNER JOIN Waits AS W2
ON W2.rn <= W1.rn
GROUP BY W1.rn, W1.wait_type, W1.wait_time_s, W1.pct
HAVING SUM(W2.pct) - W1.pct < 99 OPTION (RECOMPILE); -- percentage threshold
 
-- Clear Wait Stats 
-- DBCC SQLPERF(’sys.dm_os_wait_stats’, CLEAR);

This query will help you zero in on what your SQL Server instance is spending the most time waiting for. Especially if your SQL Server instance is under stress or having performance problems, this can be very valuable information. Knowing that your top cumulative wait types are all I/O related can point you in the right direction for doing further evidence gathering and investigation of your I/O subsystem. However, be aware of several important caveats when using and interpreting the results of this query.

First, this is only a rollup of wait types since the last time your SQL Server instance was restarted, or the last time your wait statistics were cleared. If your SQL Server instance has been running for several months and something important was recently changed, the cumulative wait stats will not show the current actual top wait types, but will instead be weighted toward the overall top wait types over the entire time the instance has been running. This will give you a false picture of the current situation.

Second, there are literally hundreds of different wait types (with more being added in each new version of SQL Server), and only a small number of them are documented in SQL Server Books Online. There is a lot of bad information on the Internet about what many wait types mean, and how you should consider addressing them. Bob Ward, who works for Microsoft Support, is a very reliable source for SQL Server wait type information. He has a SQL Server Wait Type Repository available online at http://blogs.msdn.com/b/psssql/archive/2009/11/03/the-sql-server-wait-type-repository.aspx that documents many SQL Server wait types, including what action you might want to take to alleviate that wait type.

Finally, many common wait types are called benign wait types, meaning you can safely ignore them in most situations. The most common benign wait types are filtered out in the NOT IN clause of the health check query to make the results more relevant. Even so, I constantly get questions from DBAs who are obsessing over a particular wait type that shows up in this query. My answer is basically that if your database instance is running well, with no other signs of stress, you probably don’t need to worry too much about your top wait type, particularly if it is an uncommon wait type. SQL Server is always waiting on something; but if the server is running well, with no other warning signs, you should relax a little.

Next, using the query shown in Listing 15-21, you are going to look at the cumulative signal (CPU) waits for the instance.

LISTING 15-21: Signal waits for the instance

-- Signal Waits for instance
SELECT CAST(100.0 * SUM(signal_wait_time_ms) / SUM (wait_time_ms) AS NUMERIC(20,2))
 AS [%signal (cpu) waits],
CAST(100.0 * SUM(wait_time_ms - signal_wait_time_ms) / SUM (wait_time_ms) AS 
NUMERIC(20,2)) AS [%resource waits]
FROM sys.dm_os_wait_stats WITH (NOLOCK) OPTION (RECOMPILE);
 
-- Signal Waits above 15-20% is usually a sign of CPU pressure

Signal waits are CPU-related waits. If you are seeing other signs of CPU pressure on your SQL Server instance, this query can help confirm or deny the fact that you are seeing sustained cumulative CPU pressure. Usually, seeing signal waits above 15–20% is a sign of CPU pressure.

Now you will take a look at which logins have the most open database connections, using the query shown in Listing 15-22.

LISTING 15-22: Login count information

--  Get logins that are connected and how many sessions they have 
SELECT login_name, COUNT(session_id) AS [session_count] 
FROM sys.dm_exec_sessions WITH (NOLOCK)
GROUP BY login_name
ORDER BY COUNT(session_id) DESC OPTION (RECOMPILE);
 
-- This can help characterize your workload and
-- determine whether you are seeing a normal level of activity

This query is one way to gauge whether you are seeing a normal level of activity on your database server. You can look at the number of connections for each login (especially if you use application-level logins) to determine whether you are seeing a normal workload. For example, if one of your logins typically sees about 150 active connections but you are seeing 350 connections for that login, then you probably have good reason to suspect that your workload has changed and your database server may be working harder than usual.

This query is also good for troubleshooting and confirming database connectivity. If you are seeing active connections for particular logins, then you know that at least some web or application servers are able to connect using that login. You can also use the old DBA trick of using a Microsoft Data Link (.udl file) to verify connectivity from a remote server.

You can easily create a Microsoft Data Link file on any machine running Windows 2000 or newer by creating a new, empty text file and then changing the file extension from .txt to .udl. Then you can double-click on the .udl file and you will open a Data Link Properties dialog. After you enter valid login credentials for the database server and database that you want to connect to, you can click on the Test Connection button to verify that you can make a database connection to that database on that server. This is a good troubleshooting tool that you can use on a web server or application server that does not require any development tools to be installed to verify connectivity from that server to the database server.

Next, using the query shown in Listing 15-23, you will take a look at some current task and pending I/O count information.

LISTING 15-23: Average Task Count information

-- Get Average Task Counts (run multiple times)
SELECT AVG(current_tasks_count) AS [Avg Task Count], 
AVG(runnable_tasks_count) AS [Avg Runnable Task Count],
AVG(pending_disk_io_count) AS [Avg Pending DiskIO Count]
FROM sys.dm_os_schedulers WITH (NOLOCK)
WHERE scheduler_id < 255 OPTION (RECOMPILE);
 
-- Sustained values above 10 suggest further investigation in that area
-- High Avg Task Counts are often caused by blocking or other resource contention
-- High Avg Runnable Task Counts are a good sign of CPU pressure
-- High Avg Pending DiskIO Counts are a sign of disk pressure

This is one query that you will want to run multiple times in quick succession, as the values returned change very frequently, depending on your workload and how your SQL Server instance is running. Any value for these three columns that stays above 10 for a sustained period is cause for some concern. The Average Task Count (per CPU scheduler) is a good indicator of your overall workload level. Sustained high Average Task Counts are often caused by blocking or other resource contention. The Average Runnable Task Count indicates how many tasks are waiting for CPU time on each CPU scheduler. It is a very reliable indicator of CPU pressure. The Average Pending DiskIO Count measures how many pending I/O operations are on each CPU scheduler, so it is a good indicator of overall I/O pressure. This value in particular will jump around a lot as your I/O subsystem is under stress. For all three of these columns, lower values are better than higher values.

The next query, shown in Listing 15-24, will return your CPU utilization history over the last 256 minutes, in one-minute intervals.

LISTING 15-24: CPU utilization history

-- Get CPU Utilization History for last 256 minutes (in one minute intervals)
-- This version works with SQL Server 2008 and above
DECLARE @ts_now bigint = (SELECT cpu_ticks/(cpu_ticks/ms_ticks) 
                          FROM sys.dm_os_sys_info WITH (NOLOCK)); 
 
SELECT TOP(256) SQLProcessUtilization AS [SQL Server Process CPU Utilization], 
                SystemIdle AS [System Idle Process], 
                100 - SystemIdle - SQLProcessUtilization 
                AS [Other Process CPU Utilization], 
               DATEADD(ms, -1 * (@ts_now - [timestamp]), 
               GETDATE()) AS [Event Time] 
FROM (SELECT record.value(’(./Record/@id)[1]’, ‘int’) AS record_id,    
record.value(’(./Record/SchedulerMonitorEvent/SystemHealth/SystemIdle)[1]’, ‘int’)
AS[SystemIdle],record.value(’(./Record/SchedulerMonitorEvent/SystemHealth/
ProcessUtilization)[1]’,’int’) 
         AS [SQLProcessUtilization], [timestamp] 
     FROM (SELECT [timestamp], CONVERT(xml, record) AS [record] 
       FROM sys.dm_os_ring_buffers WITH (NOLOCK)
       WHERE ring_buffer_type = N’RING_BUFFER_SCHEDULER_MONITOR’ 
       AND record LIKE N’%<SystemHealth>%’) AS x 
     ) AS y 
ORDER BY record_id DESC OPTION (RECOMPILE);
 
-- Look at the trend over the entire period. 
-- Also look at high sustained Other Process CPU Utilization values

This query shows the recent CPU utilization history for your database server, recorded in one-minute increments. That means it can miss a short spike of CPU activity, but I don’t think that’s really a big problem. You should be more interested in the trend over the last four hours than worry about sub minute spikes. The query gives you CPU utilization by the SQL Server Database Engine, and the sum of all other processes that are running on the database server (the “Other Process CPU Utilization” column). This gives you an idea of CPU pressure caused by SQL Server versus other sources such as management or monitoring software. Ideally, your mission-critical database servers are dedicated SQL Server instances with virtually nothing else running on the machine besides SQL Server. If you see Other Process CPU Utilization above 5% for a sustained period, you should investigate what else is using CPU on your database server.

After looking at your CPU utilization history, it is a good idea to see what is happening with the physical memory at the operating-system level on your database server. You can do that using the query shown in Listing 15-25.

LISTING 15-25: Operating system memory information

-- Good basic information about OS memory amounts and state
SELECT total_physical_memory_kb, available_physical_memory_kb, 
       total_page_file_kb, available_page_file_kb, 
       system_memory_state_desc
FROM sys.dm_os_sys_memory WITH (NOLOCK) OPTION (RECOMPILE);
 
-- You want to see "Available physical memory is high"
-- This indicates that you are not under external memory pressure

This query tells you how much physical memory is in the server, how much physical memory is available, how large the operating system page file is, and how much space is available in the page file. It also signals whether the operating system is low on physical memory, which would mean that SQL Server was under external memory pressure. It is rare to see the operating system signaling that is under severe memory pressure, especially if you have set the max server memory (MB) instance-level setting in SQL Server to an appropriate value that leaves enough memory available for the operating system.

After looking at memory usage at the operating-system level, you are going to want to take a look at what is happening with SQL Server’s internal memory usage, which you can do using the query shown in Listing 15-26.

LISTING 15-26: SQL server memory information

-- SQL Server Process Address space info 
--(shows whether locked pages is enabled, among other things)
SELECT physical_memory_in_use_kb,locked_page_allocations_kb, 
       page_fault_count, memory_utilization_percentage, 
       available_commit_limit_kb, process_physical_memory_low, 
       process_virtual_memory_low
FROM sys.dm_os_process_memory WITH (NOLOCK) OPTION (RECOMPILE);
 
-- You want to see 0 for process_physical_memory_low
-- You want to see 0 for process_virtual_memory_low
-- This indicates that you are not under internal memory pressure

This query tells you how much memory is actually being used by the SQL Server Database Engine. This information is more reliable than what is displayed in Windows Task Manager. It also tells you whether this SQL Server instance is using locked pages in memory. Finally, it indicates whether the SQL Server process is signaling that it is low on physical or virtual memory.

One classic way of measuring whether SQL Server is under internal memory pressure is to look at its Page Life Expectancy (PLE) (See Chapter 3, “Understanding Memory”), which you can do using the query shown in Listing 15-27.

LISTING 15-27: Page Life Expectancy information

-- Page Life Expectancy (PLE) value for default instance
SELECT cntr_value AS [Page Life Expectancy]
FROM sys.dm_os_performance_counters WITH (NOLOCK)
WHERE [object_name] LIKE N'%Buffer Manager%' -- Handles named instances
AND counter_name = N'Page life expectancy' OPTION (RECOMPILE);
 
-- PLE is one way to measure memory pressure.
-- Higher PLE is better. Watch the trend, not the absolute value.

This query returns the current Page Life Expectancy (PLE) value, in seconds, for the default instance of SQL Server. PLE is a measurement of how long SQL Server expects to keep in the SQL Server buffer pool before it is flushed or evicted. Higher PLE values are better than lower PLE values. You should develop an awareness of the normal range of PLE values for your more important SQL Server instances. That will help you identify a current PLE that is abnormally high or low.

Microsoft has a long-standing recommendation of 300 as a threshold for acceptable PLE, which is often debated in the SQL Server community. One thing that everyone does agree on though is that a PLE value of less than 300 is quite bad. Modern database servers with high amounts of physical memory typically have much higher PLE values than 300. Instead of focusing on the current PLE value, watch the trend over time.

After looking at Page Life Expectancy, you are going to want to look at Memory Grants Outstanding, using the query shown in Listing 15-28.

LISTING 15-28: Memory Grants Outstanding information

-- Memory Grants Outstanding value for default instance
SELECT cntr_value AS [Memory Grants Outstanding]
FROM sys.dm_os_performance_counters WITH (NOLOCK)
WHERE [object_name] LIKE N'%Memory Manager%' -- Handles named instances
AND counter_name = N'Memory Grants Outstanding' OPTION (RECOMPILE);
 
-- Memory Grants Outstanding above zero 
-- for a sustained period is a secondary indicator of memory pressure

This query returns the current value for Memory Grants Outstanding for the default instance of SQL Server. Memory Grants Outstanding is the total number of processes within SQL Server that have successfully acquired a workspace memory grant (refer to Chapter 3). You want this value to be zero if at all possible. Any sustained value above zero is a secondary indicator of memory pressure due to queries that are using memory for sorting and hashing. After looking at Memory Grants Outstanding, you should also look at Memory Grants Pending (which is a much more important indicator of memory pressure), by using the query shown in Listing 15-29.

LISTING 15-29: Memory Grants Pending information

-- Memory Grants Pending value for default instance
SELECT cntr_value AS [Memory Grants Pending]
FROM sys.dm_os_performance_counters WITH (NOLOCK)
WHERE [object_name] LIKE N'%Memory Manager%' -- Handles named instances
AND counter_name = N'Memory Grants Pending' OPTION (RECOMPILE);
 
-- Memory Grants Pending above zero 
-- for a sustained period is an extremely strong indicator of memory pressure

This query returns the current value for Memory Grants Pending for the default instance of SQL Server. Memory Grants Pending is the total number of processes within SQL Server that are waiting for a workspace memory grant. You want this value to be zero if at all possible. Any sustained value above zero is an extremely strong indicator of memory pressure. Especially if you see any signs of internal memory pressure from the previous three queries, take a closer look at the overall memory usage in SQL Server by running the query shown in Listing 15-30.

LISTING 15-30: Memory clerk information

-- Memory Clerk Usage for instance
-- Look for high value for CACHESTORE_SQLCP (Ad-hoc query plans)
SELECT TOP(10) [type] AS [Memory Clerk Type], 
       SUM(pages_kb) AS [SPA Mem, Kb] 
FROM sys.dm_os_memory_clerks WITH (NOLOCK)
GROUP BY [type]  
ORDER BY SUM(pages_kb) DESC OPTION (RECOMPILE);
 
-- CACHESTORE_SQLCP  SQL Plans         
-- These are cached SQL statements or batches that 
-- aren’t in stored procedures, functions and triggers
--
-- CACHESTORE_OBJCP  Object Plans      
-- These are compiled plans for 
-- stored procedures, functions and triggers
--
-- CACHESTORE_PHDR   Algebrizer Trees  
-- An algebrizer tree is the parsed SQL text 
-- that resolves the table and column names

This query gives you a good idea of what (besides the buffer cache) is using large amounts of memory in SQL Server. One key item to look out for is high values for CACHESTORE_SQLCP, which is the memory clerk for ad hoc query plans. It is quite common to see this memory clerk using several gigabytes of memory to cache ad hoc query plans.

If you see a lot of memory being used by the CACHESTORE_SQLCP memory clerk, you can determine whether you have many single-use, ad hoc query plans using a lot of memory in your procedure cache by running the query shown in Listing 15-31.

LISTING 15-31: Single-use ad-hoc queries

-- Find single-use, ad-hoc queries that are bloating the plan cache
SELECT TOP(20) [text] AS [QueryText], cp.size_in_bytes
FROM sys.dm_exec_cached_plans AS cp WITH (NOLOCK)
CROSS APPLY sys.dm_exec_sql_text(plan_handle) 
WHERE cp.cacheobjtype = N'Compiled Plan' 
AND cp.objtype = N'Adhoc' 
AND cp.usecounts = 1
ORDER BY cp.size_in_bytes DESC OPTION (RECOMPILE);
 
-- Gives you the text and size of single-use ad-hoc queries that 
-- waste space in the plan cache
-- Enabling 'optimize for ad hoc workloads' for the instance 
-- can help (SQL Server 2008 and above only)
-- Enabling forced parameterization for the database can help, but test first!

This query returns the query text and size of your largest (in terms of memory usage) single-use, ad hoc query plans wasting space in your plan cache. If you see a lot of single-use, ad hoc query plans in your plan cache, you should consider enabling the instance-level optimize for ad hoc workloads setting (also see Chapter 3). This setting enables SQL Server 2008 and later to store a much smaller version of the ad hoc execution plan in the plan cache the first time that plan is executed. This can reduce the amount of memory that is wasted on single-use, ad hoc query plans that are highly likely to never be reused. Conversely, sometimes the result of enabling this setting is that more of these smaller, ad hoc plans are stored in the plan cache (because more smaller plans can fit in the same amount of memory as fewer, larger plans), so you may not see as much memory savings as you anticipated.

Even so, we don’t see any good reason not to enable this setting on all SQL Server 2008 and later instances. When I talked to one of the developers at Microsoft who worked on this feature a couple of years ago, the only downside to this setting that she could see was a scenario in which you had several identical ad hoc query plans that would be executed between two and ten times, in which case you would take a small hit the second time the plan was executed. That seems like an edge case to me.

DATABASE-LEVEL QUERIES

After running all the server- and instance-level queries, you should have a fairly good idea of which database or databases are the most resource intensive on a particular instance of SQL Server. In order to get more details about a particular database, you need to switch your database context to that database and run a set of database-specific queries. The code in Listing 15-32 shows how to switch your database context using T-SQL. Be sure to change the name of the database to the name of the database that you are interested in investigating further.

LISTING 15-32: Switching to a user database

-- Database specific queries ******************************************************
 
-- **** Switch to a user database *****
USE YourDatabaseName;
GO

This code merely switches your database context to the named database. Be sure to change the name to the database that you want to look at. Many people make the mistake of running these queries while connected to the master system database. If you do that, you will get a lot of mostly useless information about your master database.

After you are sure you are pointing at the correct database, you can find out how large it is with the query shown in Listing 15-33.

LISTING 15-33: Database file sizes and space available

-- Individual File Sizes and space available for current database
SELECT name AS [File Name], physical_name AS [Physical Name], size/128.0 AS [Total 
Size in MB],
size/128.0 - CAST(FILEPROPERTY(name, 'SpaceUsed') AS int)/128.0 AS [Available Space
In MB], [file_id]
FROM sys.database_files WITH (NOLOCK) OPTION (RECOMPILE);
 
-- Look at how large and how full the files are and where they are located
-- Make sure the transaction log is not full!!

This query shows you where the data and log files for your database are located. It also returns how large and how full they are. It is a good way to help monitor and manage your data and log file sizing and file growth. I don’t like to see my data files getting too close to being 100% full, and I don’t like to see my log files getting more than 50% full. You should manually manage the growth of your data and log files, leaving autogrow enabled only for an emergency.

The next query, shown in Listing 15-34, shows a DMV that enables you to focus solely on your log file size and space used.

LISTING 15-34: Transaction log size and space used

-- Get transaction log size and space information for the current database
SELECT DB_NAME(database_id) AS [Database Name], database_id,
CAST((total_log_size_in_bytes/1048576.0) AS DECIMAL(10,1)) 
AS [Total_log_size(MB)],
CAST((used_log_space_in_bytes/1048576.0) AS DECIMAL(10,1)) 
AS [Used_log_space(MB)],
CAST(used_log_space_in_percent AS DECIMAL(10,1)) AS [Used_log_space(%)]
FROM sys.dm_db_log_space_usage WITH (NOLOCK) OPTION (RECOMPILE);
 
-- Another way to look at transaction log file size and space

This query, using a DMV introduced in SQL Server 2008 R2 Service Pack 1, enables you to directly query your log file size and the space used, as a percentage. It would be relatively easy to use this DMV to write a query that could be used to trigger an alert when a percentage of log file usage that you specify is exceeded. Of course, if you are properly managing the size of your transaction log, along with how often you take transaction log backups when you are in the FULL recovery model, you should not run into problems that often. The obvious exceptions are if something happens with database mirroring, replication, or a long-running transaction that cause your transaction log to fill up despite frequent transaction log backups.

The next query, shown in Listing 15-35, will enable you to gather some I/O statistics by file for the current database.

LISTING 15-35: I/O statistics by file for the current database

-- I/O Statistics by file for the current database
SELECT DB_NAME(DB_ID()) AS [Database Name],[file_id], num_of_reads, num_of_writes, 
io_stall_read_ms, io_stall_write_ms,
CAST(100. * io_stall_read_ms/(io_stall_read_ms + io_stall_write_ms) 
AS DECIMAL(10,1)) AS [IO Stall Reads Pct],
CAST(100. * io_stall_write_ms/(io_stall_write_ms + io_stall_read_ms) 
AS DECIMAL(10,1)) AS [IO Stall Writes Pct],
(num_of_reads + num_of_writes) AS [Writes + Reads], num_of_bytes_read, 
num_of_bytes_written,
CAST(100. * num_of_reads/(num_of_reads + num_of_writes) AS DECIMAL(10,1)) 
AS [# Reads Pct],
CAST(100. * num_of_writes/(num_of_reads + num_of_writes) AS DECIMAL(10,1)) 
AS [# Write Pct],
CAST(100. * num_of_bytes_read/(num_of_bytes_read + num_of_bytes_written) 
AS DECIMAL(10,1)) AS [Read Bytes Pct],
CAST(100. * num_of_bytes_written/(num_of_bytes_read + num_of_bytes_written) 
AS DECIMAL(10,1)) AS [Written Bytes Pct]
FROM sys.dm_io_virtual_file_stats(DB_ID(), NULL) OPTION (RECOMPILE);
 
-- This helps you characterize your workload better from an I/O perspective

This query returns the number of reads and writes for each file in your database. It also returns the number of bytes read and written for each file in the database, and the number of read I/O and write I/O stalls for each file in the database. Finally, it breaks down the read/write ratio and read/write I/O stall ratio into percentage terms. The point of all this information is to help you better characterize your I/O workload at the database-file level. For example, you might discover that you are doing a lot more writes to a particular data file than you expected, which might be a good reason to consider using RAID 10 instead of RAID 5 for the logical drive where that data file is located. Seeing a lot of I/O stalls for a particular database file might mean that the logical drive where that file is located is not performing very well or simply that the database file in question is particularly active. It is definitely something to investigate further.

Next, with the query shown in Listing 15-36, you are going to take a look at the transaction log Virtual Log File (VLF) count.

LISTING 15-36: Virtual Log File count

-- Get VLF count for transaction log for the current database,
-- number of rows equals the VLF count. Lower is better!
DBCC LOGINFO;
 
-- High VLF counts can affect write performance 
-- and they can make database restore and recovery take much longer

This query simply tells you how many VLFs you have in your transaction log file. Having a large number of VLFs in your transaction log can affect write performance to your transaction log. More important, it can have a huge effect on how long it takes to restore a database, and how long it takes a database to become available in a clustering failover. It can also affect how long it takes to recover a database when your instance of SQL Server is started or restarted. What is considered a large number of VLFs?

I don’t like to see more than a couple of hundred VLF files in a transaction log. For the most part, having fewer VLFs is better than having a large number, but I don’t worry too much until I start getting more than 200-300. The most common way to get a high VLF count is when you create a database in FULL recovery model with the default settings for the size and autogrowth increment for the transaction log file, and then you don’t take frequent transaction log backups. By default, you start out with a 1MB transaction log file that is set to grow by 10% when autogrow kicks in after the 1MB file fills up completely. The now 1.1MB file will quickly fill up again, and autogrow will make it 10% larger. This happens repeatedly; and each time the transaction log file is grown, more VLFs are added to the transaction log. If the growth amount is less than 64MB, then 4 VLFs will be added to the transaction log. If the growth amount is between 64MB and 1GB, then 8 VLFs will be added to the transaction log. Finally, if the growth amount is over 1GB, then 16 VLFs will be added to the transaction log.

Knowing this, you can see how a 1MB transaction log file can grow and end up with tens of thousands of VLFs. The way to avoid this is to manually manage your transaction file size, and to change the autogrow increment to a more reasonable value. That way you will have fewer growth events (whether manual or autogrows), and therefore a lower VLF count. With a relatively large and active database, I recommend setting the autogowth increment to 8000MB. This way, you only need a few growth events to grow the transaction file to a sufficiently large size, which keeps the VLF count much lower.

Picking a good size for your transaction log file depends on a number of factors. First, how much write activity do you think your database will see with its normal workload? You want to figure out how much transaction log activity is generated in an hour, in terms of MB or GB. One easy way to determine this is to take an uncompressed transaction log backup every hour for a full day. This gives you a good idea of your average and peak log generation rates. Make sure that your transaction log file is large enough to hold at least eight hours of normal activity, and consider when and how often you do maintenance activity such as reorganizing or rebuilding indexes, which generate a lot of transaction log activity. Creating new indexes on large tables and loading or deleting a lot of data also creates a lot of transaction log activity.

You should also consider how often you are going to run transaction log backups (in order to help meet your Recovery Point Objective [RPO] and Recovery Time Objective [RTO]). If you need to run very frequent transaction log backups, you may be able to have a somewhat smaller transaction log file. This also depends on how large your database is and how long it takes to do a full database backup. While a full database backup is running, transaction log backups will not clear the log file. If you have a very slow I/O subsystem and a very large database, your full database backups may take a long time to complete. You want to size your transaction log file to be large enough that it never has to autogrow. One disadvantage to having an extremely large transaction log file (besides wasting some disk space) is that it will take quite a bit longer to restore a copy of your database, as Windows cannot use Windows Instant File Initialization on log files.

If you discover that you have a very high number of VLFs in your transaction log file, you should take a transaction log backup and then immediately shrink the transaction log file — not the entire database, just the transaction log file. After you do this, check your VLF count. It may have not gone down by much, depending on the prior activity and state of your transaction log. If this is the case, simply repeat the transaction log backup and transaction log file shrink sequence several times until you get the VLF count down. By this time, your transaction log file will probably be very small, so you are going to want to immediately grow it back in reasonable increments to a reasonable size, based on the factors previously discussed. Keep in mind that if you decide to grow your transaction file in 8000MB increments and you have a slow I/O subsystem, you may see a performance impact during the file growth operation, as Windows Instant File Initialization does not work on transaction log files.

At this point, it is time to start looking at the query activity on this particular database, using the query shown in Listing 15-37.

LISTING 15-37: Top cached queries by execution count

-- Top cached queries by Execution Count (SQL Server 2012)
SELECT qs.execution_count, qs.total_rows, qs.last_rows, qs.min_rows, qs.max_rows,
qs.last_elapsed_time, qs.min_elapsed_time, qs.max_elapsed_time,
SUBSTRING(qt.TEXT,qs.statement_start_offset/2 +1,
(CASE WHEN qs.statement_end_offset = -1
      THEN LEN(CONVERT(NVARCHAR(MAX), qt.TEXT)) * 2
 ELSE qs.statement_end_offset END - qs.statement_start_offset)/2) 
AS query_text
FROM sys.dm_exec_query_stats AS qs WITH (NOLOCK)
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS qt
ORDER BY qs.execution_count DESC OPTION (RECOMPILE);
 
-- Uses several new rows returned columns 
-- to help troubleshoot performance problems

This query simply returns cached queries ordered by execution count. This is useful for getting an idea about your typical workload. Knowing which queries are executed the most often, along with their range of execution times and range of rows returned can be very useful information. Having the query text available enables you to run a query in SSMS, where you can look at the execution plan and the I/O statistics to determine whether there are any tuning opportunities.

Next, you will take a look at a similar query that focuses on cached stored procedures. This query is shown in Listing 15-38.

LISTING 15-38: Top cached stored procedures by execution count

-- Top Cached SPs By Execution Count (SQL Server 2012)
SELECT TOP(250) p.name AS [SP Name], qs.execution_count,
ISNULL(qs.execution_count/DATEDIFF(Second, qs.cached_time, GETDATE()), 0) 
AS [Calls/Second],
qs.total_worker_time/qs.execution_count AS [AvgWorkerTime], 
qs.total_worker_time AS [TotalWorkerTime],qs.total_elapsed_time, 
qs.total_elapsed_time/qs.execution_count AS [avg_elapsed_time],
qs.cached_time
FROM sys.procedures AS p WITH (NOLOCK)
INNER JOIN sys.dm_exec_procedure_stats AS qs WITH (NOLOCK)
ON p.[object_id] = qs.[object_id]
WHERE qs.database_id = DB_ID()
ORDER BY qs.execution_count DESC OPTION (RECOMPILE);
 
-- Tells you which cached stored procedures are called the most often
-- This helps you characterize and baseline your workload

This query returns the top cached stored procedures ordered by execution count. It can help you determine which stored procedures are called most often, and how often they are called. Knowing this is very helpful for baseline purposes. For example, if you know that your most frequently executed stored procedure is normally called 50 times/second, but you later see it being called 300 times/second, you would want to start investigating. Perhaps there was a change to your application(s) that is causing them to call that stored procedure much more often. It might be by design, or it might be a defect in the application, introduced in a recent update. It is also possible that your applications are suddenly seeing more load, either from legitimate users or from a denial-of-service attack.

One thing to keep in mind when examining all these queries that are looking at cached stored procedures is that the stored procedures may have gone into the cache at different times, which skews the numbers for these queries. You should always look at the cached_time in the query results (see Listing 15-39) to see how long the stored procedure has been in the cache.

LISTING 15-39: Top cached stored procedures by average elapsed time

-- Top Cached SPs By Avg Elapsed Time (SQL Server 2012)
SELECT TOP(25) p.name AS [SP Name], qs.total_elapsed_time/qs.execution_count 
AS [avg_elapsed_time], qs.total_elapsed_time, qs.execution_count, 
ISNULL(qs.execution_count/DATEDIFF(Second, qs.cached_time, 
GETDATE()), 0) AS [Calls/Second], 
qs.total_worker_time/qs.execution_count AS [AvgWorkerTime], 
qs.total_worker_time AS [TotalWorkerTime], qs.cached_time
FROM sys.procedures AS p WITH (NOLOCK)
INNER JOIN sys.dm_exec_procedure_stats AS qs WITH (NOLOCK)
ON p.[object_id] = qs.[object_id]
WHERE qs.database_id = DB_ID()
ORDER BY avg_elapsed_time DESC OPTION (RECOMPILE);
 
-- This helps you find long-running cached stored procedures that
-- may be easy to optimize with standard query tuning techniques

This query captures the top cached stored procedures by average elapsed time. This is useful because it can highlight long-running (on average) stored procedures that might be very good tuning candidates. If you can make some changes to a stored procedure that previously took 90 seconds that result in it returning in 5 seconds, you will look like a magician to your boss. Conversely, if a long-running query is executed only once a day, optimizing it will not really help your overall workload as much as you might expect. This query is somewhat less sensitive to the cached_time issue, as you are sorting by average elapsed time.

Next, you are going to look at the most expensive stored procedures from an overall CPU perspective, with the query shown in Listing 15-40.

LISTING 15-40: Top cached stored procedures by total worker time

-- Top Cached SPs By Total Worker time (SQL Server 2012). 
-- Worker time relates to CPU cost
SELECT TOP(25) p.name AS [SP Name], qs.total_worker_time AS [TotalWorkerTime], 
qs.total_worker_time/qs.execution_count AS [AvgWorkerTime], qs.execution_count, 
ISNULL(qs.execution_count/DATEDIFF(Second, qs.cached_time, GETDATE()), 0) 
AS [Calls/Second],qs.total_elapsed_time, qs.total_elapsed_time/qs.execution_count 
AS [avg_elapsed_time], qs.cached_time
FROM sys.procedures AS p WITH (NOLOCK)
INNER JOIN sys.dm_exec_procedure_stats AS qs WITH (NOLOCK)
ON p.[object_id] = qs.[object_id]
WHERE qs.database_id = DB_ID()
ORDER BY qs.total_worker_time DESC OPTION (RECOMPILE);
 
-- This helps you find the most expensive cached 
-- stored procedures from a CPU perspective
-- You should look at this if you see signs of CPU pressure

This query returns the top cached stored procedures ordered by total worker time. Worker time relates to the CPU cost of a query or stored procedure. Especially if you see signs of CPU pressure from looking at your top cumulative wait types or your CPU utilization history, you should look very closely at the results of this query to figure out which stored procedures were the most expensive from a CPU perspective. The reason you sort by total worker time is because this takes into account the total CPU cost of the stored procedure since it has been cached. You might have a stored procedure that is not particularly expensive for a single execution that is called very frequently, resulting in a very high overall CPU cost.

With this query, you do need to pay particular attention to the cached_time column for each stored procedure. The length of time that a stored procedure has been in the cache can have a huge effect on its cumulative cost figures. That’s why I like to periodically clear out the procedure cache so that the stored procedures that are part of my regular workload will be recompiled and go back into the cache at nearly the same time. This makes it much easier to accurately interpret the results of all these stored procedure cost queries. It also has the benefit of clearing out single-use, ad hoc query plans that may be wasting a lot of space in your cache.

Next, the query shown in Listing 15-41 will provide some information about your cached stored procedures from a logical reads perspective.

LISTING 15-41: Top cached stored procedures by total logical reads

-- Top Cached SPs By Total Logical Reads (SQL Server 2012). 
-- Logical reads relate to memory pressure
SELECT TOP(25) p.name AS [SP Name], qs.total_logical_reads 
AS [TotalLogicalReads], qs.total_logical_reads/qs.execution_count 
AS [AvgLogicalReads],qs.execution_count, 
ISNULL(qs.execution_count/DATEDIFF(Second, qs.cached_time, GETDATE()), 0) 
AS [Calls/Second], qs.total_elapsed_time,qs.total_elapsed_time/qs.execution_count 
AS [avg_elapsed_time], qs.cached_time
FROM sys.procedures AS p WITH (NOLOCK)
INNER JOIN sys.dm_exec_procedure_stats AS qs WITH (NOLOCK)
ON p.[object_id] = qs.[object_id]
WHERE qs.database_id = DB_ID()
ORDER BY qs.total_logical_reads DESC OPTION (RECOMPILE);
 
-- This helps you find the most expensive cached 
-- stored procedures from a memory perspective
-- You should look at this if you see signs of memory pressure

This query returns the top cached procedures ordered by total logical reads. Logical reads equate to memory pressure, and indirectly to I/O pressure. A logical read occurs when a query finds the data that it needs in the buffer pool (in memory). Once data is initially read off of the I/O subsystem, it goes into the SQL Server buffer pool. If you have a large amount of physical RAM and your instance-level max server memory setting is at an appropriately high level, you will have a relatively large amount of space for the SQL Server buffer pool, which means that SQL Server is much more likely to subsequently find what it needs there, rather than access the I/O subsystem with a physical read.

If you are seeing signs of memory pressure, such as persistently low page life expectancy values, high memory grants outstanding, and high memory grants pending, look very closely at the results of this query. Again, you need to pay close attention to the cached_time column to ensure that you are really looking at the most expensive stored procedures from a memory perspective.

After I have identified the top several stored procedure offenders, I like to run them individually (with appropriate input parameters captured from SQL Server Profiler) in SSMS with the SET STATISTICS IO ON command enabled and the graphical execution plan enabled. This enables me to start troubleshooting why the queries in the stored procedure are generating so many logical reads. Perhaps the queries are doing implicit conversions that cause them to ignore a perfectly valid index, or maybe they are using T-SQL functions on the left side of a WHERE clause. Another common issue is a clustered index or table scan due to a missing index. There are many possible reasons why a query has a very large number of logical reads.

If you are using SQL Server 2008 or later and you have Enterprise Edition, you should take a look at SQL Server data compression. Data compression is usually touted as a way to reduce your I/O utilization requirements in exchange for some added CPU utilization. While it does work very well for that purpose (with indexes that are good candidates for compression), it can also reduce your memory pressure in many cases. An index that has been compressed will stay compressed in the buffer pool, until the data is updated. This can dramatically reduce the space required in the buffer pool for that index.

The next query, shown in Listing 15-42, looks at the most expensive stored procedures from a physical reads perspective.

LISTING 15-42: Top cached stored procedures by total physical reads

-- Top Cached SPs By Total Physical Reads (SQL Server 2012). 
-- Physical reads relate to disk I/O pressure
SELECT TOP(25) p.name AS [SP Name],qs.total_physical_reads 
AS [TotalPhysicalReads],qs.total_physical_reads/qs.execution_count 
AS [AvgPhysicalReads], qs.execution_count, qs.total_logical_reads, 
qs.total_elapsed_time, qs.total_elapsed_time/qs.execution_count 
AS [avg_elapsed_time], qs.cached_time 
FROM sys.procedures AS p WITH (NOLOCK)
INNER JOIN sys.dm_exec_procedure_stats AS qs WITH (NOLOCK)
ON p.[object_id] = qs.[object_id]
WHERE qs.database_id = DB_ID()
AND qs.total_physical_reads > 0
ORDER BY qs.total_physical_reads DESC, 
qs.total_logical_reads DESC OPTION (RECOMPILE);
 
-- This helps you find the most expensive cached 
-- stored procedures from a read I/O perspective
-- You should look at this if you see signs of I/O pressure or of memory pressure

This query returns the top cached stored procedures ordered by total physical reads. Physical reads equate to disk I/O cost. A physical read happens when SQL Server cannot find what it needs in the SQL Server buffer pool, so it must go out to the storage subsystem to retrieve the data. No matter what kind of storage you are using, it is much slower than physical memory.

If you are seeing signs of I/O pressure, such as I/O-related wait types in your top cumulative wait types query, or high times for disk seconds/read in Windows Performance Monitor, examine the results of this query very closely. Don’t forget to consider how long a stored procedure has been in the cache by looking at the cached_time column. A very expensive stored procedure that was just recently cached will probably not show up at the top of the list compared to other stored procedures that have been cached for a long period of time.

After identifying the top several stored procedure offenders, run them individually (with appropriate input parameters captured from SQL Server Profiler) in SSMS with the SET STATISTICS IO ON command enabled and the graphical execution plan enabled. This will help you determine why the queries in the stored procedure are generating so many physical reads. Again, after you have exhausted standard query-tuning techniques to improve the situation, you should consider using SQL Server data compression (if you have Enterprise Edition) to further reduce the amount of data being read off of the I/O subsystem. Other options (besides standard query tuning) include adding more physical RAM to your server and improving your I/O subsystem. Perhaps you can add additional spindles to a RAID array, change the RAID level, change the hardware cache policy, and so on.

Next, take a look at the most expensive cache stored procedures for logical writes. To do that, use the query shown in Listing 15-43.

LISTING 15-43: Top cached stored procedures by total logical writes

-- Top Cached SPs By Total Logical Writes (SQL Server 2012). 
-- Logical writes relate to both memory and disk I/O pressure 
SELECT TOP(25) p.name AS [SP Name], qs.total_logical_writes 
AS [TotalLogicalWrites], qs.total_logical_writes/qs.execution_count 
AS [AvgLogicalWrites], qs.execution_count,
ISNULL(qs.execution_count/DATEDIFF(Second, qs.cached_time, GETDATE()), 0) 
AS [Calls/Second],qs.total_elapsed_time, qs.total_elapsed_time/qs.execution_count 
AS [avg_elapsed_time], qs.cached_time
FROM sys.procedures AS p WITH (NOLOCK)
INNER JOIN sys.dm_exec_procedure_stats AS qs WITH (NOLOCK)
ON p.[object_id] = qs.[object_id]
WHERE qs.database_id = DB_ID()
ORDER BY qs.total_logical_writes DESC OPTION (RECOMPILE);
 
-- This helps you find the most expensive cached 
-- stored procedures from a write I/O perspective
-- You should look at this if you see signs of I/O pressure or of memory pressure

This query returns the most expensive cached stored procedures ordered by total logical writes, meaning simply the stored procedures that are generating the most write activity in your database. You might be surprised to see SELECT type stored procedures show up in this list, but that often happens when the SELECT procedures INSERT intermediate results into a temp table or table variable before doing a later SELECT operation.

Especially with OLTP workloads that see a lot of intensive write activity, you should pay attention to the results of this query. As always, consider the cached_time column before making any judgments. After you have identified the actual top offenders in this query, talk to your developers to see if perhaps they are updating too much information, or updating information too frequently. I would also be looking at the index usage on your most frequently updated tables. You might discover that you have a number of nonclustered indexes that have a high number of writes, but no reads. Having fewer indexes on a volatile, write-intensive table will definitely help write performance. After some further investigation and analysis, you might want to drop some of those unused indexes.

From a hardware perspective, adding more physical RAM to your server might help even out your write I/O workload a little bit. If SQL Server has more RAM in the buffer pool, it will not have to issue automatic checkpoints to write to the data file(s) quite as often. Going longer between automatic checkpoints can help reduce total write I/O somewhat because more data in the same data pages might have been modified over that longer period of time. A system that is under memory pressure will also be forced to have the lazy writer write dirty pages in memory to the disk subsystem more often.

Finally, improving your I/O subsystem, especially the LUN where your transaction log is located, would be an obvious step. Again, adding more spindles to the RAID array, changing from RAID 5 to RAID 10, and making sure your RAID controller hardware cache is used for writes instead of reads will all help write performance.

Next, you will take a look at the most expensive statements within your cached stored procedures for average I/O, using the query shown in Listing 15-44.

LISTING 15-44: Top statements by average I/O

-- Lists the top statements by average input/output 
-- usage for the current database
SELECT TOP(50) OBJECT_NAME(qt.objectid) AS [SP Name],
(qs.total_logical_reads + qs.total_logical_writes) /qs.execution_count 
AS [Avg IO],SUBSTRING(qt.[text],qs.statement_start_offset/2, 
(CASE 
 WHEN qs.statement_end_offset = -1 
 THEN LEN(CONVERT(nvarchar(max), qt.[text])) * 2 
 ELSE qs.statement_end_offset 
 END - qs.statement_start_offset)/2) AS [Query Text]
FROM sys.dm_exec_query_stats AS qs WITH (NOLOCK)
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS qt
WHERE qt.[dbid] = DB_ID()
ORDER BY [Avg IO] DESC OPTION (RECOMPILE);
 
-- Helps you find the most expensive statements for I/O by SP

This query identifies the most expensive cached statements for I/O, ordered by average I/O. If your system is showing any signs of I/O pressure, you should definitely take a look at the results of this query. Even if you are not seeing I/O pressure, it never hurts to be aware of which statements within your stored procedures are causing the most I/O pain.

Next, using the query shown in Listing 15-45, you will look for nonclustered indexes that have more writes than reads.

LISTING 15-45: Possible bad nonclustered indexes

-- Possible Bad NC Indexes (writes > reads)
SELECT OBJECT_NAME(s.[object_id]) AS [Table Name], i.name AS [Index Name], 
i.index_id,user_updates AS [Total Writes], 
user_seeks + user_scans + user_lookups AS [Total Reads],
user_updates - (user_seeks + user_scans + user_lookups) AS [Difference]
FROM sys.dm_db_index_usage_stats AS s WITH (NOLOCK)
INNER JOIN sys.indexes AS i WITH (NOLOCK)
ON s.[object_id] = i.[object_id]
AND i.index_id = s.index_id
WHERE OBJECTPROPERTY(s.[object_id],'IsUserTable') = 1
AND s.database_id = DB_ID()
AND user_updates > (user_seeks + user_scans + user_lookups)
AND i.index_id > 1
ORDER BY [Difference] DESC, [Total Writes] DESC, [Total Reads] ASC OPTION 
(RECOMPILE);
 
-- Look for indexes with high numbers of writes 
-- and zero or very low numbers of reads
-- Consider your complete workload
-- Investigate further before dropping an index!

This query returns all nonclustered indexes in the current database, along with their total writes and total reads ordered by the difference between the number of writes and the number of reads. The idea here is to find indexes that have a lot of writes and very few (or zero) reads. An index that is only written to, but never used for reads, is not useful at all. You are paying the cost to maintain the index, but you are receiving no benefit. Having many “unused” indexes on a table hurts your insert/update/delete performance, and it makes your table and database need more space in the data file(s). It also makes backups and restores take longer to complete.

Keep in mind that these read and write statistics reflect only the period since this instance of SQL Server has been running. Depending on how long your instance has been running, you may not have seen your complete workload yet. For example, some indexes may be used only for monthly reporting queries, meaning they might have a lot more writes than reads during the rest of the month. If you dropped an index like that based on the results of this query, you could cause some serious performance issues when it comes time to run those reporting queries.

In other words, use some caution and common sense before you start dropping indexes solely based on the results of this query. You should always do some further investigation and analysis before you drop an index on an important database.

Next, using the query shown in Listing 15-46, you will look for indexes that SQL Server thinks you would benefit from adding to this database.

LISTING 15-46: Missing indexes by index advantage

-- Missing Indexes current database by Index Advantage
SELECT user_seeks * avg_total_user_cost * (avg_user_impact * 0.01) 
AS [index_advantage], 
migs.last_user_seek, mid.[statement] AS [Database.Schema.Table],
mid.equality_columns, mid.inequality_columns, mid.included_columns,
migs.unique_compiles, migs.user_seeks, migs.avg_total_user_cost, 
migs.avg_user_impact
FROM sys.dm_db_missing_index_group_stats AS migs WITH (NOLOCK)
INNER JOIN sys.dm_db_missing_index_groups AS mig WITH (NOLOCK)
ON migs.group_handle = mig.index_group_handle
INNER JOIN sys.dm_db_missing_index_details AS mid WITH (NOLOCK)
ON mig.index_handle = mid.index_handle
WHERE mid.database_id = DB_ID() -- Remove this to see for entire instance
ORDER BY index_advantage DESC OPTION (RECOMPILE);
 
-- Look at last user seek time, number of user seeks 
-- to help determine source and importance
-- SQL Server is overly eager to add included columns, so beware
-- Do not just blindly add indexes that show up from this query!!!

This query shows you what SQL Server considers to be “missing indexes” ordered by a calculated column called index_advantage. The idea here is that anytime the SQL Server query optimizer determines that a particular index not present in the database would help reduce the cost of a query, it will note that fact. Over time, as your workload runs on your database server, you will likely see a growing number of proposed new indexes returned when you run this query. I strongly caution you to not get overly enthusiastic about creating new indexes based solely on the results of this query. Many people have proudly told me that they wrote a script that automatically creates every single index that SQL Server identifies in this query, which is a huge mistake!

Instead, you should consider a number of factors before you start adding new indexes to your tables. First, consider what type of workload you have and how volatile your table is. If you have an OLTP type of workload, with a lot of writes to your table, you should be much more hesitant about adding new indexes, as more indexes will slow down your insert/update/delete performance on that table. Second, you should look at the last_user_seek column to get an idea of whether this “missing” index would really affect your normal workload. If your last_user_seek is from a few seconds or a few minutes ago, it is more likely to be part of your normal workload. If it is from a few days or a few weeks ago, it is more likely to be from an ad hoc query or a reporting query, and I would be much less inclined to add that index. You should also look at the user_seeks column to get an idea of how many times SQL Server has determined it would need this index, along with the avg_user_impact and avg_total_user_cost columns to help assess how important the index really might be.

You should also consider your existing indexes on the table in question. In many cases this query will recommend a new index that is almost an exact duplicate of an existing index. You need to apply some judgment and common sense and consider your complete workload before you start adding new indexes based solely on the results of this query. Finally, you should be aware that if you make any kind of index change on a particular table, the missing index statistics for that table will be cleared out, and it will take some time (as your workload is running) for the missing index statistics to show any information for that table. To understand how this could bite you, suppose you had a table that needed three new indexes to help an important part of your normal workload. After a thorough analysis, you decide to add the first index. After that index is added, you run this query again, and no results are returned for the table in question. This might lead you to conclude that SQL Server does not need the other two indexes, which would probably be incorrect. You just need to wait for a period of time, depending on your workload, to see whether SQL Server really needs any more indexes on the table. You can use the query shown in Listing 15-45 along with this query to help zero in on which indexes are really needed on each table.

Next, you will look for missing index warnings in the cached execution plans for stored procedures in this database, using the query shown in Listing 15-47.

LISTING 15-47: Missing index warnings for cached plans

-- Find missing index warnings for cached plans in the current database
-- Note: This query could take some time on a busy instance
SELECT TOP(25) OBJECT_NAME(objectid) AS [ObjectName],query_plan, 
cp.objtype, cp.usecounts
FROM sys.dm_exec_cached_plans AS cp WITH (NOLOCK)
CROSS APPLY sys.dm_exec_query_plan(cp.plan_handle) AS qp
WHERE CAST(query_plan AS NVARCHAR(MAX)) LIKE N'%MissingIndex%'
AND dbid = DB_ID()
ORDER BY cp.usecounts DESC OPTION (RECOMPILE);
 
-- Helps you connect missing indexes to specific stored procedures or queries
-- This can help you decide whether to add them or not

This query returns information about cached execution plans that have “missing index” warnings. It will give you the stored procedure name, the query plan, and the use count for that cache execution plan. This can help you decide whether a particular “missing index” is really important or not. You should use this query along with the query shown in Listing 15-46 to help determine whether you should add any new indexes to a particular table.

Next, using the query shown in Listing 15-48, you can find out which tables and indexes are using the most space in the SQL Server buffer pool.

LISTING 15-48: Buffer usage by table and index

-- Breaks down buffers used by current database 
-- by object (table, index) in the buffer cache
SELECT OBJECT_NAME(p.[object_id]) AS [ObjectName], 
p.index_id, COUNT(*)/128 AS [Buffer size(MB)],  COUNT(*) AS [BufferCount], 
p.data_compression_desc AS [CompressionType]
FROM sys.allocation_units AS a WITH (NOLOCK)
INNER JOIN sys.dm_os_buffer_descriptors AS b WITH (NOLOCK)
ON a.allocation_unit_id = b.allocation_unit_id
INNER JOIN sys.partitions AS p WITH (NOLOCK)
ON a.container_id = p.hobt_id
WHERE b.database_id = CONVERT(int,DB_ID())
AND p.[object_id] > 100
GROUP BY p.[object_id], p.index_id, p.data_compression_desc
ORDER BY [BufferCount] DESC OPTION (RECOMPILE);
 
-- Tells you what tables and indexes are 
-- using the most memory in the buffer cache

This query indicates which indexes and tables in the current database are using the most memory in the SQL Server buffer pool. It also shows you whether the index is using any form of data compression. If you see an index that is using a large amount of space in the buffer pool, you should investigate whether that index might be a good candidate for SQL Server data compression, assuming that you have SQL Server 2008 or later Enterprise Edition.

An ideal data compression candidate would be a large, static table that has highly compressible data. In such a case, you might see as much as a 10:1 compression ratio, meaning the compressed index would take up far less space in the buffer pool, and in the data file on disk. In my experience, I have typically seen anywhere from 2:1 up to 4:1 for average compression ratios. A poor data compression candidate would be a smaller, highly volatile table containing data that does not compress very well. In that case, you would most likely be better off without using data compression.

Next, you will find out the size (in terms of row counts) and the data compression status of all the tables in this database, using the query shown in Listing 15-49.

LISTING 15-49: Table names, row counts, and compression status

-- Get Table names, row counts, and compression status 
-- for the clustered index or heap
SELECT OBJECT_NAME(object_id) AS [ObjectName], 
SUM(Rows) AS [RowCount], data_compression_desc AS [CompressionType]
FROM sys.partitions WITH (NOLOCK)
WHERE index_id < 2 --ignore the partitions from the non-clustered index if any
AND OBJECT_NAME(object_id) NOT LIKE N'sys%'
AND OBJECT_NAME(object_id) NOT LIKE N'queue_%' 
AND OBJECT_NAME(object_id) NOT LIKE N'filestream_tombstone%' 
AND OBJECT_NAME(object_id) NOT LIKE N'fulltext%'
AND OBJECT_NAME(object_id) NOT LIKE N'ifts_comp_fragment%'
AND OBJECT_NAME(object_id) NOT LIKE N'filetable_updates%'
GROUP BY object_id, data_compression_desc
ORDER BY SUM(Rows) DESC OPTION (RECOMPILE);
 
-- Gives you an idea of table sizes, and possible data compression opportunities

This query returns all your table sizes, including row count and data compression status (for the clustered index), ordered by row counts. It is a good idea to have a notion of how many millions or billions of rows are contained in the larger tables in your database. This is one indirect way of keeping tabs on the growth and activity of your database. Knowing the compression status of the clustered index of your largest tables is also very useful, as it might uncover some good candidates for data compression. As previously discussed, SQL Server data compression can be a huge win in many scenarios if you are able to take advantage of it with Enterprise Edition.

Next, using the query shown in Listing 15-50, you can find out the last time that statistics were updated for all indexes in the database.

LISTING 15-50: Last statistics update for all indexes

-- When were Statistics last updated on all indexes?
SELECT o.name, i.name AS [Index Name],STATS_DATE(i.[object_id], 
i.index_id) AS [Statistics Date], s.auto_created, 
s.no_recompute, s.user_created, st.row_count
FROM sys.objects AS o WITH (NOLOCK)
INNER JOIN sys.indexes AS i WITH (NOLOCK)
ON o.[object_id] = i.[object_id]
INNER JOIN sys.stats AS s WITH (NOLOCK)
ON i.[object_id] = s.[object_id] 
AND i.index_id = s.stats_id
INNER JOIN sys.dm_db_partition_stats AS st WITH (NOLOCK)
ON o.[object_id] = st.[object_id]
AND i.[index_id] = st.[index_id]
WHERE o.[type] = 'U'
ORDER BY STATS_DATE(i.[object_id], i.index_id) ASC OPTION (RECOMPILE);  
 
-- Helps discover possible problems with out-of-date statistics
-- Also gives you an idea which indexes are most active

This query returns the name and several other properties for every clustered and nonclustered index in your database, sorted by the date on which statistics on that index were last updated. This can help you track down performance problems caused by out of date statistics that could be causing the SQL Server Query Optimizer to choose a poorly performing execution plan. I like to use this query to discover whether I have old statistics on my more volatile and important tables in the database.

Unless you have a compelling reason not to, it is usually a very good idea to have SQL Server automatically create statistics and automatically update them as the data changes in your tables. Especially for OLTP workloads, I usually like to enable the Auto Update Statistics Asynchronously database setting, which allows the Query Optimizer to use existing statistics while new ones are being generated (instead of waiting for the new ones to be created). This can give you more predictable query performance instead of taking a big performance hit during a statistics update operation.


NOTE It is also a good practice to manually update statistics on a periodic basis as part of your regular database maintenance. Even under Auto Update Statistics, statistics are not updated the moment data changes. To keep the update frequency from conflicting with normal query workloads, the auto update is only triggered when a certain threshold of data change has occurred. Performing periodic manual statistics updates ensures you always have up to date statistics.

Next, using the query shown in Listing 15-51, you will find out which indexes in the current database have the most fragmentation.

LISTING 15-51: Fragmentation information for all indexes

-- Get fragmentation info for all indexes 
-- above a certain size in the current database 
-- Note: This could take some time on a very large database
SELECT DB_NAME(database_id) AS [Database Name], 
OBJECT_NAME(ps.OBJECT_ID) AS [Object Name], 
i.name AS [Index Name], ps.index_id, index_type_desc,
avg_fragmentation_in_percent, fragment_count, page_count
FROM sys.dm_db_index_physical_stats(DB_ID(),NULL, NULL, NULL ,'LIMITED') AS ps
INNER JOIN sys.indexes AS i WITH (NOLOCK)
ON ps.[object_id] = i.[object_id] 
AND ps.index_id = i.index_id
WHERE database_id = DB_ID()
AND page_count > 500
ORDER BY avg_fragmentation_in_percent DESC OPTION (RECOMPILE);
 
-- Helps determine whether you have fragmentation in your relational indexes
-- and how effective your index maintenance strategy is

This query returns every table and index in the current database, ordered by average fragmentation level. It filters out indexes that have fewer than 500 pages, as fragmentation in very small tables is not something you typically have to worry about. Depending on the size of your tables and indexes, and your hardware, this query could take some time to run. This query uses the LIMITED mode option (which is the default if no mode option is specified) when it runs, so it will return less information, but take less time to run than the DETAILED mode.

This query is useful because it can show you the overall condition of your indexes as far as fragmentation goes relatively quickly. Heavily fragmented indexes can reduce your I/O performance and your query performance for some types of queries. It can also increase the space required by your data files.

If you see indexes that have more than 10% fragmentation, you need to decide whether to reorganize them or simply rebuild them. Reorganizing an index is always an online operation, and it can be stopped at any time. It can take longer than simply rebuilding an index and it may not reduce the fragmentation as much as rebuilding the index will. Rebuilding an index can be either an online operation or an offline operation, depending on several factors. The first factor is whether you have SQL Server Standard Edition or SQL Server Enterprise Edition.

If you have Standard Edition, rebuilding an index is always an offline operation. If you have Enterprise Edition, your index rebuild operations can be online or offline depending on a few more factors. With SQL Server 2012, you can rebuild clustered indexes in online mode, regardless of what data types your table contains. With earlier versions of SQL Server, you cannot rebuild a clustered index in online mode if your table has any lob data types, such as nvarchar(max).

After you reorganize or rebuild indexes that are heavily fragmented, you may free up a considerable amount of space within your data file(s). The data file will still be the same size, but more free space will be available. This is a good thing! Strongly resist any urge you may have to shrink your data files to reclaim that disk space. Shrinking data files is a very resource-intensive operation that has the unfortunate side-effect of heavily fragmenting your indexes. Do not let your system administrator or SAN administrator talk you into shrinking data files or entire databases on a regular basis.

Finally, don’t make the common mistake of simply rebuilding all your indexes on a regular basis, whether they need it or not. This is a huge waste of resources on your database server. You can find many good index maintenance scripts on the Internet. One very well regarded one was developed and is maintained by Ola Hallengren which you can get from here: http://ola.hallengren.com.

SUMMARY

In this chapter, you walked through a set of 51 queries that enable you to gather a wealth of useful information about your database server — starting at the hardware level, then at the instance level, and finally at the individual database level. For each query, you have read the underlying rationale for why you would want to run the query and what you can expect to discover. You should now be able to interpret the results of each query, and in many cases also be able to respond to them. For some queries, I provided a more extended discussion of my thoughts and experiences regarding best practices for the area covered by the query.

The purpose behind this detailed examination of queries is that as a database professional, you should be keenly aware of what is going on with your database servers — from the hardware and storage subsystem, to the instance configuration and health, and finally to the properties and performance of your most important individual databases. I believe that you should run this set of queries on every single database server and instance for which you are responsible, saving the results as a baseline. I have never failed to find problems or opportunities for improvement on a database server after running these queries. You should also be able to find and correct many issues in your environment based on what you discover from them.

After you get a baseline, and work on correcting the initial problems that you discover, get in the habit of periodically running these queries again to discover anything new that crops. A busy database server is a very complex system. You will often find that after you make a change to alleviate a performance bottleneck, the bottleneck simply moves to another part of the system. Even if you have modern, high-performance hardware on a properly configured database instance, with a well-tuned database, you will see changes over time as your data volumes grow and your workload increases or simply changes. After reading this chapter, you have a very good set of tools that you can use to discover and address this inevitable change process.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset