Chapter 9

Troubleshooting Methodology and Practices

WHAT’S IN THIS CHAPTER?

  • Methodology for troubleshooting
  • Ten steps to troubleshooting success
  • An approach to working on complex issues with teams

WROX.COM CODE DOWNLOADS FOR THIS CHAPTER

The wrox.com code downloads for this chapter are found at www.wrox.com/remtitle.cgi?isbn=1118177657 on the Download Code tab. The code is in the Chapter 9 download and individually named according to the names throughout the chapter.

INTRODUCTION

SQL Server is a complex product, often tightly integrated with mid-tier and end-user applications. Problems can be many and varied, and the scope of a database administrator (DBA) is broad, covering storage, servers, networking, applications, and meeting business requirements. DBAs often find themselves supporting third-party and in-house-developed applications with varying levels of maturity and robustness. There are no fixed requirements for Transact-SQL development, and application developers often overlook the performance and scalability of code they write. This chapter provides DBAs with guidance based on mature patterns for troubleshooting complex issues and complements the technical aspects of this book.

This chapter focuses on a data-driven methodology for troubleshooting SQL Server problems, consisting of the following three steps (the data collection and analysis may require multiple iterations until the problem’s cause is identified):

1. Define the problem.
2. Iterate through:
  • Data collection
  • Data analysis
3. Validate and implement resolution.

Investing time and effort to develop and refine a troubleshooting methodology helps improve the efficiency and speed with which you troubleshoot problems. Much like planning a trip, the route and endpoint may vary, but identifying the destination and developing an approach to planning an efficient route is a distinct and different skill from driving itself. As you plan subsequent journeys, you can refine your approach, becoming more adept at determining the quickest route and better at estimating the length of time it takes between breaks and optimal departure time.

Troubleshooting SQL Server problems is similar to planning a long car trip. Whereas the rest of this book focuses on how the engine works and the troubleshooting skills themselves, this chapter will help you develop a methodology for troubleshooting, which is as important as the tangible troubleshooting skills themselves.

The book provides one approach for troubleshooting, which you can use to develop or refine your own approach. As you do so, consider roles and responsibilities, communication, reporting, and seeking external help. Reaching a successful resolution can often be achieved by more than one route. Identifying the path of least resistance while achieving the goal is the hallmark of a database professional experienced in troubleshooting complex issues.

APPROACHING PROBLEMS

Developing a professional methodology to managing problems will lead to a less stressful time at work, help make work more rewarding, and differentiate you from others. Although SQL Server is a discrete technology, it is often the case that when problems occur, uncertainty arises regarding the root cause, and problem scope is rarely well-defined. As such, issues can be passed around support teams with little progress or ownership.

This section of the chapter describes a set of ten principles you can use in order to clearly and efficiently identify and resolve problems with SQL Server. Although many of the details here are not specific to SQL Server problems, they are good practices for troubleshooting many types of complex IT issues.

Ten Steps to Successful Troubleshooting

The following steps provide a detailed methodology for successful and efficient incident resolution. They intentionally separate identification of the root cause and issue resolution. These are different tasks, and many situations require equal (or greater) effort to identify the root cause of an issue versus actually fixing it. Indeed, the fix itself may be trivial, but knowing exactly which fix to make is completely dependent on accurately understanding the problem and its cause; therefore, accurate root cause diagnosis is vital.

To get in front of a complex issue — that is, understand it and resolve it — use the following ten steps:

1. Define the problem — Establish a clear problem statement. The objective is to capture in one or two sentences a summary of the technical problem and success criteria. A detailed explanation will likely be required later, but aim initially to create a concise summary for circulation to interested parties.
2. Ascertain the problem’s impact — The business stakeholders and sponsors often don’t want to know technical details. They want to know the operational and financial impact of the incident. This must be categorized and monetized to the furthest extent possible. For example, if you had a website outage, you should estimate the cost to the organization — e.g., $10,000/ hour. If degraded service is likely, how much will it cost in lost revenue or reputation? If the incident prevents employees from completing their work (e.g., call center workers are unproductive), this can be estimated by the cost of wages plus operational impact (e.g., $10/ hour for 50 call center employees plus any overtime to make callbacks).
3. Engage the correct resources — These could be internal or external. In many enterprise scenarios, it is necessary to formally engage internal resources from other disciplines, such as storage operations, application support, and incident management. There may be external suppliers or third parties who should be engaged, such as hardware manufacturers, software vendors, or implementation consultants. Ensure that all participants are briefed with the same problem description and have a good understanding of the success criteria.
4. Identify potential causes — Meet all necessary parties (physically or virtually) to share the problem description, its impact, and any troubleshooting steps already performed. Consider proposed options to mitigate the impact or work around the problem. Identify any possibility to minimize the immediate impact to the business while a long-term solution is sought.
5. Plan and coordinate tasks across teams — Develop a plan, consisting of a number of hypotheses and a number of scenarios that may cause or influence the problem. Seek to prove or disprove each hypothesis by assigning it to a team with the skills and experience necessary to prove the hypothesis and reach a conclusion. — The intention is to narrow the focus by eliminating components that are not causing the problem, until eventually the problem component is found. Iterate around this method until the hypotheses are proven or disproven.
6. Select a communication plan and review — Document the plan and agree who will keep management, end users, and the technical team updated. Mutually agree on a time to reconvene, (e.g., every 2 hours or 4 hours may be appropriate). In scenarios with geographically dispersed teams, maintaining an open conference call to assist troubleshooting can be useful, but it’s still important to plan and execute regular reviews.
7. Identify root cause — After a number of iterations (each iteration should be isolated, repeatable, and have narrow scope),you will have disproved a number of hypotheses, and hopefully proved one. Once the cause of the problem is understood, progress to the next step to find a fix.
8. Determine solution — This step involves identifying a resolution to the defined and understood cause of the problem.
9. Test and implement — Even if the problem does not exist in the test or pre-production environment, implement the fix there first. This involves making the identified change and confirming no undesired impact, then deploying to the production environment. If possible, ensure a rollback position and be prepared to invoke this plan if necessary.
10. Review — Post-mortem analysis will help prevent further recurrence of this issue or new issues in the future and can be used to identify other vulnerable systems within the organization which should be fixed, and will improve the troubleshooting approach to ensure it is as optimized and efficient as possible.

The ten steps outlined above and described in more detail in the following sections describe a troubleshooting approach you can adapt and simplify as desired. Not all problems require full formal engagement, but adopting an approximation of these disciplines can help you prioritize other activities, such as monetizing the impact of problems and defining a clear problem statement.

Behavior and Attitude

In addition to employing a good troubleshooting approach, adopting a positive attitude with moderate determination and persistence to identify the root cause and resolve issues definitely helps. A positive attitude leads to better quality results, faster resolution, and it will reduce the stress level for you and co-workers during the troubleshooting process. Using a consistent approach to resolving problems by decomposing them scientifically is a proven and effective method, and many of these aspects are within your control.

The following behaviors and attitudes are characteristic of the most effective database professionals when troubleshooting complex problems:

  • Remain calm — Stay objective, no matter how urgent the problem. Project confidence and calmness to your peers, end users, and management, even if they show signs of stress or panic. This reassures them that you are in control and able to resolve the problem. These people are more likely to give you the time and space necessary to investigate and resolve the issue if they trust your capability.
  • Remember that problems are never random — Problems with computers happen for a reason. When you don’t understand the reason, the cause may seem random, but there is always an explanation. Intermittent or infrequent problems in particular appear random; seek to identify patterns or correlating events that could lead to the circumstances that cause the problem.
  • Avoid prejudice — Never assume that you know how to solve a problem until you have a problem description and have done some basic testing. It is not necessary to provide an instant answer; the correct answer with a short delay trumps a quick, inaccurate answer. This habit also builds your credibility with management as a reliable and capable engineer.
  • Avoid looking for fixes — Ensure that finding the cause is your first priority! The people around you will be pressing hard for a fix or an estimated time to fix. The fix is the goal, but you must first lay the foundation by understanding the cause.
  • Think ahead — Proactively consider potential blockers. If you may need to restore the database, start the tape retrieval process in parallel with troubleshooting. This reduces overall downtime and impact if you do need to revert to the backup.

Success Criteria

Having defined the problem, recognizing its resolution is usually relatively straightforward. Nonetheless, explicitly agreeing on a set of success criteria helps to structure troubleshooting steps and provide a positive test case scenario. Otherwise, what constitutes problem resolution can be subjective.

With performance problems, for example, it can be difficult to reach a consensus about what constitutes good-enough performance, which can mean different things to different people. From a DBA’s perspective, it’s often the case that the first few optimizations realize the most performance gains, with each subsequent performance improvement harder to achieve — meaning more effort, more fundamental schema changes, and smaller incremental performance improvement. For this reason, it’s important to agree on the performance objective and when to stop tuning.

Unfortunately, it’s common to see an enterprise spend a lot of time troubleshooting numerous issues that have nothing to do with the main source of the problem. Avoid this by defining both the problem and the success criteria, and seeking agreement with the sponsor; that way, expectations are clear and understood by all parties.

Working with Stakeholders

Stakeholders are a group of people usually consisting of business management, IT management, owners, shareholders, and anyone with an interest in the success or failure of the organization. Most business stakeholders want problems resolved as fast as possible using the fewest possible resources, and managers often feel under pressure to provide answers to users, their superiors, and external stakeholders such as customers, investors, auditors, or the media.

When managers are not well informed or they don’t have confidence in the incident team, this can lead to the undesirable behavior of micro-management. These are the managers who hover, requesting constant updates and generally inhibiting the troubleshooting process. You can avoid this, however, by proactively handling an incident to ensure both that stakeholders have the information they need and that they receive regular updates.

Broadly speaking, managers look first for a solution, then the cause. Database professionals should first attempt to understand the cause, then identify a solution. These opposing approaches can lead to friction, so it’s important to recognize them and respect each other’s priorities.

To minimize friction with management, try enlisting their help by nominating a single spokesperson for the incident. Request that they communicate with stakeholders and anyone who isn’t directly involved in troubleshooting the problem. Agree on a schedule for providing updates and stick to that schedule to reduce distractions, such as requests for information. Identify one person to whom you will provide updates, letting that person communicate with anyone else who needs the information. If more than one person is directly involved in the technical aspects of troubleshooting, nominate just one technical person to talk to the management contact.

Managers can also help by gathering information to determine the problem’s real impact on the business. As a guideline, try to establish the following:

  • How severely is the system affected?
  • How many users cannot work?
  • Is money being lost? If so, quantify the amount.
  • What is the visibility of the issue?
  • Are external customers affected?
  • Could any regulatory or compliance obligations be breeched?
  • How serious are the consequences if the problem persists?

Management can also be enlisted to identify mitigating factors. Are any options available to run a degraded service such as manual systems that enable some operations to continue? Encourage managers to generate ideas for a short-term tactical solution while the root cause is investigated and a resolution implemented.

Managers might also be helpful in engaging third parties, initially to make contact and open a dialog, and, in situations in which escalation is required, to engage the right resources to advance a solution. Each of these factors can be used to help shape the solution.

Service-Level Agreements

A service-level agreement (SLA) forms an agreement between IT and the business or between an outsourcer and an organization. The SLA should define availability and performance metrics for key business applications. SLAs often include metrics for response and resolution times in the event of an incident. These agreements are non-functional requirements and useful for managing business expectations in terms of application performance, availability, and response time in the event of an incident.

Two terms commonly used in storage solution design can be borrowed and adapted to most other areas of IT and business agreements: recovery point objective (RPO) and recovery time objective (RTO). Both can be included within an SLA to govern the data loss and recovery period following an incident.

RTO refers to the amount of time a solution can be down before the system is recovered. This varies according to the type of failure — for example, in the event of a single server failure in a failover cluster, the RTO could reasonably be 1–2 minutes; in the event of a total site loss, it might reasonably be four hours. This RTO metric essentially governs how long IT has to restore service in the event of various types of failures.

RPO refers to how much data loss can be tolerated without impact to the business. In the SQL Server world this commonly determines the frequency of transaction log backups. If, for example, the RPO were five minutes, you would need to take log backups every five minutes to ensure a maximum data loss of the same duration. Combining these facets of an agreement, it would be fairly common for a DBA to agree to configure five-minute log backups, and log shipping to a second location with an RPO of 15 minutes and an RTO of four hours. This would mean bringing the disaster recovery location online within four hours and ensuring a maximum data loss duration of 15 minutes. Agreeing to these objectives ahead of time with the business is an important part of setting and managing expectations.

Engaging External Help

It is not always necessary or possible to solve a problem with external assistance if there is a lack of knowledge, experience or time. Knowing who and when to call are important aspects of successful troubleshooting. Often, the objection to hiring a consultant, specialist, or support provider, or to open a support request with Microsoft Customer Service and Support (CSS), is financial. In reality, many problem scenarios can be much more expensive to resolve without external help. The time, resources, and opportunity costs of taking a long time to solve a problem, solving it in an inappropriate or inefficient way, or not solving it at all can be high. Ensure that all factors are taken into consideration when deciding if and when to engage outside help.

In some situations, it may be cheaper to engage help immediately — e.g., when the day rate for a consultant is half the cost of revenue loss per day; in this scenario it may make sense to bring in a consultant immediately. For example, it may be most beneficial to engage a specialist for problems related to rarely used features, as an organization might not have deep expertise with such features.

Besides cost, another barrier to enlisting external help is a desire to be perceived by the organization as the expert in a particular feature or technology. This can be quite short-sighted, particularly if an incident is causing revenue or reputation damage to the organization. Knowing when to ask for help is a valuable trait, and engaging an external resource also provides the opportunity to learn and increase the value you deliver to the business. Using external resources also provides a firsthand opportunity to see different approaches to troubleshooting, which can be more valuable than the technical skills themselves.

Certain types of problems are well suited for outside help. One such example is database corruption. This can be a serious problem, and many urban legends and “common wisdom” surround the best approach to resolving corruption problems, and mistakes could easily make a problem worse, without solving the underlying cause of the problem.

If you do engage support, whether it’s from CSS, a consultant, or another outside assistance, you will need to provide them with some basic information. Consider the following as a starting point:

  • Environment overview (network diagram, application architecture)
  • Problem statement and steps to reproduce
  • success criteria
  • Key stakeholders
  • Steps already taken to resolve issue and outcome
  • Windows System and Application Event Logs and SQL Server Error Logs
  • Profiler trace containing the problem (if possible)
  • SQLDiag output if it will add value

DEFINING THE PROBLEM

Investing time to understand the problem and application environment often leads to a higher-quality and faster problem resolution. While it is tempting to focus on immediately resolving the problem, complex problems are rarely resolved until causes are fully understood. A thorough understanding of the configuration, patterns, and characteristics of the problem will position you well for resolving the problem.

To learn about the problem, you need to identify the major software and hardware components, review the impact of recent changes, and understand the specific circumstances that cause the problem condition to occur. The following section provides a framework for these aspects. Decomposing the problem into constituent components will help isolate the cause of the problem and identify bottlenecks.

Guidelines for Identifying the Problem

Use the following guidelines to fully comprehend the exact problem you are facing:

  • Construct a diagram of the end-to-end application environment.
  • Obtain visibility of major hardware components, paying special attention to components that may complicate troubleshooting, such as geographically dispersed configurations, local caching, and network load balancing (NLB). Network load balancers can mask a problem with an individual server because the problem server may only serve traffic for 25% of requests (assuming four active servers); therefore, occurrences of the problem can appear random or inconsistent.
  • Gather all relevant logs to a single location:
    • Windows and System Event logs
    • SQL Server Error Logs
    • Dump files
    • Application logs
  • Construct a timeline of activities and events leading up to the failure.
  • Retrieve change logs, including any information relating to changes before the problem occurred and any changes or steps carried out in an attempt to resolve the problem.
  • Understand the steps necessary to reproduce the problem. If possible, ensure that you have a repeatable process to reproduce the problem and validate on a test environment if possible.
  • Agree on success criteria. Where the problem is repeatable, this is easy. With intermittent problems this can be more difficult, although agreeing to a period of non-occurrence may be valid (e.g., before troubleshooting the problem occurred daily, so if one week passes without the problem you can consider the issue resolved).
  • Understand log context, (e.g., client, middle tier, or SQL Server). Pay attention to the time zone on each machine. It may be necessary to synchronize the time zones for data from multiple sources.
  • Understand the rhythm of the business. This enables you to determine whether the current workload is typical, a seasonal spike, or an unusual pattern.
  • Capture any situations when the problem does not occur. Understanding these scenarios can be useful in refining the scope of the problem too.

Part of understanding the problem is understanding why the issue is occurring now. If this is a new system, perhaps you haven’t seen this level of load on the system before. If it is an existing system, review your change control documents to see what has changed recently on the system. Any change, even if seemingly unrelated, should be reviewed. This can mean any alteration, no matter how small, such as a Windows or SQL Server patch, a new policy or removed permission, a configuration option, or an application or database schema change.

Isolating the Problem

Are you certain the problem is related to the database tier? How do you know it’s a database problem? Many problems begin life as an application behavior or performance issue, and there may be other software components or interactions that could affect the database platform.

Once you have a good understanding of the problem, decompose it into manageable elements; isolating each component enables you to focus on the problem area fast. The intention of this approach is to eliminate or incriminate each area of the environment. Approach troubleshooting as a series of mini-experiments, each looking to prove or disprove that a specific feature or component is functioning correctly.

The following list describes what to look for when troubleshooting each major problem category:

  • Connectivity issues — Does the problem only occur with one protocol, such as named pipes or TCP/IP? Are some applications, users, client workstations, or subnets able to connect while others cannot? Does the problem occur only with double hops, whereas direct connections work? Will local connections work but remote connections fail? Is the problem related to name resolution (does ping by name work)? Could network routing be the issue (check ping or tracert)? Can you connect using the dedicated administrator connection (DAC)? Try to connect with SQL Authentication as well as using a domain account.
  • Performance issues — For a performance problem you need to determine if the problem is on the client, the middle tier, the server on which SQL Server runs, or the network. If it is an application performance problem, it is essential to establish how much time is consumed in the database tier; for example, if application response time is 10 seconds, is 1 second or 9 seconds consumed by the database response time? Capture slow-running stored procedures, execute these directly on the server, and confirm execution times.
  • Hardware bottlenecks — Identify resource contention around disk, CPU, network, or memory. Using wait stats analysis and the tools discussed in this book, identify the top N worst queries by contended resource (disk, memory, or CPU) and investigate further.
  • SQL Server issues — As well as hardware contention, SQL Server has finite internal resources, such as locks, latches, worker threads, and shared resources such as tempdb. Isolate these problems with wait stats analysis and DMVs, then investigate queries that are causing the resource consumption.
  • Compilation issues — If possible, identify one user query that is slow, the most common causes are insufficient resources. This could be caused by a sub-optimal query plan as a result of missing or outdated statistics, or inefficient indexes. Analyze the plan cache to help identify this problem.

Performance Bottlenecks

Performance troubleshooting involves identifying the bottleneck. This may be done live on the system, or via a post-mortem review by analyzing data collected during problem occurrence. This is often an iterative process, each cycle identifying and resolving the largest bottleneck until the problem is resolved. Often, fixing one bottleneck uncovers another and you need to start the troubleshooting cycle again with the new bottleneck.

Memory

If you identify a SQL Server memory bottleneck, you have several options to improve performance. The first is to increase physical memory or change the memory configuration. Another approach is to review queries and optimize performance to consume less memory.

If you decide to increase the memory available to SQL Server, you could consider adding more physical memory, or increasing the memory assignment for virtual machines (VMs). Improving the use of existing memory without adding more is often more scalable and yields better results. While x86 (32-bit) systems are becoming less common, if you are running SQL Server 2005 or 2008 on 32-bit systems or VMs, consider using the Address Window Extension (AWE) or /3GB to increase the buffer pool available to SQL Server (the AWE feature was discontinued in SQL Server 2012). However, if you do see memory contention on a x86 server, consider a plan to migrate to an × 64 system to resolve this issue. The × 64 platform provides increased virtual memory and better memory management.

Aside from physical memory and server configuration, significant performance gains can be made through query tuning to reduce memory requirements. Identify queries that require significant memory grants, such as sorts or hashes, and review the query plans for these scenarios. Try to identify better indexes, and avoid table scans and other operations that force a large number of rows to be read from disk and manipulated in memory.

CPU

CPU problems could be sustained or occasional spikes. Occasional CPU spikes, especially for a small number of CPUs, can often be safely ignored. Wait statistics record the resource SQL Server or a query is waiting on. Capturing wait statistics information can prove a useful tool in understanding resource bottlenecks and to identify whether CPU contention is the cause of performance problems. Consider server build and configuration options to improve CPU performance, such as increasing the number and speed of CPU cores. In terms of configuration options, review the maximum degree of parallelism to ensure it is optimal for the intended workload.

In many situations, overall performance may be acceptable while the server demonstrates high CPU. As with memory, once you have established CPU is the dominant wait type, identify the top 10 worst-performing queries by CPU and then work through each of these in turn. Look at the query execution plan and identify expensive CPU operations, such as hash joins, sorts, and computed columns. Look for opportunities to reduce CPU workload with new indexes, consolidated indexes, XML indexes, or to improve query design.

Storage I/O

Storage input/output (I/O) is typically the slowest resource within a server (memory and CPU are orders of magnitude quicker). Therefore, optimizing the storage solution design and configuration (ensuring the solution performs optimally) as well as being considerate with I/O requests (making fewer I/O requests) is essential to achieve scalable systems with good performance. Review the PerfMon disk counters for Average Disk Sec/Read and Average Disk Sec/Write to verify that the time to make a read or write is ideally below 20 milliseconds for OLTP systems, higher for decision support systems. Generally speaking, if storage is performing slower than this, database performance will be affected. When reviewing storage performance, consider the end-to-end solution. Following are some elements that may affect performance:

  • RAID levels
  • Disk types (enterprise flash Disk, SCSI)
  • Dedicated or shared disk arrays
  • Connectivity (InfiniBand, Fibre Channel, iSCSI)
  • HBA cache and queue settings
  • HBA load balancing policy (active; active vs. active; or passive)
  • NTFS cluster size
  • Layout and isolation of data, index, log, and tempdb files
  • Storage cache and controllers policy

In addition to ensuring optimal storage performance, be smart with I/O and ensure that the database is not making unnecessary requests. Reviewing and optimizing a query plan to eliminate index scans and replace them with seeks can often deliver an order of magnitude benefit in I/O reduction. It is common to overwhelm the storage solution with inefficient queries, saturating controllers and cache on the storage array.

Reduce I/O workload by improving indexes for more efficient access, make sure statistics are current, tune or increase memory to improve cache performance, or alter queries to avoid unnecessary I/O. Rationalize and consolidate indexes to minimize the overhead of index maintenance. Use Profiler or DMVs to identify the worst-performing queries by reads and writes. In addition, use STATISTICS IO to identify batches within a query that contain high logical I/Os. Usually, identifying the table or view that has the highest number of logical I/Os is sufficient to identify the table or view requiring optimization.

Network

Network bottlenecks can look like SQL Server performance problems. When query results are not sent or received by the client as fast as SQL Server can send them, SQL Server can appear slow. Often a particular function within an application is described as slow. In this case, you should try to determine the database interaction used by this functionality.

SQL Server Profiler can find which stored procedures, functions, and queries are executed when the application feature is accessed. Sometimes this indicates that each query executes quickly, but either very many queries are executed or there is a large delay between the calls to each query. The latter case usually indicates that the performance problem is somewhere outside of SQL Server.


CONSIDER DISABLING TCP CHIMNEY
TCP Chimney is a network interface card (NIC) technology that by default allows servers to offload some TCP workload to the network card itself. This works well on desktop PCs and application servers, but database servers often transfer large amounts of data to clients.
In this scenario, the offload activity may overwhelm the NIC, and the processing capability on the network card can become a bottleneck. Disable TCP offloading using the NETSH command utility and NIC drivers.

If you are able to narrow the problem down to a single stored procedure as the main contributor to the problem, break that stored procedure down into individual queries. Often there will be a single query within that procedure — this is the area to focus on for tuning and optimization.

DATA COLLECTION

When the problem is defined and well understood, and the success criteria have been agreed upon, the next step is to gather data. What data you should gather depends on the problem and what (if any) work has already been completed.

It is critical that decisions about remedial action are based on data. Decisions or recommendations without the foundation of empirical data are simply guesses or assumptions. Gathering data is an iterative process that may require several iterations before the complete problem is captured and a conclusive decision can be reached. As mentioned earlier, ensure that stakeholders understand that sufficient data will be required ahead of any recommendations and problem resolution. These stakeholders must also understand the value and long-term benefits of quantitative analysis and evidence-based decisions. Clearly explaining your methodology helps to win their confidence in the process and its outcome.

Since data collection is so important in identifying the root cause, the following section provides an approach around data collection and specific guidance to ensure the data collection objectives are met.

Focused Data Collection

Only rarely is complete fault information provided when an incident is escalated. The nature of databases means serverwide problems are more immediately obvious, and support teams are notified rapidly. High-impact database problems reach support teams by phone or walk-up much faster than automated alerts or formal support-ticket escalation. Typically, escalated support cases contain insufficient data to make any kind of decision, and further analysis is required before any kind of remediation can begin. In the early minutes (sometimes hours) of these incidents, information can be vague while the cause of the problem is unknown. At this point, the scope of the issue is often the complete solution, consisting of all components of the solution.

During this time, it can be useful to adopt a top-down approach to troubleshooting by starting with Performance Monitor (PerfMon). PerfMon has the advantage of being a serverwide diagnostics tool, and it can be useful in identifying or indemnifying the database tier as the root cause of the problem.

The divide-and-conquer method is especially useful when troubleshooting performance problems. This approach takes the complete end-to-end application environment and selects a midpoint between the client and the database server to determine whether the performance problem exists at that point. Based on the outcome, you can then focus on the problem half and iterate through it until the problem component is identified. This approach can work particularly well with the database tier, as calling a number of stored procedures to validate database functionality and responsiveness can be a useful junction in identifying or absolving SQL Server as the cause of the performance problem.

Understanding Data Gathering

Data gathering is a balance between collecting sufficient information to capture the problem and not collecting so much data that the collection process itself affects system performance or there is simply too much data to analyze efficiently.


BLACK BOX TRACING
Consider a scenario with an intermittent problem for which there is no clear pattern to occurrences. Black box tracing creates a server-side trace, writing trace data to a circular log file that contains data for a specified time period (e.g., 1 hour or 4 hours). If the problem event occurs again and the trace is stopped before the occurrence of the problem is overwritten in the log (this could be automated), the trace will contain the problem. You can also look a setting up in Extended Events to help with this (Chapter 13).

If the problem can be reproduced it will be much easier to collect data and refine the scope rapidly. If it occurs in a predictable pattern, it is usually possible to restrict data collection to a short period of time and gather all necessary data.

Conversely, if the problem happens infrequently or without pattern, a different strategy is required. Often it isn’t possible to start data capture when the problem occurs because events occurring just before the problem starts may be important. Therefore, consider using a black box circular trace to enable a continuous lightweight trace that can be stopped when the problem occurs.

Tools and Utilities

The following list summarizes some of the most commonly used data collection tools and analysis utilities. Subsequent chapters contain additional details covering when and how to use each of these tools:

  • PerfMon — Performance Monitor (PerfMon) ships with Windows and can be used to gather information on server resources and services. It can track serverwide information such as CPU and memory usage, I/O statistics, and network activity. Several SQL Server-specific counters can be useful for various troubleshooting and monitoring scenarios.
  • Profiler — SQL Server Profiler can be used to capture statement-level information from within the database engine.
  • XEvents — Extended Events are a lightweight event-driven data-capture feature that can assist troubleshooting while minimizing the monitoring footprint.
  • PSSDiag — This is a wrapper around SQLDiag, PerfMon, and other add-ins. SQLDiag can do anything that PSSDiag can do, but it is not pre-packaged with all the add-ins that PSSDiag may be configured with. PSSDiag is usually configured by a Microsoft support engineer and sent to help troubleshoot a specific problem. It is specific to a certain version of SQL Server, and the add-ins are usually architecture specific (x86 or x64).
  • SQLDiag — SQLDiag ships with SQL Server. It can be used to gather basic environmental information such as the SQL Server Error Logs, Event Logs, and SQL Server configuration settings. It can also be used to capture time-synchronized Profiler and PerfMon information (see Chapter 11).
  • Event Logs (Application, System, and Security) — These logs are often useful, displaying which errors, warnings, and informational messages have occurred in the recent past.
  • Application logs — If the application instrumentation includes writing Error Log output or diagnostic information, these logs can be useful for identifying the cause of a problem.
  • User dumps — If you see an exception in the SQL Server Error Logs, you should also see a mini-dump file with the extension .mdmp. This can be used by Microsoft CSS to help determine why the exception occurred.
  • NetMon — This is a network sniffer that is used to look at data as it is sent over the network. It is often used to diagnose connectivity or Kerberos problems.
  • CMS — Central Management Server is a feature with SQL Server Management Studio and provides a method of storing your SQL Server registrations in a central database. It can be useful in a troubleshooting scenario because you don’t have to remember specific SQL Server instance names and passwords — they are already stored in CMS. In addition, you can execute commands against groups of CMS servers at once.
  • Management Data Warehouse — This SQL Server Management Studio tool is used for performance trending. You can use it to collect and consolidate various data over time, which you can then analyze to see how performance has changed.
  • Policy-Based Management (PBM) — PBM can be used to validate whether predetermined standards have been followed. Some policies can prevent certain actions from ever occurring.

DATA ANALYSIS

After data collection, data analysis is the second iterative activity required to identify the problem’s root cause. It may be necessary to perform data analysis multiple times for a single problem, including data from multiple sources and formats. The typical starting point is to review PerfMon output to identify bottlenecks or contention with I/O, CPU, memory, or the network resources. Often, once the main bottleneck is resolved, another will appear. At this point it is important to understand the success criteria and SLAs to have a clear goal and know when to stop troubleshooting.

The following list describes several tools and utilities you can use to analyze the data collected:

  • SQL Nexus — This tool, available from www.codeplex.com, imports data into a SQL Server database and generates reports showing the most common performance issues based on that data. It takes Profiler, PerfMon, and PerfStats as input. For example, it can show all statements captured in the trace that were blocked for more than 30 seconds. SQL Nexus is commonly used to show the stored procedures or queries that had the highest duration, were executed most frequently, or used the highest cumulative CPU. With stored procedures, it is possible to drill down to get more specific information on duration and other statistics about individual queries within the stored procedure. SQL Nexus is a great tool for taking a large amount of data and quickly locating pain points that require more thorough examination. You can read all about it in Chapter 12.
  • Profiler — Profiler can be used to replay statements if the right events were captured. A Replay template built into Profiler can be used to capture those events. This is useful to test the same queries repeatedly against a database to which changes have been made, such as modifying indexes and altering file layout. The impact/benefits of these changes can be measured relative to the overall workload using this consistent set of replay workload. Profiler can also be useful for troubleshooting security and connectivity problems.
  • PerfMon — Performance Monitor can be used to isolate CPU, memory, I/O, or network bottlenecks. Another use is to help determine whether SQL Server is the victim of another process (such as anti-virus apps or device drivers), consuming resources such that SQL Server performance is affected.
  • Database Tuning Advisor (DTA) — The DTA can take as input either an individual query or an entire trace workload. It makes recommendations for possible index or partitioning changes that can be tested. Never implement suggestions from DTA without thorough review in the context of the total workload. Analysis with DTA is most effective when a complete workload can be captured in a trace and processed.
  • SQL Server Data Tools — Provides an integrated environment for developers to create, edit and deploy database schemas. A full discussion of the product is well beyond the scope of this section.
  • Debugger — It is possible to debug stored procedures from SQL Server Management Studio beginning with SQL Server 2008.

COUNTING THE COST OF PERFORMANCE PROBLEMS
Performance problems are rarely caused by a single large query executing on a server. More often, the query with the highest cumulative cost is a relatively short and fast query, but one that might be executed thousands of times per minute. A stored procedure that takes 200 ms to execute and is called thousands of times per minute will have a greater impact on server performance than a single query that takes 1.5 seconds to complete. As such, focus your attention on queries with the highest cumulative cost.
When analyzing data, use aggregates to consider the total cumulative time (duration, CPU, read/writes, etc.), rather than identifying the single longest-running query. You can use the Performance Dashboard reports or SQL Nexus to identify these queries.

VALIDATING AND IMPLEMENTING RESOLUTION

Once the solution has been identified, it should be validated through testing and implemented in production. This process should be as controlled and disciplined as the iterations of collecting and analyzing data. A production problem does not justify a cavalier attitude toward production changes, and professionalism must be maintained even under pressure.

Validating Changes

Changes should always be made in a test environment prior to production. In an ideal scenario, the problem can be reproduced in the test environment, which provides an opportunity to confirm, or validate, that the fix has the desired impact. It is also important to carry out confidence tests to ensure that the change has no undesired impact.

Testing Changes in Isolation

If possible, test each change in isolation. Changing several settings at once can make it harder to identify which change resolved the problem or caused other issues. In addition, it can be harder to roll back multiple changes than single, individual changes. Ensure that you have a thorough understanding of the consequences of any change, including rollback options.

Implementing Resolution

The final step is to implement the resolution in production. Ensure that the change is documented and any impact (such as service restarts) communicated. Note the behavior of the database or application before and after the change, as well as exactly what change was made. Ensure that the success criteria are met once the resolution is implemented, and share your results with the stakeholders.

Once the resolution is implemented and the solution is stabilized, carry out post-mortem analysis and ensure that the root causes are communicated to relevant parties. Identify any other vulnerable systems within the organization and communicate any lessons learned that may help you avoid a recurrence of similar problems in the future.

SUMMARY

Confidence and control are critical to successful troubleshooting. Developing a methodology that can be consistently applied and refined is a valuable skill that enhances your value to an organization.

Your attitude, knowledge of when and whom to ask for help, and ability to communicate with management and stakeholders are essential skills. In many job roles these skills are as important as technical skills to a successful career.

It doesn’t matter if you are investigating a performance problem, resolving a security issue, or fixing a connectivity problem. Use each scenario as an opportunity to implement and refine your troubleshooting approach.

Define the problem, understand the exact circumstances under which the problem manifests itself, and identify the problem’s potential impact on the organization. Consider the end-to-end application, hardware, and software scenario. Use the divide-and-conquer technique to isolate the problem, eliminating or incriminating each component of the problem until you find the root cause. After the cause is understood, develop a fix to resolve the issue, and then test it to validate that it doesn’t cause any undesirable consequences before implementing it in production.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset