Before we dive into the specifics of the CLR and .NET, we need to understand performance measurement in general, as well as the many tools available to us. You are only as powerful as the tools in your arsenal, and this chapter attempts to give you a solid grounding and set the stage for many of the tools that will be discussed throughout the book.
Before deciding what to measure, you need to determine a set of performance requirements. The requirements should be general enough to not prescribe a specific implementation, but specific enough to be measurable. They need to be grounded in reality, even if you do not know how to achieve them yet. These requirements will, in turn, drive which metrics you need to collect. Before collecting numbers, you need to know what you intend to measure. This sounds obvious, but it is actually a lot more involved than you may think. Consider memory. You obviously want to measure memory usage and minimize it. But which kind of memory? Private working set? Commit size? Paged pool? Peak working set? .NET heap size? Large object heap size? Individual processor heaps to ensure they are balanced? Some other variant? For tracking memory usage over time, do you want the average for an hour, the peak? Does memory usage correlate with processing load size? As you can see, there are easily a dozen or more metrics just for the concept of memory alone. And we have not even touched the concept of private heaps or profiling the application to see what kinds of objects are using memory!
Be as specific as possible when describing what you want to measure.
Story In one large server application I was responsible for, we tracked its private bytes (see the section on Performance Counters in this chapter for more information about various types of memory measurement) as a critical metric and used this number to decide when we needed to do things like restart the process before beginning a large, memory-intensive operation. It turned out that quite a large amount of those “private bytes” were actually paged out over time and not contributing to the memory load on the system, which is what we were really concerned with. We changed our system to measure the working set instead. This had the benefit of “reducing” our memory usage by a few gigabytes. (As I said, this was a rather large application.)
Once you have decided what you are going to measure, come up with specific goals for each of those metrics. Early in development, these goals may be quite malleable, even unrealistic, but should still be based on the top-level requirements. The point at the beginning is not necessarily to meet the goals, but to force you to build a system that automatically measures you against those goals.
Your goals should be quantifiable. A high-level goal for your program might state that it should be “fast.” Of course it should. That is not a very good metric because “fast” is subjective and there is no well-defined way to know you are meeting that goal. You must be able to assign a number to this goal and be able to measure it.
Bad: “The user interface should be responsive.”
Good: “No operation may block the UI thread for more than 20 milliseconds.”
However, just being quantifiable is not good enough either. You need to be very specific, as we saw in the memory example earlier.
Bad: “Memory should be less than 1 GB.”
Good: “Working set memory usage should never exceed 1 GB during peak load of 100 queries per second.”
The second version of that goal gives a very specific circumstance that determines whether you are meeting your goal. In fact, it suggests a good test case.
Another major determining factor in what your goals should be is the kind of application you are writing. A user interface program must at all costs remain responsive on the UI thread, whatever else it does. A server program handling dozens, hundreds, or even thousands of requests per second must be incredibly efficient in handling I/O and synchronization to ensure maximum throughput and keep the CPU utilization high. You design a server of this type in a completely different way than other programs. It is very difficult to fix a poorly written application retroactively if it has a fundamentally flawed architecture from an efficiency perspective.
Capacity planning is also important. A useful exercise while designing your system and planning performance measurement is to consider what the optimal theoretical performance of your system is. If you could eliminate all overhead like garbage collection, JIT, thread interrupts, or whatever you deem is overhead in your application, then what is left to process the actual work? What are the theoretical limits that you can think of, in terms of workload, memory usage, CPU usage, and internal synchronization? This often depends on the hardware and OS you are running on. For example, if you have a 16-processor server with 64 GB of RAM with two 10 GB network links, then you have an idea of your parallelism threshold, how much data you can store in memory, and how much you can push over the wire every second. It will help you plan how many machines of this type you will need if one is not enough.
You have likely heard the phrase, coined by Donald Knuth, “Premature optimization is the root of all evil.” The context of the quote is in determining which areas of your program are actually important to optimize. This brings us to Amdahl’s Law, which describes the theoretical maximum speedup of a software program through optimization, in particular how it applies to sequential programs and picking which parts of a program to optimize. Micro-optimizing code that does not significantly contribute to overall inefficiency is largely a waste of time. This concept most obviously applies to micro-optimizations at the code level, but it can apply to higher levels of your design as well. You still need to understand your architecture and its constraints as you design or you will miss something crucial and severely hamstring your application. But within those parameters, there are many areas which are not important (or you do not know which sub-areas are important yet). It is not impossible to redesign an existing application from the ground up, but it is far more expensive than doing it right in the first place. When architecting a large system, often the only way you can avoid the premature optimization trap is with experience and examining the architecture of similar or representative systems. In any case, you must bake performance goals into the design up front. Performance, like security and many other aspects of software design, cannot be an afterthought, but needs to be included as an explicit goal from the start.
The performance analysis you will do at the beginning of a project is different from that which occurs once it has been written and is being tested. At the beginning, you must make sure the design is scalable, that the technology can theoretically handle what you want to do, and that you are not making huge architectural blunders that will forever haunt you. Once a project reaches testing, deployment, and maintenance phases, you will instead spend more time on micro-optimizations, analyzing specific code patterns, trying to reduce memory usage, etc.
You will never have time to optimize everything, so start intelligently. Optimize the most inefficient portions of a program first to get the largest benefit. This is why having goals and an excellent measurement system in place is critical—otherwise, you do not even know where to start.
When considering the numbers you are measuring, decide what the most appropriate statistics are. Most people default to average, which is certainly important in most circumstances, but you should also consider percentiles. If you have availability requirements, you will almost certainly need to have goals stated in terms of percentiles. For example:
“Average latency for database requests must be less than 10ms. The 95th percentile latency for database requests must be less than 100ms.”
If you are not familiar with this concept, it is actually quite simple. If you take 100 measurements of something and sort them, then the 95th entry in that list is the 95th percentile value of that data set. The 95th percentile says, “95% of all samples have this value or less.” Alternatively, “5% of requests have a value higher than this.”
The general formula for calculating the index of the Pth percentile of a sorted list is:
0.01 * P * N
where P is the percentile and N is the length of the list.
Consider a series of measurements for generation 0 garbage collection pause time in milliseconds with these values (pre-sorted for convenience):
1, 2, 2, 4, 5, 5, 8, 10, 10, 11, 11, 11, 15, 23, 24, 25, 50, 87
For these 18 samples, we have an average of 17ms, but the 95th percentile is much higher at 50ms. If you just saw the average number, you may not be concerned with your GC latencies, but knowing the percentiles, you have a better idea of the full picture and that there are some occasional GCs happening that are far worse.
This series also demonstrates that the median value (50^th percentile) can be quite different from the average. The average value of a series of measurements is often prone to strong influence by values in the higher percentiles.
Percentiles values are usually far more important for high-availability services. The higher availability you require, the higher percentile you will want to track. Usually, the 99th percentile is as high as you need to care about, but if you deal in a truly enormous volume of requests, 99.99th, 99.999th, or even higher percentiles will be important. Often, the value you need to be concerned about is determined by business needs, not technical reasons.
Percentiles are valuable because they give you an idea of how your metrics degrade across your entire execution context. Even if the average user or request experience in your application is good, perhaps the 90th percentile metric shows some room for improvement. That is telling you that 10% of your execution is being impacted more negatively than the rest. Tracking multiple percentiles will tell you how fast this degradation occurs. How important this percentage of users or requests is must ultimately be a business decision, and there is definitely a law of diminishing returns at play here. Getting that last 1% may be extremely difficult and costly.
I stated that the 95th percentile for the above data set was 50ms. While technically true, it is not useful information in this case—there is not actually enough data to make that call with any statistical significance, and it could be just a fluke. To determine how many samples you need, just use a rule of thumb: You need one “order of magnitude” more samples than the target percentile. For percentiles from 0-99, you need 100 samples minimum. You need 1,000 samples for 99.9th percentile, 10,000 samples for 99.99th percentile, and so on. This mostly works, but if you are interested in determining the actual number of samples you need from a mathematical perspective, research sample size determination.
Put more exactly, the potential error varies with the square root of the number of samples. For example, 100 samples yields an error range of 90-100, or a 10% error; 1,000 samples yields an error range of 969-1031, or a 3% error.
Do not forget to also consider other types of statistical values: minimum, maximum, median, standard deviations, and more, depending on the type of metric you are measuring. For example, to determine statistically relevant differences between two sets of data, t-tests are often used. Standard deviations are used to determine how much variation exists within a data set.
If you want to measure the performance of a piece of code, especially to compare it to an alternative implementation, what you want is a benchmark. The literal definition of a benchmark is a standard against which measurements can be compared. In terms of software development, this means precise timings, usually averaged across many thousands (or millions) of iterations.
You can benchmark many types of things at different levels—entire programs to single methods. However, the more variability that exists in the code under test, the more iterations you will need to achieve sufficient accuracy.
Running benchmarks is a tricky endeavor. You want to measure the code in real-world conditions to get real-world, actionable data, but creating these conditions while getting useful data can be trickier than it seems.
Benchmarks shine when they test a single, uncontended resource, the classic example being CPU time. You certainly can test things like network access time, or reading files off an SSD, but you will need to take more care to isolate those resources from outside influence. Modern operating systems are not designed for this kind of isolation, but with careful control of the environment, you can likely achieve satisfactory results.
Testing entire programs or submodules are more likely to involve this use of contended resources. Thankfully, such large-scope tests are rarely called for. A quick profile of an app will reveal those spots that use the most resources, allowing for narrow focus on those areas.
Small-scope micro-benchmarking most commonly measures the CPU time of single methods, often rerunning them millions of times to get precise statistics on the time taken.
In addition to hardware isolation, there are a number of other factors to consider:
The sample code that accompanies this book has a few quick-and-dirty benchmarks throughout, but for the above reasons, they should not be taken as the absolute truth.
Instead of writing your own benchmarks, you should almost certainly use an existing library that handles many of the above issues for you. I’ll discuss a couple of options later in this chapter.
If there is one single rule that is the most important in this entire book, it is this:
Measure, Measure, Measure!
You do NOT know where your performance problems are if you have not measured accurately. You will definitely gain experience and that can give you some strong hints about where performance problems are, just from code inspection or gut feel. You may even be right, but resist the urge to skip the measurement for anything but the most trivial of problems. The reasons for this are two-fold:
First, suppose you are right, and you have accurately found a performance problem. You probably want to know how much you improved the program, right? Bragging rights are much more secure with hard data to back them up.
Second, I cannot tell you how often I have been wrong. Case in point: While analyzing the amount of native memory in a process compared to managed memory, we assumed for a while that it was coming from one particular area that loaded an enormous data set. Rather than putting a developer on the task of reducing that memory usage, we did some experiments to disable loading that component. We also used the debugger to dump information about all the heaps in the process. To our surprise, most of the mystery memory was coming from assembly loading overhead, not this dataset. We saved a lot of wasted effort.
Optimizing performance is meaningless if you do not have effective tools for measuring it. Performance measurement is a continual process that you should bake into your development tool set, testing processes, and monitoring tools. If your application requires continual monitoring for functionality purposes, then it likely also requires performance monitoring.
The remainder of this chapter covers various tools that you can use to profile, monitor, and debug performance issues. I give emphasis to Visual Studio and software that is freely available, but know there are many other commercial offerings that can in some cases simplify various analysis tasks. If you have the budget for these tools, go for it. However, there is a lot of value in using some of the leaner tools I describe (or others like them). For one, they may be easier to run on customer machines or production environments. More importantly, by being a little “closer to the metal,” they will encourage you to gain knowledge and understanding at a very deep level that will help you interpret data, regardless of the tool you are using.
For each of the tools, I describe basic usage and general knowledge to get started. Sections throughout the book will give you detailed steps for very specific scenarios, but will often rely on you already being familiar with the UI and the basics of operation.
Tip Before digging into specific tools, a general tip for how to use them is in order. If you try to use an unfamiliar tool on a large, complicated project, it can be very easy to get overwhelmed, frustrated, or even get erroneous results. When learning how to measure performance with a new tool, create a test program with well-known behavior, and use the tool to prove its performance characteristics to you. By doing this, you will be more comfortable using the tool in a more complicated situation and less prone to making technical or judgmental mistakes.
While it is not the only IDE, most .NET programmers use Visual Studio, and if you do, chances are this is where you will start to analyze performance. Different versions of Visual Studio come with different tools. This book will assume you have at least the Professional version installed, but I will also describe some tools found in higher versions as well. If you do not have the right version, then skip ahead to the other tools mentioned.
Assuming you installed Visual Studio Professional or higher, you can access the performance tools via the Analyze menu and selecting Performance Profiler (or use the default keyboard shortcut: Alt+F2).
Standard .NET applications will show at least three options, with more available depending on the specific type of application:
If you just need to analyze CPU or look at what is on the heap, then use the first two tools. The Performance Wizard can also do CPU analysis, but it can be a bit slower. However, despite being somewhat of a legacy tool, it can also track memory allocations and concurrency.
For superior concurrency analysis, install the free Concurrency Visualizer, available as an optional extension (Tools | Extensions and Updates… menu).
The Visual Studio tools are among the easiest to use, but if you do not already have the right version of Visual Studio, they are quite expensive. They are also fairly limited and inflexible in what they provide. If you cannot use Visual Studio, or need more capabilities, I describe free alternatives below. Nearly all modern performance measurement tools use the same underlying mechanism (at least in Windows 8/Server 2012 and above kernels): ETW events. ETW stands for Event Tracing for Windows and this is the operating system’s way of logging all interesting events in an extremely fast, efficient manner. Any application can generate these events with simple APIs. Chapter 8 describes how to take advantage of ETW events in your own programs, defining your own or integrating with a stream of system events. Some tools, such as PerfView, can collect arbitrary ETW events all at once and you can analyze all of them separately from one collection session. Sometimes I think of Visual Studio performance analysis as “development-time” while the other tools are for the real system. Your experience may differ and you should use the tools that give you the most bang for the buck.
This section will introduce the general interface for profiling with the CPU profiling options. The other profiler options (such as for memory) will be covered later in the book, in appropriate sections.
When you choose CPU Usage, the results will bring up a window with a graph of CPU usage and a list of expensive methods.
If you want to drill into a specific method, just double-click it on the list, and it will open up a method Call/Callee view.
If that option does not give you enough information, take a look at the performance wizard. This tool uses VsPerf.exe to gather important events.
When you choose the CPU (Sampling), it collects CPU samples without any interruption to your program.
While a different interface than the CPU Usage view we saw earlier, this view shows you the overall CPU usage on a time line, with a tree of expensive methods below it. There are also alternate reports you can view. You can zoom in on the graph and the rest of the analysis will update in response. Clicking on a method name in the table will take you to a familiar-looking Function Details view.
Below the function call summary, you will see the source code (if available), with highlighted lines showing the most expensive parts of the method.
There are other reports as well, including:
Instead of sampling, you can choose to instrument the code. This modifies the original executable by adding instructions around each method call to measure the time spent. This can give more accurate reporting for very small, fast methods, but it has much higher overhead in execution time as well as the amount of data produced. Other than a lack of a CPU graph, the report looks and behaves the same as the CPU sampling report. The major difference in the interface is that it is measuring time instead of number of samples.
Visual Studio can analyze CPU usage, memory allocations, and resource contentions. This is perfect for use during development or when running comprehensive tests that accurately exercise the product. However, it is very rare for a test to accurately capture the performance characteristics of a large application running on real data. If you need to capture performance data on non-development machines, say a customer’s machine or in the data center, you need a tool that can run outside of Visual Studio.
For that, there is the Visual Studio Standalone Profiler, which comes with the Professional or higher versions of Visual Studio. You will need to install it from your installation media separately from Visual Studio. On my ISO images for both 2012 - 2015 Professional versions, it is in the Standalone Profiler directory. For Visual Studio 2017, the executable is VsPerf.exe and is located in C:Program Files (x86)Microsoft Visual Studio2017EnterpriseTeam ToolsPerformance Tools
.
To collect data from the command line with this tool:
VsPerfCmd.exe /Start:Sample /Output:outputfile.vsp
VsPerfCmd.exe /Shutdown
This will produce a file called outputfile.vsp, which you can open in Visual Studio.
VsPerfCmd.exe has a number of other options, including all of the profiling types that the full Visual Studio experience offers. Aside from the most common option of Sample, you can choose:
Trace vs. Sample mode is an important choice. Which one to use depends on what you want to measure. Sample mode should be your default. It interrupts the process every few milliseconds and records the stacks of all threads. This is the best way to get a good picture of CPU usage in your process. However, it does not work well for I/O calls, which will not have much CPU usage, but may still contribute to your overall run time.
Trace mode requires modification of every function call in the process to record time stamps. It is much more intrusive and causes your program to run much slower. However, it records actual time spent in each method, so may be more accurate for smaller, faster methods.
Coverage mode is not for performance analysis, but is useful for seeing which lines of your code were executed. This is a nice feature to have when running tests to see how much of your product the tests cover. There are commercial products that do this for you, but you can do it yourself without much more work.
Concurrency mode records events that occur when there is contention for a resource via a lock or some other synchronization object. This mode can tell you if your threads are being blocked due to contention. See Chapter 4 for more information about asynchronous programming and measuring the amount of lock contention in your application.
Performance counters are some of the simplest ways to monitor your application’s and the system’s performance. Windows has hundreds of counters in dozens of categories, including many for .NET. The easiest way to access these is via the built-in Windows utility PerformanceMonitor (PerfMon.exe).
Each counter has a category and a name. Many counters also have instances of the selected counter as well. For example, for the % Processor Time counter in the Process category, the instances are the various processes for which there are values. Some counters also have meta-instances, such as _Total or <Global>, which aggregate the values over all instances.
Many of the chapters ahead will detail the relevant counters for that topic, but there are general-purpose counters that are not .NET-specific that you should be familiar with. There are performance counters for nearly every Windows subsystem and these are generally applicable to every program.
However, before continuing, you should familiarize yourself with some basic operating system terminology:
I will use some of these terms throughout the book, especially in Chapter 2 when I discuss garbage collection. For more information on these topics, look at a dedicated operating system book such as Windows Internals. (See the bibliography at the end of the book.)
The Process category of counters surfaces much of this critical information via counters with instances for each process, including:
There are a few other generally useful categories, depending on your application. You can use PerfMon to explore the specific counters found in these categories.
It is surprisingly difficult to find detailed information on performance counters on the Internet, but thankfully, they are self-documenting! In the Add Counter dialog box in PerfMon, you can check the “Show description” box at the bottom to display details on the highlighted counter.
PerfMon also has the ability to collect specified performance counters at scheduled times and store them in logs for later viewing, or even perform a custom action when a performance counter passes a threshold. You do this with Data Collector Sets and they are not limited just to performance counter data, but can also collect system configuration data and ETW events.
To set up a Data Collector Set, in the main PerfMon window:
Once done, you can open the properties for the collection set and set a schedule for collection. You can also run them manually by right-clicking on the job node and selecting Start. This will create a report, which you can view by double-clicking its node under Reports in the main tree view.
To create an alert, follow the same process but select the Performance Counter Alert option in the Wizard.
It is likely that everything you will need to do with performance counters can be done using the functionality described here, but if you want to take programmatic control or create your own counters, see Chapter 7 for details. You should consider performance counter analysis a baseline for all performance work on your application.
Event Tracing for Windows (ETW) is one of the fundamental building blocks for all diagnostic logging in Windows, not just for performance. This section will give you an overview of ETW and Chapter 8 will teach you how to create and monitor your own events.
Events are produced by providers. For example, the CLR contains the Runtime provider that produces most of the events we are interested in for this book. There are providers for nearly every subsystem in Windows, such as the CPU, disk, network, firewall, memory, and many, many more. The ETW subsystem is extremely efficient and can handle the enormous volume of events generated, with minimal overhead.
Each event has some standard fields associated with it, like event level (informational, warning, error, verbose, and critical) and keywords. Each provider can define its own keywords. The CLR’s Runtime provider has keywords for things like GC, JIT, Security, Interop, Contention, and more. Keywords allow you to filter the events you would like to monitor.
Each event also has a custom data structure defined by its provider that describes the state of some behavior. For example, the Runtime’s GC events will mention things like the generation of the current collection, whether it was background, and so on.
What makes ETW so powerful is that, since most components in Windows produce an enormous number of events describing nearly every aspect of an application’s operation, at every layer, you can do the bulk of performance analysis with ETW events only.
Many tools can process ETW events and give specialized views. In fact, starting in Windows 8, all CPU profiling is done using ETW events.
To see a list of all the ETW providers registered on your system, open a command prompt and type:
logman query providers
This will produce a large amount of output similar to the following:
Provider GUID
------------------------------------------------------------------
.NET Common Language Runtime {E13C0D23-CCBC-4E12-931B...
ACPI Driver Trace Provider {DAB01D4D-2D48-477D-B1C3...
Active Directory Domain Services: SAM {8E598056-8993-11D2-819E...
Active Directory: Kerberos Client {BBA3ADD2-C229-4CDB-AE2B...
Active Directory: NetLogon {F33959B4-DBEC-11D2-895B...
ADODB.1 {04C8A86F-3369-12F8-4769...
ADOMD.1 {7EA56435-3F2F-3F63-A829...
Application Popup {47BFA2B7-BD54-4FAC-B70B...
Application-Addon-Event-Provider {A83FA99F-C356-4DED-9FD6...
...
You can also get details on the keywords for a specific provider:
D:>logman query providers "Windows Kernel Trace"
Provider GUID
------------------------------------------------------------------
Windows Kernel Trace {9E814AAD-3204-11D2-9A82...
Value Keyword Description
------------------------------------------------------------------
0x0000000000000001 process Process creations/deletions
0x0000000000000002 thread Thread creations/deletions
0x0000000000000004 img Image load
0x0000000000000008 proccntr Process counters
0x0000000000000010 cswitch Context switches
0x0000000000000020 dpc Deferred procedure calls
0x0000000000000040 isr Interrupts
0x0000000000000080 syscall System calls
0x0000000000000100 disk Disk IO
0x0000000000000200 file File details
0x0000000000000400 diskinit Disk IO entry
0x0000000000000800 dispatcher Dispatcher operations
0x0000000000001000 pf Page faults
0x0000000000002000 hf Hard page faults
0x0000000000004000 virtalloc Virtual memory allocations
0x0000000000010000 net Network TCP/IP
0x0000000000020000 registry Registry details
0x0000000000100000 alpc ALPC
0x0000000000200000 splitio Split IO
0x0000000000800000 driver Driver delays
0x0000000001000000 profile Sample based profiling
0x0000000002000000 fileiocompletion File IO completion
0x0000000004000000 fileio File IO
Unfortunately, there is no good online resource to explain which events exist in the various providers. Some common ETW events for all Windows processes include those in the Windows Kernel Trace category:
To see other events from this provider or others, you can collect ETW events and examine them yourself.
Throughout the book, I will mention the important events you should pay attention to in an ETW trace, particularly from the CLR Runtime provider. For the complete CLR ETW documentation, you can visit https://docs.microsoft.com/dotnet/framework/performance/etw-events-in-the-common-language-runtime.
Many tools can collect and analyze ETW events, but PerfView, originally written by Microsoft .NET performance architect (and writer of this book’s Foreword) Vance Morrison, is one of the best for its sheer power. The previous screenshot of ETW events is from this tool.
PerfView is built upon an ETW processing engine called TraceEvent, which you can reuse yourself (See Chapter 8). But PerfView’s real utility lies in its extremely powerful stack grouping and folding mechanism that lets you drill into events at multiple layers of abstraction.
While other ETW analysis tools can be useful, I often prefer PerfView for a few reasons:
Here are some common questions that I routinely answer using PerfView:
To collect and analyze events using PerfView follow these basic steps:
During event collection, PerfView captures ETW events for all processes. You can filter events per-process after the collection is complete.
Collecting events is not free. Certain categories of events are more expensive to collect than others. For example, a CPU profile generates a huge number of events, so you should keep the profile time very limited (around a minute or two) or you could end up with multi-gigabyte files that you cannot analyze.
Most views in PerfView are variations of a single type, so it is worth understanding how it works
PerfView is mostly a stack aggregator and viewer. When you record ETW events, the stack for each event is recorded. PerfView analyzes these stacks and shows them to you in a grid that is common to CPU, memory allocation, lock contention, exceptions thrown, and most other types of events. The principles you learn while doing one type of investigation will apply to other types, since the stack analysis is the same.
You also need to understand the concepts of grouping and folding. Grouping turns multiple sources into a single entity. For example, there are multiple .NET Framework DLLs and which DLL a particular function is in is not usually interesting for profiling. Using grouping, you can define a grouping pattern, such as “System.*!=>LIB”, which coalesces all System.*.dll assemblies into a single group called LIB. This is one of the default grouping patterns that PerfView applies. If you wanted to, for example, collapse all method calls in the TimeZoneInfo
class, you could have a group defined as:
“mscorlib.ni!System.TimeZoneInfo*->TIMEZONE”
This will cause TIMEZONE to appear throughout your stack in the place of any TimeZoneInfo
methods.
Folding allows you to hide some of the irrelevant complexity of the lower layers of code by counting its cost in the nodes that call it. As a simple example, consider where memory allocations occur—always via some internal CLR method invoked by the new
operator. What you really want to know is which types are most responsible for those allocations. Folding allows you to attribute those underlying costs to their parents, code which you can actually control. For example, in most cases you do not care about which internal operations are taking up time inside String.Format
; you really care about what areas of your code are calling String.Format
in the first place. PerfView can fold those operations into the caller to give you a better picture of your code’s performance.
Folding patterns can use the groups you defined for grouping. So, for example, you can just specify a folding pattern of “LIB” which will ensure that all methods in System.* are attributed to their caller outside of System.*.
The user interface of the stack viewer needs some brief explanation as well.
Controls at the top allow you to organize the stack view in multiple ways. Here is a summary of their usage, but you can click on the ? in the column headers to bring up a help file that gives you more details.
There are a few different view tabs:
In the grid view, there are a number of columns. Click on the column names to bring up more information. Here is a summary of the most important columns:
In the chapters that follow, I will give instructions for solving specific problems with various types of performance investigations. A complete overview of PerfView would be worth a book on its own, or at least a very detailed help file—which just so happens to come with PerfView. I strongly encourage you to read this manual once you have gone through a few simple analyses.
It may seem like PerfView is mostly for analyzing memory or CPU, but do not forget that it is really just a generic stack aggregation program, and those stacks can come from any ETW event. It can analyze your sources of lock contention, disk I/O, or any arbitrary application event with the same grouping and folding power.
CLR Profiler is a possible alternative to PerfView’s memory analysis capabilities if you want a graphical representation of the heap and relationships between objects. CLR Profiler can show you a wealth of detail. For example:
I rarely use CLR Profiler because of some of its limitations and age, but it is still occasionally useful. It has unique visualizations that no other free tool currently matches. It comes with 32-bit and 64-bit binaries as well as documentation and the source code.
The basic steps to get a trace are:
This will start the application with profiling active. When you are done profiling, exit the program, or select Kill Application in CLR Profiler. This will terminate the profiled application and start processing the capture log. This processing can take quite a while, depending on the profile duration. (I have seen it take over an hour before.)
While profiling is going on, you can click the “Show Heap now” button in CLR Profiler. This will cause it to take a heap dump and open the results in a visual graph of object relationships. Profiling will continue uninterrupted, and you can take multiple heap dumps at different points.
When it is done, you will see the main results screen.
From this screen, you can access different visualizations of heap data. Start with the Allocation Graph and the Time Line to see some of the essential capabilities. As you become comfortable analyzing managed code, the histogram views will also become an invaluable resource.
Note While CLR Profiler is generally great, I have had a few major problems with it. First, it is a bit finicky. If you do not set it up correctly before starting to profile, it can throw exceptions or die unexpectedly. For example, I always have to check the Allocations or Calls boxes before I start profiling if I want to get any data at all. You should completely disregard the Attach to Process button, as it does not seem to work reliably. CLR Profiler does not seem to work well for truly huge applications with enormous heaps or a large number of assemblies. If you find yourself having trouble, PerfView may be a better solution because of its polish and extreme customizability through very detailed command-line parameters that allow you to control nearly all aspects of its behavior. Your mileage may vary. On the other hand, CLR Profiler comes with its own source code so you can fix it!
The Windows Assessment and Deployment Kit (Windows ADK, also part of the Windows SDK) contains a number of tools that aid in deploying operating systems and applications to machines. Inside it are a pair of tools called Windows Performance Recorder and Windows Performance Analyzer. These tools process ETW events in the same manner as PerfView. However, Windows Performance Analyzer excels in displaying hardware and operating system level information. It can display .NET events as well, but it is not as convenient as PerfView.
To capture a trace, invoke Windows Performance Recorder and start capturing.
After you are done capturing events, click the Save button, which will bring up an interface for you to provide more details, while WPR processes the captured data in the background.
The capture data file can be opened in any tool that can analyze ETW events, but there is a convenient button to open it directly in Windows Performance Analyzer.
Windows Performance Analyzer shows you a list of resource categories along the left-hand size. Double-clicking them opens up a detailed view with a graph and table with details suitable for that resource. For example, details for memory usage will show you different categories of memory usage, such as active vs committed memory, paged pool, private pages, and more.
Because this tool focuses more on general operating system resource usage issues, rather than .NET, I will not discuss it further in this book, but it is a useful tool to keep in mind when you are dealing with some classes of performance problems.
WinDbg is a general-purpose Windows Debugger distributed for free by Microsoft. If you are used to using Visual Studio as your main debugger, using this bare-bones, text-only debugger may seem daunting. Do not let it be. Once you learn a few commands, you will feel comfortable and after a while, you will rarely use Visual Studio for debugging except during active development.
WinDbg is far more powerful than Visual Studio and will let you examine your process in many ways you could not otherwise. It is also lightweight and more easily deployable to production servers or customer machines. In these situations, it is in your best interest to become familiar with WinDbg. By itself, however, WinDbg is not that interesting for managed code. To work with managed processes effectively, you will need to use .NET’s SOS extensions, which ship with each version of the .NET Framework. A very handy SOS reference cheat sheet is located at https://docs.microsoft.com/dotnet/framework/tools/sos-dll-sos-debugging-extension. You can also use SOS.dll from Visual Studio, but this is not as straightforward, and there are other benefits to becoming familiar with WinDbg, so I will cover that scenario.
With WinDbg and SOS together, you can quickly answer questions such as these:
WinDbg is not usually my first tool (that is often PerfView), but it is often my second or third, allowing me to see things that other tools will not easily show. For this reason, I will use WinDbg extensively throughout this book to show you how to examine your program’s operation, even when other tools do a quicker or better job. (Do not worry; I will also cover those tools.)
Do not be daunted by the text interface of WinDbg. Once you use a few commands to look into your process, you will quickly become comfortable and appreciative of the speed with which you can analyze a program. The chapters in this book will add to your knowledge little by little with specific scenarios.
To get WinDbg, you must install the Windows SDK. You can choose to install only the debuggers if you wish.
To get started with WinDbg, do a simple tutorial with a sample program. The program will be basic enough—a straightforward, easy-to-debug memory leak. You can find it in the accompanying source code in the MemoryLeak project (available at http://www.writinghighperf.net).
using System;
using System.Collections.Generic;
using System.Threading;
namespace MemoryLeak
{
class Program
{
static List<string> times = new List<string>();
static void Main(string[] args)
{
Console.WriteLine("Press any key to exit");
while (!Console.KeyAvailable)
{
times.Add(DateTime.Now.ToString());
Console.Write('.');
Thread.Sleep(10);
}
}
}
}
Startup this program and let it run for a few minutes.
Run WinDbg from where you installed it. It should be in the Start Menu if you installed it via the Windows SDK. Take care to run the correct version, either x86 (for 32-bit processes) or x64 (for 64-bit processes). Go to File | Attach to Process (or hit F6) to bring up the Attach to Process dialog.
From here, find the MemoryLeak process. (It may be easier to check the By Executable sort option.) Click OK.
WinDbg will suspend the process (This is important to know if you are debugging a live production process!) and display any loaded modules. At this point, it will be waiting for your command. The first thing you usually want to do is load the CLR debugging extensions. Enter this command:
.loadby sos clr
If it succeeds, there will be no output.
If you get an error message that says “Unable to find module ‘clr’” it most likely means the CLR has not yet been loaded. This can happen if you launch a program from WinDbg and break into it immediately. In this case, first set a breakpoint on the CLR module load:
sxe ld clr
g
The first command sets a breakpoint on the load of the CLR module. The g
command tells the debugger to continue execution. Once you break again, the CLR module should be loaded and you can now load SOS with the .loadby sos clr
command, as described previously.
At this point, you can do any number of things. Here are some commands to try:
!ProcInfo
This prints out some general debugging information about the process as a whole, including environment variables set:
---------------------------------------
Environment
=::=::
=C:=C:WINDOWSsystem32
...many, many environment variables
---------------------------------------
Process Times
Process Started at: 2017 Nov 7 22:5:49.44
Kernel CPU time : 0 days 00:00:00.01
User CPU time : 0 days 00:00:00.01
Total CPU time : 0 days 00:00:00.02
---------------------------------------
Process Memory
WorkingSetSize: 26572 KB PeakWorkingSetSize: 26572 KB
VirtualSize: 717972 KB PeakVirtualSize: 717972 KB
PagefileUsage: 566560 KB PeakPagefileUsage: 566560 KB
---------------------------------------
44 percent of memory is in use.
Memory Availability (Numbers in MB)
Total Avail
Physical Memory 4095 4095
Page File 4095 4095
Virtual Memory 4095 3783
More useful commands:
g
This stands for “Go” and continues execution. You cannot enter any commands while the program is running.
<Ctrl-Break>
This pauses a running program. Do this after you Go (g
) to get back control.
.dump /ma d:memorydump.dmp
This creates a full process dump to the selected file. This will allow you to debug the process’s state later, though since it is a snapshot, of course you will not be able to debug any further execution.
!DumpHeap -stat
DumpHeap
shows a summary of all managed objects on the object heap, including their size (just for this object, not any referenced objects), count, and other information. If you want to see every object on the heap of type System.String
, type !DumpHeap -type System.String
. You will see more about this command when investigating garbage collection.
~*kb
This is a regular WinDbg command, not from SOS. It prints the current stack for all threads in the process.
To switch the current thread to a different one, use the command:
~32s
This will change the current thread to thread 32. Note that thread numbers in WinDbg are not the same as thread IDs. WinDbg numbers all the threads in your process for easy reference, regardless of the Windows or .NET thread ID.
!DumpStackObjects
You can also use the abbreviated version: !dso
. This dumps out the address and type of each object from all stack frames for the current thread.
Note that all commands located in the SOS debugging extension for managed code are prefixed with a !
character.
The other thing you need to do to be effective with the debugger is set your symbol path to download the public symbols for Microsoft DLLs so you can see what is going on in the system layer. Set your _NT_SYMBOL_PATH environment variable to this string:
symsrv*symsrv.dll*c:sym*http://msdl.microsoft.com/download/symbols
Replace c:sym with your preferred local symbol cache path (and make sure you create the directory). With the environment variable set, both WinDbg and Visual Studio will use this path to automatically download and cache the public symbols for system DLLs. During the initial download, symbol resolution may be quite slow, but once cached, it should speed up significantly. You can also use the .symfix
command to automatically set the symbol path to the Microsoft symbol server and local cache directly:
.symfix c:sym
If you have not used WinDbg before, do not be afraid to dive in and try it out. Once you memorize a small number of commands, you will be highly productive in no time. Deep mastery of WinDbg will come with time and experience, but it is worth the journey. You can do many types of analysis in WinDbg that are very difficult or impossible to do in other debuggers. See especially Chapter 2’s section for investigating memory issues for many examples of WinDbg usage.
After you have used WinDbg for a while and seen the power available to you, you will likely have the thought, “I wish I could access this stuff programmatically.” Thankfully, you can! Microsoft.Diagnostics.Runtime (nicknamed “CLR MD”) is an open source library available at https://github.com/microsoft.clrmd. It provides access to much of the functionality in SOS.dll, in a convenient, easy-to-use API. CLR MD is designed to be a fairly low level API, allowing you to easily build on top of it to provide richer functionality. In fact, some of PerfView’s functionality is built on top of CLR MD, so if PerfView is not giving you exactly what you need, you can go under the hood, so to speak, to this library, and build what you need.
In this section, I’ll provide an overview of the tool and how to use it, but specific solutions to problems will be found in the relevant sections throughout the book.
Note The library is very much in active development, and you will see differences between the documentation and what is currently implemented. The API may also change further.
You can use this library to both attach to live processes (as a debugger), or open heap dump files on disk. I’ll show examples of both.
To attach to a live process, you just need to supply a process ID. In this example, I’m explicitly starting a new process for convenience. Most examples of CLR MD in this book will come from the AnalyzeProcess sample code project accompanying this book.
static void Main(string[] args)
{
// Let's create our own process to test with
var startInfo = new ProcessStartInfo(TargetProcessName);
startInfo.CreateNoWindow = true;
startInfo.WindowStyle = ProcessWindowStyle.Hidden;
var targetProcess = Process.Start(startInfo);
Thread.Sleep(1000);
using (DataTarget target = DataTarget.AttachToProcess(
targetProcess.Id,
10000, // timeout
AttachFlag.Invasive))
{
PrintDumpInfo(target);
var clr = target.ClrVersions[0].CreateRuntime();
}
}
private static void PrintDumpInfo(DataTarget target)
{
PrintHeader("Target Info");
Console.WriteLine($"Architecture: {target.Architecture}");
Console.WriteLine($"Pointer Size: {target.PointerSize}");
Console.WriteLine("CLR Versions:");
foreach(var clr in target.ClrVersions)
{
Console.WriteLine($" {clr.Version}");
}
}
This program will print out the following information:
Target Info
===========
Architecture: X86
Pointer Size: 4
CLR Versions:
v4.7.2115.00
The clr
object obtained after calling PrintDumpInfo
is the main interface to most of the interesting commands. Using it, you can, for example, iterate over every object in the heap:
var heap = clr.Heap;
foreach(var obj in heap.EnumerateObjects())
{
int gen = heap.GetGeneration(obj.Address);
Console.WriteLine(
$"0x{obj.Address:x} - {obj.Type.Name}" +
$" - Generation: {generation}");
}
Which produces output similar to:
0x30ec8ac - System.Byte[] - Generation: 0
0x30ecca0 - LargeMemoryUsage.B - Generation: 1
In addition to the heap, you can examine code:
foreach(var module in clr.Modules)
{
foreach (var type in module.EnumerateTypes())
{
foreach(var method in type.Methods)
{
Console.WriteLine(method.Name);
}
}
}
This produces output like this:
Main
GetNewObject
.cctor
ToString
ToString
Equals
You can also open crash dumps. This is slightly more complicated because you must also obtain the mscordacwks.dll file that matches the CLR version(s) present in the dump. When attaching to a live process, this is trivial because it is guaranteed to be present on the machine. With a dump from a different machine, and potentially a different version of the CLR altogether, you must obtain it from that machine or download it from the Microsoft symbol server. This code shows you how to accomplish this:
{
...
string dacFile =
GetDacFile(
dataTarget.ClrVersions[0],
dataTarget);
var clr = dataTarget.ClrVersions[0].CreateRuntime(dacFile);
...
}
private static string GetDacFile(ClrInfo clrInfo,
DataTarget target)
{
string location = clrInfo.LocalMatchingDac;
if (string.IsNullOrEmpty(location) || !File.Exists(location))
{
// try to download from symbol server
ModuleInfo dacInfo = clrInfo.DacInfo;
try
{
location = target.SymbolLocator.FindBinary(dacInfo);
}
catch (WebException)
{
return null;
}
}
return location;
}
This method is equivalent to calling CreateRuntime
with no arguments, but it is useful to know how to do this yourself in case you have custom needs.
You will see more examples of its power in later chapters, but a summary of some of the things it can tell you:
Note I have seen a couple of issues when using this library to examine the code in a truly huge DLL. The APIs in Microsoft.Diagnostics.Runtime rely on internal .NET APIs that may not have the most efficient implementation. In one case, I was using a dump file to calculate how much JITting had happened in a 500 MB DLL with 80,000 types, and hundreds of thousands of methods. I hit Ctrl-Break after about 36 hours. That is the only DLL I’ve had issues with.
There are many free and paid products out there that can take a compiled assembly and decompile it into IL, C#, VB.NET, or any other .NET language. Some of the most popular include Reflector, ILSpy, and dotPeek, but there are others.
These tools are valuable for showing you the inner details of other people’s code, something critical for good performance analysis. I use them most often to examine the .NET Framework itself when I want to see the potential performance implications of various APIs.
Converting your own code to readable IL is also valuable because it can show you many operations, such as boxing, that are not visible in the higher-level languages.
Chapter 6 discusses the .NET Framework code and encourages you to train a critical eye on every API you use. Tools like ILSpy, dotPeek, and Reflector are vital for that purpose and you will use them frequently as you become more familiar with existing code. You will often be surprised at how much work goes into seemingly simple methods. Analyzing the assemblies of other developers and companies can teach you much about good (or bad) organization, design, and coding practices.
Some other things these tools can show you:
Most tools also have search capability to allow you to find types, methods, fields, or code statements.
MeasureIt is a handy micro-benchmark tool by Vance Morrison (the same author of PerfView). It shows the relative costs of various .NET APIs in many categories including method calls, arrays, delegates, iteration, reflection P/Invoke, and many more. It compares all the costs to calling an empty static function as a benchmark.
MeasureIt is primarily useful to show you how design choices will affect performance at an API level. For example, in the locks category, it shows you that using ReaderWriteLock
is about four times slower than just using a regular lock
statement.
It is easy to add your own benchmarks to MeasureIt’s code. It ships with its own code packed inside itself—just run MeasureIt /edit
to extract it. Studying this code will give you a good idea of how to write accurate benchmarks. There is a lengthy explanation in the code comments about how to do high-quality analysis, which you should pay special attention to, especially if you want to do some simple benchmarking yourself.
For example, it prevents the compiler from inlining function calls:
[MethodImpl(MethodImplOptions.NoInlining)]
public void AnyEmptyFunction()
{
}
There are other tricks it uses such as working around processor caches and doing enough iterations to produce statistically significant results.
MeasureIt is handy because it has a number of built-in measurements of the CLR itself, which can give you a good idea of what the basics cost. If you are interested in benchmarking your own code, then read on to the next section.
The standard in .NET benchmarking is probably the open-source project BenchmarkDotNet. This library handles many of the usual concerns about micro-benchmarking and does much more by:
Getting started is very easy. Here is a simple example, comparing the performance of foreach
loops on an array versus IEnumerable
. With simple attribute decoration, you can let the library do almost all the work for you.
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Collections.Generic;
namespace BenchmarkTest
{
public class LoopBenchmarks
{
static int[] arr = new int[100];
public LoopBenchmarks()
{
for (int i = 0; i < arr.Length; i++)
{
arr[i] = i;
}
}
[Benchmark]
public int ForEachOnArray()
{
int sum = 0;
foreach (int val in arr)
{
sum += val;
}
return sum;
}
[Benchmark]
public int ForEachOnIEnumerable()
{
int sum = 0;
IEnumerable<int> arrEnum = arr;
foreach (int val in arrEnum)
{
sum += val;
}
return sum;
}
}
class Program
{
static void Main(string[] args)
{
var summary = BenchmarkRunner.Run<LoopBenchmarks>();
}
}
}
You can run this yourself with the BenchmarkTest sample code.
The output ends with this:
Total time: 00:00:43 (43.64 sec)
// * Summary *
BenchmarkDotNet=v0.10.9, OS=Windows 10 Redstone 2 (10.0.15063)
Processor=Intel Core i7-3930K CPU 3.20GHz (Ivy Bridge),
ProcessorCount=12
Frequency=14318180 Hz, Resolution=69.8413 ns, Timer=HPET
[Host] : .NET Framework 4.7 (CLR 4.0.30319.42000),
32bit LegacyJIT-v4.7.2102.0
DefaultJob : .NET Framework 4.7 (CLR 4.0.30319.42000),
32bit LegacyJIT-v4.7.2102.0
Method | Mean | Error | StdDev |
--------------------- |----------:|----------:|----------:|
ForEachOnArray | 53.32 ns | 0.2083 ns | 0.1846 ns |
ForEachOnIEnumerable | 561.69 ns | 7.2943 ns | 6.8231 ns |
// * Hints *
Outliers
ForEachTest.ForEachOnArray: Default -> 1 outlier was removed
// * Legends *
Mean : Arithmetic mean of all measurements
Error : Half of 99.9% confidence interval
StdDev : Standard deviation of all measurements
1 ns : 1 Nanosecond (0.000000001 sec)
// ***** BenchmarkRunner: End *****
// * Artifacts cleanup *
Notice that even for such simple code, it took a full 43 seconds to execute the benchmarks.
You can of course customize how these benchmarks work with additional configuration.
To read more, visit http://benchmarkdotnet.org. Add it to your project directly from Visual Studio by installing the BenchmarkDotNet NuGet package.
The old standby of brute-force debugging via console output is still a valid scenario and should not be ignored. Rather than console output, however, I encourage you to use ETW events instead, as detailed in Chapter 8.
Performing accurate code timing is also a useful feature at times. Never use DateTime.Now
for tracking performance data. It is just too slow for this purpose. Instead, use the System.Diagnostics.Stopwatch
class to track the time span of small or large events in your program with extreme accuracy, precision, and low overhead.
var stopwatch = Stopwatch.StartNew();
...do work...
stopwatch.Stop();
TimeSpan elapsed = stopwatch.Elapsed;
long elapsedTicks = stopwatch.ElapsedTicks;
See Chapter 6 for more information about using times and timing in .NET.
If you want to ensure that your own benchmarks are accurate and reproducible, study the source code and documentation to MeasureIt, which highlights the best practices on this topic. It is often harder than you would expect and performing benchmarks incorrectly can be worse than doing no benchmarks at all because it will cause you to waste time on the wrong thing. It would be better to use a 3rd-party library like BenchmarkDotNet.
No developer, system administrator, or even hobbyist should be without this great set of tools. Originally developed by Mark Russinovich and Bryce Cogswell and now owned by Microsoft, these are tools for computer management, process inspection, network analysis, and a lot more. Here are some of my favorites:
There are dozens more. You can download this suite of utilities (individually or as a whole) from https://docs.microsoft.com/sysinternals/.
The final performance tool is a rather generic one: a simple database—something to track your performance over time. The metrics you track are whatever is relevant to your project, and the format does not have to be a full-blown SQL Server relational database (though there are certainly advantages to such a system). It can be a collection of reports stored over time in an easily readable format, or just CSV files with labels and values. The point is that you should record it, store it, and build the ability to report from it.
When someone asks you if your application is performing better, which is the better answer?
Yes.
Or:
In the last 6 months, we have reduced CPU usage by 50%, memory consumption by 25%, and request latency by 15%. Our GC rate is down to one in every 10 seconds (it used to be every second!), and our startup time is now dominated entirely by configuration loading (35 seconds).
As mentioned earlier, bragging about performance gains is so much better with solid data to back it up!
You can find many other tools. There are plenty of static code analyzers, ETW event collectors and analyzers, assembly decompilers, performance profilers, and much more.
You can consider the list presented in this chapter as a starting point, but understand that you can do significant work with just these tools. Sometimes an intelligent visualization of a performance problem can help, but you will not always need it.
You will also discover that as you become more familiar with technologies like Performance Counters or ETW events, it is easy to write your own tools to do custom reporting or intelligent analysis. Many of the tools discussed in this book are automatable to some degree.
No matter what you do, there is going to be some overhead from measuring your performance. CPU profiling slows your program down somewhat, performance counters will require memory and/or disk space. ETW events, as fast as they are, are not free.
You will have to monitor and optimize this overhead in your code just like all other aspects of your program. Then decide whether the cost of measurement in some scenarios is worth the performance hit you will pay.
If you cannot afford to measure all the time, then you will have to settle for some kind of profiling. As long as it is often enough to catch issues, then it is likely fine. However, do not underestimate the people cost of manual performance measurement—often, this can add up to a much higher cost than building a system that can automatically perform measurement for you.
You could also have “special builds” of your software, but this can be a little dangerous. You do not want these special builds to morph into something that is unrepresentative of the actual product.
As with many things in software, there is a balance you will have to find between having all the data you want and having optimal performance.
The most important rule of performance is Measure, Measure, Measure!
Know what metrics are important for your application. Develop precise, quantifiable goals for each metric. Average values are good, but pay attention to percentiles as well, especially for high-availability services. Ensure that you include good performance goals in the design up front and understand the performance implications of your architecture. Optimize the parts of your program that have the biggest impact first. Focus on macro-optimizations at the algorithmic or systemic level before moving on to micro-optimizations. When you are unsure about the performance of an algorithm, utilize benchmarking frameworks to test them.
Have a good foundation of performance counters and ETW events for your program. For analysis and debugging, use the right tools for the job. Learn how to use the most powerful tools like WinDbg and PerfView to solve problems quickly.