Using the .NET Framework

The previous chapter discussed general .NET coding techniques and pitfalls, especially those related to language features. In this chapter, we discuss some of the issues you must consider when using the enormous library of code that ships with .NET. I cannot possibly discuss all of the various subsystems and classes that are part of the .NET Framework, but the purpose of this chapter is to give you the tools you need to do your own investigations into performance, and to be aware of common patterns that you may need to avoid.

The .NET Framework was written with an extremely broad audience in mind (all developers everywhere, really), and is meant to be a general-purpose framework, providing stable, correct, robust code that can handle many situations. As such, it does not emphasize raw performance, and you will find many things you will need to work around in the inner loops of your codebase. The .NET Framework has to work for everybody, everywhere, making few assumptions about the calling code. It will often emphasize correctness and robustness over speed and efficiency. In your own code, however, you can often achieve large gains by understanding your own constraints and assumptions, and making appropriate customizations. This is not license to start reimplementing string on a new project, but to be aware of the limitations of the framework when combined with the critical areas of your own code’s performance.

To get around weaknesses in the .NET Framework (or any 3rd-party library), you may need to use some ingenuity. Some possible approaches are:

  • Use an alternate API with less cost.
  • Redesign your application to not call the API as often.
  • Re-implement some APIs in a more performant manner.
  • Do an interop into a native API to accomplish the same thing (assuming the marshaling cost is less).

Understand Every API You Call

The guiding principle of this chapter is this:

You must understand the code executing behind every called API.

To say you have control of your performance is to assert that you know the code that executes in every critical path of your code. You should not have an opaque 3rd-party library at the center of your inner loop—that is ceding control.

You will not always have access to the source of every method you call (though you do always have access to the assembly level!), but there is usually good documentation for all Windows APIs. With .NET, you can use one of the many IL viewing tools out there to see what the Framework is doing. (This ease of inspection does not extend to the CLR itself, which, while it is available as part of .NET Core, is written largely in dense, heavily macroed native code.)

Get used to examining Framework code for anything you are not familiar with. The more that performance is important to you, the more you need to question the implementation of APIs you do not own. Just remember to keep your pickiness proportional to the need for speed.

What follows in this chapter is a discussion of a few general areas you should be concerned with as well as some specific, common classes every program will use.

Multiple APIs for the Same Thing

You will occasionally run into a situation where you can choose among many APIs for accomplishing the same thing. A good example is XML parsing. XML parsing is never fast, exactly, but there are better ways to do it depending on your scenario. There are at least 9 different ways to parse XML in .NET:

  • XmlTextReader
  • XmlValidatingReader
  • XDocument
  • XmlDocument
  • XPathNavigator
  • XPathDocument
  • LINQ-to-XML
  • DataContractSerializer
  • XmlSerializer

Which one you use depends on factors such as ease of use, productivity, suitability for the task, and performance. XmlTextReader is very fast, but it is forward-only and does no validation. XmlDocument is very convenient because it has a full object model loaded, but it is among the slowest options.

It is as true for XML parsing as it for other API choices: not all options will be equal, performance-wise. Some will be faster, but use more memory. Some will use very little memory, but may not allow certain operations. You will have to determine which features you need and measure the performance to determine which API provides the right balance of functionality vs. performance. You should prototype the options and profile them running on sample data.

Collections

.NET provides over 21 built-in collection types, including concurrent and generic versions of many popular data structures. Most programs will only need to use a combination of these and you should rarely need to create your own.

Choosing which collections to use depends on many factors, including: semantic meaning in the APIs (push/pop, enqueue/dequeue, add/remove, etc.), underlying storage mechanism and cache locality, speed of various operations on the collection such as Add and Remove, and whether you need to synchronize access to the collection. All of these factors can greatly influence the performance of your program.

Collections to Avoid

Some collections still exist in the .NET Framework only for backward compatibility reasons and should never be used by new code. These include:

  • ArrayList
  • Hashtable
  • Queue
  • SortedList
  • Stack
  • ListDictionary
  • HybridDictionary

The reasons these should be avoided are casting and boxing. These collections store references to Object instances in them so you will always need to cast down to your actual object type.

The boxing problem is even more pernicious. Suppose you want to have an ArrayList of Int32 value types. Each value will be individually boxed and stored on the heap. Instead of iterating through a contiguous array of memory to access each integer, each array reference will require a pointer dereference, heap access (likely hurting locality), and then an unboxing operation to get at the inner value. This is horrible. Use a non-resizable array or one of the generic collection classes instead.

In the early versions of .NET there were some string-specific collections that are now obsolete because of the power of generics. Examples include StringCollection, StringDictionary, NameValueCollection, and OrderedDictionary. They do not necessarily have performance problems per se, but there is no need to even consider them unless you are using an existing API that requires them.

Arrays

The simplest, and likely the most-used, collection is the humble Array. Arrays are ideal because they are compact, using a single contiguous block of memory, which improves processor cache locality when accessing multiple elements (assuming value types—reference types will still jump to other places in the heap). Accessing them is in constant time and copying them is fast. Resizing them, however, will mean allocating a new array and copying the old values into the new object. Many of the more complicated data structures are built on top of arrays.

A common requirement is to be able to pass around a subsegment of an array. One way to do this is to copy the desired values into a new array of the correct size, but this incurs extra CPU and memory costs. A better way is to have a structure that just refers to the relevant portion of the array. Thankfully, there are two such structures already in .NET: ArraySegment<T> and Span<T>.

byte[] fileContents = File.ReadAllBytes("foo.bin");
ArraySegment<byte> header = new ArraySegment<byte>(fileContents, 
                                                   1, 
                                                   24);
ProcessHeader(header);

Many .NET APIs that can operate on arrays will often have a version that takes either an ArraySegment<T> directly, or the individual array reference, offset, and count values. When building your own APIs that operate on arrays, it would be wise to consider making an ArraySegment<T>-compatible version.

Span<T> is similar, but can represent sub-segments of multiple types of contiguous memory and was covered in detail in Chapter 2.

Jagged vs. Multi-dimensional Arrays

There are two ways to allocate multi-dimensional arrays in .NET:

  • Multi-dimensional arrays: a single object with multiple indexes:
const int Size = 50;

private int[,,] multiArray;

void Init()
{
    this.multiArray = new int[Size, Size, Size];
}
  • Jagged arrays: arrays of arrays (multiple objects), each with a single index:
private const int Size = 50;

private int[][][] jaggedArray;

public void GlobalSetup()
{
    this.jaggedArray = new int[Size][][];
    for (int i = 0; i < Size; i++)
    {
        this.jaggedArray[i] = new int[Size][];
        for (int j = 0; j < Size; j++)
        {
            this.jaggedArray[i][j] = new int[Size];
        }
    }
}

They look basically equivalent, but they are very different internally. The initialization code above does not really demonstrate this difference, so consider some code that iterates through all values and modifies each entry.

public void Negate_JaggedArray()
{
    for (int i = 0; i < Size; i++)
    {
        for (int j = 0; j < Size; j++)
        {
            for (int k = 0; k < Size; k++)
            {
                this.jaggedArray[i][j][k] *= -1;
            }
        }
    }
}

public void Negate_MultiArray()
{
    for (int i = 0; i < Size; i++)
    {
        for (int j = 0; j < Size; j++)
        {
            for (int k = 0; k < Size; k++)
            {
                this.multiArray[i, j, k] *= -1;
            }
        }
    }
}

The code looks extremely similar, but look at the IL code underneath, just for the inner loop. For the jagged array, it is pretty much what we expect:

IL_000c: ldarg.0
IL_000d: ldfld int32[][][] ArrayPerf.ArrayPerfTest::jaggedArray
IL_0012: ldloc.0
IL_0013: ldelem.ref
IL_0014: ldloc.1
IL_0015: ldelem.ref
IL_0016: ldloc.2
IL_0017: ldelema [mscorlib]System.Int32
IL_001c: dup
IL_001d: ldind.i4
IL_001e: ldc.i4.m1
IL_001f: mul
IL_0020: stind.i4
IL_0021: ldloc.2
IL_0022: ldc.i4.1
IL_0023: add
IL_0024: stloc.2

It is just a series of memory accesses, some math, and stores. By contrast, here is the code for performing the same mutation on the multi-dimensional array:

IL_000c: ldarg.0
IL_000d: ldfld int32[0..., 0..., 0...] 
  ArrayPerf.ArrayPerfTest::multiArray
IL_0012: ldloc.0
IL_0013: ldloc.1
IL_0014: ldloc.2
IL_0015: call instance int32& int32[0..., 0..., 0...]::
  Address(int32, int32, int32)
IL_001a: dup
IL_001b: ldind.i4
IL_001c: ldc.i4.m1
IL_001d: mul
IL_001e: stind.i4
IL_001f: ldloc.2
IL_0020: ldc.i4.1
IL_0021: add
IL_0022: stloc.2

It’s mostly the same–except for that obvious method call in the middle. The difference this can make is demonstrated by the ArrayPerf project in the accompanying source:

Method Mean Error StdDev
Negate_JaggedArray 281.0 us 1.7812 us 1.6661 us
Negate_MultiArray 439.6 us 0.2280 us 0.1780 us

The multi-dimensional arrays is significantly more expensive to traverse because of that method call. There should theoretically be some benefits in memory locality with multi-dimensional arrays and the ability to layout all values consecutively in memory, but it’s not taken advantage of, and there is, frankly, very little reason to use multi-dimensional arrays in any circumstance.

Generic Collections

The generic collection classes are:

  • Dictionary<TKey, TValue>
  • HashSet<T>
  • LinkedList<T>
  • List<T>
  • Queue<T>
  • SortedDictionary<TKey, TValue>
  • SortedList<TKey, TValue>
  • SortedSet<T>
  • Stack<T>

These deprecate all of the non-generic versions and should always be preferred. They incur no boxing or casting costs and will have better memory locality for value types (especially for the List-style structures that are implemented using arrays).

Within these collections, though, there can be very large performance differences. For example, Dictionary, SortedDictionary, and SortedList all store key-value relationships, but have very different insertion and lookup characteristics.

When discussing performance of collections (or more specifically, algorithms that operate on those collections), it is useful to have an abstract way of comparing them. In computer science, we use Big O notation for this. Briefly, Big O describes performance in terms of the problem size. For example, O(n) means “linear time”—the time required is directly correlated with the problem size. A linear search through an unsorted list would be O(n). O(1) means “constant time”—the problem size is irrelevant, the operation always takes the same amount of time, as in hash table lookup. For more information about Big O notation, see Appendix C for a deeper discussion and more examples.

  • Dictionary is implemented as a hash table and has O(1) insertion and retrieval times.
  • SortedDictionary is implemented as a binary search tree and has O(log n) insertion and retrieval times.
  • SortedList is implemented as a sorted array. It has O(log n) retrieval times, but can have O(n) insertion times in the worst case. If you insert random elements it will need to resize frequently and move the existing elements. It is ideal if you insert all of the elements in order, and then use it for fast lookups.

Of the three, SortedList has the smallest memory requirements. The other two will have much more random memory access, but can guarantee better insertion times on average. Which one of these you use depends greatly on your application’s requirements.

The difference between HashSet and SortedSet is similar to the difference between Dictionary and SortedDictionary.

  • HashSet uses a hash table and has O(1) insertion and removal operations.
  • SortedSet uses a binary search tree and has O(log n) insertion and removal operations.

List has O(1) insertion, but O(n) removal and searching. Stack and Queue can only add or remove from one end of the collection so have O(1) time in all operations.

LinkedList has O(1) insertion and removal characteristics, but it should usually be avoided for large numbers of primitive types because it will allocate a new LinkedListNode<T> object for every item you add, which can be wasteful overhead.

To give you an idea of the difference in storage requirements for each collection discussed here, I ran a test program that added 1,000 4-byte integers into each structure and used !ObjSize in WinDbg to see the differences in used space:

  • List: 4,036 bytes
  • Stack: 4,036 bytes
  • Queue: 4,044 bytes
  • SortedList: 8,076
  • Dictionary: 22,144 bytes
  • LinkedList: 24,028 bytes
  • SortedSet: 24,044 bytes
  • SortedDictionary: 28,076 bytes
  • HashSet: 30,972 bytes

You can see that List, Stack, and Queue are very close to the ideal storage requirements (4,000 bytes of pure data plus minimal overhead), but the others have varying amounts of overhead.

When choosing a data structure, first choose one that makes sense logically from the functionality point of view. If it gives good performance, then great. Otherwise, see if others have better insertion or retrieval characteristics for your scenario, and have acceptable storage requirements.

Concurrent Collections

The concurrent collection classes are safe to use across multiple threads without any additional thread synchronization. All of these classes are located in the System.Collections.Concurrent namespace and are all defined for use with generics:

  • ConcurrentBag<T> (A bag is similar to a Set, but it allows duplicates)
  • ConcurrentDictionary<TKey, TValue>
  • ConccurentQueue<T>
  • ConcurrentStack<T>

Most of these are implemented internally using Interlocked or Monitor synchronization primitives. Different methods require different levels of synchronization, so you must be careful. For example, calling Count, IsEmpty, ToArray(), and TryUpdate() on ConcurrentDictionary all involve taking a lock. You can and should examine their implementations using an IL reflection tool. Since every API is protected with some synchronization mechanism, frequent access may become expensive. As discussed in chapter 4, it may be better to forgo use of a concurrent collection and synchronize at a higher level.

Pay attention to the APIs for insertion and removal of items from these collections. They all have Try- methods which can fail to accomplish the operation in the case another thread beat them to it and there is now a conflict. For example, ConcurrentStack has a TryPop method which returns a Boolean value indicating whether it was able to pop a value. If another thread pops the last value, the current thread’s TryPop will return false.

ConcurrentDictionary has a few methods which deserve special attention. You can call TryAdd to add a key and value to the dictionary, or TryUpdate to update an existing value. Often, you will not care whether it is already in the collection and want to add or update it—it does not matter. For this, there is the AddOrUpdate method which does exactly that, but rather than having you provide the new value directly, you instead need to pass two delegates: one for add and one for update. If the key does not exist, the first delegate will be called with the key and you will need to return a value. If the key does exist, the second delegate is called with the key and existing value and you need to return a new value (which could just be the existing value). You should pay attention to the issue of delegate caching, described in the previous chapter.

In either case, the AddOrUpdate method will return to you the new value—but it is important to realize that this new value may not be the value from the current thread’s AddOrUpdate call! These methods are thread safe, but not atomic. It is possible another thread calls this method with the same key and the first thread will return the value from the second thread.

There is also an overload of the method that does not have a delegate for the add case (you just pass in a value).

A simple example will be helpful:

dict.AddOrUpdate(
  // Key I'm trying to add
  0, 
  // Delegate to call when adding--
  // return string value based on the key
  key => key.ToString(),
  // Delegate to call when already present--update existing value
  (key, existingValue) => existingValue);

dict.AddOrUpdate(
  // Key I'm trying to add
  0,
  // Value to add if new
  "0",
  // Delegate to call when already present--update existing value
  (key, existingValue) => existingValue);

The reason for having these delegates rather than just passing in the value is that in many cases generating the value for a given key is a very expensive operation and you do not want two threads to do it simultaneously. The delegate gives you a chance to just use the existing value instead of regenerating a new copy. However, note that there is no guarantee that the delegates are called only once. Also, if you need to provide synchronization around the value creation or update, you need to add that synchronization in the delegates themselves—the collection will not do it for you.

Related to AddOrUpdate is the GetOrAdd method which has almost identical behavior.

string val1 = dict.GetOrAdd(
  // The key to retrieve
  0,
  // A delegate to generate the value if not present
  k => k.ToString());

string val2 = dict.GetOrAdd(
  // The key to retrieve
  0,
  // The value to add if not present
  "0");

The lesson here is to be careful when using concurrent collections. They have special requirements and behaviors in order to guarantee safety and efficiency, and you need to understand exactly how they are used in the context of your program to use them correctly and effectively. Because of the semantics around delegates and the need to maintain a consistent data structure, you very well may execute more code than you think manipulating these data structures.

Bit-manipulation Collections

There are a handful of other specialized collections that ship with .NET, but most of them are string-specific or store Object, so can safely be ignored. Notable exceptions are BitArray and BitVector32.

BitArray represents an array of bit values. You can set individual bits and perform Boolean logic on the array as a whole. If you need only 32 bits of data, though, use BitVector32 which is faster and has less overhead because it is a struct (it is little more than wrapper around an Int32).

Initial Capacity

Most collections use a simpler underlying data structure to store its data. Often, this is an array, or a set of arrays. As the collection fills up, these arrays may need to be re-allocated and the data copied. It can be easy to get into pathological cases where you spend most of your time just resizing data structures.

Thankfully, most collections susceptible to this problem provide a parameter in the constructor to pre-allocate a specific size. I recommend you use these parameters often, even in cases where you do not necessarily know the exact size you will end up with. As long as you do not extremely overestimate the capacity requirements, it will usually result in savings by avoiding repeated reallocations. In any case, you should measure this with a profiler.

Collection Default Capacity Can Pre-Allocate
ConcurrentDictionary<TKey, TValue> 31 yes
Dictionary<TKey, TValue> 0 yes
HashSet<T> 0 yes
List<T> 0 yes
Queue<T> 0 yes
SortedDictionary<TKey, TValue>* 0 no
SortedList<T> 0 yes
SortedSet<T> 0 no
Stack<T> 0 yes

HashSet will allocate prime-numbered space on the first insertion. SortedSet and SortedDictionary are implemented as a binary tree, which encapsulates individual nodes, allocated as needed.

Key Comparisons

Data structures which have a key component (e.g., Dictionary<TKey, TValue>) usually also take a comparer parameter, allowing you to decide what kind of key comparisons you want.

As you will see in the String section, doing a more appropriate (even if it is more expensive computationally) comparison is almost always preferable to manipulating the key for comparison’s sake. The typical example is with strings and case-sensitivity.

I have actually seen code similar to the following because the code author did not know how to check keys correctly:

var keytoLookup = "myKey";
var dict = new Dictionary<string, object>();
...
foreach(var kvp in dict)
{
    if (kvp.Key.ToUpper() == keyToLookup.ToUpper())
    {
        ...
    }
}

This creates a tremendous amount of GC and memory waste, as well as unnecessary CPU cost. Beware of LINQ methods that make it easier to do these kinds of lookups as well.

There are a couple of ways to solve this:

  1. Restrict the underlying dataset’s keys to not require case-insensitive comparison. This will allow you to both avoid the memory allocation and get the cheapest comparison option.
  2. Pass a comparison object to the Dictionary<TKey, TValue> constructor that can be used to perform more accurate comparisons and lookups. There are a number of comparers defined for you as static properties on the StringComparer class.
var keytoLookup = "myKey";
var dict = new Dictionary<string, object>(
             StringComparer.OrdinalIgnoreCase);
...
object val;
if (dict.TryGetValue(keyToLookup, out val))
{
    ...
}

Now the dictionary can be correctly used for its intended purpose.

Sorting

An additional consideration: If you ever use any of your own types (especially structs) as values in lists, and those values are inherently orderable, you should implement IComparable<T>.

struct MyType : IComparable<MyType>
{
    public string Name { get; set; }

    public int CompareTo(MyType other)
    {                
        return this.Name.CompareTo(other.Name);
    }
}

If a value can be sorted in arbitrary ways (such as a Person object where sorting could be by birth date or by last name), then it is best to avoid implementing this altogether and avoid the potential confusion.

Creating Your Own Collection Types

I have rarely had the need to create my own collection types from scratch, but the need does occasionally arise. If the built-in types do not have the right semantics for you, then definitely create your own as an appropriate abstraction. When doing so, follow these general guidelines:

  1. Implement the standard collection interfaces wherever they make sense
    • IEnumerable<T>
    • ICollection<T>
    • IList<T>
    • IDictionary<TKey, TValue>
  2. Consider how the collection will be used when deciding how to store the data internally. Pay attention to things like locality-of-reference and favor arrays if sequential access is common.
  3. Do you need to add synchronization into the collection itself? Or perhaps create a concurrent version of the collection?
  4. Understand the run-time complexity of the add, insert, update, find, and remove algorithms. See Appendix C for a discussion of Big O complexity.
  5. Implement APIs that make semantic sense, e.g., Pop for stacks, Dequeue for queues.

Strings

In .NET, strings have two problems:

  1. Strings are immutable.
  2. We want to mutate them.

From those two items spring most of the issues.

By immutable, we mean that, once created, strings exist forever in that state until garbage collected. This means that any modification of a string results in the creation of a new string. Fast, efficient programs generally do not modify strings in any way. Think about it: strings represent textual data, which is largely for human consumption. Unless your program is specifically for displaying or processing text, strings should be treated like opaque data blobs as much as possible. If you have the choice, always prefer non-string representations of data.

String Comparisons

As with so many things in performance optimization, the best string comparison is the one that does not happen at all. If you can get away with it, use enums, or some other numeric data for decision-making. If you must use strings, keep them short and use the simplest alphabet possible.

There are many ways to compare strings: by pure byte value, using the current culture, with case insensitivity, etc. You should use the simplest way possible. For example:

String.Compare(a, b, StringComparison.Ordinal);

is faster than

String.Compare(a, b, StringComparison.OrdinalIgnoreCase);

which is faster than

String.Compare(a, b, StringComparison.CurrentCulture);

If you are processing computer-generated strings, such as configuration settings or some other tightly coupled interface, then ordinal comparisons with case sensitivity are all you need.

All string comparisons should use method overloads that include an explicit StringComparison enumeration. Omitting this should be considered an error.

Finally, String.Equals is a special case of String.Compare and should be used when you do not care about sort order. It is not actually faster in many cases, but it conveys the intent of your code better.

ToUpper and ToLower

Avoid calling methods like ToLower and ToUpper, especially if you are doing this for string comparison purposes. Instead, use one of the IgnoreCase options for the String.Compare method.

There is a bit of a tradeoff, but not much of one. On the one hand, doing a case-sensitive string comparison is faster, but this still does not justify the use of ToUpper or ToLower, which are guaranteed to process every character, where a comparison operation can often make a decision as soon as the first non-equal character is hit—even faster if the strings differ in length. It also creates a new string, allocating memory and putting more pressure on the garbage collector. Just avoid this.

Concatenation

For simple concatenation of a known (at compile time) quantity of strings, you can usually just use the + operator or the String.Concat method. These are often more efficient in both CPU and memory usage than using a StringBuilder. There is also String.Join, but it uses StringBuilder internally.

string result = a + b + c + d + e + f;

Once the situation becomes more dynamic, it becomes trickier and can depend on factors such as the number of strings, their lengths, and whether you can pre-initialize the result buffer. At this point, consider using StringBuilder, which uses pooled char buffers to build up a string before you convert it to a String object.

The accompanying sample code contains the project StringConcatPerf, which benchmarks a number of concatenation methods.

As a baseline, it concatenates 10 literal strings, each of length 10. This takes about 0.001ns per iteration. With non-literals, the times break down as follows:

Count = 10, Length = 1

  • StringBuilder (unitialized): 99ns
  • StringBuilder (initialized): 125ns
  • String.Concat: 176ns

Count = 10, Length = 10

  • String.Concat: 195ns
  • StringBuilder (initialized): 246ns
  • StringBuilder (uninitialized): 402ns

Count = 100, Length = 1

  • StringBuilder (initialized): 771ns
  • StringBuilder (uninitialized): 895ns
  • String.Concat: 1714ns

Count = 100, Length = 10

  • StringBuilder (initialized): 1721ns
  • String.Concat: 1943ns
  • StringBuilder (uninitialized): 2243ns

As you can see, the relative performance highly depends on the input characteristics. The difference between “initialized” StringBuilder and “unitialized” StringBuilder is whether you pass an argument to the constructor to pre-allocate the needed amount of space:

var sb = new StringBuilder(1024); // pre-initialize
var sb2 = new StringBuilder();    // default 16-char buffer

If at all possible, you should pre-initialize StringBuilder capacity to be at least as large as what you may need to avoid repeated allocations of additional buffers. (StringBuilder does allocate additional buffers in chained chunks, so it does largely avoid the copying problem, but it is still an inefficient use of CPU and memory and can lead to more memory being allocated than you need.)

Formatting

String.Format is an expensive method. In addition to parsing the format string, it will box value type arguments, potentially call custom formatters, and allocate more memory than is strictly needed for the resulting string. Do not use it unless necessary. Avoid it for simple situations like this:

string message = String.Format(
  "The file {0} was {1} successfully.",
  filename, 
  operation);

Instead, just do some simple concatenation:

string message = "The file " + filename + " was " + operation 
  + " successfully";

Reserve use of String.Format for cases where performance does not matter or the format specification is more complex (like specifying how many decimals to use for a double value).

The StringConcatPerf sample project produces some benchmark numbers comparing String.Format vs String.Concat:

  • String.Concat: 225ns
  • String.Format: 440ns

It is nearly twice as expensive to produce the same string with String.Format compared to String.Concat. Use the simplest and cheapest tool for the job.

ToString

Be wary of calling ToString for many classes. If you are lucky, it will return a string that already exists. Other classes will cache the string once generated. For example, the IPAddress class caches its string, but it has an extremely expensive string generation process that involves StringBuilder, formatting, and boxing. Other types may create a new string every time you call it. This can be very wasteful for the CPU and can also impact the frequency of garbage collections.

When designing your own classes, consider the scenarios your class’s ToString method will be called in. If it is called often, ensure that you are generating the string as rarely as possible. If it is only a debug helper, then it likely does not matter what it does.

ToString is often called by loggers and various string formatting helper methods like String.Format and Console.WriteLine (and related functionality in other classes). Make sure your own classes are designed with this in mind if possible. Some classes should produce a string only for debugging purposes.

Avoid String Parsing

If you can, design your system so that string parsing is not necessary. If data needs to be transformed from strings, can it be done offline or during startup? String processing is often CPU-intensive, repetitive, and memory-heavy—all things to avoid.

If you do need to parse during runtime, use methods that do not throw exceptions in failure circumstances. For example, do not use Int32.Parse because it will throw a FormatException,. Instead, use Int32.TryParse, which will just return false. Many of the basic .NET types will have similar API options.

Substrings

The String class has various methods for returning portions of the string as new strings. These will allocate new objects. If you can possibly operate on a subportion in the form of a char array, consider using the ReadOnlySpan<T> struct as described in Chapter 2 to represent a portion of the underlying array, as in this example:

{
...
    ReadOnlySpan<char> subString = 
      "NonAllocatingSubstring".AsSpan().Slice(13);
    PrintSpan(subString);
...
}

private static void PrintSpan<T>(ReadOnlySpan<T> span)
{
    for (int i = 0; i < span.Length; i++)
    {
        T val = span[i];
        Console.Write(val);
        if (i < span.Length - 1) { Console.Write(", "); }
    }
    Console.WriteLine();
}

The output of this code is:

S, u, b, s, t, r, i, n, g

Avoid APIs that Throw Exceptions Under Normal Circumstances

Exceptions are expensive, as you saw in Chapter 5. As such, they should be reserved for truly exceptional circumstances. Unfortunately, there are some common APIs that defy this basic assumption.

Most basic types have a Parse method, which will throw a FormatException when the input string is in an unrecognized format. For example, Int32.Parse, DateTime.Parse, etc. Avoid these methods in favor of TryParse, which will return a bool if parsing fails.

Another example is the System.Net.HttpWebRequest class, which will throw an exception if it receives a non-200 response from a server. This bizarre behavior is thankfully corrected in the System.Net.Http.HttpClient class in .NET 4.5.

Avoid APIs That Allocate From the Large Object Heap

Recall from chapter 2 that arrays are basically the only things that can be allocated large enough to be placed on the large object heap. If your own code is allocating these, they should of course be pooled.

Unfortunately, there are .NET APIs which will do this as well. The only way you can avoid these APIs is by profiling heap allocations using PerfView, which will show the stacks allocating memory like this. You cannot always avoid it, so just be aware that there are some .NET APIs that will do this. For example, calling the Process.GetProcesses method will guarantee an allocation on the large object heap. You can avoid repeated LOH allocations by caching its results, calling it less frequently, or just avoid it completely by retrieving the information you need via interop directly from the Win32 API.

Some APIs will pool the large objects they create to avoid repeated allocation on the large object heap (StringBuilder does this, for example), but you would need to analyze their code to determine this. Profiling and going from there should be enough to steer you in the right direction.

Use Lazy Initialization

If your program uses a large or expensive-to-create object that is rarely used, or may not be used at all during a given invocation of the program, you can use the Lazy<T> class to wrap a lazy initializer around it. As soon as the Value property is accessed, the real object will be initialized according to the constructor you used to create the Lazy<T> object.

If your object has a default constructor, you can use the simplest version of Lazy<T>:

var lazyObject = new Lazy<MyExpensiveObject>();
...
if (needRealObject)
{
  MyExpensiveObject realObject = lazyObject.Value;
  ...
}

If construction is more complex, you can pass a Func<T> to the constructor.

var myObject = new Lazy<MyExpensiveObject>(
  () => Factory.CreateObject("A"));
...
MyExpensiveObject realObject = myObject.Value

Factory.CreateObject is a dummy method that produces an object of type MyExpensiveObject.

If myObject.Value is accessed from multiple threads, it is possible that each thread will need to initialize the object. By default, Lazy<T> is thread safe and only a single thread will be allowed to execute the creation delegate and set the Value property. You can modify this with a LazyThreadSafetyMode enumeration. This enumeration has three values:

  • None: No thread safety. If important, you must ensure that the Lazy<T> object is accessed via a single thread in this case.
  • ExecutionAndPublication: Only a single thread is allowed to execute the creation delegate and set the Value property. This is the default value if you use a constructor of Lazy without this option.
  • PublicationOnly: Multiple threads can execute the creation delegate, but only a single one will initialize the Value property.

You should use Lazy<T> in place of your own singleton and double-checked locking pattern implementations.

If you have a large number of objects and Lazy<T> is too much overhead to use, you can use the static EnsureInitialized method on the LazyInitializer class. This uses Interlocked methods to ensure that the object reference is only assigned to once, but it does not ensure that the creation delegate is called only once. Unlike Lazy<T>, you must call the EnsureInitialized method yourself.

static MyObject[] objects = new MyObject[1024];      

static void EnsureInitialized(int index)
{
  LazyInitializer.EnsureInitialized(ref objects[index], 
      () => ExpensiveCreationMethod(index));
}

Note that by using the delegate here, you will have an extra allocation for the delegate object, in addition to whatever allocations ExpensiveCreationMethod performs.

The Surprisingly High Cost of Enums

You probably do not expect methods that operate on Enum values, a fundamentally integer type, to be very expensive. Unfortunately, because of the requirements of type safety, many simple operations are more expensive than you realize.

Take the Enum.HasFlag method, for example. You likely imagine the implementation to be something like the following:

public static bool HasFlag(Enum value, Enum flag)
{
  return (value & flag) != 0;
}

Unfortunately, what you actually get is something similar to:

// C# code generated by ILSpy
public bool HasFlag(Enum flag)
{
  if (flag == null)
  {
    throw new ArgumentNullException("flag");
  }
  if (!base.GetType().IsEquivalentTo(flag.GetType()))
  {
    throw new ArgumentException("Enum types do not match", 
        new object[]
      {
       flag.GetType(),
       base.GetType()
       }));
  }
  return this.InternalHasFlag(flag);
}

This is a good example of the side effects of using a general purpose framework. If you control your entire code base, then you can do better, performance-wise, by making certain assumptions and enforcing constraints not available to the .NET Framework. If you find you need to do a HasFlag test a lot, then do the check yourself:

[Flags]
enum Options
{
  Read = 0x01,
  Write = 0x02,
  Delete = 0x04
}

...

private static bool HasFlag(Options option, Options flag)
{
  return (option & flag) != 0;
}

Enum.ToString is also quite expensive for Enums that have the [Flags] attribute. One option to make it cheaper is to cache all of the ToString calls for that Enum type in a simple Dictionary<EnumType, String>. Or you can avoid writing these strings at all and get much better performance just using the actual numeric value and convert to strings offline.

For a fun exercise, see how much code is invoked when you call Enum.IsDefined. Again, the existing implementation is perfectly fine if raw performance does not matter, but you will be horrified if you find out it is a real bottleneck for you!

Story I found out about the performance problems of Enum the hard way, after a release. During a regular CPU profile I noticed that a significant portion of CPU, over 3%, was going to just Enum.HasFlag and Enum.ToString. Excising all calls to HasFlag and using a Dictionary for the cached strings reduced the overhead to negligible amounts.

Tracking Time

Time means two things:

  • Absolute time of day
  • Time span (how long something took)

For absolute times, .NET supplies the versatile DateTime structure. However, calling DateTime.Now is a fairly expensive operation because it has to consider time zone information. Consider calling DateTime.UtcNow instead, which is more streamlined.

Even calling DateTime.UtcNow might be too expensive for you if you need to track a lot of time stamps. If that is the case, get the time once and then track offsets instead, rebuilding the absolute time offline, using the time span measuring techniques shown here.

To measure time intervals, .NET provides the TimeSpan struct. If you subtract two DateTime structs you will get a TimeSpan struct.

However, if you need to measure very small time spans with minimal overhead, use the system’s performance counter, via System.Diagnostics.Stopwatch which will return to you a 64-bit number measuring the number of clock ticks since the CPU received power. To calculate the real time difference you take two measurements of the clock tick, subtract them, and divide by the system’s clock tick count frequency. Note that this frequency is not necessarily related to the CPU’s frequency. Most modern processors change their CPU frequency often, but the tick frequency will not be affected.

You can use the Stopwatch class like this:

var stopwatch = Stopwatch.StartNew();
//...do work...
stopwatch.Stop();
TimeSpan elapsed = stopwatch.Elapsed;
long elapsedTicks = stopwatch.ElapsedTicks;

There are also static methods to get a time stamp and the clock frequency, which may be more convenient if you are tracking a lot of time stamps and want to avoid the overhead of creating a new Stopwatch object for every interval.

long receiveTime = Stopwatch.GetTimestamp();
long parseTime = Stopwatch.GetTimestamp();
long startTime = Stopwatch.GetTimestamp();
long endTime = Stopwatch.GetTimestamp();

double totalTimeSeconds = (endTime - receiveTime) / 
    Stopwatch.Frequency;

Finally, remember that values received from the Stopwatch.GetTimestamp method are only valid in the current executing session and only for calculating relative time differences.

Combining the two types of time, you can see how to calculate offsets from a base DateTime object to get new absolute times:

DateTime start = DateTime.Now;
long startTime = Stopwatch.GetTimestamp();
long endTime = Stopwatch.GetTimestamp();

double diffSeconds = (endTime - startTime) / Stopwatch.Frequency;
DateTime end = start.AddSeconds(diffSeconds);

Regular Expressions

A regular expression is a way to search for a string with advanced pattern-matching syntax.

Some simple examples include:

Regex regex = new Regex("<value>(.*)</value>");

This finds any text between some XML <value> tags. Regular expressions can be extremely complex and difficult to understand until you master their syntax and learn how to break them down into manageable chunks.

A more complex-looking expression is:

static Regex regex = 
  new Regex("$s*[-+]?([0-9]{0,3}(,[0-9]{3})*(.[0-9]+)?)");

This extracts dollar values of the form “$-34.19”, “$52”, and a few others.

Regular expressions are not fast. Internally, .NET creates a state machine to traverse the input string and match it against the pattern’s symbols. The costs include:

  • Evaluation time can be long: This depends on the input text and the pattern to match. It is quite easy to write regular expressions that perform poorly. Minor differences in the expression can make significant differences in processing speed and optimizing them in and of themselves is a whole topic unto itself.
  • Assembly generation: With some options, an in-memory assembly is generated on the fly when you create a Regex object. This helps with the runtime performance, but is expensive to create the first time.
  • JIT costs can be high: The code generated from a regular expression can be very long and have patterns that give the JIT compiler fits. Thankfully, recent versions of the JIT have largely improved this.

As far as .NET goes, there are a few things you can do to improve the efficiency of Regex usage:

  • Ensure you are up-to-date with .NET and patches. Specifically, .NET 4.6 has a new JIT compiler that dramatically improves regular expression parsing performance.
  • Create a Regex instance variable rather than using the static methods. The static methods will internally create an instance anyway, but you will not be able to reuse it. For one-offs, these are fine.
  • Create the Regex object with the RegexOptions.Compiled flag. Without this, regular expressions are interpreted. When compiled, expressions are much faster to evaluate, at the expense of steeper initialization costs due to emitting code into an in-memory assembly and then JITting it. You may also want to avoid compiling expressions if you are always creating new format strings and rarely reusing earlier ones. This will pollute the process with a lot of extra code.
  • Do not recreate Regex objects over and over again. Create one, save it, and reuse it to match on new input strings. Never have a pattern like this:
class Foo
{
    private static void Evaluate(string[] inputs)
    {
        for (int i = 0; i < inputs.Length; i++)
        {
            Regex regex = new Regex("<value>(.*)</value>", 
                                    RegexOptions.Compiled);
            Match match = regex.Match(input);
            ...
        }
    }
}

Do this instead:

class Foo
{
    private static readonly Regex regex = 
      new Regex("<value>(.*)</value>", 
                RegexOptions.Compiled);
    
    private static void Evaluate(string[] inputs)
    {
        for (int i = 0; i < inputs.Length; i++)
        {
            Match match = regex.Match(input);
            ...
        }
    }
}

You may also want to impose a timeout on complex expressions:

Regex regex = new Regex("<value>(.*)</value>", 
                        RegexOptions.Compiled, 
                        TimeSpan.FromSeconds(5));

You can also impose a global timeout via an application domain setting:

AppDomain.CurrentDomain.SetData("REGEX_DEFAULT_MATCH_TIMEOUT", 
                                TimeSpan.FromSeconds(10));

As an alternative to regular expressions, you can find or build your own parsing library for very simple patterns like dates or phone numbers that have few variations.

LINQ

The biggest danger with Language Integrated Query (LINQ) is that it has the potential to hide code from you—code for which you cannot be accountable because it is not present in your source file!

To illustrate this, look at a fairly simple example of a method that shifts all values in an array by a certain amount, provided they meet a certain condition:

public static int[] ShiftLinq(int[] vals, int shiftAmount)
{
    var temp = from n in vals select n + shiftAmount;
    return temp.ToArray();
}

It looks fairly innocuous, but there is a lot going on under the covers. Look at the IL this translates to:

.method public hidebysig static 
    int32[] ShiftLinq (
        int32[] vals,
        int32 shiftAmount
    ) cil managed 
{
    // Method begins at RVA 0x209c
    // Code size 37 (0x25)
    .maxstack 3
    .locals init (
      [0] class LinqCost.Program/'<>c__DisplayClass1_0'
    )

    IL_0000: newobj instance void LinqCost.Program
        /'<>c__DisplayClass1_0'::.ctor()
    IL_0005: stloc.0
    IL_0006: ldloc.0
    IL_0007: ldarg.1
    IL_0008: stfld int32 
        LinqCost.Program/'<>c__DisplayClass1_0'
          ::shiftAmount
    IL_000d: ldarg.0
    IL_000e: ldloc.0
    IL_000f: ldftn instance int32 
        LinqCost.Program/'<>c__DisplayClass1_0'
          ::'<ShiftLinq>b__0'(int32)
    IL_0015: newobj instance void class 
        [mscorlib]System.Func`2<int32, int32>
          ::.ctor(object, native int)
    IL_001a: call class [mscorlib]
        System.Collections.Generic.IEnumerable`1<!!1> 
        [System.Core]System.Linq.Enumerable
        ::Select<int32, int32>(class 
        [mscorlib]System.Collections.Generic.IEnumerable`1
          <!!0>, class [mscorlib]System.Func`2<!!0, !!1>)
    IL_001f: call !!0[] 
        [System.Core]System.Linq.Enumerable
        ::ToArray<int32>(
          class [mscorlib]
          System.Collections.Generic.IEnumerable`1<!!0>)
    IL_0024: ret
} // end of method Program::ShiftLinq

You can see that there are 37 bytes of IL code, including two allocations for closure objects and delegates, as well as two method calls. Some of those method calls will have further allocations. To top it off, the final ToArray call at the end takes an IEnumerable<T> argument, which means that, depending on the source collection, it will not know how long the array is until it gets to the end. That implies continual resizing and copying until the end, then a final copy into a precisely sized array. Whether any of this matters to you depends on the needs of your program. It is certainly far more expensive than an alternative implementation:

public static void Shift(int[] vals, int shiftAmount)
{
    for (int i = 0; i < vals.Length; i++)
    {                
        vals[i] += shiftAmount;                
    }
}

Instead of query syntax, we have a typical for-loop with an if statement. This translates to the following IL:

.method public hidebysig static 
    void Shift (
        int32[] vals,
        int32 shiftAmount
    ) cil managed 
{
    // Method begins at RVA 0x20d0
    // Code size 27 (0x1b)
    .maxstack 3
    .locals init (
        [0] int32
    )

    IL_0000: ldc.i4.0
    IL_0001: stloc.0
    IL_0002: br.s IL_0014
    // loop start (head: IL_0014)
        IL_0004: ldarg.0
        IL_0005: ldloc.0
        IL_0006: ldelema 
                   [mscorlib]System.Int32
        IL_000b: dup
        IL_000c: ldind.i4
        IL_000d: ldarg.1
        IL_000e: add
        IL_000f: stind.i4
        IL_0010: ldloc.0
        IL_0011: ldc.i4.1
        IL_0012: add
        IL_0013: stloc.0

        IL_0014: ldloc.0
        IL_0015: ldarg.0
        IL_0016: ldlen
        IL_0017: conv.i4
        IL_0018: blt.s IL_0004
    // end loop

    IL_001a: ret
} // end of method Program::Shift

The resulting IL is smaller at 27 bytes, but more importantly there are no allocations or method invocations whatsoever.

If you are paying attention, you may notice a rather significant “cheat” in the for-loop version of the code: it modifies the original array, while the LINQ version creates a new one. This is deliberate. LINQ pushes you in the direction of creating new objects rather than reuse existing ones. This is often because LINQ makes such heavy use of the IEnumerable<T> interface, which is the lowest-level interface a collection can implement. This means that semantics more appropriate to the specific data structure cannot be used, and we lose some performance in the process. However, there are methods which will check for the presence of other interfaces like ICollection or IList and make use of those methods if possible. LINQ also takes a rather functional, immutable view of data. The lesson here is that your tools will often enforce design and performance constraints upon you.

So how does the performance actually differ? Here are the benchmark result numbers:

Method Mean Error StdDev Gen 0 Allocated
ShiftLinq 286.78 ns 1.8867 ns 1.7648 ns 0.0448 236 B
Shift 12.38 ns 0.1568 ns 0.1390 ns - 0 B

The LINQ version is 20 times more expensive than the simpler version and because of its memory allocations, has a higher likelihood of causing garbage collections.

I want to be clear in stating that LINQ is not a bad technology—it has its place. It is phenomenally convenient at times, and many LINQ queries are perfectly performant, but it can make heavy use of delegates, interfaces, and temporary object allocation if you go crazy with temporary dynamic objects, joins, or complex where clauses.

You can often achieve some significant speedup in time by using Parallel LINQ, but keep in mind this is not actually reducing the amount of work to be done; it is just spreading it across multiple processors. For a mostly single-threaded application that just wants to reduce the time it takes to execute a LINQ query, this may be perfectly acceptable. On the other hand, if you are writing a server that is already using all of the cores to perform processing, then spreading LINQ across those same processors will not help the big picture and may even hurt it. In this case, it may be better to do without LINQ at all and find something more efficient.

If you suspect that you have some unaccounted-for complexity, run PerfView and look at the JITStats view to see the IL sizes and JIT times for methods that involve LINQ. Also look at the CPU usage of those methods once JITted.

It is worth noting that LINQ in .NET Core has made a number of efficiency improvements, such as reducing the number of allocations overall, avoiding delegate allocations, and improving algorithms all over the place to take advantage of known collection sizes (getting around the IEnumerable problem) that can offset some of the costs described here, but not all of them.

In the end, you must choose the appropriate tool for the job, and LINQ may be perfectly acceptable and convenient in most parts of your software, but inappropriate for other parts. Measure where allocations and CPU are going, and if it is in LINQ-heavy code, it is usually easy to convert it. Always let the profiler be the guide.

Reading and Writing Files

There are a number of convenience methods on the File class such as Open, OpenRead, OpenText, and OpenWrite. These are fine if performance is not critical.

If you are doing a lot of disk I/O, then you need to pay attention to the type of disk access you are doing, whether it is random, sequential, or if you need to ensure that the write has been physically written to the device before notifying the application of I/O completion. For this level of detail, you will need to use the FileStream class and a constructor overload that accepts the FileOptions enumeration. You can logically OR multiple flags together, but not all combinations are valid. None of these options are required, but they can provide hints to the operating system or file system on how to optimize file access.

using (var stream = new FileStream(
            @"C:foo.txt", 
            FileMode.Open, 
            FileAccess.Read, 
            FileShare.Read, 
            16384 /* Buffer Size*/, 
            FileOptions.SequentialScan | FileOptions.Encrypted))
{
...  
}

The options available to you are:

  • None: No additional options.
  • Asynchronous: Indicates that you will be doing asynchronous reading or writing to the file. This is not required to actually perform asynchronous reads and writes, but if you do not specify it, the underlying I/O will be performed synchronously without I/O completion ports. (Your thread will still execute asynchronously.) There are also overrides of the FileStream constructor that will take a Boolean parameter to specify asynchronous access.
  • DeleteOnClose: Causes the OS to delete the file when the last handle to the file is closed. Use this for temporary files.
  • Encrypted: Causes the file to be encrypted using the current account’s credentials.
  • RandomAccess: Gives a hint to the file system to optimize caching for random access.
  • SequentialScan: Gives a hint to the file system that the file is going to be read sequentially from beginning to end.
  • WriteThrough: Ignores caching and goes directly to the device’s permanent storage. This generally makes I/O slower. The flag will be obeyed by the file system’s cache, but many storage devices also have onboard caches, and they are free to ignore this flag and report a successful completion before it is written to permanent storage.

Random access is bad for any device, such as a hard disk or tape, that needs to seek to the required position. Sequential access should be preferred for performance reasons. However, many computers have SSDs these days where the difference between random and sequential access is minimal or nonexistent.

Most applications will not have to care about the specific style of I/O until they are completely saturating the devices they are accessing. When that occurs, you will need to take into account the type of hardware that exists, as well as what kind of accessing you are performing, and use the above flags accordingly.

Optimizing HTTP Settings and Network Communication

If your application makes outbound HTTP calls, there are a number of settings you can change to optimize network transmission. You should exercise caution in changing these, however, as their effectiveness greatly depends on your network topology and the servers on the other end of the connection. You also need to take into account whether the target endpoints are in a data center you control, or are somewhere on the Internet. You will need to measure carefully to see if these settings benefit you or not.

To change these by default for all endpoints, modify these static properties on the ServicePointManager class:

  • DefaultConnectionLimit: The number of connections per end point. Setting this higher may increase overall throughput if the network links and both endpoints can handle it.
  • Expect100Continue: When a client initiates a POST or PUT command it normally waits for a 100-Continue signal from the server before proceeding to send the data. This allows the server to reject the request before the data is sent, saving bandwidth. If you control both endpoints and this situation does not apply to you, turn this off to improve latency.
  • UseNagleAlgorithm: Nagling, described in RFC 896 at https://tools.ietf.org/html/rfc896.html is a way to reduce the overhead of packets on a network by combining many small packets into a single larger packet. This can be beneficial by reducing overall network transmission overhead, but it can also cause packets to be delayed. On modern networks, this value should usually be off. You can experiment with turning this off and see if there is a reduction in response times.

Some of these settings can also be applied to individual ServicePoint objects, which can be useful if you want to customize settings by endpoint, perhaps to differentiate between local datacenter endpoints and those on the Internet. In addition to the above, the ServicePoint class also lets you control additional parameters:

  • ConnectionLeaseTimeout: Specifies the maximum time in milliseconds that an active connection will be kept alive. Set this to -1 to keep connections alive forever. This setting is useful for load balancing, where you will want to periodically force connections to close so that the process will re-connect to different machines. Setting this value to 0 will cause the connection to close after every request. This is not recommended because making a new HTTP connection is fairly expensive.
  • ConnectionLimit: Specifies the maximum number of connections this endpoint can have.
  • MaxIdleTime: Specifies the maximum time in milliseconds that a connection can remain open but idle. Set this to Timeout.Infinite to keep connections open indefinitely, regardless of whether they are active or not.
  • ReceiveBufferSize: The size of the buffer used for receiving requests. The default is 8 KB. You can use a larger buffer if you regularly get large requests.
  • SupportsPipelining: Allows multiple requests to be sent without waiting for a response between each one. However, the responses are sent back in order. See RFC 2616 at https://tools.ietf.org/html/rfc2616 (the HTTP/1.1 standard) for more information.

You can also force an individual HTTP request to close its current connection (after the response has been sent back) by setting the KeepAlive header to false.

Story Ensure that what you are transmitting is optimally encoded. While profiling an internal system, we noticed an extremely high memory allocation rate and CPU usage for a particular component. With some investigation, we realized that it was receiving an HTTP response, transforming the received bytes into a base64-encoded string, decoding that string into a binary blob, and then finally deserializing that blob back into a strongly typed object. It was wasting bandwidth by needlessly encoding a binary blob as a string, and wasting our CPU with multiple layers of encoding, and finally it was causing more time spent in GC with multiple large object allocations. The lesson is to send only what you need, as compactly as possible. Base64 is rarely, if ever, useful today, especially among internal components. Regardless of whether you are doing file or network I/O, encode the data as ideally as possible. For example, if you need to read a series of integers, do not waste CPU, memory, disk space, and network bandwidth wrapping that in XML.

Finally, another word of caution relating to the principle highlighted at the top of this chapter about the general purpose of much of the .NET Framework. The built-in HTTP client, while generally very good and perfectly acceptable for downloading Internet content, may not be suitable for all applications, particularly if your application is very sensitive to latencies at high percentiles, and especially with intra-datacenter requests. If you care about 95th or 99th percentile latencies for HTTP requests, you may have to write your own HTTP client around the underlying WinHTTP APIs to get that last bit of performance. Doing this correctly takes quite a bit of expertise in both HTTP and multithreading in .NET to get right, so you need to justify the effort.

SIMD

Single-Instruction, Multiple-Data features of processors execute a single set of operations across multiple pieces of data. Most modern x64-platform processors from both Intel and AMD support these instructions, and they are critical for parallel computations required in things like gaming, scientific, and mathematical computation. The current JIT compiler as of this writing (4.7.1) makes some limited use of these instructions and registers, mostly through the System.Numerics.Vectors NuGet package.

For an example, consider an algorithm that finds the minimum value in an array. A straightforward serial algorithm would look like this:

int min = int.MaxValue;
for (int i = 0; i < sourceArray.Length; i++)
{
    if (sourceArray[i] < min)
    {
        min = sourceArray[i];
    }
}

The equivalent SIMD version of this is considerably more complex:

var minVector = new Vector<int>(int.MaxValue);
int i = 0;
// Process array in chunks of the vector's length
for (i=0; i < Length - Vector<int>.Count; i += Vector<int>.Count)
{
    Vector<int> subArray = new Vector<int>(this.sourceArray, i);
    minVector = Vector.Min(subArray, minVector);
}

// get min of the min vector and any leftover elements
int min = Int32.MaxValue;
for (int j = 0; j < Vector<int>.Count; j++)
{
    min = Math.Min(min, minVector[j]);
}

for (i = 0; i < sourceArray.Length; i++)
{
    min = Math.Min(min, sourceArray[i]);
}

Since Vector uses hardware registers, the length is governed by that hardware and is static, thus the use of the static property Vector<int>.Count in the example above. On my machine, the SIMD version is not any faster than the non-SIMD version. However, the situation is very different for another algorithm.

Here is an algorithm that scales an array by an integer value. The non-SIMD version is very simple:

for(int i=0; i < this.sourceArray.Length; i++)
{
    this.destinationArray[i] = this.sourceArray[i] * ScaleFactor;
}

The SIMD version is arguably simpler:

this.destinationVector = this.sourceVector * ScaleFactor;

In this case, the performance difference is enormous. Here are some benchmark results for both types of algorithms:

Method Mean Error StdDev
Min_NonSIMD 4.312 ns 0.1615 ns 0.4761 ns
Min_SIMD 8.645 ns 0.0342 ns 0.0320 ns
Scale_NonSIMD 4.764 ns 0.0201 ns 0.0188 ns
Scale_SIMD 2.363 ns 0.0052 ns 0.0046 ns

You can see that scaling the vector is about twice as fast as the non-SIMD version.

There is no guarantee that any arbitrary SIMD-version of an algorithm is faster than its non-SIMD equivalent. Getting the algorithms right can depend on context, hardware, and the specifics of your CPU and memory patterns. There is an overhead to using SIMD instructions and the trick is making this less than the speedup you get from the parallelism. Usually, the effectiveness of a SIMD-based algorithm will depend on how effectively you can keep the algorithm operating on data within the Vector struct or other System.Numerics.Vector data structures. This is the equivalent of keeping the calculations on the hardware. When you transfer values from the Vector to standard .NET data structures, you lose the benefits of the parallelism and hardware acceleration.

Investigating Performance Issues

Many of the techniques for finding issues with .NET Framework performance are exactly the same as with your own code. When you use tools to profile CPU usage, memory allocations, exceptions, contention, and more, you will see the hot spots in the framework just like you see them in your own code.

Note that PerfView will group much of the framework together and you may need to change the view settings to get a better picture of where Framework performance is going.

Performance Counters

.NET has many categories of performance counters. Chapters 2 through 4, which cover garbage collection, JIT compilation, and asynchronous programming, all detail the performance counters for their specific topic. .NET has additional performance counters for the following categories:

  • .NET CLR Data: Counters relating to SQL clients, connection pools, and commands.
  • .NET CLR Exceptions: Counters relating to rate of exceptions thrown.
  • .NET CLR Interop: Counters relating to calling native code from managed code.
  • .NET CLR Networking: Counters relating to connections and amount of data transmitted.
  • .NET CLR Remoting: Counters relating to the number of remote calls, object allocations, channels, and more.
  • .NET CLR Data Provider for SqlServer/Oracle: Counters for various .NET database clients.

Depending on your system’s configuration you may see more or less than these.

Summary

As with all frameworks, you need to understand the implementation details of all the APIs you use. Do not take anything for granted.

Take care when picking collection classes. Consider API semantics, memory locality, algorithmic complexity, and space usage when choosing a collection. Prefer jagged arrays over multi-dimensional arrays. Completely avoid the older-style non-generic collections like ArrayList and HashTable. Use concurrent collections only when you need to synchronize most or all of the accesses. Consider initial capacity, key comparisons, and sorting ability with collections and the objects they store.

Pay particular attention to string usage and avoid creating extra strings. Prefer simpler APIs over more complex ones like String.Format. Use ReadOnlySpan<T> for substrings, when possible.

Avoid APIs that throw exceptions in normal circumstances, allocate from the large object heap, or have more expensive implementations than you expect.

When using regular expressions, make sure that you do not recreate the same Regex objects over and over again, and strongly consider compiling them with the RegexOptions.Compiled flag.

Pay attention to the type of I/O you are doing and use the appropriate flags when opening files to give the OS a chance to optimize performance for you. For network calls, disable Nagling and Expect100Continue. Only transmit the data you need and avoid unnecessary layers of encoding.

Avoid using reflection APIs to execute dynamically loaded code. Call this kind of code via common interfaces or through code-generated delegates.

If you are doing significant manipulation of arrays and matrices, use the System.Numerics.Vectors NuGet package to take advantage of SIMD commands.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset