12

Load Testing

“If you can fill the unforgiving minute,

With sixty seconds’ worth of distance run,

Yours is the Earth and everything that’s in it,

And – which is more – you’ll be a Man, my son!”

– If by Rudyard Kipling, from the book Rewards and Fairies

In the previous chapter, we saw the importance of destructive testing and trying the failure modes of your system when different services are restarted or offline. In this chapter, we will turn to another vital aspect of non-functional testing – load testing.

Load testing is non-functional because it provides no new checks of the functions of your system. All the tests you run here should have been covered before in black- and white-box testing to ensure they work in at least one scenario. Load testing extends that testing to determine the limits of your system – how does it perform during periods of peak activity? Are operations resilient and reliable even when running many millions of times? How fast does your system run during high loading? What CPU, disk, and memory resources does your system need, and are they reasonable?

A separate question involves what happens when you go beyond these limits to deliberately restrict system resources, run for excessive periods, or apply too great a load. This will be considered in the next chapter, Chapter 13, Stress Testing. In this chapter, you will learn about the following topics:

  • How to identify load testing operations
  • Understanding load testing variables
    • Static versus dynamic
    • Soak testing versus spikes in load
  • Designing load-testing applications
  • What to check while load testing
  • How to find race conditions in asynchronous systems
  • Identifying coding inefficiencies
  • Checking the performance of your system
  • Filtering loading errors
  • Debugging loading issues

The question for this chapter is, can your system fill each unforgiving minute with 60 seconds’ worth of program run?

Advantages and disadvantages of load testing

As with many of these sections, the advantage of load testing is that it is the only way to find this class of bugs, and there are no alternatives. Only by deliberately loading the system can you discover interactions and limitations when your application performs many actions quickly. These are very difficult to predict and plan for, which is why this testing is so essential. Other advantages and disadvantages of load testing are as follows:

Advantages

Disadvantages

The only way to find this class of issue

Requires dedicated tools

The only way to measure system performance

Requires high-quality tools

Uncovers hard-to-find issues

Harder to debug

Finds software inefficiencies

Less realistic scenarios

Expensive

Table 12.1 – Advantages and disadvantages of load testing

The other major advantage of load testing is that it is the only way to measure system performance. While the specification might state that the system is capable of handling some amount of load, that is not a value the development team can simply type in. It results from the interactions and performance of various factors interacting with each other. It is challenging to predict; the only way to be sure is to test it in practice.

To find errors that only occur once in a thousand times, you’ll need to perform that function a thousand times, on average. They may be rare but depending on how often your customers use that function, this may happen daily. These are crucial issues that the product owner and development team won’t find by giving the code a quick try, but they will affect your users. They are difficult to reproduce and isolate, so they need a dedicated test plan.

Load testing also finds inefficiencies in your application. For example, if a core function takes 10 times as long to run in your latest release, this is where you should see that. The function might work and pass all other forms of testing, but when you perform thousands of them each day, your system won’t handle the load and will start to fail.

Real-world example – Reading events since the beginning of time

One of the worst bugs I ever shipped could only have been found with load testing. At one cloud provider where I worked, we started to see slow responses and crashes from our units. They appeared seemingly at random, although they mainly affected our busiest sites hosting some of our biggest customers.

These issues were a nightmare to debug since our logs at the time were held locally on units, which were slow and unresponsive. Because our system still provided some service, we couldn’t take units out of commission once they were in this state. Sometimes, the problems would be bad enough to cause crashes, but they returned in the same condition even after a restart.

Eventually, we tracked the issue to disk usage, then to the database, and then to one particular command. Our previous release added an inefficiency that searched through every event stored on a unit. The more time that passed, the worse the problem became until the units slowed down and crashed. It had survived all our testing but failed under load.

Load testing is one of the hardest areas to perform and debug. It requires dedicated tools to generate traffic, a realistic system to run against, and sufficient resources to load it heavily. Load testing tools have to be of high quality, too. If you are looking for a failure that only happens one time in a thousand, then you need your tools to be at least that reliable; otherwise, they will fail before your system. If they fail earlier, you will end up searching through many spurious errors or, some classes of issues may not appear at all. For instance, if your test system leaks memory faster than the system under test, you may never uncover the real issue.

Load testing generates many ancillary problems, from disks filling with logs to email spam and database tables bloated with test data. Along with performing the loading, you will need scripts to clean up after yourself. This may mean rotating logs, clearing database entries, or putting restrictions on what actions the tests perform, for instance, preventing them from sending emails. All that is extra work but it’s necessary to have these tests run smoothly.

Debugging load tests is especially hard given the number of operations going on simultaneously. During load testing, the logs will be full of traffic, and multiple services will be in use simultaneously. Picking apart the causes of problems is difficult, especially as you search for issues that do not appear during lighter loading.

There is a major class of issues that you only hit when performing load testing. For instance, deleting a user before it has been fully created may leave the database in an invalid state. While a genuine issue, it will never be encountered by a real user unless they are making and deleting users within milliseconds of each other. That’s something load testing does, but no real user ever would.

You need to filter out that class of error. To do that, you can alter the load tests to work around system limitations, especially by slowing them down to more realistic speeds. Alternatively, you can trigger the issue but filter it out and ignore it or, if the development team has time, actually fix the problem on the system. While that might sound like an obvious choice, fixing an issue that only testers will ever see is not a high priority. They could work on other bugs and features that will have a more direct customer impact instead.

Load testing is often expensive, either in terms of the dedicated hardware it requires or the compute time on cloud services you have to pay for. That’s in addition to the time it takes to set up a run. You will need a dedicated environment to run load tests since they are antisocial and may impact other operations. You can’t rely on valid test results if load testing occurs simultaneously. That alternate system takes resources and time to set up.

Despite these weaknesses, load testing uncovers such critical bugs that it should be a crucial part of your test program. First, we will consider the technical requirements to carry out any load testing.

Prerequisites

Successfully performing load testing has several prerequisites beyond that of other forms of testing:

  • A dedicated test environment: Taking the system to its limits may cause other tests to fail, so they should be run elsewhere.
  • A complete test environment: To see realistic loading issues, you will need a complete, realistic system with the correct subsystems and resources in place.
  • A reliable test environment: You need networks without packet loss and systems without other recurring issues before you can reliably find loading issues.
  • A loaded test environment: Add at least as many entities as are present on your live system.
  • Load generation tools: To rapidly create entities and perform actions on your system:
    • This includes generating web requests or a database load, simulating client connections, or data processing tasks. For more details, see the Load test design section.

All these need to be in place to carry out the tests in this section.

Real-world example – The dummy hardware

In one video conferencing company I worked for, we had a dedicated test environment. However, the hardware for video processing was expensive and unnecessary for many of the tests. While we had some real hardware, others were just virtualized, providing the same functionality but capable of much lower performance.

That arrangement caused regular headaches and every few months, a tester would complain about tests failing when they were running them on the wrong hardware. Finally, we bit the bullet and purchased enough hardware so that the entire system could run realistic tests.

While load tests have more prerequisites than functional tests, you can still perform them early in the release cycle if you have sufficient warning and planning. Since load testing can take significant time and is a great way to flush out issues, it’s well worthwhile getting organized to run it as early as you can.

With a realistic, reliable system in place, you can start to plan your load testing, starting by identifying which operations produce load on your system.

Identifying load operations

A massive surge in traffic might be a good problem, but when someone famous retweets your brand or your latest advert goes viral, that may be what you face. How will your system perform? That could be your one chance to make a first impression on thousands of potential customers, so you want it to be a good one.

What are the key operations of your system? These are functions your application runs repeatedly and that might cause a burst of load. Users signing up, logging in, and logging out are three common examples. If administrators can manage other users, then the creation and deletion of accounts are two others. Otherwise, your load testing will depend on your business: the number of simultaneous games you can run on your servers or video calls, the data you can gather, the processing you can perform, the downloads you can sustain, and the page impressions you can render. The possibilities are many and varied.

Go back to your core use cases and break them down into stages. What if a thousand users performed them simultaneously? There will be a whole series of operations a user will go through, so pick out each step as a requirement for your load testing.

Real-world example – The halftime break

I worked for a company that manufactured hardware to deliver SMS messages, back before smartphones when texting was the main way people messaged each other. I was onsite in Athens to commission equipment as it went live when Greece reached the European Championship semi-finals. Our kit would instantly face a massive test.

I could see the SMS traffic across the country, in real time, for the network we were supplying. It built up to a peak before kickoff before thankfully dying down during the first half. Traffic plateaued again during halftime, and I braced myself for the final peak as the game ended. We had performed well, and I didn’t want to push our luck.

But the game went to extra time, prolonging my nervous wait even longer. After another dip, there was one last surge at the final whistle, which gradually died off into the evening as the Greeks celebrated their win. My hosts were doubly happy about the performance of their network and their team.

You must have tools that can stress each function of your application individually so that you can choose exactly how to combine them. You can plot curves of variables against each other. For example, your system may support 100 simultaneous downloads and 300 logged-in users, but can it do both simultaneously? Perhaps it can only manage 50 downloads and 100 live users. You don’t need to plot a detailed graph, but you’ll need to pick representative points to see how the variables affect each other.

It’s no use being able to support 100 downloads only if no users are on the system because, in practice, there will always be users. You may need to pick ratios for your loading variables – two logged-in users per download, for instance – so you can specify your system’s capability under realistic usage.

System performance is often under-specified, as described in Chapter 2, Writing Great Feature Specifications. Perhaps no one has ever considered the maximum rate of sign-ups your application should be capable of. The product owner may not have a strong opinion, and the developers might not know what rate the system can support. In these cases, pick an agreed rate and test up to it. You’re aiming for a value far beyond practical usage and within current capabilities to keep the product owner and development team happy. Once you’ve picked that number and tested it, set your monitoring to warn you if you ever approach it in live use. You may be able to go beyond it, but that needs another, harsher test.

Dynamic versus static load

You can put two different forms of load on your system: large numbers of entities or significant rates of change. For instance, having a million users configured on your system applies one form of loading; creating one user per second involves the second form. For each of the core operations you have identified, test both a large static configuration and a rapid rate of usage.

You will need to look out for different styles of issues in various cases. With a sizeable static configuration, consider low-level processing such as the following:

  • Database query times
  • Data processing times
  • System resource use:
    • CPU
    • Memory
    • Disk
    • Internal resources such as file handles, database connections, addresses, and so on

Also, consider the effects on the application overall and frontend behavior:

  • Loading times of interfaces, especially those that rely on filtering entities from long lists of entities:
    • APIs
    • Web pages
    • App screens
    • Downloads
  • User experience problems:
    • Drop-down lists that are unworkably long
    • Lists that make pages screens slow to load
    • Lists that require searching or filtering

Problems with static load will be considered further in the Raising system limits section. However, usually, when I refer to load testing in this chapter, I am not referring to a large steady-state load but instead dynamic, rapidly repeating operations. The many possible checks for those tests will be described in the What to check during load testing section.

Soak testing versus spikes of load

For each loading operation, whatever that function is, you can load it with a spike of the activity or apply sustained usage. The pass mark for a spike of activity within the system’s capabilities is that the system should be able to batch, record, and smooth the work so that it all gets done and returns to its previous state. For testing beyond those limits, see Chapter 13, Stress Testing.

Your application should specify what load it can sustain in terms of constant use and peak activity. For instance, your application might handle 5 sign-ups per second on average over a minute, with peaks of 10 sign-ups per second. If you had 10 sign-ups in one second and 0 the next, that averages to 5, and your system should successfully process them all.

You need two tests, both of the peaks your application can manage and the sustained rate it should maintain over the long term. Those ongoing tests are known as soak testing. They ensure the application doesn’t have any memory, CPU, or disk usage bottlenecks where the application can’t keep up and will eventually fail over time.

That’s a very different test from ensuring sufficient buffers and queues are in place to store and process spikes in load, and you need to try both.

Loading combinations

You can perform load testing on many possible operations, but there are even more combinations. Can you sign up new users while many people log in and out? Can customers place new orders if there are lots of downloads in progress? The possibilities are endless, so you will have to select key functions that interact, such as those requiring database updates to check for contention or excessive disk access.

To find unexpected combinations, set up a randomized load test that runs different loading scripts in parallel with each other and checks system performance. I’m a great believer in searching for the unknown unknowns – the interactions that no one has thought of, not even you. Try things for the sake of trying them, even if there is no reason to suspect a bug. Of course, prioritize the higher-risk areas, but test it all. Just because the development team can’t imagine a bug doesn’t mean there isn’t one.

Your system can exhibit emergent behaviors: complex patterns arising from simple entities interacting. In nature, these produce beautiful effects such as the symmetry of snowflakes or murmurations of starlings. In your application, their results are likely to be more pernicious, with operations that work individually interacting and blocking each other when combined under load. These can be very hard to predict, which is why this testing is so important.

Randomized load testing is an excellent example of checking for unexpected interactions. As noted in Chapter 6, White-Box Functional Testing, high transaction rates for one operation might prevent others from completing it. So, mix and match loading combinations to see which affect each other, measuring the success and response times of all the operations as you perform them.

For all these cases, you will also have to decide what it means to pass a load test; see the What to check during load testing section for more details. To continue that example, the system may perform hundreds of downloads while users are logged in, but the downloads and signing in take 10 times as long as usual. Slowness is often a symptom of overload, and you will have to choose how slow is too slow along with the product owner.

Next, we will consider how to run load tests and the programs to generate this load.

Load test design

While some aspects of testing, such as exploratory and user experience testing, require manual steps, most testing should be automated, and some, such as load testing, absolutely require it. There are many tools available to produce these kinds of requests, such as LoadNinja or WebLOAD, so if they are suitable for your application, I recommend using one of those to get started quickly and see what is possible.

If your application requires other protocols or you want more control over the load test behavior, you can prepare your own scripts. While they may appear as simple as looping through some fixed behaviors, writing a good load script is deceptively complex, so check on the plugins and extensions available in other tools. Many are also open source, so you can take a branch to customize them.

First, we will consider the case of a client-server architecture in which you are load testing the server, such as a web application or mobile application connecting to core infrastructure. If you want to write your own load runner, consider the following factors.

Load runner architecture for client-server applications

To test application servers, you need to mimic many clients performing realistic connections to them. These should be simplified versions of real clients capable of running multiple instances on a single machine, receiving commands to initiate different tests, and passing their results back to a central location.

To prepare your load test implementation, you’ll need to architect your code so that tests can run in parallel. In general, you want three layers in your architecture: performing individual actions, combining sequences of actions, and overall control over starting and stopping different arrangements, as shown here:

Layer

Purpose

Test control

The logic and intelligence for choosing which test sequences to run

Test sequences

Sequences of actions, such as create-check-delete-check loops

Individual commands

Individual actions such as creating, checking, and deleting entities

Table 12.2 – Layers of load testing logic

For instance, one script might have actions such as creating and deleting users, which is the lowest level. The mid-level script constantly loops, creating users, checking for their existence, deleting them, and then checking their removal. That script takes variables such as usernames and delays between cycles. These lower two levels can be distributed across many remote machines to increase the load you can produce.

A top-level script sits above, running user creation for 5 minutes, for instance, then loading up to the maximum number of configured users and rerunning it.

By clearly separating the layers, you can easily swap out the method of user creation – for instance, using an API instead of a web page – while leaving all the other logic of your load test intact. You can alter the rate of the stress test and, most importantly, you can choose how to combine them without affecting how they are run.

Other load runner architectures

There are many other architectures and interfaces on which you can run load testing and also use those same three layers of test control, sequences, and individual commands:

  • Database read operations, performing queries on tables
  • Database write operations, adding and changing data
  • Mixed database operations
  • Batches of data processing such as generating statistics and machine learning

For each layer of your system, consider its inputs and how to take them to their limits, both in terms of the number of messages they receive and the volume of incoming data.

Load runner interfaces

Then, you need to choose what interfaces to use when loading your system. These fall into three classes:

  • Realistic clients such as web pages or applications
  • APIs or public, programmatic interfaces
  • Dedicated interfaces for debugging

Each option has strengths and weaknesses. Using genuine clients means your testing is as realistic as possible, and you are mimicking the behavior of large numbers of users. On the downside, you face all the complexities and limitations of the client code. Any bugs or instabilities it suffers from will limit your testing. You are also testing against an interface that isn’t designed to be used by a program. Web pages may change without warning, leaving you having to refactor your tests to keep them working, and you’re likely to suffer from inconsistent results.

APIs don’t suffer from that problem. They are published and are guaranteed to be backward-compatible between releases unless you are given a fair warning about deprecated functions. This makes them far more reliable, although they are less realistic. You are also constrained by the commands and information available from the API. You have a unique use case because you are trying to load the application, so there may be significant omissions that limit your testing.

To get precisely what you need, you must design it yourself, with a dedicated interface for testing. That gives you control over what commands and information are available if you have the time to add them to your program. You can also fix any bugs and extend them to meet your needs while guaranteeing backward-compatibility.

On the downside, a dedicated interface is the least realistic. You may hit issues that only affect that route and which no customer would see, and you may miss the bugs customers encounter when using real clients and interfaces.

None of those options is perfect, so it can be best to use a combination to ensure you have some realistic testing and the control you need.

Load runner functions

Scripts to simply apply load to your application are not enough; you also need scale, visibility, and control.

For small applications, you may be able to generate the scale of load you need on a single machine, firing off requests at rates far higher than any practical usage by a single user or web browser. For larger applications, however, you will need a set of worker machines, guided by a controller, working together to load a given target. You will need different worker and controller scripts and a protocol to control their actions and get feedback on their results.

Load test results are complex due to the volume of information they produce: timings and many possible outcomes from many actions simultaneously. For further details, see the What to check during load testing section. At this level, be aware that you’ll need to pass results up through your architecture, from the individual calls to the controller’s user interface. You’ll also need aggregation to summarize the outcome and easily flag anomalies. With that data, you can generate all sorts of graphs and charts to visualize your results.

Once you have that visibility, you need to act on it, not least by stopping your load tests when they hit a problem. That can be as simple as halting a script, but in a distributed system, you’ll need to pass that command down to other processes or machines and ideally have them clean up their current actions before exiting cleanly. In more complex implementations, you’ll want the ability to start, stop, and alter different types of load testing in real time, and produce new combinations. Again, that requires messages and functions in the scripts running the load.

Next, we will consider a specific type of load testing: raising system limits.

Raising system limits

Load testing is required for one specific form of feature enhancement: raising application limits. Your product currently supports 50 simultaneous users; if that increased to 100, would it still work? The first test you need to do is load the system up to that level to check for internal limits.

The complexity of these changes is often underestimated because testers perform most of the work rather than developers. Usually, features require more time from developers than testers because designing and implementing a feature takes longer than testing it. That means companies typically have more developers than testers to reflect that difference. However, when raising an internal system limit, the development work may be as easy as changing a single number, from 50 to 100, for instance. On the other hand, the test work may involve developing new tools to reach that higher limit and, once reached, running an extensive test plan.

In a complex system, there can easily be resource limitations that make seemingly simple changes harder to implement.

Real-world example – Out of addresses

In one company where I worked, our clients constantly communicated with our cloud. We could only support so many connections, but we needed to raise that value as we grew. The change was as simple as increasing a static limit, but each connection also required an internal address. We increased the configured limit but couldn’t reach it in practice because we ran out of addresses.

The first test when raising system limits is to ensure that the whole system can support increasing that number with no hidden restrictions. Are new entities created successfully up to the new limit, and are they fully functional? Maybe you can create that many new users, but you have run out of file handles, so they can’t upload profile pictures. Whatever entity you support more of, thoroughly test it when fully loaded.

Then, design a test plan to check the system behavior at the new limit. For a central function, that may need to be an extensive test – for instance, increasing the number of users can have ramifications across the product: on any page that displays users, any database searches based on the users table, and anywhere user lists are searched or filtered. Those may not be obvious, so you’ll need to run a detailed regression test plan.

Look out for user interface effects. As described in the Dynamic versus static load section, high levels of static load can mean lists become so long they are slow to display, hard to filter, and challenging to search. You’ll need pagination and searching to deal with that and indexing on database tables to ensure timely retrieval.

A special case of increasing system limits is raising internal limits, such as the number of threads performing some processing or the number of database connections. While these don’t have effects directly visible to customers, they can have consequences across your system, so the key is always to hit the new limits at the same as realistic levels of other types of load to look for issues.

The checks you run when performing load testing are complex enough to deserve special attention, as described in the next section.

What to check during load testing

You should watch for monitoring alarms during all your tests, but especially with load testing, which is designed to exercise uncovering system issues. If there are memory leaks or leaks of other resources, this is the test to find them. Load testing has to be performed by automated scripts, but writing the checks is at least as much work as generating the load. A single command to change the system’s state may need many tests to verify it. Write a generic check function that you can expand for whatever tests you are performing, and use your system monitoring; see Chapter 10, Maintainability.

At the most basic level, you can run load testing and check for any catastrophic events – the application crashing or unhandled exceptions. The next level of checking is verifying that each operation is successful. For every user creation command, for instance, check that a user exists. You should also routinely check the logs for error messages. For that to be effective, you’ll need to purge the logs of spurious errors to let you see the real ones, again, as discussed in Chapter 10, Maintainability.

This makes load testing sound perfect and pristine – that the system behaves impeccably until a fault causes an operation to fail and an error to appear. In practice, unfortunately, load testing is far messier, with a far lower signal-to-noise ratio. You’ll see many temporary errors that work when retried or slow operations that are borderline unacceptable. Sifting through them is so complex that it deserves its own discussion, as presented in the Filtering load test errors section.

With those basic checks in place, and given that you can distinguish real issues, you can plan sweeping tests of system metrics. You’ll need to adapt these to your product, but check this list for each machine and subsystem in your service:

  • CPU:
    • High sustained levels
    • Spikes of usage
  • Disk:
    • High usage
    • High rates of increase
  • Memory:
    • High usage
    • High rates of increase
  • System resources:
    • Handles
    • Addresses
    • Database connections
  • Rate of errors
  • Packet loss
  • Latency on operations

Each of these has a distinctive pattern that indicates issues. For CPU usage, you are looking for plateaus during which the CPU load is maxed out, showing that other processing may be delayed. Spikes in CPU load also indicate temporary issues. Measuring CPU usage is complex and dependent on memory latency, so in practice, it can be more helpful to measure how much CPU time is free. Dips where this approaches zero indicate overloading.

For disk usage, you are looking for unexpectedly high use, sudden jumps in usage, or continuously increasing usage. If logs and records aren’t regularly cleaned up or moved off critical servers, this is the test that will highlight it.

Memory leaks

Look for similar patterns in memory usage: regular increases or the tell-tale sawtooth wave, which indicates crashes due to memory being exhausted:

Figure 12.1 – A sawtooth pattern of memory usage indicating a memory leak

Figure 12.1 – A sawtooth pattern of memory usage indicating a memory leak

In this case, the memory usage increases linearly over time, and each drop in the sawtooth wave represents a crash, which recovers application memory. Those crashes are likely to be catastrophic and noticeable for stateful machines, but for stateless machines processing temporary transactions, you may need dedicated monitoring to detect them.

Real-world example – The countdown to Christmas lunch

It was a week before Christmas, and the office was relaxing. We had shipped our last major release of the year, and for lunch we had booked a swanky local restaurant for our Christmas lunch. Then, a warning light appeared on our monitoring system.

It was nothing urgent, just a high-memory warning from one of the busiest nodes in our cloud. We ran a private cloud with servers we maintained, so this was our responsibility. Was the new release using more memory? Looking at the graph, we saw the inexorable upward trend of leaking memory. The warning had appeared because it had crossed the 80% usage threshold and would only keep rising. Checking the gradient, we even knew when it would crash: in a couple of hours, right in the middle of Christmas lunch.

We could restart it before it crashed, but that outage would be the same length, and if you were going to suffer downtime, lunch was the best time since it was quietest. We had load-tested some operations before release but missed the one with this error. The development team set about investigating while we prepared to check its recovery and answer the inevitable support queries. Christmas was canceled.

Some memory leaks are in background tasks that run regularly and produce an obvious pattern; others only happen under specific circumstances. It can be difficult to separate legitimate increases in memory use due to storing more information from bugs that use memory but never free it. Load testing helps uncover those issues by performing so many operations that any problems become apparent. However, you have to cover a live system’s full range of functions, or you risk missing the trigger of a memory leak. If you only perform an operation once as part of functional testing, the change in memory usage will be imperceptible. This class of issue requires load testing.

System resource leaks

A similar pattern to a memory leak can also affect internal system resources. If there is a finite pool of addresses, handles, connections, or IDs designed to be reused, then neglecting to recover them will result in a leak, eventually leading to your system running out and failing in some way.

Real-world example – The last database connection

In one company I worked at, we hit an outage due to running out of database connections. The only symptom was sporadic failures because we had no monitoring in place, so as well as increasing the connection limit, we also began to record the number of connections.

Months later, we hit the limit again, and again the monitoring failed. Why hadn’t it reported being out of connections?

The reason was our check to read the number of database connections also needed to connect to the database. When the system had run out of connections, our monitoring could no longer reach the database to tell us.

If system resources aren’t correctly reused, then one day, you will run out. That might be longer than the age of the universe, depending on the number and your rate of usage, or it may surprise you much sooner than that. Try to identify all those resources and IDs, since adding checks and warnings for them is trivial. As a fallback, extensive load testing should burn through resources far faster than live usage, so that will flush out any issues.

Reporting results

The trick to reporting load testing results is aggregation. When performing so many operations across many parts of the system simultaneously, you are generating a tremendous amount of information. Unlike functional tests, where you can examine each feature and its behavior individually, during load testing, you need to summarize the overall results. You will need to measure all the system metrics, as described in the What to check during load testing section; then, in addition, the results of the load operations themselves. The key measures are as follows:

  • How many operations were performed
  • Inbound and outbound data rates
  • How many successes and failures there were
  • Minimum, maximum, average, and standard deviation of operation times
  • Summaries of failures and excessive latency
  • Any errors and warnings that were generated

These measures can be taken at each layer of your system, such as from load balancers, core servers, and databases.

First, you need to record how many load operations you performed and their success rate. Recall Chapter 5, Black-Box Functional Testing, and the different methods of checking API call results. Ideally, you need a separate check on a different interface to verify that an operation was successful. In addition to measuring the success, record the time that takes to look for delays. All these should be fed back to the load runner interface so that they can be summarized, and graphed with unexpected results highlighted. For example, a summary might contain the following data:

Test ID

1

Time

20.43 minutes

Requests

103,472

Successes

103,244

Failures

228

Success rate

99.78%

Minimum latency

10.3 ms

Average latency

15.98 ms

Maximum latency

302.44 ms

Inbound data total

10.3 GB

Outbound data total

30.4 GB

Table 12.3 – Example load test result output

Visualizing load test results is vital to identifying trends and highlighting issues. In this example, the maximum latency (302.44 ms) is significantly larger than the average (15.98 ms). There were over 200 failures, but you don’t know when they occurred. Were they grouped together or evenly spread? By plotting the results over time, you can investigate further. Perhaps there was one period of high latency that then recovered, in which case you need to check the logs at that time:

Figure 12.2 – Example load testing output with a period of high latency

Figure 12.2 – Example load testing output with a period of high latency

Or perhaps the results were getting steadily worse over time, indicating you should run your test for longer and examine the system resources being used:

Figure 12.3 – Example load testing output with increasing latency

Figure 12.3 – Example load testing output with increasing latency

Only by visualizing your results can you see the pattern of failures, and so infer their underlying causes. If all you have are summaries or, worse yet, reams of unaggregated logs, you won’t be able to debug effectively.

Finally, you need an easy way to check for any errors or warnings. Running a system under load is likely to generate many of these, so you need to filter out the expected problems, leaving only the new and interesting ones to be investigated. Do the failures correlate with the period of high latency, or are they separate, indicating that they’re due to different causes?

Examining load testing results can be the most time-consuming part of the whole process, so it’s well worthwhile polishing these interfaces and making them as painless as possible. It takes care and attention, because problems in load testing are often far from obvious, as described next.

Defect hiding

Load testing uncovers some of the toughest bugs to find and fix, especially because one issue can obscure others. On the user interface, in contrast, you can see multiple problems simultaneously, but that’s not always the case with loading. If your application crashes after 3 days of loading due to a memory leak, that will hide the fact that it also crashes after 4 days due to an ID rolling over. You will need to test, investigate, fix, and release a new build for the first issue before you can start to look for subsequent problems.

Because of this, loading results are often on a measured scale rather than passing or failing. You can measure your Mean Time Between Failures (MTBF) while running under load, which averages the time before a malfunction for any reason. You can then convert the MTBF into a pass or fail result – to pass, load testing must run successfully for more than a week, for instance.

Real-world example – Checking the crash frequency

One of the most challenging bugs I ever encountered stemmed from a known issue. On a hardware platform for video conferencing, we suffered from a known but infrequent crash. Even under loading conditions, taking millions of calls, it took a month or more to hit. No customer was likely to see it, and almost certainly not twice. It was a known issue, very difficult to diagnose, and we lived with it.

Coming up to the end of a long waterfall development cycle, we noticed that this crash wasn’t happening every month anymore; now, it was happening every few days. It still required loading conditions, but it was much more frequent. It had gone unnoticed for months because it was a known issue that we routinely ignored.

Suddenly, we needed a fix for a very difficult issue, which added a long delay to the end of an already long project because we hadn’t spotted it earlier.

Another measure is the rate of failures you see. In a simple, synchronous program that executes from beginning to end, you can expect the same result every time you run it with the same parameters. However, very few systems are so simple. There will typically be multiple different systems interacting, often sending messages to one another and performing other processing while awaiting the result. These asynchronous systems are far more efficient and scalable but introduce new classes of issues such as race conditions.

Again, rather than a simple pass or fail result, you may end up with a failure rate: say, one in a million transactions fails. Along with the product manager, you’ll have to decide what result is acceptable and counts as a pass. Load-testing bugs can be challenging to find and fix, so the effort to resolve those issues must be weighed against other tasks the development team could be doing. No application is perfect; if you think yours is, that probably shows you haven’t tested it enough. If you test thoroughly enough, you’ll always find a bug, and if you haven’t found a bug, then you haven’t tested well enough yet. The sign of a great product isn’t having any bugs; it’s having small bugs you understand.

Asynchronous processing and race conditions require special consideration and are described in the next section.

Race conditions and asynchronous systems

Unlike synchronous applications, which will execute from beginning to end deterministically every time based on their inputs, asynchronous applications depend on external independent systems. Those systems may be external, third parties with which you share information, send commands, or separate parts of your internal implementation.

Testing these interactions requires different approaches to find another class of bug. Consider an asynchronous application that sends requests to two different external systems and then waits for their responses:

Figure 12.4 – Sending messages to external systems and receiving replies in order B then C

Figure 12.4 – Sending messages to external systems and receiving replies in order B then C

Application A has a bug and relies on responses coming from Application B before Application C. Generally, that is the case, and Application B processes the queries faster and returns its responses first. However, if Application B is ever delayed, Application C will return first and trigger the bug:

Figure 12.5 – Sending messages to external systems and receiving replies in order C then B

Figure 12.5 – Sending messages to external systems and receiving replies in order C then B

This is a race condition, a particularly nasty class of bugs to reproduce and isolate. After all, Application A does the same thing in both cases, but the outcome is very different.

Load testing is one way to trigger these issues. If Application B is delayed one time in a million, then running tests a million times would eventually hit the bug. That is not a very elegant solution. The other way to trigger these cases is to identify the commands and external calls and deliberately add delays to alter the order of responses. That is helpful testing, although it requires debugging functions in the external applications to introduce latency, as described in Chapter 11, Destructive Testing.

However, that testing requires a complete list of external messages, knowing which messages might interact, and the ability to trigger delays. Load testing will find which interactions matter in practice and flush out unknown unknowns that no one had thought of.

In the next section, we will consider a particular source of loading on the system – when starting up and shutting down.

Startup and shutdown

One vital test for your application is how well it starts up and shuts down. While your system might work well running in a steady state, how long does it take to get there, and can it start quickly when under load? For an operations team, the ability to restart a system is an emergency fallback to recover from unknown states, so it has to work. Worse, applications might crash at any time, and you need to know they can recover without manual intervention, which would make outages stretch into minutes or hours instead of seconds.

For each release, check how long your system takes to start up. Have any inefficiencies or extra processes been introduced? As with many gray areas, you’ll have to work with the product owner to decide the time limit. There are two cases to consider – one where your whole application or one subsystem is restarting without any load. The other is to restart one part of your system while it is loaded. Measure the performance in both cases. Recovering from outages is described further in Chapter 11, Destructive Testing, but it is also part of load testing to restart your system with ongoing operations.

Another type of startup is when enabling a new feature. The whole system might run happily in a steady state, but starting a function with wide-ranging effects can require significant resources. This can be harder to identify – most features won’t cause problems, so you’ll need to watch for ones that will.

Real-world example – Turning on reading conferences from Outlook

In one company I worked at, we had a scheduler that let people create meetings in Outlook. As a new feature, we would also read their Outlook calendars and display those meetings in our application. It was a complex feature, requiring us to reconcile meetings created in Outlook with those we’d made, but it finally passed the testing and we turned it on for live users.

Immediately, there were issues as the system tried to load meetings from users across our system. That load resulted in us writing massive logs, which slowed the application so much that requests started to fail. Those requests immediately retried with no delay, causing even more load, logging, and failures.

A feature that had worked perfectly well in our small test environment was not ready for the numbers present on our live cloud.

For features that add significant load to the system, ensure you test them with high operations rates and large static configurations such as large numbers of users. What happens if many of its transactions fail? Also, check the retry behavior to ensure retries are properly implemented and to look for excessive messaging or logging, which could result in cascades of defects. See Chapter 7, Testing of Error Cases, for more on that.

Next, we will look at coding inefficiencies that load testing can flush out.

Loading inefficiencies

It’s very easy to write code that works well but scales badly. Load testing is your chance to expose those issues by running new features with the heaviest loads your system is designed to sustain. Recall the example of the database query, which read all events since the beginning of time, gradually slowing the system down as time went on. There are many other examples.

Real-world example – 80 participants hang up

In one video company I worked at, we increased the maximum size of our meeting from 40 to 80 users. It was a massive project requiring changes throughout the system, but we finally got it running and were delighted to see so many participants connect successfully. There were huge congratulations all around; then, we finished the meeting, and all started to hang up.

However, that hanging up took a strangely long time. Panes flickered back and forth, and commands became unresponsive, but finally, after several confusing seconds, everyone disconnected.

Our algorithm for working out who should be shown in a conference didn’t scale well, it turned out. When participant number 80 hung up, the system recalculated where the other 79 panes should appear for the remaining 79 participants. Then, it calculated 78 panes for 78 participants, then 77, and so on. It only processed further commands after it worked out what it should display. Since people generally hang up very quickly at the end of a meeting, calculations that had been fine with 40 participants became noticeably slow with 80.

Look out for n-squared relationships between the number of entities and the processing required, as when calculating conference panes in the preceding example. A good code review can find these coding inefficiencies, but they aren’t easy to spot, and testing is the surest way to catch them. The steps you need were listed previously – identifying the entities in your system, taking them to their maximum values, and then performing a range of normal operations to ensure you still get good response times.

In the next section, we will look at messages between modules and the limitations they face under loading conditions.

Loading messages between modules

Load testing involves firing many messages at an application’s external interfaces, but internal messaging will also have high usage when the system is busy. Another class of issues arises from overloading those communication flows.

Do your internal messages have suitable queues, retries, back-offs, and failure modes? This is another class of issue best probed with load testing to ensure there are no hidden bottlenecks within your system.

This section is hard to describe because both the causes and effects are indirect and can be difficult to trigger with system testing. You will have to check with the development team which operations will cause the highest rate of internal messaging within your system; this is an area where you need white-box insights. It won’t be clear what symptoms internal message failures might produce. Operations could time out or fail; the system might be left in an inconsistent state, with some processes believing a task was complete and others believing it had failed. Fixes for these failures include adding queues or batching messages together to avoid high message rates arriving at the destination.

This is one area where the tests and their effects depend very much on your particular application. There are essential bugs to find, but you will have to work with your developers on how to exercise them and check the results.

Next, we will consider the performance of our system, measuring its current baseline so that it’s ready to be compared to future releases.

Performance testing

Is your application slower than the last release? Some services, such as web servers, have relatively low resource requirements and are unlikely to be constrained under normal circumstances. Other applications will hit the limits of the available network, disk access, memory, or processing. You will see symptoms such as increased latency on operations or rising failure rates, which indicate that your system cannot maintain this level of activity.

Programs tend to become larger and more complex over time, which always carries the risk of slowing them down. Whatever limits your system is hitting, part of testing is to ensure that this release doesn’t accidentally have lower performance than the last.

Real-world example – The accidental load test

In a company that provided SMS text message infrastructure, I was onsite to perform user acceptance testing with a large customer. We worked through the test plan, successfully demonstrating all the functionality such as read receipts, redirects, and statistics monitoring.

After finishing one test, I noticed an anomaly: the system reported it was processing 50 messages per second when it was supposed to be idle. I double-checked our loading program but nothing was running. Where was the load coming from?

We investigated but couldn’t see the source of so many messages, so we decided to repeat the previous tests. Sure enough, partway through, the load on the system jumped to almost 100 messages per second. It turned out we had configured a loop in our redirects. Two systems diverted the same message back and forth as fast as they could, with no protection to stop it. Sending a second message into the loop had doubled the problem. We quickly removed the redirect and added a feature request to prevent it in the future.

Identifying bottlenecks

When you encounter stress or load test failures, it is likely to be because a single subsystem is overloaded. A chain is only as strong as its weakest link, as the saying goes, and your application is only as fast as its slowest component. When you start to hit performance limits such as high latency or failed operations, you need to track down the source of the issue.

This is a great test for your monitoring and logging, as described in Chapter 10, Maintainability. How easy is it to isolate an error and identify the source of the problem, among thousands of operations? That should indicate the subsystem that triggered the problem.

To diagnose the issue, examine those logs and the metrics you measure for that subsystem. It is running out of CPU or memory resources, or program resources such as processes or threads? Are queues becoming full or are operations timing out?

When it is unclear where a problem originates, you can experiment by increasing system resources and rerunning your test. That is especially the case when using cloud computing services where it is easier to assign more resources.

To resolve bottlenecks, you can either expand vertically, by giving more resources to the same applications, or horizontally, by deploying more subsystems that can be used in parallel. Scaling vertically is usually easier to start with as it changes only one aspect of the system, but there is a limit to how fast one subsystem will be able to run. For a longer-term solution, you may need to re-architect your system to be able to use multiple subsystems simultaneously.

Load tests in the release cycle

The critical tests here are a set of standard operations you can run on each release – possibly you process the same file or perform the same task – and record the system resources and time it takes. That sets a baseline to compare against future releases. Your testing is a comparison: if this release requires more resources or time than your baseline, that is a failure until the product owner and development team agree to the new behavior.

A short, sharp load test can be added as part of your CI tests to check the performance after every change. It is very easy to add a database query or function call that is a huge drain on system resources but works fine in limited test environments and only shows issues on realistic data rates or database sizes. Once you have these tests set up, make sure they are run regularly.

For real-time systems, even a tiny delay can be a critical bug. If you take 1.1 seconds to process 1 second’s worth of video, for instance, your video calls won’t go well. For systems with less time pressure, you can decide what performance you are happy with and set that as your new baseline.

So far, we have described load testing as simple and obvious, with clear passes, failures, and questions that need to be decided. However, actual results from load testing can be far more complex and require skill and experience to read.

Filtering load test errors

Load testing is messy. This is the most challenging form of testing, requiring all functions and logging to work reliably and a stable, robust system from which to run tests. If functional testing is a scalpel carefully probing your application, load testing is a rugby tackle, using brute force to take it down. When you run load testing, many operations coincide, and the system runs in ways it never usually does. While this may trigger genuine issues, you’ll also hit a large class of problems that are only ever seen under loading conditions. These aren’t useful to find or fix; so, the best option is often to work around them.

Real-world example – Users left behind after load testing

When we started load testing in one company I worked at, one of our first tests was to load user creation. We ran a simple loop that created and deleted thousands of users and instantly hit a problem: a small percentage of delete operations failed, leaving users that had been created but not removed. We happily reported the issue we’d found.

There was a race condition, not in our product but in our loading script. In our distributed system, it took some time to make the user, so occasionally, we deleted the user before it was fully created. The delete operation failed, then the creation operation finished, leaving the user in place. You would only hit that bug if you deleted users milliseconds after creating them, so while you might hit that problem in load testing, no user would ever see it.

There are several classes of spurious errors you are likely to hit when driving your system so hard. You need to be aware of each of these so that you can work around them and find real issues instead:

  • Problems driving the interface
  • Problems with reporting and logging
  • Problems with your load application
  • Unrealistic operations

As described in the Load runner interfaces section, there are several ways to create load on your system, from user-facing screens and pages to dedicated debugging interfaces. Any problems with the debug interface are not interesting to customers and only affect your tests. While they might be a problem for load testing, they are a low priority compared to issues that impact your users. Load testing actual pages can also uncover spurious problems, which only occur if you use them programmatically instead of as a real user.

For these issues, and many in this section, the best short-term solution is to work around them. It might not be worth the development team’s time to fix them compared to other work, as no user will see these problems. Change your load tests to avoid failure and keep going until there is a lull in development, and these can be improved. Often, the solution is adding a delay to make the load scenario more like actual usage.

Next, you may hit problems with reporting or logging. Log files may be too large or not rotated correctly, APIs may not be able to handle the volume of requests, disk contention might cause failures, or there may be other similar problems. Again, these issues are aspects of the load test rather than anything a customer is likely to see. APIs should meet their specified performance goals, but those may be significantly lower than the rates you want to run loading at. To work around this, you can change the method of obtaining load results, such as by writing a dedicated results file and turning down the logging sensitivity so that less is written.

Result sensitivity

There can be bugs in your test code, in the script that applies the load. Does it have sensible timeouts set on messages? Do you want retries or to report every failure? Does it handle errors from the application correctly?

There is a trade-off between the resilience of your tests and their sensitivity. If an operation fails but works when retried, do you count that as a success or a failure? You will need to get to know your system and how reliable it is. If there is a known issue either in your code or the system as a whole, be ready to filter out those issues. The risk is that a new, important issue manifests in the same way, and you fail to find it.

Unfortunately, there are no easy answers here, so you need to pay close attention and regularly tune your filters. Accepting retries for operations will make your error reporting quieter and let you see new issues clearly, at the cost of possibly missing some errors. Following up on every failure lets you see everything, but can be a huge task.

Loading reliability

Your loading tools have to be highly reliable. If your scripts fail every hundredth transaction, that will drown out any real failures that happen every thousandth transaction. In general, load scripts are much simpler than the main application, but if you use tools, such as clients to connect to your servers, then you rely on the stability of those clients. You have to fix issues there before you can load your infrastructure properly.

It’s very easy to set a load test running for the weekend, only for it to fail 20 minutes after you leave. These scripts are highly likely to see timeouts and errors, so you need to build resilience, ideally with a system that will restart the main process if it stops. You will also need persistent memory so that a load test can pick up where it left off rather than relying on the state stored in memory.

Load testing, by definition, is unrealistic. No one would be able to perform the rate of transactions you are proposing. The aim is to condense many days or weeks of real usage into just a few hours. If that condensation process produces issues, such as trying to delete users before they were fully created, those aren’t interesting problems, and you need to work around them.

As you can see, a host of problems can arise from loading your system. You will have to get to know your application’s behavior to filter out the uninteresting errors associated with your testing from the real issues your users could hit. Distinguishing those is one of the hardest jobs in testing, so be ready to practice and learn a lot. Even once you have isolated a genuine issue on your system, debugging issues found in load testing presents unique challenges, as considered next.

Debugging load test issues

Functional testing, as described previously, is like taking a scalpel to your application and carefully probing individual functions. Ideally, each tester has a dedicated system, or at least of the core elements, so that you can completely control what happens there. When there is an issue, the logs are silent except for the single operation you performed, aside from any regular background processing. It’s easy to isolate useful information.

When load testing, that’s not the case. Having created a million users, finding the one that failed can be challenging. Performing even a single operation can have a cascade of effects across your system. Creating a single user might involve loading an interface, accepting input, sending that to the backend, and writing it to storage. There may be many other impacts, depending on your system’s architecture.

To debug issues successfully during load testing, you must be thoroughly proficient at debugging functional issues. You need to know the logs, where they are, and what they should look like. You need to be able to search and filter them to know which fields are available at each level of the system. And you need to be able to step between different subsystems to trace an operation through various stages to find where the problem arose.

As with the rest of load testing, this is where your skills are put to the test. Testing has a lovely learning curve – you can start acting as a regular user, exploring standard functions, then gradually dive deeper into your system, automating routine tasks and refining your checks to detect subtle issues. Load testing is a complex challenge as in addition to all the expected operations, new, emergent behavior appears. This is your challenge.

Summary

Load testing uncovers some of the hardest-to-find and most important bugs in your system. These are issues that you’ll never hit by just running some exploratory testing. They require dedicated tools and skills to discover and even more to isolate and debug.

In this chapter, we described identifying load operations, the differences between static and dynamic load, and soak testing versus spikes of operations. You need to consider the design of your application for load testing, the interfaces it should use, and the functions it requires. There are different bugs when raising system limits, looking for race conditions, inefficient programming, or testing the messages between modules. You need to create a performance baseline to check for higher resource usage and look for defects that obscure other problems.

Finally, we considered the challenges of filtering and debugging load-testing issues. In the next chapter, we will go one step further and apply load to push your system beyond its limits to test how it copes with usage over and above its design parameters.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset