5

Black-Box Functional Testing

“Suppose everyone had a box with something in it: we call it a “beetle.” No one can look into anyone else’s box, and everyone says he knows what a beetle is only by looking at his beetle…

The thing in the box has no place in the language-game at all; not even as a something: for the box might even be empty. No, one can ‘divide through’ by the thing in the box; it cancels out, whatever it is.”

- Ludwig Wittgenstein, Philosophical Investigations

Black-box functional testing is what most people imagine when they think of software testing. Of all the chapters in this book, this one will likely be most familiar to you. If you’ve done any software testing, it will have started here.

Black-box testing covers the happy-path scenarios—the working cases that show the feature behaving as it should under normal conditions. Since this is only one chapter in a much larger book, you can see testing comprises more than that, but the main functionality is where your testing should start. Without the basic functionality in place and working, there’s no point continuing onto the finer points of testing. If a user can’t even enter their password, you can’t test how secure it is; if emails fail to send, you can’t test whether it can handle a thousand in a minute, and so forth. This chapter describes how to test those first cases in detail.

We’ll cover the following topics in this chapter:

  • How to enable new features
  • Different types of testing: API-based, CRUD testing, and negative testing
  • Identifying worst-case scenarios
  • Equivalence partitioning and boundary value analysis
  • How to test different variable types
  • Uncovering hidden defects
  • Optimizing error guessing
  • Determining what checks to perform
  • Precision versus brittleness in automated tests
  • Comparing bugs and features

As we have seen, system testing breaks down into multiple approaches, such as usability, security, maintainability. Here we consider black-box testing, starting with its strengths and weaknesses.

Advantages and disadvantages of black-box testing

The different forms of testing have neatly interlocking advantages and disadvantages. Therefore, unlike in Part 1, Preparing to Test, I won’t consider alternatives to each kind of testing. Each one is suited to finding a distinctive class of bugs but suffers from particular weaknesses. Only by combining them can you mitigate those shortcomings and achieve comprehensive test coverage.

The advantages and disadvantages of black-box testing are as follows:

Advantages

Disadvantages

Realistic

May not cover all the code

Covers the main functionality

Does not cover error cases

Takes a user’s point of view

Difficult to debug

No need to learn implementation details

Repeats previous tests

Table 5.1 – Advantages and disadvantages of black-box testing

The advantages of black-box testing are that it is realistic, performing actions your users will take themselves. In other testing areas, we will cover rarer cases: errors and system failures, security attacks, and loading, among others. Those are essential cases, but you are unlikely to hit them on any given day. On the other hand, black-box testing covers the main use cases and the parts of your product that must work well. These tests take the user’s point of view, with no allowance for the code or system architecture. This means you are most likely to hit the issues that will affect real customers.

These tests are also relatively easy to run. You don’t need to know a programming language or understand the code; you don’t need to learn techniques for security testing, nor have tools to generate excessive load like other testing areas. That means a wide variety of users can carry it out; this part of system testing could be performed by product owners or a documentation team, or it could be outsourced to an external team.

Despite its ease and realism, black box testing is unlikely to cover all possible code paths because it doesn’t consider the implementation. That requires white-box testing, described in the next chapter. While you might try error cases as part of black-box testing, they aren’t tried systematically here. Because black-box testing tests the whole system, it is difficult to diagnose the root cause of issues and debug them. Bugs also need to be triaged, which adds delays. And due to covering common cases, these tests are likely duplicates of exploratory testing or testing the developer performed. However, it is still necessary to step through them systematically and thoroughly to ensure you haven’t missed anything.

As we will see, the disadvantages of the different types of testing aren’t reasons not to carry them out. Instead, they show it’s necessary to perform varied testing. You can mitigate their weaknesses by augmenting tests from one area with those from another. For instance, you can explicitly list error cases (see Chapter 7, Testing of Error Cases) to ensure they are covered separately from the primary use case testing. For complete coverage, you need to consider all these different types of testing.

The following sections consider specific examples of test patterns that repeatedly appear in applications. Firstly, we explain what to test when enabling a new feature.

Enabling new features

How do you enable new functionality? Some features are simple and stateless, such as a new command in an API, a new button, or a new interface for existing functionality. The current system can continue without these features, and enabling them has little effect on the rest of the system. Stateless features are simple, and you can just enable them and begin testing.

Other features are more complex and stateful, such as database migrations or new forms of data storage. Some require several steps, such as enabling a new system, migrating users onto it, then disabling the old system. Breaking changes on interfaces need all downstream systems to be ready for the change. This section considers the complexities around enabling and disabling such stateful features.

Stateful features might be enabled with a feature flag or be available immediately after an upgrade. For this discussion, these two methods are equivalent: all you need is a way to enable the feature by turning the flag on or upgrading and a way to disable the feature by disabling the flag or downgrading.

The first pair of tests cases are the following:

  • Check entities that can be created with the feature enabled
  • Check entities that were created on the old system, then have the feature enabled for them

These will follow different code paths and may reveal different bugs. Your tests need to try both. For each case, you must ensure that the new functionality works and check for regressions in other areas.

For complex upgrades, you will also need to consider dependencies and ordering. Which parts of the system need to be upgraded before the feature can be enabled? Are there any interacting features?

If those upgrades occur over time, there is a matrix of upgrade states the system can be in, and you need to test them all, as seen in the following table:

Module A

Module B

Feature flag

State

Downgraded

Downgraded

Disabled

1. Initial state

Upgraded

Downgraded

Disabled

2. First dependency ready

Upgraded

Upgraded

Disabled

3. Upgrades complete

Upgraded

Upgraded

Enabled

4. Feature fully enabled

Table 5.2 – The states of a gradual upgrade process

Simple features only have states 1 and 4, with a feature that is on and off. For more complex upgrades, you need to enumerate all the intermediate arrangements to check that the system works in each of them.

Are there any special considerations during an upgrade? For large, risky changes, it is worth implementing a way to gradually move to the new system but keep the existing one in use for some percentage of users. That lets you identify any test escapes without requiring a downgrade of the entire system, but this means there are more states to test.

Consider the following possible problems:

  • Is the upgrade instantaneous, or do any parts of the system need to be restarted to take effect?
  • Is there a significant load placed on the system when the feature is first enabled, for instance, data downloads or processing?
  • Does the upgrade replace old functionality or does it add new functionality to exist alongside it? Do the current methods still work correctly in its presence?
  • How are users routed to the new versus the old system? Are those controls working correctly?
    • For instance, if 10% of users are assigned to the new method, you will need to check a sample of users to see that that figure is applied.

Finally, for complex changes, you need to consider the case of a feature being enabled, disabled, and re-enabled. Again, this is easy for stateless features: a button vanishes on the UI, an API call starts returning 404, and so forth. More complex, stateful features require dedicated code paths to handle the feature being disabled, such as database migrations or alternative processing.

To give a concrete example, consider adding friend connections to the users in your database. Previously, you had users who could sign up and use your product, but now they can look up other people on your system and add them as friends. You enable the feature for some users but then discover a critical bug and have to disable it. What happens to the connections users made while the feature was enabled?

Those code paths require dedicated tests, such as the following examples:

  • Can the feature be successfully disabled?
  • What happens if you enable a feature, create new entities (for example, friend connections), then disable it?
    • What happens to those entities?
    • Does the system go back to its initial state?
  • What happens if you create entities with a new feature, disable it, then re-enable it?
    • Are those entities available again?
    • If your system performs processing over time, how does it handle the gap when it was disabled?

Enumerate any limitations or invalid operations. Possibly it isn’t worth the development time to fully support disabling a feature, so when a feature is disabled, new users will lose any new entities they had created. If the product owner is happy with that behavior, you can test to ensure it works correctly, but be sure to identify and describe that case upfront so it isn’t a surprise later.

In my experience, fundamental and far-reaching features often suffer these delays and require multiple attempts to enable them. Unfortunately, that means they are most likely to interact with new features. When you re-enable a feature, you might not want to run its entire test plan again, but just verify the changes and fixes you made since last time. However, you also need to consider any other new features added since the previous test run. Re-enabling a significant feature isn’t as simple as trying the same thing twice; you are aiming for a moving target. Carefully examine not just the feature but also its interactions.

Finally, always check that you can downgrade your system again. This operation should rarely be necessary. Most upgrades will succeed, and of those that fail, you can generally issue a rapid patch rather than having to roll back. However, every so often, an issue is so large that the change will take too long, and you need to get back onto working code as soon as possible. In that case, you have to be confident that downgrading will work. This is a stressful time when people will ask how such a critical bug was written and missed by testing. Those questions have to wait until later.

Tip

Make sure that downgrading is part of your test plan, as well as simply disabling the feature. After upgrading to broken code, the last thing you need is to discover you are stuck because the database migration fails on a downgrade. Always check that you have a way back and that your disaster recovery plan will work.

Turning on features can be as simple as performing an upgrade or enabling an option, but for large, complex changes, the upgrade itself can require a test plan. Work through the different states your system can be in, considering upgrades, creating entities, and downgrades.

The next sections consider the main methods used in black-box testing, from testing APIs and interactions with storage, to negative testing and worse-case scenarios. We start with API testing.

Performing API testing

Verifying an API is one of my favorite types of testing. The coding you need is generally straightforward and accessible, requiring a few complex tools and a little setup. Best of all, APIs usually have detailed and exhaustive documentation, either created directly from the source code or as part of customer-facing documents.

Using that documentation as your test basis, API testing involves stepping through each message and field and systematically trying different values. See the Testing variable types section for the important equivalence partitions for common data types. You can choose what level of coverage you need to reach, varying from a single check that fields are accepted to exhaustive testing of valid and invalid inputs.

APIs can vary greatly. Hideous binary, tag-length-value, or fixed-width protocols have to be completely specified in advance and need to be decoded before they become human-readable. If you are using a text-based SOAP or REST-based protocol with labeled text fields, count yourself lucky. But all types of API require significant checks with the same style of tests.

Example API command

Consider a simple API command to create a new user with the following fields:

Field Name

Type

Required?

Parameters

UserName

string

Yes

Minimum length: 1

Maximum length: 256

Age

integer

No

Country

string

No

ISO 3166 3 letter code

Table 5.3 – An example API command to create a user

For simplicity, we will ignore whatever authentication mechanism is in place to restrict access to valid users. This request will generate either a success or failure response. If the API request to create a user is a success, it will return a user ID, which can be used in future API calls, for instance, to update or delete that user:

Field Name

Type

Notes

UserID

integer

Used to reference a user in future API commands

Table 5.4 – An example API response to create a user

If the request fails, there are a limited number of error responses:

Error Name

Code

Notes

InvalidInput

400

There was an error in the request message

UserLimitReached

405

The system already has the maximum number of users configured

InternalError

500

A problem on the server meant the request could not be fulfilled

Table 5.5 – An example of API error responses to creating a user

There would be other API commands in the same format to query, edit, and delete that user. Other entities in this example system would follow the same pattern, but this example is sufficient to illustrate the key tests to run. In an HTTP-based API, commands to create and edit entities will typically use POST commands, while those retrieving data will use GET commands, so check they are implemented correctly and the alternative is rejected.

Next, check whether the API supports missing out fields. The age and country fields are supposed to be optional, so is the command successful if they are absent, and are the correct defaults set in that case? Conversely, is the message rejected if the required UserName field is missing?

Given the fields are present, test their values. What if the UserName field is too short or too long, the age is out of range, or the country code is invalid? Those cases should generate error responses with code 400, otherwise, it is a bug. Those rejections should also write a warning log line. It shouldn’t be an error because this is an expected failure. For more on that, see Chapter 10, Maintainability.

You then need to test the other cases. These are considered more in Chapter 7, Testing of Error Cases, but the key is to systematically cause each error case. Here, that means reaching the maximum number of users or triggering an internal error, for instance, by blocking the database from writing the new user details.

For the successful case, check whether you receive a user ID. Can that ID be used in subsequent commands? When checking API calls, don’t rely on a successful return code; instead, independently verify the change with a subsequent check, as described in the next section.

Checking API commands

You can check the results of API calls in three main ways:

  • Examining the return value of the API call: Examining the return value of an API call is necessary but not sufficient to measure its success. A success result may only indicate that the application accepted an API call and began processing it, not that it was completed.
  • Performing an API call to check the outcome: Because checking return values is insufficient, you should always check the outcome of API commands independently with at least one other API call. For instance, if you sent a command to create a user, follow up with a request to list users to ensure your entry is included.
  • Performing a check on a different interface to check the outcome: Even better, check whether your new value is available via a different interface, for instance, on a web page or other graphical interface. To continue the previous example, after creating a user via the API, check that that user is visible on user lists presented on a web interface. That shows that the new data has been successfully written and can be translated to other outputs. Conversely, you can use APIs to check that data is correctly added via other interfaces for a more robust check.

Within API testing, you will also need to consider the system’s internal state (see Chapter 6, White-Box Functional Testing) and possible error cases (see Chapter 7, Testing of Error Cases). Those are described in more detail in the following chapters but look for common sequences such as creating, updating, and deleting entities. That pattern of data entry and retrieval is common to many systems. In the next section, we move on from APIs to consider storage commands. Those can be triggered by APIs or other interfaces and have their own set of test requirements.

Performing CRUD testing

The four fundamental data operations of stateful computing systems are as follows:

  1. Create
  2. Read
  3. Update
  4. Delete

These apply to any system with data storage, whether in a document or relational databases, text or binary files, or some other method. In all those cases, it is possible to create new records or write new data and then read it back for display or processing. You can update information that has already been written and, finally, delete it. Those operations apply to many systems, so you can use them to guide your testing.

The relative frequencies of those operations will depend on your system. Typical systems, however, will perform reads most of all, followed by updates, creations, and deletions. An HR database, for instance, will list users any time someone looks at the users page. A staff member might have their details updated occasionally after their creation, and while there are a positive number of user entries, creations will outnumber deletions. So, in terms of frequency, you can expect the order to be Read, Update, Create, Delete. Check whether your system is an exception. For instance, version control software tracks all history and often adds records but seldom updates or deletes them. There are other exceptions, so check the frequencies on your system.

Luckily, the riskiness of those operations is almost the reverse of their usual frequency: Delete, Update, Create, Read.

Testing deletion operations

Deletes are most dangerous because you remove data. You need to be sure that you have removed only the correct data and removed it completely. The rest of the system must be aware of the deletion so it doesn’t attempt to access the missing records.

Real-world example – The dangers of deletion

In one company I worked at, we provided a messaging service where we stored the complete history on private servers that we ran. Messages were immutable – we had no easy way to delete them, and any customer requests to do so needed development time.

As a new feature, in one release, we added messages in meetings. Users could then easily message each other for the duration of a conference, after which messages weren’t available. The feature worked well in testing, and all its functionality in meetings was good. The problems came after conferences ended.

For the first time, messages were deleted, not explicitly, but after meetings, and we had a series of crashes as the code tried to access non-existent entities. Sending more messages was fine; deleting them was far harder.

Trying to reference null pointers or invalid objects is a common cause of crashes in lower-level languages such as C, which lack memory management. In your test plan, focus on deletion operations and what happens afterward. Consider the following cases:

  • Rapid deletion and creation of entities:
    • Is there a race condition around deletion?
    • Is it impossible for other parts of the system to access deleted data?
    • Is there a narrow window when it is still available?
  • Accessing stale entities:
    • For instance, open a list of entities in two browser windows
    • On one, delete an entity, then attempt to reference it from the other
    • Is a reasonable error message displayed?
  • Accessing related entities:
    • After deleting an entity, check all related operations
    • Can it be accessed anywhere?
    • Do any parts of the system still have a reference to it?
  • Recreating entities:
    • After deleting an entity, recreate it in the same place
    • Make a new entity as similar as possible
    • If data can be stored in different locations, move data from location A to B, then back from location B to location A

All those tests are designed to check the completeness and robustness of delete operations. Backups and restores also test deletions, described further in Chapter 11, Destructive Testing. This is a rich source of bugs, so cover it thoroughly in your test plans.

Testing update operations

The update command is the next most risky operation, because it also affects existing information. While it isn’t as destructive as a delete command, it is still an unrecoverable operation if you alter the wrong data.

Real-world example – Where’s the WHERE?

This example is a cliché, but I’ve seen it happen for real. In one video conferencing company I worked at, we hit an issue on our live system that required a change to the live database. A senior engineer needed to manually update one customer’s conference number with an SQL command, requiring a WHERE statement to specify the number to change.

The engineer missed the WHERE clause. Instead of changing a single conference number, he set every conference number in the cloud to the same value. Calls failed around the world as the routing went wrong. To his credit, the engineer kept his cool and had it fixed within minutes. We implemented stricter checks on any live database updates from then on, even from senior engineers.

That is an example of an operational issue rather than a bug found in testing, but the same principles apply. Either updating the wrong entries or updating the right entries to the wrong values instantly causes bugs. Watch out for complex data types and translations because they are most likely to go wrong; these include the following examples:

  • Dates
  • Times
  • Currencies
  • Entities with many subcategories
  • Entities with complex data transformations

These are considered in more detail later. The most important class to identify is your system’s intricacies, so you can concentrate your testing there.

Real-world example – Updating subcategories

At one company I worked at, there were six subcategories of users, such as system administrators, external integrations, and our normal users. That complexity was a breeding ground for bugs.

For instance, different system modules counted different subcategories, so the total number of users varied depending on which module you used. When we added new features, such as the ability to disable users, we had to consider the effect on all six subcategories. If the developer missed one, there was unexpected behavior and issues.

Testing creation operations

Creation is the next safest operation since it leaves existing data intact. The worst possible outcome with a creation command should be for it to fail and leave the system in the same state. It’s easy to test for those failures.

Creation can cause broader disruption if it allows you to add duplicate entities. For example, setting up a new machine with a duplicate IP address will disrupt routing to the previous machine that had been working. Your system should flag up and block the creation of duplicate entities wherever possible, and you should raise that as a bug if that’s not the case.

Another subtle bug around creation is for new entities to be created incorrectly, missing certain pieces of data or configuration. This will need careful checking to ensure that new entities are fully functional and not just present. You need to write those checks into test plans and auto-test systems.

Testing read operations

Finally, read operations are safest since they neither change existing data nor add to it. The bugs around read operations are user-facing: are all data values shown correctly? What about long values, numbers with rounding, emojis, and so on? These are considered in more detail in the Testing variable types section.

CRUD testing is a fundamental part of checking your system’s data storage. Here I can only list general considerations; you will need to examine your system in detail to understand its complexities and where bugs are likely to be found. When performing updates and deletions, you need to check for not only incorrectly changing the data you wanted to modify but also changing data that should have been left untouched. That is negative testing, considered next.

Performing negative testing

Testing changes always come in two parts: was the entity changed correctly, and was everything else left unchanged? That second aspect can sometimes be overlooked, so ensure you include it in your testing. For every positive test case you write, consider the equivalent negative case; here is an example:

  • If one user is deleted, are other users still available?
  • Does a user have the access they should and are they barred from areas they shouldn’t have access to?
  • Is the configuration updated correctly, and are the other configuration options unchanged?

We already saw the catastrophic example of the missed WHERE clause. The positive test for that change passed – the faulty customer did have their conference number updated. The only problem was that all the other numbers in the cloud were changed simultaneously. This is a problem when it’s unclear how widely the configuration is used. A setting might be local to one part of the system but is actually shared in many other contexts. The larger and more complex a system is, the more likely this is to be the case.

A popular bug-tracking tool demonstrates this issue. It has several layers of indirection, with fields, field configurations, and field configuration schemes, among others. Each is highly configurable and can be shared between different projects. You need to carefully check each level to see how widely it is used. Otherwise, making a change will have unexpected consequences for other users.

Real-world example – Fight Club branding

In one company I worked for, we could apply branding to our conferences by overlaying text on the screen. The feature wasn’t used very much, and the configuration was fiddly. One customer was behind on payment, and we wanted to inform their meeting participants. I tested the feature to ensure it still worked on the live cloud by branding my organization’s conferences. Given the context, you can see where this is going.

Unfortunately, my branding configuration wasn’t unique, and I simply inherited the global configuration. When I set branding to apply to my organization, it affected all unbranded live customers too. Anyone who started a conference just then saw my test phrase across the bottom of their screen: Brought to you by Tyler Durden.

To my shame, I thought the test was working just fine until I got a panicked call from a member of our sales team doing a demo for a customer. I’d checked that my change had been successful but not that everything else had been left unchanged.

While positive tests are easy to see, negative cases can be harder to identify. When you make a change, an infinite number of things should be left unaltered, and you will never know whether you’ve caught them all. You will have to judge what areas are likely to change and, therefore, what you should check for each update you make.

Because it’s hard to describe negative test cases, you will have to carefully check what interactions exist in your system. They may be as obvious as shared configuration between different modules, or the connections may be more subtle. You will need to map out the connections for your product.

As well as simple, individual changes, you should work out the most complex operations possible on your system. The following section describes how to identify worst-case scenarios. If they work, any simpler operations should also be successful.

Identifying worse-case scenarios

To maximize the efficiency of your testing, identify the worst-case scenarios and just test them. If they work, then all simpler cases should be successful as well. Many tests will be a subset of complex tests, so it’s fastest just to perform the complex ones. For example, if your users can have multiple copies of many different relationships, add multiple instances of every kind of relationship. If that works, then having just one relationship should work, as well as having multiple copies of just one type because those are simpler cases.

One exception is with newly written code, where you can expect to find issues. Then, if you test the worst-case scenario, you’re likely to hit a problem and possibly several simultaneously. In that case, it’s faster to gradually build up to the most complex configuration. Then you can isolate and debug issues as they arise, and you know which step triggered them. However, once each aspect of the test has passed individually, it’s quickest to combine them.

The hardest part is identifying the most complex configuration for your system, so consider the following guides:

  • Add the maximum number of configurable entities to your system
  • Add all possible types of configurable entities to your system
  • Fill the entity’s history with one of each possible event
  • With the most complex configuration and history in place, perform each system action:
    • Upgrade
    • Migration to an alternative system/location
    • Conversion between types
    • Backup and restore
    • Deletion
  • Perform as many actions simultaneously as possible

If your most complex configuration can survive the most complex system operations, you should be well-placed for simpler commands. However, every so often, complicated cases are not a superset of simpler cases, so watch out for those, as they will need separate tests. I learned that the hard way.

Real-world example – Load testing prototypes

One of the most annoying bugs I ever shipped was by concentrating on worst-case scenarios and missing a simpler case. We were working on a new hardware product that implemented the functionality of our previous platform but with an entirely new architecture and chipset. That meant we had a mature product with years of feature development to test, but anything could be broken. To make matters worse, we only had three prototype units to test on.

We hammered those units, running lots of functional testing, as well as worst-case load and stress tests, putting them through their paces for months, and fixing many bugs before release. We finally shipped, and… they suffered crashes almost immediately. What had I missed?

It turned out that one test we hadn’t done was to leave them alone. Real customers used the units far less than we had, leaving them idle for days at a time. In that case, a buffer wasn’t flushed, so it overran, causing a crash. Doing nothing with the unit was one test that hadn’t been in our plan.

Sometimes there is no single worst case. If two configurations are mutually exclusive, you will need two different cases to cover those. Interactions between conditions are also difficult to judge and write tests for. You rapidly hit combinatorial explosions of possible test cases if you try to cover every combination. Where possible, do everything at once; where that’s not possible, you will have to judge their relative importance.

Identifying worst-case scenarios is essential for designing your test plans, so consider them for each new feature. Next, we consider grouping variables based on their characteristics. The following section describes how best to do that using equivalence partitioning.

Understanding equivalence partitioning

It is impossible to test every possible input into your system, so we have to group them into categories demonstrating the behavior of a whole class of inputs. You are undoubtedly already doing that within your test plans, and its official name is equivalence partitioning. By considering it in the abstract, you can learn how to identify possible test cases quickly and completely.

For instance, to test whether a textbox can handle inputs including spaces, we could use the string “one two”. This string is an example of all possible strings that include a space between words. They are all equivalent, so we have partitioned them together, and that example is our single test to check the whole group. The following example strings test other partitions:

Partition

Example test

Strings including spaces

hello world

Strings including accented characters

Élan

Strings including capital letters

Hello

Strings with special characters

;:!@£$%îèà{[('")}]

Table 5.6 – Examples of equivalence partitions for strings

A classic example of tests performed with equivalence partitioning is tax rates. For instance, a 0% rate is applied up to £12,000, 20% up to £40,000, and 40% above that. In this case, the partitions are clearly stated in the requirements, and you should choose one value to test within the 0%, 20%, and 40% tax ranges for a total of three tests to cover all three valid partitions.

Those aren’t the only test cases you need, of course. You are probably already worried about invalid partitions, considered in Chapter 7, Testing of Error Cases, and the boundaries between different partitions described in the upcoming Using boundary value analysis section. Performing equivalence partition tests does not cover every condition, but it identifies cases you should include.

The tax rate example listed the possible partitions in the specification, but usually, you won’t be lucky enough to get partition boundaries explicitly laid out, and you’ll have to identify them for yourself. Other examples of partitions are included in the following list:

  • Different system states:
    • For example, users who are already members, users registered but without a password, and new users who have not logged in
  • Different options enabled
  • Valid and invalid inputs
  • Functionally different numeric values:
    • Over 18 versus under 18
    • Orders expensive enough to get free shipping
  • Different input data types:
    • Currencies, time zones, languages, and so on

As you can see, there are many possible partitions across many different variables. You will need to analyze your system to find the partitions relevant to you.

Once you have identified the different equivalent partitions, you must write test cases to exercise each one. As with identifying worst-case scenarios, you can often combine testing partitions to reduce the number of test cases. For instance, you could try this string:

Hello ;:!@£$%îèà{[('")}]

If that one works, you have tested accented characters, some special characters, capital letters, and strings with a space simultaneously. Such combinations reduce the total number of tests.

On the downside, if that string fails, you don’t know which aspect caused the failure. When the chance of tests failing is higher, for instance, in new code, it is worth separating the different test conditions, even though it means running more individual tests. Likewise, combining cases can save time with automated testing, but debugging is easier if the conditions are separate. However, if you are stuck running regression tests manually, they are both slow and likely to pass. In that case, combining test conditions can speed up the process.

Look out for partitions due to the implementation that are invisible to the user. For instance, integers may be encoded in 2 bytes up to 65,535 and 4 bytes above that value. Those are two different partitions, and although the requirements do not distinguish between them, the implementation does. This will be considered in more detail in Chapter 6, White-Box Functional Testing.

By applying the principle of equivalence partitioning, we can generate test cases for common inputs, such as strings and files, but you also need to consider the boundaries between partitions. This is covered by boundary value analysis.

Using boundary value analysis

Within a partition, not all values are equally important. Bugs are much more likely on the boundaries of partitions where off-by-one errors may be present, for instance, if greater-than-or-equals is used instead of equals. The idea is simple enough: when there is a boundary, you need to test the values just lower and higher to ensure the divide is in the right place.

As with equivalence partitions, boundaries might be explicitly listed within the specifications or they may be implicit aspects of how a feature was implemented. Examples of explicit boundaries are:

  • Tax rates: The values up to X thousand are taxed at one rate, and values from X to Y thousand are taxed at another
  • Ages: Users below 13 are banned, users between 13 and 18 get a child account, and users over 18 get an adult account
  • Passwords: The length of passwords must be over eight characters

Examples of implicit boundaries will depend on your product’s implementation details, but look out for values and lists that are implemented in one way up to a certain point, then differently for other values. For instance, integers of different sizes might be stored differently. The implementation will attempt to hide those details, but issues may be present.

The values for integer boundaries are simple enough—for the previous password example, for instance, you need to test that the application rejects passwords with seven characters and accepts passwords that are eight characters long.

For non-integer numeric values, the value just above the boundary needs to take precision into account. If your first £10,000 is tax-free, you need to test what happens at £10,000.01, one penny above that. Given that you can’t claim 20% on a penny, what happens to the rounding? That, too, is part of the boundary analysis. If the feature specification isn’t precise about the behavior, whether it should be rounded up or down, for instance, then you need to add it in.

The age example also has non-integer values. At 12 years and 364 days old, users are rejected; you need to test that value, as well as 13 years and 0 days old. Is age measured any more finely than days? Hopefully not! But that, too, needs to be explicitly stated in the specification.

Boundary value analysis can be two-value, as described previously, or three-value, also known as full boundary analysis. For that, you test the values one lower, equal to, and one above the boundary. So, for the eight-character password, you would try seven-, eight-, and nine-character passwords. Personally, I can’t see a nine-character password failing if an eight-character password was accepted correctly. However, three-value boundary analysis is part of testing theory, so I would be remiss if I neglected to mention it.

As well as the boundaries between values, you also need to check how values will interact with each other, as described in the following section.

Mapping dependent and independent variables

Glaring bugs that happen under a wide range of circumstances are easy to find, even with exploratory testing. If the Logout button never works, you will see that the first time you log out. However, if it usually works but only fails on a certain web browser, a certain browser version, or only after you have changed the browser window size, you might not find the issue. If you don’t test those situations, the Logout button will work in your test. Bugs that only occur under certain circumstances, are much harder to find: you can do X, and you can do Y, but you can’t do X and Y together.

Bugs that depend on multiple variables are challenging to find because the test plan you need is huge. First, you have to identify the variables affecting your system. Some essential variables will have far-reaching effects, such as the operating system you’re running on or the web browser you are using. Those are common, but also consider variables for your system that fundamentally alter its behavior. Maybe you can use it with users logged in or not or as a paid rather than a free user. There may be different interfaces, such as an application, web page, API, or a white label versus an own-brand offering. You will need to examine your system to discover the important variables for you.

Some variables are independent and only need to be tested once. For a simple function, such as inviting a new user by their email address, you could have one test with emails with capital letters to ensure they are correctly accepted. A second test could try a shared email domain, such as outlook.com, versus a company email domain, such as microsoft.com, which is reserved for Microsoft employees. Some applications behave differently in those cases, although initially, we’ll assume there is no difference between them.

If those two variables are independent of one another, then we only need to invite two email addresses to cover those cases:

Email capitalization

Email domain

Test case 1

All lowercase

Company domain

Test case 2

Including capital (uppercase) letters

Shared domain

Table 5.7 – A test plan with independent variables

However, suppose your application considers company email addresses far more valuable than shared ones. Your application is effectively split in two—it checks the type of email address early in the processing and gives a completely different experience if the email address belongs to a company. Then the email domain would massively change the application’s behavior, and many other variables would depend on it, including whether it correctly handled capitalized email addresses. Now, these are dependent variables, so you have to try every combination of them:

Email capitalization

Email domain

Test case 1

All lowercase

Company domain

Test case 2

Including capital letters

Company domain

Test case 3

All lowercase

Shared domain

Test case 4

Including capital letters

Shared domain

Table 5.8 – A test plan with dependent variables

This presents you with a problem. With dependent variables, the length of the test plan has doubled. Dependent variables have a multiplicative effect on the amount of testing you have to do, and the greater the number of options, the larger that effect is. You face a combinatorial explosion of possibilities. What happens if company domains are fine on Windows and macOS but fail on iOS, with long strings, on full moons, or by the last light of Durin’s Day?

Tests can become as obscure as you want, which shows that no test plan can ever be complete. This demonstrates why exhaustive testing is impossible, as one of the ISTQB principles states. However, testing some combinations is achievable, and how deep should you go, and which combinations are important?

This is where the skill of testing begins, and here I can only give guidance and heuristics. Identifying which variables interact can be challenging to predict in advance and will depend on the details of the system you are testing. I can only advise you to identify the different variables and consider each pair carefully. These interactions are perfect places for bugs to hide, and it is the tester’s skill to find and test those interactions. Still, systematically laying out the variables as previously shown and considering them gives you the best chance to identify potential issues.

Aim to identify orthogonal, independent variables, which do not interact and can be considered separately. They are valuable because they do not expand the matrices of possible cases. Conversely, identify the dependent variables and possible interactions that can be a poly-dimensional array. If you have five variables that all interact, then it’s a five-dimensional array with potentially scores of tests, depending on how many values each variable can take. Automation can help you step through all cases systematically, so long as you can spend the time to do so. Even if you choose not to cover every case, it’s important to have identified them all, so you know what you are not testing.

In my time as a tester, there has been a whole class of test escapes caused by failing to test combinations of variables. I got close enough that we tried A and B but didn’t try A and B together. I kicked myself every time we found a combination I had missed. If I identified this variable and that variable, why hadn’t I tried them together?

Real-world example – Swimming through the sky

My favorite example of two dependent variables didn’t come from my work but from a bug I hit in an old Spyro the Dragon game. At one point, you had to chase a thief around a pre-defined course. When you caught him, you were transported to talk to a character who thanked you for your efforts.

The problem was that part of that course was underwater. If you caught the thief while you were swimming, you were transported out of the water, but the game thought you were still swimming. Since you could swim up as well as down, you could swim through the sky and reach everywhere in the level.

Catching the thief outside the water worked and swimming normally worked, but catching the thief while swimming revealed a fun bug.

You can’t test everything, so there will always be gaps. The best you can do is apply your knowledge of the most likely areas to break and learn from any combinations you miss. A helpful tool to map out dependent variables is decision tables, which are explained in the following section.

Using decision tables

To capture complex interactions between dependent variables and system behavior, it can be helpful to write out the possibilities in a table. This provides a basis for writing test cases and ensures that all conditions are covered. By expanding out the variables in a systematic way, you confirm you haven’t missed any combinations.

Consider a web application with basic or advanced support depending on the operating system and web browser it runs on. The advanced mode isn’t a replacement for the basic mode; some users may choose the basic mode even though the advanced mode is available.

The specification states the following:

  • Chrome supports basic and advanced modes on Windows and macOS
  • Edge supports only basic mode on macOS
  • Safari only supports basic mode and only on macOS
  • Edge supports basic and advanced modes on Windows

Do those requirements cover all possible cases? They do, although that isn’t immediately obvious. We can make those conditions clearer by writing them out, which also shows us how many tests we need in total:

OS

Browser

App mode

Is it supported?

Windows

Chrome

Basic

Yes

Windows

Chrome

Advanced

Yes

Windows

Edge

Basic

Yes

Windows

Edge

Advanced

Yes

Windows

Safari

Basic

No

Windows

Safari

Advanced

No

Mac

Chrome

Basic

Yes

Mac

Chrome

Advanced

Yes

Mac

Edge

Basic

Yes

Mac

Edge

Advanced

No

Mac

Safari

Basic

Yes

Mac

Safari

Advanced

No

Table 5.9 – A decision table for application support on different operating systems and web browsers

We have two operating systems, three web browsers, and two application modes totaling 2 x 3 x 2 = 12 cases. Even though some arrangements are unsupported, you also need to test those to ensure they fail in a controlled way. Writing out the entire table takes a little time but has the advantage of explicitly listing every test case.

To avoid redundancy, you can collapse lines that don’t depend on a variable. For instance, in the first two lines of the preceding table, Chrome on Windows supports both basic and advanced app modes:

OS

Browser

App mode

Is it supported?

Windows

Chrome

Basic

Yes

Windows

Chrome

Advanced

Yes

Table 5.10 – An excerpt of an expanded decision table

We can rewrite that to show that Chrome on Windows is supported regardless of the app mode:

OS

Browser

App mode

Supported?

Windows

Chrome

Supported

Table 5.11 – An excerpt of a collapsed decision table

Removing all redundancy from the preceding table gives the following collapsed decision table:

OS

Browser

App mode

Supported?

Chrome

Supported

Edge

Basic

Supported

Windows

Safari

Unsupported

Windows

Edge

Advanced

Supported

Mac

Edge

Advanced

Unsupported

Mac

Safari

Basic

Supported

Mac

Safari

Advanced

Unsupported

Table 5.12 – A collapsed decision table for application support

Note that the first line indicates that both operating systems support Chrome in both app modes. Four lines have collapsed down to one. Start with one column first and collapse it down as far as possible before moving on; otherwise, you can end up with overlaps.

While the collapsed decision table is neater, it no longer shows the explicit tests you need to run, so the expanded decision table can be both simpler to write and more useful in practice.

Boundary value analysis and equivalence partitioning complement decision tables. Equivalence partitioning identifies the variables and their different values, and then you can determine which are dependent and independent of each other. Decision tables expand the complete list of possible test cases, and you can then, if necessary, identify the boundaries between them by performing boundary value analysis.

Where multiple variables interact, it can be helpful to display the dependencies visually. Cause and effect graphs are one way to achieve that, as described next.

Using cause-effect graphing

In the preceding example, application support depended on which operating system and web browser you were using. This can be visually presented in a cause-and-effect graph, which lets you visually display relationships between different variables and specific outcomes. These use standard logic operators of AND, OR, and NOT. The following diagram displays an AND relationship:

Figure 5.1 – AND relationship between two variables and an effect

Figure 5.1 – AND relationship between two variables and an effect

These diagrams use traditional symbols representing the logical operators:

Figure 5.2 – OR relationship between two variables and an effect

Figure 5.2 – OR relationship between two variables and an effect

The AND and OR operators can take two or more inputs and have their usual truth table outputs. The following diagram displays a NOT relationship:

Figure 5.3 – NOT relationship between variable A and an effect

Figure 5.3 – NOT relationship between variable A and an effect

Using this technique, you can map complex relationships between inputs and outputs to see their effects more clearly. You can create maps of arbitrary complexity, including operators with more than two inputs, multiple causes, intermediate states, and multiple effects. The following diagram shows such a combination:

Figure 5.4 – A complete example cause-effect graph

Figure 5.4 – A complete example cause-effect graph

Note the intermediate states E1 and E2 and the three-input OR gate leading into E1. This shows how causes A through D interact to produce effects 1 and 2. Using this graph, you can determine the truth table linking these variables, as shown here:

A

B

C

D

E1

E2

Effect 1

Effect 2

0

0

0

0

1

0

0

1

0

0

0

1

1

0

0

1

0

0

1

0

0

1

0

0

0

0

1

1

0

0

0

1

0

1

0

0

1

0

0

1

0

1

0

1

1

0

0

1

0

1

1

0

1

1

1

0

0

1

1

1

1

0

0

1

1

0

0

0

1

0

0

1

1

0

0

1

1

0

0

1

1

0

1

0

1

1

1

0

1

0

1

1

1

0

0

1

1

1

0

0

1

0

0

1

1

1

0

1

1

0

0

1

1

1

1

0

1

1

1

0

1

1

1

1

1

0

0

1

Table 5.13 – The truth table for the cause-effect graph shown in Figure 5.4

I think you’ll agree that the cause-effect diagram is easier to read, although the table explicitly lists all possibilities to save you from working them out. In practice, however, variables and their connections are rarely this complex. You could separate the causes leading to Effect 1 from those leading to Effect 2, for instance, even if they share some common causes.

Frankly, if you’re using this technique, I think you have too much time on your hands. The development team should be able to diagnose the causes of bugs with far less information than an exhaustive list of all the times it is seen. So long as you can provide at least one set of steps that reliably reproduce the issue, you shouldn’t need to map out every single dependency.

Important note

I include cause-effect graphing here for completeness, but I don’t recommend you use it in practice.

Having seen how we can map the effect of different variables, we now turn to the values those variables can take. There are common equivalence partitions that cause issues in many contexts that you should use in your testing, as described in the next section.

Testing variable types

In this section, we consider interesting values to use in common variable types, such as the contents of strings or numeric fields. These lists of different test cases are useful in many different scenarios. You should refer back to them throughout this book—they are used in user interface tests, in internal and external API fields, and as values in function calls. Your system should be resilient to exceptional values in these fields at all levels, starting with those entered by the user.

These lists use equivalence partitioning to divide up the space of possible inputs into different categories and pick one example from each category to check your system’s response. Testing these invokes Murphy’s law—if something can go wrong, it will. If your customers can enter these values, one day, somebody will do so. These aren’t happy-path cases because many of these require deliberate malicious intent, such as SQL injection, and won’t be entered accidentally. But so long as the input field could receive that value, your application needs to be able to handle it correctly.

Testing generic text input fields

The chances are that your application, whether desktop, mobile, or web page, has a text entry field or two, either for users to enter data or as part of an API. You can try a standard set of tests on every text field, then more specific ones depending on how a field is formatted. First, let’s list the general cases:

  1. A null value
  2. A blank value
  3. The maximum length value:

A. If there isn’t a limit, enter a ridiculously long value and see how the application responds

  1. A value with a space or only comprised of spaces
  2. A value with leading and trailing spaces
  3. A value with uppercase and lowercase characters
  4. A valid entry (for example, see the Testing email text input fields section)
  5. An invalid entry
  6. A value with Unicode characters (accented characters, Mandarin script, and so on)
  7. A value with special characters (punctuation marks, such as [{('!@£")}])
  8. A value with HTML tags (<b>bold</b>)
  9. A value with JavaScript (<script>alert("Shouldn't show an alert box")</script>)
  10. A value with SQL injection (Robert'); SELECT * FROM USERS;):

A. If you want to be evil, you can drop tables, but SELECT statements are less damaging

In addition to those tests, certain text fields have more specific formatting conditions, such as email addresses, passwords, or numeric values, as considered next.

Testing email text input fields

For fields containing an email address, you can check all the general cases listed previously and the following in addition:

  1. The presence of a single @ symbol
  2. The presence of a . symbol after @
  3. Are email addresses with an extra . in the username treated as identical?
  4. Are email addresses with a +<anything> suffix to the username treated as identical?
  5. Does the username only contain Latin characters, numbers, and printable characters?
  6. Does the username contain a dot as the first or last character or have two consecutively?
  7. Is the username less than 64 characters?
  8. Do quotation marks work around the username?

There are more complex rules governing the precise format of allowed email addresses. For more details, see https://datatracker.ietf.org/doc/html/rfc5322.

Real-world example – Signed in or not?

A new starter had trouble accessing an internal system in one place where I worked. He could sign in successfully but couldn’t access any of the tools.

Different user groups had access to different tools on that system, so I checked his username, group memberships, and group permissions, and everything was set up correctly. Everything worked fine for everyone else.

The problem turned out to be that he logged in with a capital letter in his username. Login was not case sensitive, so he could log in successfully, but the check for group membership was case sensitive, so the system didn’t think he belonged to any groups and gave him no access. Using capital letters was the key.

Testing numeric text input fields

For numeric fields, you can check the following variations in addition to the previous general cases for text input:

  1. Non-numeric characters (letters, special characters, Unicode characters, and so on)
  2. 0
  3. Negative numbers
  4. Decimals
  5. Very large values
  6. Very small values
  7. Exponents:

A. Does the textbox accept the letter e, for instance?

Separately, you need to test your application’s handling of those values, as described in the Testing numeric processing section.

Testing password text input fields

Password fields may have a simple list of required conditions, or more complex heuristics, which gauge overall complexity. In those cases, complexity in one dimension offsets the lack of sophistication in another. For instance, if the password is long enough, it may not need to include capital (uppercase) letters, whereas shorter passwords need to contain capitals (uppercase) to achieve the same security.

Password complexity heuristics are challenging to test, and you need to know the algorithm to check that it is working successfully. For passwords that follow rules, you can step through each to check it is being applied, as follows:

  1. Check the minimum length requirement
  2. Check that passwords without capitals are rejected
  3. Check that passwords without special characters are rejected:

A. And which special characters are accepted

  1. Check that passwords without numbers are rejected
  2. Check that certain text strings are rejected (for example, 1234, aaaa)
  3. Check that dictionary words are rejected

Remember to test all the previous generic text input cases, such as whether the password fields are vulnerable to SQL injection attacks and whether they accept Unicode characters. All those cases apply here as well.

Testing time input fields

Date and time pickers may indicate your birth date, the pickup time for a parcel, when a booking is due, or a hundred other possibilities. If your application uses this input type, the following are the interesting values:

  1. Times in the past and future
  2. Times in the far future (for example, 10 years in the future)
  3. Starting and ending at midnight
  4. Going over midnight:

A. Crossing the boundary into other weeks/months/years

  1. Daylight savings changes
  2. User time zones
  3. Changing time zones:

A. Time zone adjustments moving between days

  1. Leap seconds
  2. For appointments: all-day appointments, recurring appointments, recurring appointments with exceptions

As with the other fields, as well as checking that these values are accepted, you need to test how the system handles them in future processing.

Testing user-facing textboxes

As well as the values you enter into input fields, there are different methods of entering those values. For textboxes in your application, try using them in the following ways:

  1. Copying from the textbox
  2. Pasting into the textbox
  3. Dragging and dropping to and from the textbox
  4. Selecting everything within the textbox
  5. Is the cursor displayed correctly?
  6. Is the length correctly limited?
  7. Is the character set correctly limited?
  8. Does tab correctly move you between textboxes?
  9. Does pressing Enter correctly submit the value?
  10. Is the tooltip or alternative text correct for the textbox?

Testing file uploads

Along with text inputs, many applications need the user to upload files, such as video clips, documents, or other information. If your application includes that functionality, then these are different input types to check:

  1. File types:

A. Check the content type

B. Check the file extension

  1. Smallest/largest file size
  2. Different filenames (see the list of preceding generic text inputs)

The preceding variables apply to any file type. For uploading images, consider these additional tests:

  1. Smallest/largest resolutions
  2. Unusual aspect ratios:

A. Very tall and thin

B. Very wide but short

  1. Advanced features:

A. Interlaced

B. Alpha channels

C. Animated

With all these steps, there are both positive and negative versions—we can test that all supported file types work, for instance, and that other file types are rejected. Test that your application supports up to the maximum file size and rejects those above it. There are separate checks for how these files are processed and other security risks based on their contents. For more details on those, see Chapter 9, Security Testing.

Next, we consider processing these inputs, starting with tests to perform on numeric data.

Testing numeric processing

Once data has entered your system, consider how it will be processed. Are values simply entered, then displayed back to the user, or are they combined with other data for graphs and charts? Is any processing needed before data from different sources can be displayed together? This section considers the typical processing of numeric data types.

Wherever your system uses floating-point numbers, check on limitations from rounding. Where does the rounding occur in your design? Converting currencies is a common example of this, although many of those issues apply to any floating-point processing, not just that of currencies. The considerations when testing conversions such as between currencies are the following:

  • How many decimal places does your system measure exchange rates to?
  • Which exchange rates do you use?
  • How often are exchange rates updated?
  • When exactly during a transaction is the exchange rate applied?
    • If a transaction starts on day X with exchange rate one but finishes on day Y with exchange rate two, which exchange rate is applied?
  • If transactions have multiple steps, which exchange rates are applied?
    • For instance, if a customer pays a deposit, then they pay the full amount, or if they buy a product, then they get a refund.
  • Which currencies do you support?
    • What happens if there is a transaction in an unsupported currency?
  • Are there any double conversions (from currency A to currency B, then back to currency A)?
  • Is each value rounded in the same way before display?
    • Is a value sometimes displayed rounded and sometimes not?

It can be hard to trigger rounding issues deliberately, so plan it carefully, and it’s helpful to have a large dataset of examples to double-check.

Real-world example – Inconsistent rounding

In one company I worked for, we displayed summary values to our customers in table cells shaded in different colors to provide a visual representation. The shading was based on the unrounded value, but numbers were rounded before they were displayed. That meant sometimes the same value was shaded differently – some instances of 47 were light green, and some were dark green.

Check that values are rounded consistently before they are used in any outputs, such as user interfaces.

Having stepped through the details of different test conditions, the next sections consider how to optimize your test plan and your testing, starting with hidden defects, which have the potential to introduce some of the longest project delays.

Uncovering hidden defects

Often, testing can be carried out in parallel, with many parts of the test plan conducted simultaneously. As long as they are independent, the only limit on the number of tests you can run at once is the availability of test systems or testers to carry out the tests. When two tests interact, for instance, by requiring mutually exclusive settings, they either need separate test systems, or you need to run them serially, one after another. Testing serially is much slower, and you should avoid it wherever possible.

Another case where testing has to be run serially is when a bug blocks further testing. That first bug must be fixed before you can run tests and find bugs in the remaining functionality. Any issues that couldn’t be tested are hidden behind the first bug.

For example, if signing up new users doesn’t work, then you won’t be able to find bugs when multiple users sign up simultaneously. The bugs with multiple users are hidden behind the bug signing up new users.

Sometimes it’s obvious that part of the test plan cannot be run until a bug is fixed, as in the preceding example. If a screen on your application doesn’t load, for instance, then you can’t find any other bugs on that screen. This is why exploratory testing is so valuable—it quickly identifies any issues that block large areas of testing and might hide other defects.

At other times, it’s unclear that a particular code path isn’t being run and could hide a bug. For instance, if an application crashes when it receives an invalid API command, that needs to be fixed before you can see whether it returns the correct error to the query.

Defect hiding is also an issue during load testing. A crash that occurs after one day of load testing will obscure crashes after three days of use. The lead-up to those issues might be visible – such as increasing memory usage or performance degradation – but the crash won’t occur. Because the unit can’t run longer than one day without restarting, any longer-term issues are hidden.

To find hidden bugs, you must run a complete cycle of testing: running a build, testing it, raising a bug, fixing it, deploying that fix, and testing again. Depending on your development processes, that may take some time, so these bugs risk delaying projects and releases. The only way to speed that up is to prioritize fixing bugs that might hide others. That’s not always obvious, so keep your eyes open and use your experience.

Another area where your experience is vital is when error guessing. With limited time and resources, you must choose which areas to test most by making educated guesses about which functions are most likely to fail.

Optimizing error guessing

When designing your test plan, you need to identify the relevant inputs and variables, considering the partitions and combinations of partitions to test. You need to use your creativity to choose representative cases and examples to try. Be evil! Choose the worst possible examples, the tests that are most likely to cause breakages. What has the development team not thought of? What do they sometimes forget? What broke last time in a feature like this or in this area?

Error guessing is an official title for the ad hoc process of coming up with those test cases. It relies heavily on experience, so, like exploratory testing, it is best suited to experienced testers or those who can think up interesting new cases. Areas with previous errors are prime candidates to test with your new feature. As another of the ISTQB principles states, bugs cluster together and are not evenly spread throughout an application. By knowing about previous clusters, you can make your best guess about where to find others.

Note that the more unusual your test case, the less likely it is for a real user to encounter it, and the lower priority that bug will be. It is valuable to discover even obscure bugs in case they cause problems for users or indicate an issue that may have a more widespread effect. However, be aware that the most valuable bugs are those in the most common use cases, and bugs requiring artificially convoluted steps may not be prioritized. It is still worth trying these cases because obscure cases test both the rare and unusual examples. For instance, when you enter a non-ASCII string into a textbox, you are testing non-ASCII characters and whether that textbox works at all. That means you’re gaining extra testing compared with entering an ASCII string, as described in the Identifying worst-case scenarios section.

Error guessing is not exhaustive, and you should only use it in addition to systematic testing. It provides an extra layer of checks, or it can help inform the generation of the test plan from the feature specification. However, always carry out a rigorous test plan too. Just because an area hasn’t failed in the past and you think it’s unlikely a defect will be found, carry out those tests for your new feature too. Then, also guess where you might encounter errors.

Error guessing is also a valuable part of white-box testing, which we examine in more detail in Chapter 6, White-Box Functional Testing. By understanding the architecture and implementation of the code, you can see where possible weaknesses might appear. If there’s a calculation with a division, can you make that value zero? If two different modules have to work together, are they passing the necessary information correctly? Look for the weak spots in the implementation.

What are the known weaknesses in your system? Which functions are a perennial problem? And how do problematic features interact with the new feature you are testing now? Here are a few candidates:

  • Web browser/operating system interoperability
  • Upgrading/migrating database schema
  • Restoring from backup
  • Areas where information has to be sent from one part of the system to another
  • Localization (translation and time zone differences)
  • White-label branding
  • Logging and events
  • Errors and invalid inputs
  • Behavior under load
  • Behavior under poor network conditions

None of these will be a priority for the product management team since none are associated with delivering new user features. Still, they will affect many new features you develop, whether you want them to or not. Since these are important and regular sources of bugs, they are covered in more detail in other sections or chapters, so make sure they are part of your test plans.

Finally, on error guessing, look out for the edges of your feature. Some parts of the feature will be headline functionality, the areas that people are talking about and are of the highest value to the product management team. Those tend to be well-thought-through and have fewer issues. They require systematic testing, of course, but bugs are more likely to be in the other areas with a lower priority, which have had less thought devoted to them. These are the edges of the feature, away from its core functionality. These are areas such as the following:

  • Functions that were dropped or added during the development
  • Functions only used by a few users
  • Functions that support the main feature
  • Areas where the behavior changes between domains

Watch out for any functionality that is added or dropped (or both repeatedly!) during the development cycle. Changes like that require careful thinking through to catch all the knock-on effects. If a feature is added, you need to go back to the start of this book and work through exploratory testing, writing up the feature specification, and having it reviewed all over again. Likewise, if a feature is removed, even if the change appears obvious, you’ll need to consider all the dependent and independent variables to see which need testing again.

Real-world example – What kind of calls?

I once worked on a video conferencing system that could perform many different types of calls—into conferences, directly from person to person, to groups of users, and so on. We were making an architectural change to improve the way calling worked, which required extensive testing, and we were introducing it gradually.

Initially, we would only use the new calling architecture for meetings, not for calling from person to person, so we concentrated our testing there. However, attendees could add other users to the meeting. That was a person-to-person call, so it opened up a whole new test plan. What if the people had called each other before, if they had never called each other, or if they were hosted on a different part of the system?

A seemingly small piece of functionality – the ability to add participants to conferences – greatly increased the necessary testing. We didn’t spot that initially, which caused issues when we started testing it later.

Parts of the feature used by only a small percentage of users or that only exist to support the main feature (such as adding participants to conferences) are also areas that may have received less thought and may harbor issues. They can be difficult to spot since, by definition, they are considered less. It’s your job to divide all functionality in a binary way—either it’s fully implemented and tested or entirely excluded. Either way, you need to think through the consequences. Bugs hide in gray areas, so make everything black and white.

Look for edges in terms of functions, too—the oldest supported browser or the newest. What is the complete list of supported browsers and operating systems? Which is used least? This will be a lower priority for testing because you should clearly concentrate on the most popular cases, but there are more likely to be some issues in the rarer cases.

Experienced testers should perform error guessing because they are most likely to know the weak spots in a system. They gain that experience by paying careful attention to the areas where they’ve found bugs in the past, categorizing them, and recalling them for future use. That vital skill is using feedback in your testing, which is described next.

Using feedback

Test plans are not static documents carved in stone to stand for all time. They should be dynamic, living texts that evolve as you learn more about your features and products. Recall the testing spiral from Chapter 1, Making the Most of Exploratory Testing. Even after the detailed test plan is complete, you need further testing, specification, and discussion cycles to refine and improve your checks.

Those refinements can take several forms. Most benignly, you may need to add details you are happy with to the specification. Perhaps a particular input case or UI element had been missed from the document. If you find it during testing and it works as intended, all you need to do is write it up.

In my experience, such lucky coincidences are rare. If an element was missed from the specification, it’s unlikely a developer will implement it exactly as the product owner intended. Most feedback is, unfortunately, negative—either bugs where the feature doesn’t meet the specification or surprises in areas on which the requirements are silent.

Failure to meet the specification is simple enough. If the product doesn’t match one of the requirement statements, you raise a bug. What you should look out for are themes within the bugs. Identify which features suffer from many bugs and delays and test them even more. As noted previously, bugs cluster together, so the presence of several in the same place suggests there may be more.

Look out for trends between features. The Optimizing error guessing section lists common areas of functionality, such as localization and restoring from a backup, that can be difficult to implement and may interact with many other features. You need to generate that list for your product. What areas consistently cause problems? Ideally, record the areas in which bugs are found in your bug tracking system so you can systematically log what causes issues. Otherwise, you’ll have to rely on memory, which may make specific issues stand out while obscuring more significant trends.

If the feature specification fails to describe the behavior, then you will need a discussion with the product owner to determine what the behavior should be, then you’ll need to update both the specification and the code. Does that omission suggest anything else that the requirements fail to describe? Maybe there is a class of inputs that hadn’t been considered or a whole set of interactions. For each issue you find in the specification, see how it can be generalized.

Also, watch out for bugs that indicate whole classes of potential issues. For instance, if a web UI fails to display one error message, does it show any of them correctly? If an API mishandles one invalid input, does it break with other invalid inputs?

This is all vital information to feed into your test plan to generate other test cases. Follow your nose. If one area performs brilliantly and you can find no significant holes in its specification or functionality, then leave it alone. Concentrate on the parts that have shown problems and haven’t been thought through. Those are the places to spend your time.

To spend more time on an area, you can choose other examples from within equivalence partitions, be strict about testing every supported platform rather than just picking the newest and oldest, and check that area’s interactions with other apparently independent features. Usually, those additional steps wouldn’t be necessary, but if a feature is struggling with bugs, it’s worth spending the extra time to try to find more.

The other way to increase test coverage in a problematic area is to perform finer-grained checks on the product behavior as you use it. Logs, system interfaces, and monitoring systems provide vital feedback you should use throughout your testing, as described in the next section.

Determining what to check

So far in this chapter, we have considered many different tests to run, but we haven’t covered the vital next step in the test process—what should you check to ensure the test passed? There are many levels to consider. Most superficially, you can tell whether the application continues to work as described. In our example of a signup web page, does the user receive an email after entering their address? For everything written and saved, you can check where it is subsequently displayed to the user. Is it correct in all those places?

For changes that transition your program into a new state, does it make that transition correctly in all cases? For inputs that trigger other changes in the application, do you have a complete list of those changes so you can check them all?

The feature specification should include the user experience for each output, so you’ll need to refer back to that extensively for these checks. If anything is missing, for instance, the wording of error messages, you need to get them added and reviewed.

Another part of error guessing is knowing what to check. Perhaps there is a part of your product that has many dependencies and is liable to break due to changes elsewhere. Keep a note of those areas and routinely check them when there are changes, even apparently unrelated ones.

As well as customer-visible effects, you should also check other layers within the system; for example, were there any errors in the logs? For changes that result in stored data being changed, check that data – are the fields in the database being correctly populated? This requires white-box testing and is described in Chapter 6, White-Box Functional Testing. You’ll need to work with the developers to find out exactly which database fields should change and which logs lines should be written to see any that are missing or incorrect.

So long as the logging reports issues, this can be the best way to catch unexpected behavior. Within automated tests, there should be a routine check of the relevant logs for any errors or warnings, and their presence means the test has failed, even if the behavior otherwise appears correct. Have a zero-tolerance approach to errors in the logs. As described in Chapter 10, Maintainability, all log errors need code change, even if only to downgrade the warning.

Real-world example – Correct rejections for the wrong reason

In one company, we implemented Session Initiation Protocol (SIP), a public protocol for making video calls. We went for a penetration test in which another company would send all manner of invalid messages to check that we correctly rejected them.

They ran the tests, and the results were great—we rejected all the messages. No matter what invalid input we received, none got through. However, looking at the logs, we saw that one of the early messages had sent our application into an error state. It would reject everything after that, even valid requests. We were rejecting invalid calls, but not for the right reason.

We had a bug, and the test was invalid. They needed to change their test procedure to include some valid messages amongst the failures to check that they hadn’t completely broken the system under test, and we needed to fix our bug, which only the logs had indicated.

Remember to watch the logs as you test. However, you’ll need to specify which logs those are. In a system of any size, there are likely to be many processes writing logs at different levels, and knowing which ones are relevant is often far from obvious. One of your first tasks is to identify important logs so that you can watch them all.

Load testing also presents a challenge for checking that your tests are working. Under load conditions, there will be a lot of activity and logging. It’s best to track simple counts of actions successfully completed and overall system health to avoid being overloaded with data. For further details, see Chapter 12, Load Testing.

Finally, keep your eyes open for any other unexpected changes based on your input. This extends the preceding Performing negative testing section – everywhere you can check to ensure a transition is completed successfully, you can also check to ensure it correctly left other elements unchanged. If a code change has unintended consequences, it could result in failures in apparently unrelated parts of the product. That will depend on how well your product is architected to separate different parts of its functionality.

Real-world example – The identical speed tests

One application I tested involved software applications connecting to our cloud infrastructure. When endpoints connected, they measured the network and reported the bandwidth results so we could see the quality of service they could expect.

After one upgrade, the endpoints reconnected successfully, and the system appeared to recover perfectly. However, my colleague noticed that all the endpoints reported precisely the same network bandwidth. A bug in the reporting meant that the endpoint threw away the real result, and they all showed a default value instead.

Feature testing had missed that change, and we would have missed it in live usage too, except for the sharp eyes of my colleague. There was no written test for that, no procedure to say what to look for. You just have to keep your eyes open and follow up on anything where you think to yourself, “that’s odd…”.

Real-world example – The video glitch

My favorite example of an unrelated bug was in a video unit we sold. This was a hardware appliance for performing video conferencing, which customers would buy and own themselves. I sometimes noticed video glitches but couldn’t reproduce the issue reliably.

Eventually, after a lengthy investigation, I finally spotted the cause—when I made configuration changes on the unit, that caused the video to become corrupted briefly. The action of saving changes to the internal memory interrupted the processing of the media, and in a real-time system, even brief delays were immediately visible.

Such unrelated issues are tough to catch. An experienced tester performing manual testing is most likely to spot an unexpected interaction, but that is the most expensive form of testing. Inexperienced testers may not notice issues like that, and they are challenging to write automated tests for since they only perform the checks you have designed.

These examples demonstrate that checks in automated tests should be as broad as possible. In terms of architecture, checks for the system state should be in a separate module that can be improved independently of the tests. When you add a check for a new state, all tests should be able to take advantage of that, not just the one you wrote it for. Those additional checks will take time, so you may need to prioritize them if they slow down test runs. However, err on the side of including checks, even if they appear superfluous, as they are the only way to catch such unexpected interactions.

When designing checks for your test plan, curiosity is key, just as it was when creating the tests themselves. Some checks can be written into the test plan, and you can add as much detail as possible on the outcomes and user experience for a given change. Other bugs are due to unrelated changes in a part of the system that should be unchanged. You can’t check the whole system after every test step, but automated tests should have extensive checking, even if it seems unnecessary. For manual testing, keeping your eyes open and noticing unusual events is essential.

While checking is vital to your testing, it can cause problems in automated tests. If tests check too closely, they can become brittle and prone to breaking after safe, correct changes. That results in false-positive failures and excessive maintenance, so it’s essential to get it right, as explained in the next section.

Trading off precision versus brittleness in automated testing

In manual testing, the checks are as precise as the test plan specifies and the experience of the tester performing them. In automated testing, you have to decide the level of detail of the checks you perform, and there is a trade-off between the precision of your tests and how susceptible they are to break in the future, as shown here:

Figure 5.5 – Trading off precision versus resilience in tests

Figure 5.5 – Trading off precision versus resilience in tests

The more precise and detailed you make your tests, the slower and more brittle they will be, liable to break due to small or inconsequential changes. However, in making them faster and more resilient, they will generally become more vague or superficial. While there is no escaping that trade-off, some guidelines can help.

For system testing, you can apply the same rules as you did for the specification – test the behavior, not the implementation. It doesn’t matter about the internal messages, the database values, or the application state, so long as the end result is correct. That lets the development team completely re-architect the code if they choose, and the automated tests should keep passing. That is a valuable use case for automated testing: to verify that a code refactor has left the behavior unchanged.

Automated testing requires a very controlled environment. Unexpected inputs result in tests sometimes failing, known as flakes, which obscure genuine problems with the code. It’s vital to keep your tests as simple as possible, with as few dependencies as possible. Ensuring automated testing is independent of the implementation removes its main dependency, making for more resilient tests.

While some aspects of a program are clearly implementations – such as database tables – and others are clearly outputs – such as screens and web pages – others are intermediate. The HTML for a web page, for instance, is an output of the code but produces the final look of the screen. Should that be classed as implementation and ignored or as output and tested?

While individual cases may vary, a rule of thumb is only to test the final outputs. That is what your users will see, so that is where you should concentrate your testing. The development team is free to change any stages in generating that final output, which shouldn’t affect the tests. For instance, very different HTML could produce the same screen, so the tests should focus on the result, not how it was achieved.

There may be exceptions to that rule if an interface is crucial to demonstrate that a system is working; integration tests, for instance, isolate the interfaces between different program modules to perform testing there. However, for system tests, the interfaces are the outputs of the application overall, so testing should focus on them.

The tests themselves can also have different levels of specificity. Should a test check for the presence of text on a screen or its location, for instance? How strict or lax should timeouts be? What tolerance should there be on bandwidth measurements? For each of these, the temptation when writing the test is to be very strict and then make them looser, as intermittent failures cause unnecessary work and investigation. This is a pragmatic approach since it leaves checks strict wherever possible and adds variation where needed. Ideally, these values are explicit in the specification, so you can set your tests to check them.

Another approach is to add more detail in problematic areas. The location of text can be checked by comparing screen grabs, for instance, and various services are available for that. For timeouts, you can break down the time taken at each step to see if any are longer than expected. You can add instrumentation to the generation and transmission of data when measuring bandwidth limits. All of that takes extra work, so at this point, you need to trade off the specificity of your tests with the resources you have available to build and maintain them. That detailed testing is possible at a price.

If the maintenance burden of your tests is too high, you will need to revisit your approach. Specific tests or certain checks within those tests may need to be disabled or quarantined to ensure tests pass reliably. Check your tests aren’t dependent on the implementation or external factors, then examine which aspects of the tests are most important. Focus your checks there, and leave the others alone.

Brittleness shows the danger of having too many tests or them being too strict. That is an example of test prioritization, which is considered in more detail next.

Test prioritization

The ideas in this section, and all the following chapters of this book, describe a gold-plated, exhaustive test plan that will comprehensively test your product or new feature. You should definitely consider all the test cases described here, but that doesn’t mean you should perform them all. You will be constrained by the available time and resources, which will vary depending on how critical the software is.

Test management is beyond the scope of this book, for instance, how to organize a test team, which roles should perform which tasks, how testing should fit into release cycles, and many other considerations. Here, I am solely concerned with designing test plans and the tests to run. That means I will list many tests you should consider, many more than you will be able to run in practice. You must choose which to carry out.

Even if you don’t run tests, it’s important to be aware of those you have left out and consciously decided not to run. That lets you gauge the risk of each release far more accurately than having patchy test coverage with no clear measure of which checks you are missing. Those tests should be documented so you can choose to run more if the risks increase, for instance, because you are making changes in that area of code.

Conversely, writing too many tests can result in an unmaintainable system where there are so many test failures that you can’t see real bugs. As we saw in Chapter 4, Test Types, Cases, and Environments, there is a testing pyramid where you aim to have many unit tests, fewer integration tests, and the fewest system tests.

Which system tests you should run will depend on your system, but the following list presents guidelines that apply to many systems:

  • Test all the core happy-path cases through your system
  • Once a feature is working, test the worst-case scenarios to ensure tests are as efficient as possible
  • Concentrate testing in areas where you’ve found bugs before; they are most likely to break again
  • Have different levels of tests, some you can run on every change and a large set to be run nightly or on demand
  • Perform extensive checks on logs, database state, user interface behavior, and so on for all the tests you run

I can provide no more than guidance, in this case; you will have to carefully decide exactly which tests to run on your system. Another gray area is where people’s expectations about product behavior differ. That is the fine line between bugs and feature requests, which is examined next.

Comparing bugs and features

It can be difficult to distinguish between bugs and feature requests in software applications. Despite your best efforts to clarify the feature specification (see Chapter 2, Writing Great Feature Specifications), it is easy for ambiguity to creep in. Within those gray areas, your assumptions about what functionality should work may differ from the implementation.

For instance, sometimes new features aren’t initially available in all situations. Maybe you can create users on your system, and you have recently added an API to your system, but for now, the API is read-only, and you haven’t yet added an API command to add a user. APIs usually have a precise specification to say which calls have been implemented, but other interactions between features are less clear.

Other problematic areas are what should happen under degraded performance. On poor networks, what quality of service is acceptable? On low-resolution screens, how should the user interface adapt? Beware of your assumptions here; while you might expect worse performance under suboptimal conditions, the quality you expect might be different from what the product owner had in mind. You can end up raising invalid bugs about poor performance, which is expected, or, worse, accepting degraded performance, which should be better. This needs a detailed explanation in the feature specification. If you don’t have those details, you need to discuss them with the developers and product owners to make it clear.

These gray areas are why you must keep asking questions and refining the specification as you test. If your checks are sufficiently detailed, you should find many questions to ask as you go along.

Summary

In this chapter, you have learned the key considerations when performing black-box functional system testing. This is the core of the testing since it ensures that all the main behavior of this new feature is correct, and further testing will build on these results. If the feature doesn’t work correctly even once, then you can’t check the complete user experience or its behavior under load.

Black-box testing needs to systematically cover the entire feature specification while also extending it by trying out ad hoc cases that weren’t explicitly listed. You should also maintain a sense of curiosity to see what happens as you step through the feature’s different states. You find all the best bugs when you go off the test plan, so always extend the cases as written to try new combinations and check other outputs.

This chapter reviewed the various ways to approach black-box testing, from API and CRUD testing, negative testing, and identifying worst-case scenarios to equivalence partitioning and boundary analysis, which apply over a wide range of settings. When dividing test cases into equivalent partitions, you must choose which particular case to trial. This can be informed by your previous experience, finding the known weak spots in your system since bugs tend to cluster together.

This chapter provided standard equivalence partitions for some common data types. It stressed the use of feedback and error guessing to let your application guide where you could concentrate your testing and finished by considering the precision of your checks, test prioritization, and what should count as a bug or a feature.

With the black-box testing complete, you can gain a lot of confidence that a feature is usable, and you can consider giving it to internal users or beta testers. The happy path and working cases have been verified, but the testing is far from over. Black-box testing will always be limited as it does not consider how the feature was implemented, which is described in the next chapter on white-box testing.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset