4 Quality Characteristics for Technical Testing

Quality in a product or service is not what the supplier puts in. It is what the customer gets out and is willing to pay for. A product is not quality because it is hard to make and costs a lot of money, as manufacturers typically believe. This is incompetence. Customers pay only for what is of use to them and gives them value. Nothing else constitutes quality.

–Peter F. Drucker

The fourth chapter of the Advanced Level Syllabus – Technical Test Analyst is primarily concerned with non-functional testing as defined in ISO 9126. In the previous ISTQB Advanced syllabus, we called these topics quality attributes.

In this chapter there are four common learning objectives that we will be covering in each applicable section. Therefore, we list those here to avoid listing them in each section.

Common Learning Objective

TTA-4.x.1 (K2) Understand and explain the reasons for including maintainability, portability, and resource utilization tests in a testing strategy and/or test approach.

TTA-4.x.2 (K3) Given a particular product risk, define the particular non-functional test type(s) which are most appropriate.

TTA-4.x.3 (K2) Understand and explain the stages in an application’s life cycle where non-functional tests should be applied.

TTA-4.x.4 (K3) For a given scenario, define the types of defects you would expect to find by using non-functional testing types.

Beyond the common learning objectives, we shall cover seven sections, as follows:

1. Introduction

2. Security Testing

3. Reliability Testing

4. Efficiency Testing

5. Maintainability Testing

6. Portability Testing

7. General Planning Issues

4.1 Introduction

Learning objectives

Recall of content only.

In this chapter, we will discuss testing many of the various quality characteristics with which technical test analysts (TTAs) are expected to be familiar. While most of the topics we will discuss fall under the heading of nonfunctional testing, ISTQB also adds security testing, which is considered—by ISO 9126—as a subcharacteristic of functionality testing. Security testing often falls to the TTA due to its intrinsic technical aspects.

Many of the test design techniques used in nonfunctional testing are discussed in the ISTQB Foundation syllabus. Other design techniques and a deeper discussion of the techniques introduced in the Foundation syllabus can be found in the companion book to this one, Advanced Software Testing, Vol. 1. A common element between the test analyst and technical test analyst specialties is that both require understanding the quality characteristics—that really matter to our customers—in order to recognize typical risks, develop appropriate testing strategies, and specify effective tests.

As shown in Table 4–1, the ISTQB Advanced syllabi assign certain topics to test analysts, and others to technical test analysts. Despite the fact that ISO 9126 has been replaced by ISO 25000, ISTQB has decided to continue using ISO 9126 as a reference because the changes do not affect our testing discussion and 9126 is still widely used in the testing community.

Table 4–1 ISO‐9126 testing responsibilities

Image

Since we will be covering these topics in this chapter, it is important to understand exactly what the ISO 9126 standard entails. There are four separate sections:

1. The first section, ISO 9126-1, is the quality model itself, enumerating the six quality characteristics and the subcharacteristics that go along with them.

2. ISO 9126 Part 2 references external measurements that are derived by dynamic testing. Sets of metrics are defined for assessing the quality sub-characteristics of the software. These external metrics can be calculated by mining the incident tracking database and/or making direct observations during testing.

3. ISO 9126 Part 3 references internal measurements that are derived from static testing. Sets of metrics are defined for assessing the quality subcharacteristics of the software. These measurements tend to be somewhat abstract because they are often based on estimates of likely defects.

4. ISO 9126 Part 4 is for quality-in-use metrics. This defines sets of metrics for measuring the effects of using the software in a specific context of use. We will not refer to this portion of the standard in this book.

Each characteristic in ISO-9126 contains a number of subcharacteristics. We will discuss most of them. One specific subcharacteristic that appears under each main characteristic is compliance. This subcharacteristic is not well explained in ISO 9126 because, in each case, it is a measurement of how compliant the system is with applicable regulations, standards, and conventions. While there are a few metrics to measure this compliance—and we will discuss those—the standard does not describe to which regulations, standards, and conventions it refers. Since this is an international standard, that should not be surprising.

For each compliance subcharacteristic, ISO 9126 has almost exactly the same text (where XXXX is replaced by the quality characteristic of your choice):

Internal (or external) compliance metrics indicate a set of attributes for assessing the capability of the software product to comply to such items as standards, conventions, or regulations of the user organization in relation to XXXX.

It is up to each organization, and to its technical test analysts, to determine which federal, state, county, and/or municipality standards apply to the system they are building. Because these can be quite diverse, we will emulate ISO 9126 and not discuss compliance further beyond the metrics mentioned earlier.

ISO 9126 lists a number of metrics that could be collected from both static (internal) and dynamic (external) testing. Some of the internal metrics are somewhat theoretical and may be problematic for less mature organizations (for example, how many bugs were actually found in a review compared to how many were estimated to be there). Many of these metrics would require a very high level of organizational maturity to track. For example, several of them are based on the “number of test cases required to obtain adequate test coverage” from the requirements specification documentation. What is adequate coverage? ISO 9126 does not say. That is up to each organization to decide. It is Jamie’s experience that the answer to that question is often vaporous.

Question: What is adequate? Answer: Enough!1

Jamie and Rex have followed different career paths and have different experiences when it comes to static testing. While Jamie has worked for some organizations that did some static testing, they were more the exception than the rule. The ones that did static testing did not track many formal metrics from that testing. Rex, on the other hand, has worked with far more organizations that have performed static testing and calculated metrics from the reviews (what ISO 9126 refers to as internal metrics).

After each section, we will list some of the internal and external metrics to which ISO 9126 refers. We are not claiming that these will always be useful to track; each organization must make decisions as to which metrics they believe will give them value. We are listing them because the authors of the ISO 9126 standard believed that they can add some value to some organizations some times, and we believe they may add something meaningful to the discussion.

4.2 Security Testing

Learning objectives

TTA-4.3.1 (K3) Define the approach and design high-level test cases for security testing.

Security testing is often a prime concern for technical test analysts. Because so often security risks are either hidden, subtle, or side effects of other characteristics, we should put special emphasis on testing for security risks.

Typically, other types of failures have symptoms that we can find, through either manual testing or the use of tools. We know when a calculation fails—the erroneous value is patently obvious. Security issues often have no symptoms, right up until the time a hacker breaks in and torches the system. Or, maybe worse, the hacker breaks in, steals critical data, and then leaves without leaving a trace. Ensuring that people can’t see what they should not have access to is a major task of security testing.

The illusion of security can be a powerful deterrent to good testing since no problems are perceived. The system appears to be operating correctly. Jamie was once told to “stop wasting your time straining at gnats” when he continued testing beyond what the project manager thought was appropriate. When they later got hacked, the first question on her lips was how did he miss that security hole?

Another deterrent to good security testing is that reporting bugs and the subsequent tightening of security may markedly decrease perceived quality in performance, usability, or functionality; the testers may be seen as worrying too much about minor details, hurting the project.

Not every system is likely to be a target, but the problem, of course, is trying to figure out which ones will be. It is simple to say that a banking system, a university system, or a business system is going to be a target. But some systems might be targeted for political reasons or monetary reasons, and many might just be targeted for kicks. Vandalism is a growth industry in computers; some people just want to prove they are smarter than everyone else by breaking something with a high profile.

4.2.1 Security Issues

We are going to look at a few different security issues. This list will not be exhaustive; many security issues were discussed in Advanced Software Testing: Volume 1 and we will not be duplicating those. Here’s what we will discuss:

  • Piracy (unauthorized access to data)

  • Buffer overflow/malicious code insertion

  • Denial of service

  • Reading of data transfers

  • Breaking encryption meant to hide sensitive materials

  • Logic bombs/viruses/worms

  • Cross-site scripting

As we get into this section, please allow us a single global editorial comment: security, like quality, is the responsibility of every single person on the project. All analysts and developers had better be thinking about it, not as an afterthought, but as an essential part of building the system. And every tester at every level of test had better consider security issues during analysis and design of their tests. If every person on the project does not take ownership of security, our systems are just not going to be secure.

4.2.1.1 Piracy

There are a lot of ways that an intruder may get unauthorized access to data.

SQL injection is a hacker technique that causes a system to run an SQL query in a case where it is not expected. Buffer overflow bugs, which we will discuss in the next section, may allow this, but so might intercepting an authorized SQL statement that is going to be sent to a web server and modifying it. For example, a query is sent to the server to populate a certain page, but a hacker modifies the underlying SQL to get other data back.

Passwords are really good targets for information theft. Guessing them is often not hard; organizations may require periodic update of passwords and the human brain is not really built for long, intricate passwords. So users tend to use patterns of keys (q-w-e-r-t-y, a-s-d-f, etc.). Often, they use their name, their birthday, their dog’s name, or even just p-a-s-s-w-o-r-d. And, when forced to come up with a hard-to-remember password, they write it down. The new password might be found just underneath the keyboard or in their top-right desk drawer. Interestingly enough, Microsoft published a study that claims there is no actual value and often a high cost for changing passwords frequently.2 From their mouths to our sysadmin’s ear...

The single biggest security threat is often the physical location of the server. A closet, under the sysadmin’s desk, or in the hallway are locations where we have seen servers, leaving access to whoever happens to be passing by.

Temp (and other) files and databases can be targets if they are unencrypted. Even the data in the EXE or DLL files might be discovered using a common binary editor.

Good testing techniques include careful testing of all functionality that can be accessed from outside. When a query is received from beyond the firewall, it should be checked to ensure that it is accessing only the expected area in the database management system (DBMS). Strong password control processes should be mandatory, with testing regularly occurring to make sure they are enforced. All sensitive data in the binaries and all data files should be encrypted. Temporary files should be obfuscated and deleted after use.

4.2.1.2 Buffer Overflow

It seems like every time anyone turns on their computer nowadays, they are requested to install a security patch from one organization or another. A good number of these patches are designed to fix the latest buffer overflow issue. A buffer is a chunk of memory that is allocated by a process to store some data. As such, it has a finite length. The problems occur when the data to be stored is longer than the buffer. If the buffer is allocated on the stack and the data is allowed to overrun the size of the buffer, important information also kept on the stack might also get overwritten.

When a function in a program is called, a chunk of space on the stack, called a stack frame, is allocated. Various forms of data are stored there, including local variables and any statically declared buffers. At the bottom of that frame is a return address. When the function is done executing, that return address is picked up and the thread of execution jumps to there. If the buffer overflows and that address is overwritten, the system will almost always crash because the next execution step is not actually executable. But suppose the buffer overflow is done skillfully by a hacker? They can determine the right length to overflow the buffer so they can put in a pointer to their own code. When the function returns, rather than going back to where it was called from, it goes to the line of code the hacker wants to run. Oops! You just lost your computer.

Anytime you have a data transfer between systems, buffer overflow can be a concern—especially if your organization did not write the connecting code. A security breach can come during almost any DLL, API, COM, DCOM, or RPC call. One well-known vulnerability was in the FreeBSD utility setlocale() in the libc module.3 Any program calling it was at risk of a buffer overflow bug.

All code using buffers should be statically inspected and dynamically stressed by trying to overflow it. Testers should investigate all available literature when developers use public libraries. If there are documented failures in the library, extra testing should be put into place. If that seems excessive, we suggest you check out the security updates that have been applied to your browser over the last year.

4.2.1.3 Denial of Service

Denial of service attacks are sometimes simply pranks pulled by bored high schoolers or college students. They all band together and try to access the same site intensively with the intent of preventing other users from being able to get in. Often, however, these attacks are planned and carried out with overwhelming force. For example, recent military invasions around the world have often started with well-planned attacks on military and governmental computer facilities.4 Often, unknown perpetrators have attacked specific businesses by triggering denial of service attacks by “bots,” zombie machines that were taken over through successful virus attacks.

The intent of such an attack is to cause severe resource depletion of a website, eventually causing it to fail or slow down to unacceptable speeds.

A variation on this is where a single HTTP request might be hacked to contain thousands of slashes, causing the web server to spin its wheels trying to decode the URL.

There is no complete answer to preventing denial of service attacks. Validation of input calls can prevent the attack using thousands of slashes. For the most part, the best an organization can do is try to make the server and website as efficient as possible. The more efficiently a site runs, the harder it will be to bring it down.

4.2.1.4 Data Transfer Interception

The Internet consists of a protocol whereby multiple computers transfer data to each other. Many of the IP packets that any given computer is passing back and forth are not meant for that computer itself; the computer is just acting as a conduit for the packets to get from here to there. The main protocol used on the Internet, HTTP, does not encrypt the contents of the packets. That means that an unscrupulous organization might actually save and store the passing packets and read them.

All critical data that your system is going to send over the Internet should be strongly encrypted to prevent peeking. Likewise, the HTTPS protocol should be used if the data is sensitive.

HTTPS is not infallible. It essentially relies on trust in a certification authority (VeriSign, Microsoft, or others) to tell us whom we can trust, and it also contains some kind of encryption. However, it is better and more secure than HTTP.

4.2.1.5 Breaking Encryption

Even with encryption, not all data will be secure. Weak encryption can be beaten through brute force attacks by anyone with enough computing power. Even when encryption is strong, the key may be stolen or accessed, especially if it is stored with the data. Because strong encryption tends to be time intensive, the amount and type used in a system may be subject to performance trade-offs. The technical test analysts should be involved in any discussions as to those design decisions.

Because encryption is mathematically intense, it usually can be beaten by better mathematical capabilities. In the United States, the National Security Agency (NSA) often has first choice of graduating mathematicians for use in their code/cipher breaking operations. If your data is valuable enough, there is liable to be someone willing to try to break your security, even when the data is encrypted.

The only advice we can give for an organization is to use the strongest (legal) encryption that you can afford, never leave the keys where they can be accessed, and certainly never never store the key with the data. Testers should include testing against these points in every project. Many security holes are opened when a “quick patch is made that won’t affect anything.”5

4.2.1.6 Logic Bombs/Viruses/Worms

We need to add to this short list of possible security problems the old standbys, viruses, worms, and logic bombs.

A logic bomb is a chunk of code that is placed in a system by a programmer and gets triggered when specific conditions occur. It might be there for a programmer to get access to the code easily (a back door), or it might be there to do specific damage. For example, in June 1992, an employee of the US defense contractor General Dynamics was arrested for inserting a logic bomb into a system that would delete vital rocket project data. It was alleged that his plan was to return as a highly paid consultant to fix the problem once it triggered.6

There are a great number of stories of developers inserting logic bombs that would attack in case of their termination from the company they worked for. Many of these stories are likely urban legends. Of much more interest to testers is when the logic bombs are inserted via viruses or worms.

Certain viruses have left logic bombs that were to be triggered on a certain date: April Fools’ Day or Friday the 13th are common targets.

Worms are self-replicating programs that spread throughout a network or the Internet. A virus must attach itself to another program or file to become active; a worm does not need to do that.

For testing purposes, we would suggest the best strategy is to have good antivirus software installed on all systems (with current updates, of course) and a strict policy of standards and guidelines for all users to prevent the possibility of infection.

To prevent logic bombs, all new, impacted, and changed code should be subjected to some level of static review.

4.2.1.7 Cross-Site Scripting

This particular vulnerability pertains mainly to web applications. Commonly abbreviated as XSS, this exposure allows an attacker to inject malicious code into the web pages viewed by other consumers of a website. When another user connects to the website, they get a surprise when they download what they thought were safe pages.

This vulnerability is unfortunately widespread; an attack can occur anytime a web application mixes input from a user with the output it generates without validating it or encoding it on its own.

Since the victim of a malicious script is downloading from a trusted site, the browser may execute it blindly. At that point, the evil code has access to cookies, session tokens, and other sensitive information kept by the browser on the client machine.

According to OWASP.org (see section 4.2.1.8 for a discussion), this is one of the most common security holes found on the Web today. Luckily, there are several ways for a development group to protect themselves from XSS; they can be found on the OWASP site.7

When you’re testing a website, there are several things that should always be tested thoroughly:

  • Look closely at GET requests. These are more easily manipulated than other calls (such as POSTs).

  • Check for transport vulnerabilities by testing against these criteria:

    Are session IDs always sent encrypted?

    Can the application be manipulated to send session IDs unencrypted?

    What cache-control directives are replied to calls passing session IDs?

    Are such directives always present?

    Is GET ever used with session IDs?

    Can POST be interchanged with GET?

Black-box tests should always include the following three phases:

1. For each web page, find all of the web application’s user-defined variables. Figure out how they may be inputted. These may include HTTP parameters, POST data, hidden fields on forms, and preset selection values for check boxes and radio buttons.

2. Analyze each of the variables listed in step 1 by trying to submit a predefined set of known attack strings that may trigger a reaction from the client-side web browser being tested.

3. Analyze the HTML delivered to the client browser, looking for the strings that were inputted. Identify any special characters that were not properly encoded.

Given the space we can afford to spend on this topic, we can barely scratch the surface of this common security issue. However, the OWASP site has an enormous amount of information for testing this security issue and should be consulted; it always has the latest exploits that have been found on the Web.

4.2.1.8 Timely Information

Many years ago, Jamie was part of a team that did extensive security testing on a small business website. After he left the organization, he was quite disheartened to hear that the site had been hacked by a teenager. It seems like the more we test, the more some cretin is liable to break in just for the kicks.

We recently read a report that claimed a fair amount of break-ins were successful because organizations (and people) did not install security patches that were publicly available.8 Like leaving the server in an open, unguarded room, all of the security testing in the world will not prevent what we like to call stupidity attacks.

If you are tasked with testing security on your systems, there are a variety of websites that might help you with ideas and testing techniques. Access to timely information can help you from falling victim to a “me too” hacker who exploits known vulnerabilities.9 Far too often, damage is done by hackers because no one thought of looking to see where the known vulnerabilities were.

Figure 4–1 shows an example from the Common Vulnerabilities and Exposures (CVE) site.10 This international website is free to use; it is a dictionary of publically known security vulnerabilities. The goal of this website is to make it easier to share data about common software vulnerabilities that have been found.

Image

Figure 4–1 List of top 25 security vulnerabilities from the CVE website

The site contains a huge number of resources like this, a list of the top 25 programming errors of 2010. By facilitating information transfer between organizations, and giving common names to known failures, the organization running this website hopes to make the Internet a safer community.

Image

Figure 4–2 CWE cross section

In the United States, the National Vulnerability Database, run by the National Institute of Standards and Technology (NIST) and sponsored by the National Cyber Security Division of the Department of Homeland Security (DHS), has a website to provide a clearinghouse for shared information called Common Weakness Enumeration (CWE).11 See Figure 4–2.

Image

Figure 4–3 Possible attack example from CAPEC site

A related website (seen in Figure 4–3) is called Common Attack Pattern Enumeration and Classification (CAPEC12). It is designed to not only name common problems but also to give developers and testers information to detect and fight against different attacks. From the site “about” page:

Building software with an adequate level of security assurance for its mission becomes more and more challenging every day as the size, complexity, and tempo of software creation increases and the number and the skill level of attackers continues to grow. These factors each exacerbate the issue that, to build secure software, builders must ensure that they have protected every relevant potential vulnerability; yet, to attack software, attackers often have to find and exploit only a single exposed vulnerability. To identify and mitigate relevant vulnerabilities in software, the development community needs more than just good software engineering and analytical practices, a solid grasp of software security features, and a powerful set of tools. All of these things are necessary but not sufficient. To be effective, the community needs to think outside of the box and to have a firm grasp of the attacker’s perspective and the approaches used to exploit software. An appropriate defense can only be established once you know how it will be attacked.13

The objective of the site is to provide a publicly available catalog of attack patterns along with a comprehensive schema and classification taxonomy.

Finally, there is a useful-for-testing website for the Open Web Application Security Project (OWASP).14 This not-for-profit website is dedicated to improving the security of application software. This is a wiki type of website, so your mileage may vary when using it. However, there are a large number of resources that we have found on it that would be useful when trying to test security vulnerabilities, including code snippets, threat agents, attacks, and other information.

So, there are lots of security vulnerabilities and a few websites to help. What should you do as a technical test analyst to help your organization?

As you might expect, the top of the list has to be static testing. This should include multiple reviews, walk-throughs, and inspections at each phase of the development life cycle as well as static analysis of the code. These reviews should include adherence to standards and guidelines for all work products in the system. While this adds bureaucracy and overhead, it also allows the project team to carefully look for issues that will cause problems later on. Information on security vulnerabilities should be supplied by checklists and taxonomies to improve the chance of finding the coming problems before going live.

What should you look for? Certain areas will be most at risk. Communication protocols are an obvious target, as are encryption methods. The configurations in which the system is going to be used may be germane. Look for open ports that may be accessed by those who wish to breach the software.

Don’t forget to look at processes that the organization using the system will employ. What are the responsibilities of the system administrator? What password protocols will be used? Many times the system is secure, but the environment is not. What hardware, firmware, communication links, and networks will be used? Where will the server be located? Since the tester likely will not have the ability to follow up on the software in the production environment, the next best thing is to ensure that documentation be made available to users of the software, listing security guidelines and recommendations.

Static analysis tools, preferably ones that can match patterns of vulnerabilities, are useful, and dynamic analysis tools should also be used by both developers and testers at different levels of test.

There are hundreds, possibly thousands, of tools that are used in security testing. If truth be known, these are the same tools that the hackers are going to use on your system. We did a simple Google search and came up with more listings for security tools than we could read. The tools seem to change from day to day as more vulnerabilities are found. Most of these tools seem to be open source and require a certain competence to use.

Understanding what hackers are looking for and what your organization could be vulnerable to is important. Where is your critical data kept? Where can you be hurt worst?

If any of these steps are met with, “We don’t know how to do this,” then the engagement (or hiring) of a security specialist may be indicated. Frankly, in today’s world, it is not a question of if you might be hit; it is really a question of when and how hard. At the time we write this, our tips are already out of date. Security issues are constantly morphing, no matter how much effort your organization puts into them.

Helen Keller once said, “Security is mostly superstition. It does not exist in nature.” In software, security only comes from eternal vigilance and constant testing.

4.2.1.9 Internal Security Metrics

Access Auditability: A measure of how auditable the access login is. To calculate, count the number of access types that are being logged correctly as required by the specifications. Compare that number with the number of access types that are required to be logged in the specifications. Calculation is done by using the formula

X = A/B

where A is the number of access types that are being logged as called for in the specifications and B is the number of access types required to be logged in the specifications. The result will be 0 <= X <=1. The closer to 1, the more auditable this system would be. This metric is targeted at both the analysts writing the requirements and the developers reviewing them. Obviously, this metric (and others like it) are useful only if you are collecting measurements of this type.

Access Controllability: A measure of how controllable access to the system is. To calculate, count the number of access controllability requirements implemented correctly as called for in the specifications and compare with the number of access controllability requirements actually in the specifications. Calculation is done by using the formula

X = A/B

where A is the number of access controllability requirements implemented correctly as called for in the specifications and B is the total number of access controllability requirements in the specifications. The result will be 0 <= X <= 1. The closer to one, the more controllable this system would be. This metric is targeted at both analysts writing the requirements and the developers reviewing them.

Data Corruption Prevention: A measure of how complete the implementation of data corruption prevention is. To calculate, count the number of implemented instances of data corruption prevention as specified and compare that with the number of instances of operations or access specified in the requirements as capable of corrupting or destroying data. Calculation is done by using the same formula:

X = A/B

A is the number of implemented instances of data corruption prevention as called for in the specifications that are actually confirmed in review, and B is the number of instances of operations/accesses identified in the requirements as capable of corrupting or destroying data. The result will be 0 <= X <= 1. The closer to one, the more complete the prevention of data corruption this system would have. This metric is targeted at developers reviewing the specifications. If there are multiple security levels defined, this metric would likely be aimed at the highest levels of security.

Data Encryption: A measure of how complete the implementation of data encryption is. To calculate, count the number of implemented instances of encryptable/decryptable data items as specified and compare with the number of instances of data items requiring data encryption/decryption facilities as defined in the requirements. Calculation uses the same formula:

X = A/B

A is the number of implemented instances of encryptable/decryptable data items called for in the specifications that are confirmed in review. B is the total number of data items that require data encryption/decryption facility as defined in the specifications. The result will be 0 <= X <= one. The closer to 1, the more complete this encryption would be. This metric is targeted at developers reviewing requirements, designs, and source code.

4.2.1.10 External Security Metrics

Access Auditability: A measure of how complete the audit trail concerning “user accesses to the system and to the data” is. To calculate, evaluate the amount of access that the system recorded in the access history database. Use the formula

X = A/B

where A is the number of user accesses to the system and data that are actually recorded in the access history database and B is the total number of user accesses to the system and data occurring during the evaluation time. For this metric to be meaningful, the tester would need to count the number of times attempts to access the system and data were made during the testing. This testing should be performed by penetration tests to simulate attacks. “User access to the system and data” may include “virus detection record” for virus protection. The result will be 0 <= X <= 1. The closer to 1.0, the better the auditability should be.

Access Controllability: A measure of how controllable access to the system actually is. To calculate, count the number of different detected illegal operations compared to the number of different illegal operations possible as defined in the specifications. Again we use the same formula:

X = A/B

A is the number of actually detected different types of illegal operations, and B is the number of illegal operations defined in the specifications. Penetration tests should be performed to simulate an attack when performing this measurement. Attacks should include unauthorized persons trying to create, delete, or modify programs or information. The result will be 0 <= X <= 1. The closer to 1.0, the better the access controllability will be.

Data Corruption Prevention: A measure of the frequency of data corruption events. This is done by measuring the occurrences of both major and minor data corruption events. This is calculated by using the formulae

X = 1 – A/N

Y = 1 – B/N

Z = A/T or B/T

where A is the number of times a major data corruption event occurred and B is the number of times a minor data corruption event occurred. N is the number of test cases that were run trying to cause a data corruption event. T is the amount of time spent actually testing. This requires intensive abnormal operation testing to be done trying to cause corruption. Penetration15 tests should be run to simulate attacks on the system. For X and Y, the closer the measurement is to 1.0, the better. For Z, the closer the measurement to 0, the better (i.e., a longer period of operation is measured).

4.2.1.11 Exercise: Security

Using the HELLOCARMS System Requirements document, analyze the risks and create an informal test design for security.

4.2.1.12 Exercise: Security Debrief

We picked system requirement 010-040-040, which states, “Support the submission of applications via the Internet, providing security against unintentional and intentional security attacks.”

As soon as we open this system up to the Internet, security issues (and thus testing) come to the forefront.

During our analysis phase, we would try to ascertain exactly how much security testing had already been done on HELLOCARMS itself and the interoperating systems. Since up until now the systems had been reasonably closed, we would expect to find some untested holes. These holes would prompt an estimate for testing them, to make sure we have the resources we need.

Next, as part of our analysis, we would investigate the most common web security holes on sites mentioned earlier in this chapter: CVE (Common Vulnerabilities and Exposures), NVD (National Vulnerability Database), CAPEC (Common Attack Pattern Enumeration and Classification), and OWASP (Open Web Application Security Project). We would want all the help we could find.

We would ensure that TTAs were active in static testing at every level as the design and code were being developed.

Our test suite would likely contain tests to cover the following:

  • Injection flaws where untrusted data is sent to our site trying to trick us into executing unintended commands

  • Cross-site scripting where the application takes untrusted data and sends it to the web server without proper validation

  • Testing the authentication and session management functions to make sure unauthorized users are not allowed to log in

  • Testing HELLOCARMS code to make sure direct objects (files, directories, database keys, etc.) were not available from the browsers

  • Ensuring that no unencrypted or lightly encrypted data was sent to browsers (including making sure the keys are not sent with the data)

  • Checking to make sure all certificates are tested correctly to avoid corrupted or invalid certificate acceptance

  • Testing any links on our pages to ensure that we use only trusted data in our links (would not want a reputation for forwarding our customers to malware sites)

4.3 Reliability Testing

Learning objectives

TTA-4.3.1 (K3) Define the approach and design high-level test cases for the reliability quality characteristic and its corresponding ISO 9126 subcharacteristics.

While we believe that software reliability is always important, it is essential for mission-critical, safety-critical, and high-usage systems. As you might expect, reliability testing can be used to reduce the risk of reliability problems. Frequent bugs underlying reliability failures include memory leaks, disk fragmentation and exhaustion, intermittent infrastructure problems, and lower-than-feasible time-out values.

ISO 9126 defines reliability as “the ability of the software product to perform its required functions under stated conditions for a specified period of time, or for a specified number of operations.” While some of our information can come from evaluating metrics collected from other testing, we can also test for reliability by executing long suites of tests repeatedly. Realistically, this will require automation to get meaningful data.

A precise, mathematical science has grown up around the topic of reliability. This involves the study of patterns of system use, sometimes called operational profiles. The operational profile will often include not only the set of tasks but also the users who are likely to be active, their behaviors, and the likelihood of certain tasks occurring. One popular definition describes an operational profile as being a “quantitative characterization of how a system may be used.”

Operational profiles can help allocate resources for architectural design, development, testing, performance analysis, release, and a host of other activities that may occur during the software development life cycle (SDLC). For reliability testing, operational profiles can help the organization understand just how reliable the system has to be in the performance of its mission, which tasks are in most need of reliability testing, and which functions need to be tested the most.

Overt reliability testing is really only meaningful at later stages of testing, including system, acceptance, and system integration testing. However, as you shall see, we can use metrics to calculate reliability earlier in the SDLC.

It is interesting to compare electronic hardware reliability to that of software. Hardware tends to wear out over time; in other words, there are usually physical faults that occur to cause hardware to fail. Software, on the other hand, never wears out. Limitations in software reliability over time almost always can be traced to defects originating in requirements, design, and implementation of the system.16

Image

Figure 4–4 Hardware reliability graph

In Figure 4–4, you see a graph that shows—in general—the reliability of electronic hardware over time. Most failures will occur in the first few weeks of operation. If it survives early life, odds are very good that the electronic hardware will continue working for a long time—often past the point that it is targeted to be replaced by something faster/smarter/sexier.

Image

Figure 4–5 Software reliability graph

Compare that with the software reliability graph of Figure 4–5. Software has a relatively high failure rate during the testing and debugging period; during the SDLC, the failure rate will trend downwards (we hope!). While hardware tends to be stable over its useful life, however, software tends to start becoming obsolete fairly early in its life, requiring upgrades. The cynical might say that those upgrades are often not needed but pushed by a software industry that is not making money if it is not upgrading product. The less cynical might note that there are always features missing that require upgrades and enhancements. Each upgrade tends to bring a spike of failures (hence a lowering of reliability), which then tails off over time until the next upgrade.

Finally, as can be seen in Figure 4–5, as we get upgrades, complexity also tends to increase, which tends to lower reliability somewhat.

While there are techniques for accelerated reliability testing for hardware—so-called Highly Accelerated Life Tests (or HALT)—software reliability tests usually involve extended duration testing. The tests can be a small set of prescripted tests, run repeatedly, which is fine if the workflows are very similar. (For example, an ATM could be tested this way.) The tests might be selected from a pool of different tests, selected randomly, which would work if the variation between tests were something that could be predicted or limited. (For example, an e-commerce system could be tested this way.) The test can be generated on the fly, using some statistical model (called stochastic testing). For example, telephone switches are tested this way, because the variability in called number, call duration, and the like is very large. The test data can also be randomly generated, sometimes according to a model.

In addition, standard tools and scripting techniques exist for reliability testing. This kind of testing is almost always automated; we’ve never seen an example of manual reliability testing that we would consider anything other than an exercise in self-deception.

Given a target level of reliability, we can use our reliability tests and metrics as exit criteria. In other words, we continue until we achieve a level of consistent reliability that is sufficient. In some cases, service-level agreements and contracts specify what “sufficient” means.

Many organizations use reliability growth models in an attempt to prognosticate how reliable a system may be at a future time. As failures are found during testing, the root causes of those failures should be determined and repaired by developers. Such repairs should cause the reliability to be improved. A mathematical model can be applied to such improvements, measuring the given reliability at certain steps in the development process and extrapolating just how much better reliability should be in the future.

A variety of different mathematical models have been suggested for these growth models, but the math behind them is beyond the scope of this book. Some of the models are simplistic and some are stunningly complex. One point we must make, however: whichever reliability growth model an organization chooses to use, it should eventually be tuned with empirical data; otherwise, the results are liable to be meaningless. We have seen a lot of resources expended on fanciful models that did not turn out to be rooted anywhere close to reality. What this means is that, after the system is delivered into production, the test team should still monitor the reliability of the system and use that data to determine the correctness of the reliability model chosen.

In the real world, reliability is a fickle thing. For example, you might expect that the longer a system is run without changes, the more reliable it will become. Balancing that is the observation that the longer a system runs, the more people might try to use it differently; this may result in using some capability that had never been tried before and, incidentally, causing the system to fail.

Metrics that have been used for measuring reliability are defect density (how many defects per thousand lines of source code [KLOC] or per function point). Cyclomatic complexity, essential complexity, number of modules, and counting certain constructs (such as GOTOs) have all been used to try to determine correlation of complexity, size, and programming techniques to the number and distribution of failures.

Object-oriented development has its own ways of measuring complexity, including number of classes, average number of methods per class, cohesion in classes and coupling between classes, depth of inheritance used, and many more.

We tend to try to develop reliability measures during test so that we can predict how the system will work in the real world. But balancing that is the fact that the real world tends to be much more complex than our testing environment. As you might guess, that means there are going to be a whole host of issues that are then going to affect reliability in production. In test, we need to allow for the artificiality of our testing when trying to predict the future. The more production-like our testing environment, the more likely our reliability growth model will be meaningful.

So what does it all mean? Exact real numbers for reliability might not be as meaningful as trends. Are we trending down in failures (meaning reliability is trending up)?

4.3.1 Maturity

One term that is often used when discussing reliability is maturity. This is one of three subcharacteristics that are defined by ISO 9126. Maturity is defined as the capability of the system to avoid failure as a result of faults in the software. We often use this term in relation to the software development life cycle; the more mature the system, the closer it is to being ready to move on to the next development phase.

In maturity testing, we monitor software maturity and compare it to desired, statistically valid goals. The goal can be the mean time between failures (MTBF), the mean time to repair (MTTR), or any other metric that counts the number of failures in terms of some interval or intensity. Of course, there are failures and then there are failures. Maturity testing requires that all parties involved come to agreement as to what failures count toward these metrics. During maturity testing, as we find bugs that are repaired, we’d expect the reliability to improve. As discussed earlier, many organizations use reliability growth models to monitor this growth.

4.3.1.1 Internal Maturity Metrics

Fault detection: A measure of how many defects17 were detected in the reviewed software product, compared to expectations. This metric is collected by counting the number of detected bugs found in review and comparing that number to the amount that were estimated to be found in this phase of static testing. The metric is calculated by the formula

X = A / B

where A is the actual number of bugs detected (from the review report) and B is the estimated number expected.18 If X varies significantly above or below 1.0, then something unexpected has happened that requires further investigation.

Fault removal: A measure of how many defects that were found during review are removed (corrected) during design and implementation phases. The metric can be calculated by the formula

X = A / B

where A is the number of bugs fixed prior to dynamic testing (either during requirements, design, or coding) and B is the number of bugs found during review. The result will be 0 <= X <= 1. The closer the value is to one, the better. A fault removal value of exactly one would mean that every detected defect had been removed.

Test adequacy: A measure of how many of the required test cases are covered by the test plan. To calculate this metric, count the number of test cases planned (in the test plan) and compare that value to the number of test cases required to “obtain adequate test coverage” using the formula

X = A / B

where A is the number of test cased designed in the test plan and confirmed in review and B is the number of test cases required. Our research into exactly what determines the number of test cases required or exactly how you can determine that value came up with no specific model in the ISO 9126 standard. Without a specifying model for the number of tests required to achieve the needed level of coverage, we do not find this metric useful. After all, tests can come from the requirements, risk analysis, use cases, tester experience, and literally dozens of different places. How do you quantify that? Some of Rex’s clients are able to use substantial, statistically valid, parametric models to predict B, in which case this metric is useful.

4.3.1.2 External Maturity Metrics

Estimated latent fault density: A measure of how many defects remain in the system that may emerge as future failures. This metric depends on using a reliability growth estimation model as we discussed earlier in the chapter. The formula for this metric is as follows:

X = {abs(A1 - A2)} / B

A1 is the total number of predicted latent defects in the system, A2 is the total number of actually occurring failures, and B is the product size. To get the predicted number of latent defects, you count the number of defects that are detected through testing during a specified trial period and then predict the number of future failures using the reliability growth model. The actual count of failures found comes from incident reports. Interestingly enough, ISO 9126-2 does not define how size (B) is measured; it should not matter as long as the same measurement is used in the reliability growth model. Some organizations prefer KLOC (thousands of lines of source code), while others prefer to measure function points.

There is the very real possibility that the number of actual failures may exceed the number estimated. ISO 9126-2 recommends reestimating, possibly using a different reliability growth model if this occurs. The standard also makes the point that estimating a larger number of latent defects should not be done simply to make the system look better when fewer failures are actually found. We’re sure that could never happen!

A conceptual issue with this metric is that is appears to assume that there is a one-to-one relationship between defects and failures. As Rex points out, one latent defect may cause 0 to N failures, invalidating that relationship and possibly making this metric moot. However, the reliability growth model may have this concept built in, allowing for such an eventuality. The standard suggests trying several reliability growth models and finding the one that is most suitable. In some ways, we feel that is like shopping for a lawyer based on them saying what you want to hear…

Failure density against test cases: A measure of how many failures were detected during a defined trial period. In order to calculate this metric, use the formula

X = A1 / A2

where A1 is the number of detected failures during the period and A2 is the number of executed test cases. A desirable value for this metric will depend on the stage of testing. The larger the number in earlier stages (unit, component, and integration testing), the better. The smaller the number in later stages of testing and in operation, the better. This is a metric that is really only meaningful when it is monitored throughout the life cycle and viewed as a trend rather than a snapshot in time. Although ISO 9126-2 does not mention it, the granularity of test cases might skew this metric if there are large differences between test cases at different test levels. A unit test case will typically test a single operation with an object or function, while a system test case can easily last for a couple of hours and cause thousands of such operations to occur within the code. If an organization intends to use this metric across multiple test levels, they must derive a way of normalizing the size of A2. It is likely most useful when only used in each test level singularly, without comparing across test levels. The standard does note that testing should include appropriate test cases: normal, exceptional, and abnormal tests.

Failure resolution: A measure of how many failure conditions are resolved without reoccurrence. This metric can be calculated by the formula

X = A1 / A2

where A1 is the total number of failures that are resolved and never reoccur during the trial period and A2 is the total number of failures that were detected. Clearly, every organization would prefer that this value be equal to one. In real life, however, some failures are not resolved correctly the first time; the more often failures reoccur, the closer to zero this metric will get. The standard recommends monitoring the trend for this metric rather than using it as a snapshot in time.

Fault density: A measure of how many failures were found during the defined trial period in comparison to the size of the system. The formula is as follows:

X = A / B

A is the number of detected failures and B is the system size (again, ISO 9126-2 does not define how size is measured). This is a metric where the trend is most important. The later the stage of testing, the lower we would like this metric to be. Two caveats must be made when discussing fault density. Duplicate reports on the same defect will skew the results, as will erroneous reports (where there was not really a defect, but instead the failure was caused by the test environment, bad test case, or other external problem). This metric can be a good measure of the effectiveness of the test cases. A less charitable view might be that it is a measure of how bad the code was when first released into test.

Fault removal: A measurement of how many defects have been corrected. There are two components to this metric: one covering the actually found defects and one covering the estimated number of latent defects. The formulae are

X = A1 / A2

Y = A1 / A3

where A1 is the number of corrected defects, A2 is the total number of actually detected defects, and A3 is the total number of estimated latent defects in the system. In reality, the first formula is measuring how many found defects are not being removed. If the value of X is one, every defect found was removed; the smaller the number, the more defects are being left in the system. As with some of the other metrics, this measurement is more meaningful when viewed as a trend rather than isolated in time.

If the organization does not estimate the number of latent defects (A3), this metric clearly cannot be used because A3 would equal zero, causing bad stuff to happen when the calculation is made.

The value of Y will be 0 <= 1. If the estimated value of Y is greater than one, the organization may want to investigate if the software is particularly buggy or if the estimate based on the reliability growth model was faulty. If Y is appreciably less than one, the organization may want to investigate if the testing was not adequate to detect all of the defects. Remember that duplicate incident reports will skew this metric. The closer Y is to 1.0, the fewer defects in the system should be remaining (assuming that the reliability model is suitable).

Mean time between failures (MTBF): A measure of how frequently the software fails in operation. To calculate this metric, count the number of failures that occurred during a defined period of usage and compute the average interval between the failures. Not all failures are created the same. For example, minor failures such as misspellings or minor rendering failures are not included in this metric. Only those failures that interrupt availability of the system are counted. This metric can be calculated two ways:

X = T1 / A

Y = T2 / A

T1 is the total operation time, and T2 is the sum of all the time intervals where the system was running. The second formula can be used when there is appreciable time spent during the interval when the system was not running. In either case, A is the total number of failures that were observed during the time the system was operating.

Clearly, the greater the value of X or Y, the better. This metric may require more research as to why the failures occurred. For example, there may be a specific function that is failing while other functionality may be working fine. Determination of the distribution of the types of failures may be valuable, especially if there are different ways to use the system.

Test coverage: A measure of how many test cases have been executed during testing. This metric requires us to estimate how much testing it would require to obtain adequate coverage based on…something. As we noted earlier, the standard is hazy on exactly how many tests it takes to get “adequate coverage” and exactly where those tests come from. The formula to calculate this metric is as follows:

X = A / B

A is the number of test cases actually executed and B is the estimated number of test cases, from the requirements, that are needed for adequate coverage. The closer this value comes to one, the better.

Test maturity: A measure of how well the system has been tested. This is used, according to ISO 9126-2, to predict the success rate the system will achieve in future testing. As before, the formula consists of

X = A / B

where A is the number of test cases passed and B is the estimated number of test cases needed for adequate coverage based on the requirements specification documentation. There is that term adequate coverage again. If your organization can come up with a reasonable value for this, some of these metrics would be very useful. In this particular case, the standard does make some recommendation as to the type of testing to be used in this metric. The standard recommends stress testing using live historical data, especially from peak periods. It further recommends using user operation scenarios, peak stress testing, and overloaded data input. The closer this metric is to one, that is, the more test cases that pass in comparison to all that should be run, the better.

4.3.2 Fault Tolerance

The second subcharacteristic of reliability is fault tolerance, defined as the capability of a system to maintain a specified level of performance in case of software faults. When fault tolerance is built into the software, it often consists of extra code to avoid and/or survive and handle exceptional conditions. The more critical it is for the system to exhibit fault tolerance, the more code and complexity are likely to be added.

Other terms that may be used when discussing fault tolerance are error tolerance and robustness. In the real world, it is not acceptable for a single, isolated failure to bring an entire system down. Undoubtedly, certain failures may degrade the performance of the system, but the ability to keep delivering services—even in degraded terms—is a required feature that should be tested.

Negative testing is often used to test fault tolerance. We partially or fully degrade the system operation via testing while measuring specific performance metrics. Fault tolerance tends to be tested at each phase of testing.

During unit test we should test error and exception handling with interface values that are erroneous, including out of range, poorly formed, or semantically incorrect. During integration test we should test incorrect inputs from user interface, files, or devices. During system test we might test incorrect inputs from OS, interoperating systems, devices, and/or user inputs. Valuable testing techniques include invalid boundary testing, exploratory testing, state transition testing (especially looking for invalid transitions), and attacks. In Chapter 6, we discuss fault injection tools, which will be useful for this type of testing.

In several different metrics, ISO 9126 uses the term Fault Pattern. This is defined as a “generalized description of an identifiable family of computations”19 that cause harm. These have formally defined characteristics and are fully definable in the code. The theory is that if we can define the common characteristics of computations in a certain area that cause harm, we can use those as a vocabulary that will help us find instances of the fault pattern in an automated way so they can be removed. A website we introduced earlier when discussing security, CWE, was created to categorize many of the diverse fault patterns that have been discovered. CWE stands for Common Weakness Enumeration and has the following charter:

International in scope and free for public use, CWE provides a unified, measurable set of software weaknesses that is enabling more effective discussion, description, selection, and use of software security tools and services that can find these weaknesses in source code and operational systems as well as better understanding and management of software weaknesses related to architecture and design.

Each fault pattern is fully described and defined with the following information, allowing developers to identify and remove them when finding them in their code.

  • Full description

  • When introduced

  • Applicable platforms

  • Common consequences

  • Detection methods

  • Examples in code

  • Potential mitigations

  • Relationships to other patterns

4.3.2.1 Internal Fault Tolerance Metrics

Failure avoidance: A measure of how many fault patterns were brought under control to avoid critical and serious failures. To calculate this metric, count the number of avoided fault patterns and compare it to the number of fault patterns to be considered, using the formula

X = A / B

where A is the number of fault patterns that were explicitly avoided in the design and code and B is the number to be considered. The standard does not define where the number of fault patterns to be considered is to come from. We believe their assumption is that each organization will identify the fault patterns that they are concerned about and have that list available to the developers and testers, perhaps using a site like CWE.

Incorrect operation avoidance: A measure of how many functions are implemented with specific designs or code added to prevent the system from incorrect operation. As with failure avoidance, we count the number of functions that have been implemented to avoid or prevent critical and serious failures from occurring and compare them to the number of incorrect operation patterns that have been defined by the organization. While the term operation patterns is not defined formally, examples of incorrect operation patterns to be avoided include accepting incorrect data types as parameters, incorrect sequences of data input, and incorrect sequences of operations. Based on the examples given, operation patterns are very similar to fault patterns, defining specific ways that the system may be used incorrectly. Once again, calculation of the metric is done by using the formula

X = A / B

where A is the number of incorrect operations that are explicitly designed to be prevented and B is the number to be considered as listed by the organization. ISO 9126-3 does not define exactly what it considers critical or serious failures; our assumption is that each organization will need to compile their own list, perhaps by mining the defect database.

An example of avoidance might be having precondition checks before processing an API call where each argument is checked to make sure it contains the correct data type and permissible values. Clearly, the developers of such a system must make trade-offs between safety and speed of execution.

4.3.2.2 External Fault Tolerance Metrics

Breakdown avoidance: A measure of how often the system causes the breakdown of the total production environment. This is calculated using the formula

X = 1 - A / B

where A is the number of total breakdowns and B is the number of failures. The problem that an organization might have calculating this metric is defining exactly what a breakdown is. The term breakdown in ISO 9126 is defined as follows: “the execution of user tasks is suspended until either the system is restarted or that control is lost until the system is forced to be shut down.” If minor failures are included in the count, this metric may appear to be closer to one than is meaningful.

The closer this value is to one (i.e., the number of breakdowns is closer to 0), the better. For example, if there were 100 total failures (90 minor and 10 major) and one breakdown, then including minor failures would yield the following measurement:

1 – 1/100 → .99

Not including them would yield a wholly different number:

1 – 1/10 → .90

Ideally, a system will be engineered to be able to handle internal failures without causing the total breakdown of the system; this of course usually comes about by adding extra fault sensing and repair code to the system. ISO 9126-2 notes that when few failures do occur, it might make more sense to measure the time between them—MTBF—instead of this metric.

Failure avoidance: A measure of how many fault patterns were brought under control to avoid critical and serious failures. Two examples are given: out of range data and deadlocks. Remember that these metrics are for dynamic testing. In order to capture this particular metric, we would perform negative tests and then calculate how often the system would be able to survive the forced fault without having it tumble into a critical or serious failure. ISO 9126-2 gives us examples of impact of faults as follows:

  • Critical: The entire system stops or serious database destruction occurs.

  • Serious: Important functionality becomes inoperable with no possible workarounds.

  • Average: Most functions are still available, but limited performance occurs with workarounds.

  • Small: A few functions experience limited performance with limited operation.

  • None: Impact does not reach end user.

The metric is calculated using the familiar formula

X = A / B

where A is the number of avoided critical and serious failure occurrences against test cases for a given fault pattern and B is the number of executed test cases for the fault pattern. A value closer to one, signifying that the user will suffer fewer critical and serious failures, is better.

Incorrect operation avoidance: A measure of how many system functions are implemented with the ability to avoid incorrect operations or damage to data. Incorrect operations that may occur are defined in this case to include the following:

  • Incorrect data types as parameters

  • Incorrect sequence of data input

  • Incorrect sequence of operations

To calculate this measurement, we count the number of negative test cases that fail to cause critical or serious failures (i.e., the negative test cases pass) compared to the number of test cases we executed for that purpose. We use the formula

X = A / B

where A is the number of test cases that pass (i.e., no critical or serious failures occur) and B is the total number run. The closer to one, the better this metric shows incorrect operation avoidance. Of course, this metric should be matched with a measure of coverage to be meaningful. For example, if we run 10 tests and all pass, our measurement would be 1. If we run 1,000 tests and 999 of them pass, the measurement would be 0.999. We would argue that the latter value would be much more meaningful.

4.3.3 Recoverability

The next subcharacteristic for reliability is called recoverability, defined as the capability to reestablish a specified level of performance and recover the data directly affected in case of a failure.

Clearly, recoverability must be built into the system before we can test it. Assuming that the system has such capabilities, typical testing would include running negative tests to cause a failure and then measure the time that the system takes to recover and the level of recovery that occurs. According to ISO 9126, the only meaningful measurement is the recovery that the system is capable of doing automatically. Manual intervention does not count.

Failover functionality often is built into a system when the consequences of a software failure are so unthinkable that they must never be allowed to occur. High availability is a term that is often used in these scenarios. There are three principles of high availability engineering that should be tested:

  • No single point of failure (add redundancy)

  • Reliable failover when a failure does occur

  • Immediate detection of failures when they occur (required to trigger failover)

If you consider this list, high availability really requires all three subcharacteristics of reliability (ignoring compliance for the moment) to be present (and hence tested). Jamie spent some time consulting at an organization that handled ATM and credit card transactions. That organization was a perfect example of one that required high availability.

When someone inserts their card in an ATM, or swipes it at a gasoline station or grocery store, they are not willing to wait for hours for the charge to be accepted. Most people are barely willing to wait seconds. When a fault occurs and the processing of the card is interrupted, it is not permissible for the transaction to get lost, forgotten, or duplicated.

To be fair, not all of the magic for recoverable systems comes from the software. While working for the organization just noted, Jamie learned that a great deal of the high availability came from special hardware (Tandem computers) and a special operating system. However, all associated software had be built with the utmost reliability and tested thoroughly to ensure that every single transaction was tracked and fulfilled, completed and logged.

Having said that, much recoverability does come completely—or mostly—from software systems.

Redundant systems may be built to protect groups of processors, discs, or entire systems. When one item fails, the recoverability includes the automatic failover to still-working items and the automatic logging of the failure so that the system operator will know that the system is working in a degraded mode. Testing this kind of system includes triggering various failures to ascertain exactly how much damage the system can take and still remain in service.

In certain environments, it is not sufficient to have arrays of similar systems to ensure recoverability. If all of the items of an array are identical, a failure mode that takes out one portion of the array may also affect the other portions. Therefore, some high availability systems are designed with multiple dissimilar systems. Each system performs exactly the same task, but does it in a different way. A failure mode affecting one of the systems would be unlikely to affect the others. Testing such a system would still require failover testing to ensure that the entire system remains in service after the fault.

Backup and restore testing also fits the definition of recoverability testing. This would include testing the physical backup and restoration tasks and the processes to actually perform them as well as testing the data afterward to ensure that the recovered data was actually complete and correct.

Many organizations have found that they do, indeed, have backups of all their data. Unfortunately, no one ever tested that the data was valid. After a catastrophe, they found that resources they saved by not fully testing their data recovery paled in respect to the losses incurred when that data restore did not work correctly. Oops!

4.3.3.1 Internal Recoverability Metrics

Restorability: A measure of how capable the system is to restore itself to full operation after an abnormal event or request. This metric is calculated by counting the number of restoration requirements that are implemented by the design and/or code and comparing them to the number in the specification documents. For example, if the system has the ability to redo or undo an action, it would be counted as a restoration requirement. Also included are database and transaction checkpoints that allow rollback operations. The formula for the measurement again consists of a ratio

X = A / B

where A is the number of restoration requirements that are confirmed to be implemented via reviews and B is the number called for in the requirements or design documents. This value will evaluate to 0 <= X <= 1. The closer the value is to one, the better.

Restoration effectiveness: A measure of how effective the restoration techniques will be. Remember that all of these are internal metrics that are expected to come from static testing. According to ISO 9126-3, we get this measurement by calculation and/or simulation. The intent is to find the number of restoration requirements (defined earlier) that are expected to meet their time target when those requirements specify a specific time target. Of course, if the requirements do not have a time target, this metric is moot. The measurement has the now-familiar formula of

X = A / B

where A is the number of implemented restoration requirements meeting the target restore time and B is the total number of requirements that have a specified target time. For example, suppose that the requirements specification not only requires the ability to roll back a transaction, but it also defines that it must do so within N milliseconds once it is triggered. If this capability is actually implemented, and we calculate/simulate that it would actually work correctly, the metric would equal 1/1. If not implemented, the metric would be 0/1.

4.3.3.2 External Recoverability Metrics

Availability: A measure of how available the system is for use during a specified period of time. This is calculated by testing the system in an environment as much like production as possible, performing all tests against all system functionality and uses the formulae

X = { To / (To + Tr)}

Y = A1 / A2

where To is the total operation time and Tr is the amount of time the system takes to repair itself (such that the system is not available for use). We are only interested in time that the system was able to automatically repair itself; in other words, we don’t measure the time when the system is unavailable while manual maintenance takes place. A1 is the total number of times that a user was able to use the software successfully when they tried to and A2 is the number of total number of times the user tried to use the software during the observation period.

In the above formulae, X is the total time available (the closer to one, the more available the system was) while Y is a measure of the amount of time a user was able to successfully use the system. The closer Y approaches 1, the better the availability.

Mean down time: A measure of the average amount of time the system remains unavailable when a failure occurs before the system eventually starts itself back up. As we said before, ISO 9126-2 is only interested in measuring this value when the restoration of functionality is automatic; this metric is not used when a human has to intervene with the system to restart it. To calculate this measurement, measure the time the system is unavailable after a failure during the specified testing period using the formula

X = T / N

where T is the total amount of time the system is not available and N is the number of observed times the system goes down. The closer X is to zero, the better this metric is (i.e., the shorter the unavailable time).

Mean recovery time: A measure of the average time it takes for the system to automatically recover after a failure occurs. This metric is measured during a specified test period and is calculated by the formula

X = Sum(T) / N

where T is the total amount of time to recover from all failures and N is the number of times that the system had to enter into recovery mode. Manual maintenance time should not be included. Smaller is clearly better for this measure. ISO 9126-2 notes that this metric may need to be refined to distinguish different types of recovery. For example, the recovery of a destroyed database is liable to take much longer than the recovery of a single transaction. Putting all different kinds of failures into the same measurement may cause it to be misleading.

Restartability: A measure of how often a system can recover and restart providing service to users within a target time period. As before, this metric is only concerned with automatic recovery (where no manual intervention occurs). The target time period likely comes from the requirements (e.g., a specified service-level agreement is in place). To calculate this metric, we count the number of times that the system must restart due to failure during the testing period and how many of those restarts occurred within the target time. We use the formula

X = A / B

where A is the number of restarts that were performed within the target time and B is the total number of restarts that occurred. The closer this measurement is to one the better. The standard points out that we may want to refine this metric because different types of failures are liable to have radically different lengths of recovery time.

Restorability: A measure of how capable the system is in restoring itself to full operation after an abnormal event or request. In this case, we are only interested in the restorations that are defined by the specifications. For example, the specifications may define that the system contain the ability to restore certain operations, including database checkpoints, transaction checkpoints, redo functions, undo functions, and so on. This metric is not concerned with other restorations that are not defined in the specifications. Calculation of this metric is done using the formula

X = A / B

where A is the number of restorations successfully made and B is the number of restoration cases tested based on the specifications. As before, values closer to one, showing better restorability, are preferable.

Restore effectiveness: A measure of how effective the restoration capability of the system is. This metric is calculated using the now familiar formula:

X = A / B

A is the number of times restoration was successfully completed within the target time and B is the total number of restoration cases performed when there is a required restoration time to meet. Values closer to one show better restoration effectiveness. Once again, target time is likely defined in the requirements or service-level agreement.

4.3.4 Compliance

4.3.4.1 Internal Compliance Metrics

Reliability compliance: A measure of the capability of the system to comply with such items as internal standards, conventions, or regulations of the user organization in relation to reliability. This metric is calculated by using the formula:

X = A / B

A is the number of regulations we are in compliance with as confirmed by our static testing and B is the number of total regulations, standards, and conventions that apply to the system.

4.3.4.2 External Compliance Metrics

Reliability compliance: A metric that measures how acquiescent the system is to applicable regulations, standards, and conventions. This is measured by using the formula

X = 1 - A / B

where A is the number of reliability compliance items that were specified but were not implemented during the testing and B is the total number of reliability compliance items specified. ISO 9126-2 lists the following places where compliance specifications may exist:

  • Product description

  • User manual

  • Specification of compliance

  • Related standards, conventions, and regulations

The closer to one, the better; 1 would represent total compliance with all items.

4.3.5 An Example of Good Reliability Testing

The United States National Aeronautics and Space Administration (NASA) is responsible for trying to keep mankind in space. When its software fails, it might mean the end of a multibillion-dollar mission (not to mention many lives).

The NASA Software Assurance Standard, NASA-STD-8739.8, defines software reliability as a discipline of software assurance that does the following:

1. Defines the requirements for software controlled system fault/failure detection, isolation, and recovery

2. Reviews the software development processes and products for software error prevention and/or reduced functionality states

3. Defines the process for measuring and analyzing defects and defines/derives the reliability and maintainability factors

NASA uses both trending and predictive techniques when looking at reliability.

Trending techniques track the metrics of failures and defects found over time. The intent is to develop a reliability operational profile of a given system over a specified time period. NASA uses four separate techniques for trending:

1. Error seeding:20 Estimates the number of errors in a program by using multistage sampling. Defects are introduced to the system intentionally. The number of unknown errors is estimated from the ratio of induced errors to noninduced errors from debugging data.

2. The failure rate: Study the failure rate per fault at the failure intervals. The theory goes that as the remaining number of faults change, the failure rate of the program changes accordingly.

3. Curve fitting: NASA uses statistical regression analysis to study the relationship between software complexity and the number of faults in a program as well as the number of changes and the failure rate.

4. Reliability growth: Measures and predicts the improvement of reliability programs throughout the testing process. Reliability growth also represents the failure rate of the system as a function of time and the number of test cases run.

NASA also uses predictive reliability techniques: These assign probabilities to the operational profile of a software system. For example, the system has a 5 percent chance of failure over the next 60 operational hours. This clearly involves a capability for statistical analysis that is far beyond the capabilities of most organizations (due to lack of resources and skills).

Metrics that NASA collects and evaluates can be split into two categories: static and dynamic.

The following measures are static measures:

1. Line count, including lines of code (LOC) and source lines of code (SLOC)

2. Complexity and structure, including cyclomatic complexity, number of modules and number of GOTO statements

3. Object-oriented metrics, including number of classes, weighted methods per class, coupling between objects, response for a class, number of child classes, and depth of inheritance tree

Dynamic measures include failure rate data and the number of problem reports.

The question that must be asked when discussing reliability testing is, With which faults are we going to be concerned? A complex system can fail at hundreds or thousands of places. There is a great deal of cost involved in reliability testing. As the reliability needs of the organization go up, the costs escalate even faster. Therefore, an organization must plan the testing carefully. While NASA can afford to pull out all stops when it comes to reliability testing, most other organizations are not so lucky.

The following events may be of concern:

  • An external event that should occur does not, a device that should be on line is not (or has degraded performance), or an interface or process that the system needs is not available

  • The network is slow or not available, or it suddenly crashes

  • Any of the operating system capabilities that the system relies on are not available or are degraded

  • Inappropriate, unexpected, or incorrect user input occurs

A test team must determine which of these (or other possibilities) are important for the system’s mission. The team would create [usually negative] tests to degrade or remove those capabilities and then measure the response of the system. Measurements from these tests are then used to determine if the system reliability was acceptable.

4.3.6 Exercise: Reliability Testing

Using the HELLOCARMS System Requirements document, analyze the risks and create an informal test design for reliability.

4.3.7 Exercise: Reliability Testing Debrief

We selected two related requirements, 020-010-020 and 020-010-030. The first, set in release two, requires that fewer than five (5) failures in production occur per month. The second requires that the number of failures in production be less than one (1) per month by release four. In essence, we are going to test mean time between failures (MTBF). Frankly, we might challenge this kind of a firm requirement in review because it sets a (seemingly) arbitrary value that may be impossible to meet within project constraints.

However, since the requirement is firm, it strikes us as an opportunity to create a long-running automated test that could be run over long periods (overnight, weekends, or perhaps on a dedicated workstation running for weeks).

This would depend on having automated tests available that exercise the GUI screens of the Telephone Banker. In order to be useful, the tests would need to be data-driven or keyword-driven tests, tests that can be run randomly with a very large data set.

Based on the workflow of the Telephone Banker, we would create a variety of scenarios, using random customer data. These scenarios would be as follows:

  • Accepted and rejected loans of all sorts and amounts

  • Accepted but declined-by-customer loans

  • Hang-ups and disconnects

  • Insurance accepts and declines

The defect theory that we would be testing is that the system may have reliability issues, especially when unusual scenarios are run in random order. Each test would clearly need to check for expected versus actual results. We would be looking for the number of failures that occur within the testing time period so we can get a read on the overall reliability of the system over time.

Each test build’s metrics would be compared to the previous builds’ to determine if the maturity of the system is growing (fewer failures per time period would indicate growing maturity).

Fault tolerance metrics would be extrapolated by determining how often the entire system fails when a single transaction fails as compared to being able to continue running further transactions despite the failure.

In those cases where the entire system does fail, recoverability would be measured by the amount of time it takes to get the entire HELLOCARMS system up and running again.

4.4 Efficiency Testing

Learning objectives

TTA-4.5.1 (K3) Define the approach and design high-level operational profiles for performance testing.

ISO 9126 defines efficiency as the capability of the software product to provide appropriate performance relative to the amount of resources used, under stated conditions. When speaking of resources, we could mean anything on the system, software, hardware, or any other abstract entities. For example, network bandwidth would be included in this definition.

Efficiency of a distributed system is almost always important. When might it not be important? Several years ago, Jamie was teaching a class in Juneau, Alaska. After class, he struck up a conversation—in a bar—with a tester who said he worked at the Alaska Department of Transportation. Discussion turned to a new distributed system that was going to be going live, allowing people from all over the state to renew their driver’s licenses online. When Jamie asked about performance testing of the new system, the tester laughed. He claimed that a good day might have 10 to 12 users on the site. While the tester was likely exaggerating, the point we should draw from it is that efficiency testing, like all other testing, must be based on risk. Not every type of testing must always be done to every software system.

Efficiency failures can include slow response times, inadequate throughput, reliability failures under conditions of load, and excessive resource requirements. Efficiency defects are often design flaws at their core, which make them very hard to fix during late-stage testing. So, efficiency testing can and should be done at every test level, particularly during design and coding (via reviews and static analysis).

There are a lot of myths surrounding performance testing. Let’s discuss a few.

Some testers think that the way to performance test is to throw hundreds (thousands?) of virtual users against the system and keep ramping them up until the system finally breaks down. The truth is that most performance testing is done while measuring the working system without causing it to fail. There is a kind of performance testing that does try to find the breaking point of the system, but it is a small part of the entire range of ways we test.

A second myth states that we can only do performance testing at the end of system test. This is dangerously wrong and we will address it extensively. As noted, performance testing, like all other testing, should be pervasive throughout the life cycle.

Last, we often hear the myth that a good performance tester only needs to know about a performance tool. Learn the tool and you can walk into any organization and start making big money running tests tomorrow. Turns out this is also dangerously false. We will discuss all of the tasks that must be done for good performance testing before we ever turn on a tool.

4.4.1 Multiple Flavors of Efficiency Testing

There is an urban myth that the native Inuit peoples have more than 50 different names for snow, based on nuances in snow that they can see. While researching this story, we found there is wide dispute as to whether this is myth or provable fact.21 If factual, the theory is that they have so many names because to the Inuit, who live in the snow through much of the year, the fine distinctions are important while to others the differences are negligible. It depends on your viewpoint. Consider that in America we have machines that have very little difference between them; these go by the names Chevy, Buick, Cadillac, Ford, and so on. Show them to an Inuit and they might fail to see any big distinction between them.

One of the most talented performance testers that Jamie ever met once showed Jamie a paper he was writing that enumerated some 40 different flavors of performance testing. Frankly, as Jamie read it, he did not understand many of the subtle differentiations the author was making. However, listening to others review the paper was an education in itself as they discussed subtle differences that Jamie had never considered.

The one thing we know for sure is that efficiency testing covers a lot of different test types. Shortly we will provide a sampling of the kinds of testing that might be performed. We have used definitions from the ISTQB glossary and ISTQB Advanced syllabus when available. Other definitions come from a performance testing class that was written by Rex. A few of the definitions come from a book by Graham Bath and Judy McKay.22 In each case, we tried to pick definitions that a wide array of testers have agreed on.

Most of these disparate test types go by the generic name performance testing. From the ISTQB glossary comes the following definition for performance testing itself: “the process of testing to determine the performance of a software product.”

A better definition can be constructed by blending the glossary definition of performance with that of performance testing as follows: Testing to evaluate the degree to which a system or component accomplishes its designated functions, within given constraints, regarding processing time and throughput rate.

A classic performance or response-time test looks at the ability of a component or system to respond to user or system inputs within a specified period of time, under various legal conditions. It can also look at the problem slightly differently, by counting the number of functions, records, or transactions completed in a given period; this is often called throughput. The metrics vary according to the objectives of the test.

So, with that in mind, here are some specific types of efficiency testing:

  • Load testing: A type of performance testing conducted to evaluate the behavior of a component or system with increasing load (e.g., numbers of parallel users and/or numbers of transactions) to determine what load can be handled by the component or system. Typically, load testing involves various mixes and levels of load, usually focused on anticipated and realistic loads. The loads often are designed to look like the transaction requests generated by certain numbers of parallel users. We can then measure response time or throughput. Some people distinguish between multiuser load testing (with realistic numbers of users) and volume load testing (with large numbers of users), but we’ve not encountered that too often.

  • Stress testing: A type of performance testing conducted to evaluate a system or component at or beyond the limits of its anticipated or specified workloads or with reduced availability of resources such as access to memory or servers. Stress testing takes load testing to the extreme and beyond by reaching and then exceeding maximum capacity and volume. The goal here is to ensure that response times, reliability, and functionality degrade slowly and predictably, culminating in some sort of “go away I’m busy” message rather than an application or OS crash, lockup, data corruption, or other antisocial failure mode.

  • Scalability testing: Takes stress testing even further by finding the bottlenecks and then testing the ability of the system to be enhanced to resolve the problem. In other words, if the plan for handling growth in terms of customers is to add more CPUs to servers, then a scalability test verifies that this will suffice. Having identified the bottlenecks, scalability testing can also help establish load monitoring thresholds for production.

  • Resource utilization testing: Evaluates the usage of various resources (CPU, memory, disk, etc.) while the system is running at a given load.

  • Endurance or soak testing: Running a system at high levels of load for prolonged periods of time. A soak test would normally execute several times more transactions in an entire day (or night) than would be expected in a busy day to identify any performance problems that appear after a large number of transactions have been executed. It is possible that a system may stop working after a certain number of transactions have been processed due to memory leaks or other defects. Soak tests provide an opportunity to identify such defects, whereas load tests and stress tests may not find such problems due to their relatively short duration.

  • Spike testing: The object of a spike test is to verify system stability during a burst of concurrent user and/or system activity to varying degrees of load over varying time periods. This type of test might look to verify a system against the following business situations:

    A fire alarm goes off in a major business center and all employees evacuate. The fire alarm drill completes and all employees return to work and log into an IT system within a 20-minute period.

    A new system is released into production and multiple users access the system within a very small time period.

    A system or service outage causes all users to lose access to a system. After the outage has been rectified, all users then log back onto the system at the same time.

    Spike testing should also verify that an application recovers between periods of spike activity.

  • Reliability testing: Testing the ability of the system to perform required functions under stated conditions for a specified period of time or number of operations.

  • Background testing: Executing tests with active background load, often to test functionality or usability under realistic conditions.

  • Tip-over testing: Designed to find the point where total saturation or failure occurs. The resource that was exhausted at that point is the weak link. Design changes (ideally) or more hardware (if necessary) can often improve handling and sometimes response time in these extreme conditions.

There are lots more, but our brains hurt. Unless we have a specific test in mind, we are just going to call all efficiency type testing by the umbrella name of performance testing for this chapter.

Not all of the tests just listed are completely disjoint; we could actually run some of them concurrently by changing the way we ramp up the load and which metrics we monitor.

To be meaningful, no matter which of these we run, there is much more to creating a performance test than buying a really expensive tool with 1,000 virtual user licenses and start cranking up the volume. We will discuss how to model a performance test correctly in the next section.

We have been in a number of organizations that seemed to believe that performance testing could not even be started until late in system testing. The theory goes that performance testing has to wait until the system is pretty well complete, with all the functionality in and mostly working.

Of course, if you wait until then to do the testing and you find a whole bundle of bugs when you do test (usually the case), then your organization will have the choice of two really bad options: delay the delivery of the system into production while fixes are made (that could happen, but don’t hold your breath), or go ahead and deliver a crippled system while desperately scrambling to fix the worst of the defects. The latter is what we have mostly seen occur. Consider the melodrama of the 2013 rollout of healthcare.gov in the United States, perhaps the poster child for why not to skip the performance testing.

Good performance testing, like most good testing, should be distributed throughout all of the phases of the SDLC:

  • During the development phases: From requirements through implementation, static testing should be done to ensure meaningful requirements and designs from an efficiency viewpoint.

  • During unit testing: Performance testing of individual units (e.g., functions or classes) should be done. All message and exception handling should be scrutinized; each message type could be a bottleneck. Any synchronization code, use of locks, semaphores, and threading must be tested thoroughly, both statically and dynamically.

  • During integration testing: Performance testing of collections of units (builds, backbones, and/or subsystems) should be performed. Any code that transfers data between modules should be tested. All interfaces should be scrutinized for deadlock problems.

  • During system testing: Performance testing of the whole system should be done as early as possible. The delivery of functionality into test should be mapped so that those pieces that are delivered can be scheduled for the performance testing that can be done.

  • During acceptance testing: Demonstration of performance of the whole system in production should be performed.

Realism of the test environment generally increases with each level, until system test, which should (ideally) test in a replica of the production or customer environment.

4.4.2 Modeling the System

In the early days of performance testing, many an organization would buy or lease a tool, pick a single process, record a transaction, and immediately start testing. They would create multiple virtual users using the same profile, using the same data. To call such testing meaningless is to tread too lightly.

If a performance test is to be meaningful, there are a lot of questions that must be answered that are more important than, How many users can we get on the system at one time? Come to think of it, that question (by itself) is about as meaningful as the old days when the raging question was how many teens can you get in a phone booth?

Here then are some important questions that should be asked before we get into the physical performance testing process.

What is the proposed scope of the effort; exactly which subsystems are we going to be testing? Which interfaces are important? Which components are we going to be testing? Are we doing the full end-to-end customer experience or is there a particular target we are after? Which configurations will be tested? These are just some of the questions we need to think about.

How realistic is the test going to be? If production has several hundred massive servers and we are going to be testing against a pair of small, slow, ancient servers, any extrapolation we would attempt to do would still come up with a meaningless answer since we have no way of knowing the network bandwidth, load balancers, bottlenecks, and issues in the real production architecture.

How many concurrent users do we expect? Average? Peak? Spike? What tasks are they going to be doing? Odds are really good that all of the users will not be touching the same record with the same user ID.

What is the application workload mix that we want to simulate? In other words, how many different types of users will we be simulating on the system and what percentages do they make up (e.g., 20 percent login, 40 percent search, 15 percent checkout, etc.)?

And while we are at it, how many different application workload mixes do we want on the system while we are testing? Many systems support several different concurrent applications running on the same servers. Testing only one may not be meaningful.

In that same vein, is virtualization going to be used? Will we be sharing a server with other virtualized processes? Will our processes be spread over multiple servers? Our research and discussions with a number of performance testers shows that there are a lot of different opinions as to how virtualization can affect performance testing.

Which back-end processes are going to be running during the testing? Any batch processes? Any dating processes? Month end processing? Those processes are going to happen in real life; do we need to model them for this test?

Be of good cheer—there are dozens more questions, but performance testing is possible to do successfully.

To give an example of a coherent methodology that an organization might use for doing performance testing of a web application, we have pulled one from the Microsoft Developer Network.23 This methodology consists of seven steps as detailed in the following sections.

4.4.2.1 Identify the Test Environment

Identify the test environment—and the production environment, including hardware, software, and network configurations.

Assess the expected test environment and evaluate how it compares to the expected production environment. Clearly, the closer our test system is to the expected production system, the more meaningful our test results can be. Remember, the production environment will undoubtedly already have a certain amount of load on it.

Balanced against that is the cost. Replicating the environment exactly as it is found in production is usually not going to happen. Somewhere we need to strike a balance.

Understand the tools and resources available to the test team. Having the latest and greatest of every tool along with an unlimited budget for virtual users would be a dream. If you are working in the kind of organizations that we have, dreaming of that is as close as you will get.

Identify challenges that must be met during the process. Realistically, consider what is likely to happen. Many software people are unquenchable optimists; we just know that everything is going to go right this time. While we can hope, we need to plan as realists—or, as the Foundation syllabus says, be professional pessimists.

This first step is likely to be one that is revisited throughout the process as compromises are made and changes are made. Like risk analysis, which we discussed in Chapter 1, we need to always be reevaluating the future based on what we discover during the process.

4.4.2.2 Identify the Performance Acceptance Criteria

Identify the goals and constraints for the system. Remember that many in the project may not have thought these issues through. Testers can help focus the project on what we really can achieve. There are three main ways of looking at what we are interested in:

  • Response time: user’s main concern

  • Throughput: often the business’ concern

  • Resource utilization: system concerns

Identify system configurations that might result in the most desirable combinations of the items in the preceding list. This might take some doing since people in the project may not have considered these issues yet.

Identify project success criteria—how will you know when you are done? While it is tempting for testers to want to determine what success looks like, it is up to the project as is, to make that determination. Our job is to capture information that allows the other project members to make informed decisions as to pass/fail. So which metrics are we going to collect? Don’t try to capture every different metric that is possible. Settle on a given set of measurements that are meaningful to proving success—or disproving it.

4.4.2.3 Plan and Design Tests

Model the system as mentioned earlier to identify key scenarios and likely usage.

Determine how to simulate the inevitable variability of scenarios—what do different users do and how do they do it? What is the business context in which the system is used? Focus on groups of users; look for common ways they interact with the system.

Define test data—and enough of it! Remember that different user groups will likely have distinctive differences in the data they use. Log files from production can be very helpful in gathering data information. Don’t forget to review the data you will be using with the actual users themselves when possible; they can help you find what you might have overlooked.

Make sure you consider timing as part of the data collection. Different groups will work at different rates. Not accounting for actual work patterns will very likely skew results. Don’t forget user abandonment; not every task is completed by all users. Consolidate all of the above suggestions into different models of system usage to be tested

4.4.2.4 Configure the Test Environment

Prepare the test environment, tools, and resources needed to execute the models designed in the preceding step (plan and design tests). Don’t forget to generate background load as closely as possible to production. Validate that your environment matches production to the extent that it can and document where it doesn’t. Differences between test and production environments must be taken into account or the test results will not model reality.

Determine the schedule for when the necessary features will become available (i.e., match up with the SDLC). Not all functionality will likely be available on day one of testing.

And, finally, instrument the test environment to enable collection of the desired metrics.

4.4.2.5 Implement the Test Design

Implement the test design. Create the performance testing scripts using the tools available.

Ensure that the data parameterization is as needed. This is a good place to double-check that you have sufficient data to run your tests. If you are performing soak testing, you will need a lot of data.

Smoke test the design and modify scripts as needed. One phrase Jamie remembers vividly from his five years of Latin language classes: Quis custodiet ipsos custodes?24 Who will guard the guardians themselves? Nonvalidated tests could easily be giving us bogus information. Always ensure—before beginning the actual testing—that the test scripts are meaningful and performing the actual tasks you are expecting them to. Make sure to ask yourself, “Do these results make sense?” Do not report the results of the smoke test as part of the official test results.

4.4.2.6 Execute the Test

Run the tests, monitoring the results. Work with database, network, and system personnel to facilitate the testing (first runs often show serious issues that must be addressed). Ideally, any issues would have been addressed during the validation of the scripts, but expecting the unexpected is pretty much par for the course in performance as with all testing.

Validate the tests as being able to run successfully, end to end. Run the testing in one- to two-day batches to constantly ask the reasonableness question: Are the results we are getting sensible? Beware of a common mistake made among scientists, however. When the results are not what were expected, sometimes scientists believe that the tests are invalid rather than there might be something wrong with their hypotheses. It might just be that your expectations were wrong.25

Execute the validated tests for the specified time and under the defined conditions to collect the metrics. It often makes sense to repeat tests to ensure that the results are similar. If they are not similar, why not? Often there are hidden factors that might be missed if tests are only run once. When you stop getting valuable information, you have run the tests enough.

4.4.2.7 Analyze the Results, Tune and Retest

Analyze the completed metrics. Do they prove what you wanted to prove? If they do not match expected results, why not?

Consolidate and share results data with stakeholders. As with all other testing, remember that you must report the results to stakeholders in a meaningful way. The fact that you had 1,534 users active at the same time is really cool; however, the stakeholders are more interested in whether the system will support their needed business goals.

Tune the system. Can we make changes to the system to positively affect the performance of it? This becomes more important the closer to production your test environment is. Small changes can often create huge differences in the performance of the system. Our experience is that those changes are often negative. This task will, of course, depend on time and resources.

How do you know when you are done with efficiency testing? According to the MS test guide:

When all of the metric values are within accepted limits, none of the set thresholds have been violated, and all of the desired information has been collected, you have finished testing that particular scenario on that particular configuration.

You may or may not have the time and resources to completely finish performance testing your system to the extent mentioned here. If not, don’t let it stop you from performing those parts you can afford. Don’t let the perfect be the enemy of the good!

4.4.3 Time Behavior

ISO 9126 identifies three subcharacteristics for efficiency: performance (time behavior), resource utilization, and compliance.

The good thing for technical test analysts is that many of our measurements for efficiency testing are completely quantifiable. We can measure them directly. Well, kind of. The truth is, every time we make a measurement, we can get an exact value. It took 3 milliseconds for this thing to occur. We had 100 virtual uses running concurrently, doing this, this, and this.

In performance testing, we often have to run the same test over and over again so that we can average out the results. Because the system is extremely complex, and other things may be going on, and timing of all the things going on may not be completely deterministic, and CPU loading may be affected by internal processes and a hundred other things...well, you get the picture. So, we run the tests multiple times, measure the things we want to an exact amount, and average them over the multiple runs.

When Jamie was in the military, they used to joke about how suppliers met government specifications. Measure something with a micrometer and then cut it with a chainsaw. Sometimes that is how we feel about the metrics we get from performance testing.

Time-critical systems, which include most safety-critical, real-time, and mission-critical systems, must be able to provide their functions in a given amount of time. Even less critical systems like e-commerce and point-of-sale systems should have good time response to keep users happy.

Time behavior uses measurements that look at the amount of time it takes to do something. The following measurements are possible:

  • Number of context switches per second

  • Network packet average round-trip time

  • Client data presentation time

  • Amount of time a transaction takes between any pair of components

Let’s discuss four separate scenarios where we may discover time behavior anomalies when performance testing:

  • Slow response under all load levels

  • Slow response under moderate loading of the system where the amount of loading is expected and allowed

  • Response that degrades over time

  • Inadequate error handling when loaded

First, let’s discuss the underlying graph that we will use to illustrate these bugs. In Figure 4–6, the vertical scale represents the amount of time transactions are taking to process on average. In general, the less time a transaction takes to execute, the happier the user will be. The horizontal scale shows the arrival time rate; in other words, how many transactions the system is trying to execute over a specified time.

Normally, as more and more transactions arrive, we would expect that the system may get a little slower (shown by the line getting a little higher on the graph the farther to the right it travels). Ideally, the line will stay in the gray, lower area, which is labeled the acceptable performance area. As long as the line stays in the gray, we are within the expected performance range and we would expect our users to be satisfied with the service they are getting.

Image

Figure 4–6 Unacceptable performance at any load example

In Figure 4–6, you can see that we definitely have a problem. Even with no loading at all, the performance is just barely acceptable; as load just begins to ramp up, we move immediately out of the acceptable range. This is something we would expect [hope] to find during functional testing, long before we start performance testing. However, we often miss it because functional testing tends to test the system with a single user.

Some of the issues that might be causing this are a bad database design or implementation where trying to access data just takes too long. Network latency may be problematic, or the server might be too loaded with other processes. This is a case where monitoring a variety of different metrics should quickly point out the problem.

In Figure 4–7, we start out well within the acceptable range. However, there is a definite knee before we get to 400 transactions per hour. Where the response had degraded slightly in a linear fashion, all of a sudden the degradation got much faster, rapidly moving out of the acceptable range at about 500 transactions per hour.

This is representative of a resource reaching its capacity limit and saturating. Looking at the key performance indicator metrics at this point will generally show this; we may have high CPU utilization, insufficient memory, or some other similar problem. Again, the problem could also be that there are background processes that are chewing up the resources.

Image

Figure 4–7 Slow response under moderate loading example

Image

Figure 4–8 Response that degrades over time example

In Figure 4–8, we show several curves. The first, solid, line shows a sample run early in the test. The next, dashed, line shows a run that was made somewhat later, and the dotted line shows a run made even later into the test.

What we are seeing here is a system that is degrading with time. The exact same load run later in the test was markedly slower than the previous run, and the third run was worse yet.

This looks like a classic case of memory leaking, or disk access slowing down due to fragmentation. Notice that there is no knee in this graph; no sudden dislocation of resources. It is just a balloon losing air; eventually, if the system kept running, we would expect that performance would eventually reach unacceptable levels even at low loading.

Image

Figure 4–9 Inadequate error handling example

Finally, in Figure 4–9, we see a system that does not look too bad right up until it is fairly heavily loaded. At about 900 transactions per hour, we see a knee where response starts rapidly rolling off. Is this good or bad? Anytime you see a graph or are presented with metrics, remember that everything is relative. It might be really good if the server was rated at 500 transactions per hour, but in this case we want to achieve 1,000 transactions per hour.

Suppose it were rated at 900 transactions per hour, and it is [barely] in tolerance; what else does the graph show? Looking at the legend on the graph, the assumption is that error handling is problematic. The real concern should be seen as what is happening at the very tail end of the curve. The only way the curve can go down after being in the unacceptable range is if it is sloughing off requested transactions. In other words, more transactions are being requested, but the system is denying them. These transactions may be explicitly denied (not good but understandable to the user) or simply lost (which would be totally unacceptable).

Possible causes of the symptoms might be insufficient resources, queues and stacks that are too small, or time-out settings that are too short.

Ideally, performance testing will be run with experts standing by to investigate anomalies. Unlike other testing, where we might write an incident report to be read sometime later by the developer, the symptoms of performance test failures are often investigated right away, while the test continues. In this case, the server, network, and database experts are liable to be standing by to troubleshoot the problems right away.

We will discuss the tools that they are likely to be using in Chapter 6.

4.4.3.1 Internal Time Behavior Metrics

Response time: A measure of the estimated time to perform a given task. This measurement estimates the time it will take for a specific action to complete based on the efficiency of the operating system and the application of the system calls. Any of the following might be estimated:

  • All (or parts) of the design specifications

  • Complete transaction path

  • Complete modules or parts of the software product

  • Complete system during test phase

Clearly, a shorter time is better. The inputs to this measurement are the known characteristics of the operating system added to the estimated time in system calls. Both developers and analysts would be targets for this measurement.

Throughput time: A measure of the estimated number of tasks that can be performed over a unit of time. This is measured by evaluating the efficiency of the resources of the system that will be handling the calls as well as the known characteristics of the operating system. The greater this number, the better. Both developers and analysts would be targets for this measurement.

Turnaround time: A measure of the estimated time to complete a group of related tasks performing a specific job. As in the other time-related metrics, we need to estimate the operating system calls that will be made and the application system calls involved. The shorter the time, the better. The same entities listed for response time can all be estimated for this metric. Both developers and analysts would be targets for this measurement.

4.4.3.2 External Time Behavior Metrics

ISO 9126-2 describes several external metrics for efficiency. Remember that these are to be measured during actual dynamic testing or operations. The standard emphasizes that these should be measured over many test cases or intervals and averaged since the measurements fluctuate depending on conditions of use, processing load, frequency of use, number of concurrent users, and so on.

Response time: A measure of the time it takes to complete a specified task. Alternately, how long does it take before the system responds to a specified operation? To measure this, record the time (T1) a task is requested. Record the time that the task is complete (T2). Subtract T1 from T2 to get the measurement. Sooner is better.

Mean time to response: A measure of the average response time for a task to complete. Note that this metric is meaningful only when it is measured while the task is performed within a specified system load in terms of concurrent tasks and system utilization. To calculate this value, execute a number of scenarios consisting of multiple concurrent tasks to load the system to a specified value. Measure the time it takes to complete the specified tasks. Then calculate the metric using the formula

X = Tmean / TXmean

where Tmean is the average time to complete the task (for N runs) and TXmean is the required mean time to response. The required mean time to response can be derived from the system specification, from user expectation of business needs, or through usability testing to observe the reaction of users. This value will evaluate as 0 <= X. We would want this to measure less than but close to 1.0.

Worst case response time: A measure of the absolute limit on the time required to complete a task. This metric is subjective, in that it asks if the user will always get a reply from the system in a time short enough to be tolerable for that user. To perform this measurement, emulate a condition where the system reaches maximum load. Run the application and trigger the task a number of times, measuring the response each time. Calculate the metric using the formula

X = Tmax / Rmax

where Tmax is the maximum time any one iteration of the task took and Rmax is the maximum required response time. This value will evaluate as 0 <= X. We would want this to measure less than but close to 1.0.

Throughput: A measure of how many tasks can be successfully performed over a given period of time. Notice that these are likely to be different tasks, all being executed concurrently at a given load. To calculate this metric, use the formula

X = A / T

where A is the number of completed tasks and T is the observational time period. The larger the value, the better. As before, this measurement is most interesting when comparing the mean and worst case throughput values as we did with response time.

Mean amount of throughput: The average number of concurrent tasks the system can handle over a set unit of time, calculated using the same formula (X = Xmean / Rmean), where Xmean is the average throughput and Rmean is the required mean throughput.

Worst case throughput ratio: The absolute limit on the system in terms of the number of concurrent tasks it must perform. To calculate it, use the same formula we saw earlier (X = Tmax / Rmax), where Tmax is the worst case time of a single task and Rmax is the required throughput.

Turnaround time: A measure of the wait time the user experiences after issuing a request to start a group of related tasks until their completion. As we might expect, this is most meaningful when we look at the mean and worst case turnaround times as follows.

Mean time for turnaround: The average wait time the user experiences compared to the required turnaround time as calculated by (X = Tmean / TXmean). This should be calculated at a variety of load levels.

Worst case turnaround time ratio: The absolute acceptable limit on the turnaround time, calculated in the same way we saw before (X = Tmax / Rmax).

Waiting time: A measure of the proportion of time users spend waiting for the system to respond. We execute a number of different tasks at different load levels and measure the time it takes to complete the tasks. Then, calculate the metric using the formula

X = Ta / Tb

where Ta is the total time the user spent waiting and Tb is the actual task time when the system was busy. The measurement will evaluate as 0 <= X. An efficient system, able to multitask efficiently, will have a waiting time of close to zero.

4.4.4 Resource Utilization

For many systems, including real-time, consumer-electronics, and embedded systems, resource usage is important. You can’t always just add a disk or add memory when resources get tight, as the NASA team managing one of the Mars missions found out when storage space ran out.26

Resource utilization uses measurements of actual or projected resource usage that it takes to perform a task.

Inside those categories, there are dozens of different metrics that can be captured. Here we have listed some of the most important metrics that an organization might want to collect.27

  • Processor utilization percentage at key points of the test.

  • Available memory, both RAM and virtual, at different points through the test. That includes memory page usage.

  • Top n processes active—remembering that some of them may not be part of the test but may be internal or external processes running concurrently.

  • Length on queues (processor, disk, etc.) at any given time.

  • Disk saturation and usage

  • Network errors—both in- and outbound

Remember, too many measurements waste time and resources, too few and you don’t learn what you need to know. Metrics are a science, an art, and perhaps the most frustrating thing we deal with when testing. Following are recommended ISO 9126 measurements in which an organization might be interested.

4.4.4.1 Internal Resource Utilization Metrics

All of these metrics are targeted at developers in reviews.

I/O utilization: A measure of the estimated I/O utilization to complete a specified task. The value of this metric is the number of buffers that are expected to be required (calculated or simulated). The smaller this value is, the better. Note that each task will have its own value for this metric.

I/O utilization message density: A measure of the number of error messages relating to I/O utilization in the lines of code responsible for making system calls. To calculate this value, count the number of error messages pertaining to I/O failure and warnings and compare that to the estimated number of lines of code involved in the system calls using the formula

X = A / B

where A is the number of I/O related error messages and B is the number of lines of code directly related to system calls. The greater this size, the better.

Memory utilization: A measure of the estimated memory size that the software system will occupy to complete a specified task. This is a straightforward estimation of the number of bytes. Each different task should be estimated. As expected, the smaller this estimated footprint, the better.

Memory utilization message density: A measure of the number of error messages relating to memory usage in the lines of code responsible for making the system calls that are to be used. To calculate this metric, count the number of error messages pertaining to memory failure and warnings and compare that to the number of lines of code responsible for the system calls, using the formula

X = A / B

where A is the number of memory-related error and warning messages and B is the number of lines of code directly related to the system calls. The greater this ratio, the better.

Transmission utilization: A measure of the amount of transmission resources that will likely be needed based on an estimate of the transmission volume for performing tasks. This metric is calculated by estimating the number of bits that will be transmitted by system calls and dividing that by the time needed to perform those calls. As always, because this is an internal value, these numbers must all be either calculated or simulated.

4.4.4.2 External Resource Utilization Metrics

External resource utilization metrics allow us to measure the resources that the system consumes during testing or operation. These metrics are usually measured against the required or expected values.

I/O devices utilization: A measure of how much the system uses I/O devices compared to how much it was designed to use. This is calculated using the formula

X = A / B

where A is the amount of time the devices are occupied and B is the specified time the system was expected to use them. Less than and nearer to 1.0 is better.

I/O related errors: A measure of how often the user encounters I/O-type problems. This is measured at the maximum rated load using the formula

X = A / T

where A is the number of warning messages or errors encountered and T is the user operating time. The smaller this measure, the better.

Mean I/O fulfillment ratio: A measure of the number of I/O-related error messages and/or failures over a specified length of time at the maximum load. This is compared to the required mean using the formula

X = Amean / Rmean

where Amean is the average number of I/O error messages and failures over a number of runs and Rmean is the required28 average number of I/O-related error messages. This value will evaluate as 0 <= X. Lower is better. For example, assume that we allow 10 error messages over a certain period, and we actually average 2. That would give us an X value of 0.2. On the other hand, if we average 9 error messages over that period, we would get an X of 0.9.

User waiting time of I/O devices utilization: A measure of the impact of I/O utilization on the waiting time for users. This is a simple measurement of the waiting times required while I/O devices operate. As you might expect, the shorter this waiting time, the better. This should be measured at the rated load.

Maximum memory utilization: A measure of the absolute limit on memory required to fulfill a specific function. Despite its name, this measurement actually looks at error messages rather than the number of bytes needed in memory. This is measured at maximum expected load using the formula

X = Amax / Rmax

where Amax is the maximum number of memory-related error messages (taken from one run of many) and Rmax is the maximum (allowed) number of memory-related error messages. The smaller this value, the better.

Mean occurrence of memory error: A measure of the average number of memory related error messages and failures over a specified length of time and specified load on the system. We calculate this metric using the same formula as before:

X = Amean / Rmean

where Amean is the average number of memory error messages over a number of runs and Rmean is the maximum allowed mean number of memory-related error messages. The measure will evaluate as 0 <= X; the lower, the better.

Ratio of memory error/time: A measure of how many memory errors occurred over a given period of time and specified resource utilization. This metric is calculated by running the system at the maximum rated load for a specified amount of time and using the formula

X = A / T

where A is the number of memory-related warning messages and system errors that occurred and T is the amount of time. The smaller this value, the better.

Maximum transmission utilization: A measure of the actual number of transmission-related error messages to the allowed (required) number while running at maximum load. This metric is calculated using the formula

X = Amax / Rmax

where Amax is the maximum number of transmission-related error messages (taken from one run of many) and Rmax is the maximum (allowed) number of transmission-related error messages. The smaller this value, the better.

Mean occurrence of transmission error: A measure of the average number of transmission-related error messages and failures over a specified length of time and utilization. This is measured while the system is at maximum transmission load. Run the application under test and record the number of errors due to transmission failure and warnings. The calculation uses the formula

X = Amean / Rmean

where Amean is the average number of transmission-related error messages and failures over multiple runs and Rmean is the maximum allowed number as defined earlier. The smaller the number, the better.

Mean of transmission error per time: A measure of how many transmission-related error messages were experienced over a set period of time and specified resource utilization. This value is measured while the system is running at maximum transmission loading. The application being tested is run and the number of errors due to transmission failures and warnings are recorded. The metric is calculated using the formula

X = A / T

where A is the number of warning messages or system failures and T is the operating time being measured. Smaller is better for this metric.

Transmission capacity utilization: A measure of how well the system is capable of performing tasks within the expected transmission capacity. This is measured while executing concurrent tasks with multiple users, observing the transmission capacity, and comparing it to specified values using the formula

X = A / B

where A is the measured transmission capacity and B is the specified transmission capacity designed for the software to use. Less than and nearer to one is better.

4.4.5 Compliance

4.4.5.1 Internal Compliance Metric

ISO 9126-3 defines efficiency compliance as a measure of how compliant the system is estimated to be against applicable regulations, standards, and conventions. To calculate this, we use the formula

X = A / B

where A is the number of items related to efficiency compliance that are judged—in reviews—as being correctly implemented and B is the total number of compliance items. The closer this value is to one, the more compliant it is.

4.4.5.2 External Compliance Metric

Finally, ISO 9126-2 defines an external efficiency compliance metric. This is a measure of how compliant the efficiency of the product is with respect to applicable regulations, standards, and conventions. Calculation of this metric is done using the formula

X = 1 - A / B

where A is the number of efficiency compliance items that have not been implemented during testing and B is the total number of efficiency compliance items that have been specified for the product. The closer to one, the better.

4.4.6 Exercise: Efficiency Testing

Using the HELLOCARMS system requirements document, analyze the risks and create an informal test design for efficiency testing.

4.4.7 Exercise: Efficiency Testing Debrief

We selected requirement 040-010-080. This requirement states that “once a Senior Banker has made a determination, the information shall be transmitted to the Telephone Banker within two (2) seconds.”

Once again, we would use automation to test this requirement. This one interests us because of the issue of measuring time on two different workstations. If their real-time clocks were set to appreciably different times, then any measurements that we could make would be suspect.

Jamie actually had a similar problem a few years ago that he had to solve; we would use the same solution here. The solution consists of writing a simple listener application on a separate workstation. When the Telephone Banker’s workstation is triggered to escalate a loan to the Senior Banker, a message is sent to the listener, which logs it in a text file using a time stamp from its own real-time clock. Automation on the Senior Banker workstation will handle the request. When it finishes, it sends a message to the same listener, which logs it in the same file, again with a time stamp. When the Telephone Banker automation, which has been in a waiting state for the return, gets the notification, it sends another message to the listener.

Note that this same test would also satisfy the conditions necessary to test requirement 040-010-070, which requires no more than a one-second delay for the escalation to occur.

We are disregarding the transport time from both automated workstations. This may be problematic, but we are going to assume (with later testing to confirm) that three local test workstations in the same lab running on the same network will incur pretty much the same transport time, canceling them out. Even so, because the times we are testing are relatively large (one second and two seconds), we believe the testing would likely be valid.

4.5 Maintainability Testing

Learning objectives

Only common learning objectives.

Maintainability refers to the ability to update, modify, reuse and test the system. This is important for most systems because most will be updated, modified, and tested many times during their life. Often pieces of systems and even whole systems are used in new and different situations.

Why all of the changes to a system? Remember that when we discussed reliability, we said that software does not wear out, but it does become obsolete. We will want new and extended functionality. There will also be patches and updates to make the system run better. New environments will be released that we must adapt to, and interoperating systems will be updated, usually requiring updates on our system.

Jamie remembers one of his first software experiences; they had worked for over six months putting together and delivering the new system. Jamie was so glad to see it go that he spoke out loud, “Hope I never see that software again!” Everyone laughed at him, not believing that he did not know how often that boomerang was going to come back at them.

What Jamie did not know at the time was that only a small fraction of the overall cost of a system was spent in the original creation and rollout. On the day you ship that first release, you can be pretty sure that 80 percent or more of the eventual cost has not yet been incurred. Or, as the Terminator said, “I’ll be back.”

In the ISTQB Foundation level syllabus, it was discussed that we could not do maintenance testing on a system that was not already in production. After we ship the first time, we start what some call the SMLC: software maintenance life cycle.

Okay, quick! Just off the top of your head, come up with a dynamic maintainability test for HELLOCARMS. We’ll wait. Hmmmmmmm.

Tough to do, isn’t it?

The simple fact is that much of maintainability testing is not going to be done by scripting test cases and then running them when the code gets delivered. Many, if not most, maintainability defects are invisible to dynamic testing.

Maintainability defects include hard-to-understand code, environment dependencies, hidden information and states, and excessive complexity. They can also include “painted yourself into a corner” problems when software is released without any practical mechanism for updating it in the field. For example, think of all the problems Microsoft had stabilizing their security-patch process in the mid-2000s.

Design problems. Conceptual problems. Standards and guidelines (or lack of same) compliance. We have a good way of finding these kinds of issues. It’s called static testing. From the first requirements to the latest patch, there is likely no better way to ensure that the system is maintainable. This is one case where a new tool, a couple of scripted tests, or reasonably attentive testers are not going to magically transform the pig’s ear into a silk purse.

Management must be made to understand the investment that good maintainability is going to entail. This must be seen as a long-term investment because much of the reward is going to come down the road. And, it is the worst kind of investment to try to sell—one that is mostly invisible. If we do a good job building a maintainable system, how do we show management the rewards in a physical, tangible way?

Well, we won’t have as many patches, but we can’t prove without a doubt it was due to the investment. Our maintenance programmers will make fewer mistakes, leading to fewer regression bugs, but we can’t necessarily point to the mistakes we did not make.

This is not just theoretical. Jamie was a test lead in a small start-up organization and he kept on arguing to put some standards and guidelines around the code. To spend some of the design time thinking about making sure that the application was going to be changeable. To spend some extra time documenting the assumptions we were making. To write self-documenting code using naming conventions. For each argument he made, he was challenged to prove the payback in empirical terms. He pretty much failed and the system that was built turned out to be a nightmare.

To say that we need to test the above-mentioned issues is not to say you shouldn’t test updates, patches, upgrades, and migration. You definitely should. You should not only test the software pieces, but also the procedures and infrastructure involved. For example, verifying that you can patch a system with a 100 MB patch file is all well and good until you find you have forgotten that real users will have to download this patch through a slow and balky Internet connection and will need a PhD in computer science to hand-install half of the files!

Here are a few of the project issues that exacerbate maintainability problems.

Schedules: Get the system out the door. Push it, prod it, nudge it, just get it out the door! Is it maintainable? Come on, we don’t have time for joking. Often, the general consensus of the team seems to be that we can always fix it when we have time. Let us ask the question that needs to come after that statement. Do we ever have time? We get this thing out the door, the next project is right there, filling our inbox. In our entire careers, we have never enjoyed that downtime that we were led to expect when we could catch up on the things we shelved.

Frankly, this may not be the fault of the team, entirely. The human brain seems to have hard wiring in it for this short-term gratification versus long-term benefit calculation.

If I eat this cookie now, I intellectually know that I will have to work out—sometime later—much harder than I like to. But what the heck—the cookie looks so good right now.29

Testers must learn to push the idea of short-term pain, long-term value when it comes to maintainability.

Optimism: If we can get it to work now, it likely will always work. Why invest a lot of effort into improving the maintainability since we probably won’t have any problems with it.

We wish we had a nickel for every time we heard a development manager (or project manager) say that they have hired the very best people available so of course it will work. Robert Heinlein once wrote about the optimism of a religious man sitting in a poker game with four aces in the hole...

Of course, when the project doesn’t work out successfully, it must have been the failure of the testers. And the regression bugs were simply one-time things. And on and on. While each incident is unique, the pattern of failures isn’t.

Contracts are often the problem. The contract might specify minimum functionality that has to be delivered. We don’t have time to build a better system—it is not what they asked for. We low-balled the estimate to get the job so we can’t afford to make it good too.

Initiation: One issue that we have seen repeatedly is the idea of initiation into the club. Many developers start their career doing maintenance programming on lousy systems. They have “paid their dues”! One might think that this would teach the necessity of building a maintainable system—and sometimes it does. But often we run into the mind-set of “we had to do it; you should have to do it.”

Jamie’s wife is a nurse, so he has had the chance to socialize with a number of doctors at holiday parties, picnics, and such. You might be surprised how little sympathy there is for interns and residents who often have to work 36- to 48-hour shifts. The prevailing opinion of many doctors is that “we had to do it and we survived—they should do it also.” It goes with the territory! Frankly, we hate that phrase.

Lack of ownership is clearly a problem. Maintainability, as well as quality, should be owned by the team as a whole, but rarely is.

Short timer syndrome: And finally, this one is probably not as big a problem as many think. We have occasionally heard that, “I won’t be here anymore when the bill comes due.” We call that short-timer thinking and used to see it in the United States military during the sixties and seventies when the military draft was the norm rather than voluntary service. “I’m outa here in 17 days. I’m so short I can’t see over a nickel if I was standing on a dime!”

There are probably a dozen more issues that we have not thought of. The fact is that education is the solution to many of the reasons given for ignoring maintainability. But, we have to make sure that the reasons we give to insist on maintainable development are colored green. It is about dollars or euros or pounds, or whatever term you think in. Money is the reason we should care about maintainability. Time and resources are important, but the tie-in to cost must be made for management to care.

Because maintainability is such a broad category, perhaps the best way to discuss it is to break it up into its subcategories as defined by the ISO 9126 standard. These include analyzability, changeability, stability, testability, and compliance.

4.5.1 Analyzability

The definition of analyzability, as given in ISO 9126, is the capability of the software product to be diagnosed for deficiencies or causes of failures or for the parts to be modified to be identified. In other words, how much effort will it take to diagnose defects in the system or to identify where changes can be made when needed?

Here are four common causes of poor analyzability in no particular order.

In the old days, we called it spaghetti code. Huge modules tied together with GOTO or jump constructs. No one we know still writes code like that, but some techniques are still being used that are not much better.

One of the basic tenets of Agile programming is a tactic called refactoring. Part of refactoring is that, when you do something more than once, you rewrite the code to create a callable function and then call it in each place it is needed. This is a great idea that every programmer should follow. Instead, what many programmers do is copy and paste code they want to reuse. Each module then begins to be a junior version of spaghetti code. Code should be modular. Modules should be relatively small, unless speed is of paramount issue. Each module should be understandable. Thomas McCabe understood this when he came up with cyclomatic complexity.

Back in the 1990s, while working at a large, multinational company, Jamie worked with a group that completely rewrote the operating system for a very popular mini-computer (going from PL/MP to C++). One main intention was to create a library of C++ classes that were reusable throughout the operating system. Management found that the library was not really being used extensively, so they investigated. It turned out, the group was told, that when a programmer needed a particular class, they were likely to search through the library for less than 10 minutes before giving up and writing their own class. What made this confusing was that writing their own class might take several days, and at that point, they would have a class where the code was not yet debugged. Had they searched a little longer, they likely would have found a completely debugged module that supplied the functionality they needed. Management termed this behavior the “not created here” problem. Had the company spent money on a librarian/archivist, they likely would have had no issues.

The second reason for poor analyzability is lack of good documentation. Many organizations try to save time by limiting the documentation that is created. Or, sometimes when documentation is required, it is just done poorly. Or, after changes are made, the documentation is not updated. Or, documentation is not under version control so there are a dozen different versions of a document floating around. Whatever the reason, good documentation helps us understand and analyze a system more easily.

The third reason for poor analyzability is poor—or nonexistent—standards and guidelines. Each programmer is liable to program in their own style—unless they are told to follow some kind of standards. These standards might include the following items:

  • Indentation, white space, and other structure guidelines

  • Naming conventions

  • Modular guidelines (At the company mentioned earlier, our rule of thumb was that no module should be longer than one printed page.)

  • Meaningful error messages and standard exception handling

  • Meaningful comments

Clearly, following some standards would help us analyze the system more easily.

The fourth reason involves code abstraction. Code abstraction, theoretically, is a good thing; like all good things, however, you can get too much of it. Object-oriented code is supposed to have a level of abstraction that allows the developers to build good, inheritable classes. By hiding the gory details in superclasses, a developer can ensure that other developers who inherit from those classes don’t depend on the details in their implementation. That way, if the implementation details have to change, it should not cause failures in the derived classes.

For example, suppose we supply a calculation for the sine of an angle. How that calculation works should not matter to any consumer of the calculation—as long as the calculation is correct. If we decide in a later release to change the way we make the calculation, it should not break any existing code that uses the value calculated. However, it is conceivable that a clever developer may decide to utilize some side effect of our original method of calculation because they understood how we originally made it. Now, changing our code is likely to break their code, and worse, we would not be aware of it.

The more abstraction there is in a module, however, the harder it is to understand exactly what is being done. Each organization must decide how much to abstract and how much to clarify.

The fix for most of these problems is the same. Good, solid standards and guidelines. Enforcement via static testing at all times—especially when time is pinched. No excuses. Organizations that make it clear that poor analyzability is an important class of defects on its own that will not be tolerated often do not suffer from problems with this quality subcharacteristic.

4.5.1.1 Internal Analyzability Metrics

ISO 9126-3 defines a number of internal maintainability metrics that an organization may wish to track. These should help predict the level of effort required when modifying the system.

Activity recording: A measure of how thoroughly the system status is recorded. This is calculated by counting the number of items that are found to be written to the activity log as specified compared to how many items are supposed to be written based on the requirements. The calculation of the metric is made using the formula

X = A / B

where A is the number of items implemented that actually write to the activity log as specified (as confirmed in review) and B is the number of items that should be logged as defined in the specifications. The closer this value is to one, the more complete the logging is expected to be.

Readiness of diagnostic function: A measure of how thorough the provision for diagnostic functions is. A diagnostic function analyzes a failure and provides an output to the user or log with an explanation of the failure. This metric is a ratio of the implemented diagnostic functions compared to the number of required diagnostic functions in the specifications and uses the same formula to calculate it:

X = A / B

A is the number of diagnostic functions that have been implemented (as found in review) and B is the required number from the specifications. This metric can also be used to measure failure analysis capability in the system and ability to perform causal analysis.

4.5.1.2 External Analyzability Metrics

ISO 9126-2 defines external maintainability metrics that an organization may wish to track. These help measure such attributes as the behavior of the maintainer, the user, or the system when the software is maintained or modified during testing or maintenance.

Audit trail capability: A measure of how easy it is for a user (or maintainer) to identify the specific operation that caused a failure. ISO 9126-2 is somewhat abstract in its explanation of how to record this metric: it says to observe the user or maintainer who is trying to resolve failures. This would appear to be at odds with the way of calculating the metric, which is to use the formula

X = A / B

where A is the number of data items that are actually logged during the operation and B is the number of data items that should be recorded to sufficiently monitor status of the software during operation. As in many of the other metrics in this standard, an organization must define for itself exactly what the value for B should be. How many of the different error conditions that might occur should actually be logged? A value for X closer to one means that more of the required data is being recorded. An organization must see this as a tradeoff; the more we want to log, the more code is required to be written to monitor the execution, evaluate what happened, and record it.

Diagnostic function support: A measure of how capable the diagnostic functions are in supporting causal analysis. In other words, can a user or maintainer identify the specific function that caused a failure? Like the previous metric, the method of application merely says to observe the behavior of the user or maintainer who is trying to resolve failures using diagnostic functions. To calculate, use the formula

X = A / B

where A is the number of failures that can be successfully analyzed using the diagnostic function and B is the total number of registered failures. Closer to one is better.

Failure analysis capability: A measure of the ability of users or maintainers to identify the specific operation that caused a failure. To calculate this metric, use the formula

X = 1 - A / B

where A is the number of failures for which the causes are still not found and B is the total number of registered failures. Note that this is really the flip side of the previous metric diagnostic function support. This metric is a measure of how many failures we could not diagnose, where the previous metric is a measure of how many we did. Closer to 1.0 is better.

Failure analysis efficiency: A measure of how efficiently a user or maintainer can analyze the cause of failure. This is essentially a measure of the average amount of time used to resolve system failures and is calculated by the formula

X = Sum(T) / N

where T is the amount of time for each failure resolution and N is the number of problems resolved. Two interesting notes are included in the standard for this metric. ISO 9126-2 says that only the failures that are successfully resolved should be included in this measurement; however, it goes on to say that failures not resolved should also be measured and presented together. It also points out that person-hours rather than simply hours might be used for calculating this metric so that effort may be measured rather than simply elapsed time.

Status monitoring capability: A measure of how easy it is to get monitored data for operations that cause failures during the actual operation of the system. This metric is generated by the formula

X = 1 - A / B

where A is the number of cases where the user or maintainer tried but failed to get monitor data and B is the number of cases where they attempted to get monitored data during operation. The closer to one the better, meaning they were more successful in getting monitor data when they tried.

4.5.2 Changeability

The second subcharacteristic for maintainability is changeability. The definition in ISO 9126 is the capability of the software product to enable a specified modification to be implemented. Once again, no dynamic test cases come to mind to ensure changeability. There are a number of metrics that ISO 9126 defines, but they are all retrospective, essentially asking, “Was the software changeable?” after the fact.

Virtually all of the factors that influence changeability are design and implementation practices.

Problems based on design include coupling and cohesion. Coupling and cohesion are terms that reference how a system is split into modules. Larry Constantine is credited with pointing out that high coupling and low cohesion are detriments to good software design.

Coupling refers to the degree that modules rely on the internal operation of each other during execution. High coupling could mean that there are a lot of dependencies and shared code between modules. Another possibility is that one module depends on the way the other module does something. If that dependency is changed, it breaks the first module. When high coupling is allowed, it becomes very difficult to make changes to a single module; changes here generally mean that there are more changes there and there and there. Low coupling is desirable; each module has a task to do and is essentially self-contained in doing it.

There are a number of different types of coupling:

  • Content coupling: when one module accesses local data in another.

  • Common coupling: when two or more modules share global data.

  • External coupling: when two or more modules share an external interface or device.

  • Control coupling: when one module tells another module what to do.

  • Data-structure coupling: when multiple modules share a data structure, each using only part of it.

  • Data coupling: when data is passed from one module to another—often via parameters.

  • Message coupling: when modules are not dependent on each other; they pass messages without data back and forth.

  • No coupling: when modules do not communicate at all.

Some coupling is usually required; modules usually have to be able to communicate. Generally, message coupling or data coupling would be the best options here.

Cohesion, on the other hand, describes how focused the responsibilities of the module are. In general, a module should do a single thing and do it well. Indicators that cohesion is low include when there are many functions or methods in the module that do different things that are unrelated or when they work on completely different sets of data.

In general, low coupling and high cohesion go together and are a sign that changeability is going to be good. High coupling and low cohesion are generally symptoms of a poor design process and an indicator of poor changeability.

Changeability problems caused by improper implementation practices run the gamut of many of the things we were told not to do in programming classes.

Using global variables is one of the biggest problems; it causes a high degree of coupling in that a change to the variable in one module may have any number of side effects in other modules.

Hard-coding values into a module should be considered a serious potential bug. Named constants are a much better way for developers to write code; when they decide to change the value, all instances are changed at once. When hard-coded values (which some call magic numbers) are used, invariably some get changed and some don’t.

Hard coding design to the hardware is also a problem. In the interest of speed, some developers like to program right down to the metal of the platform they are writing for. That might mean using implementation details of the operating system, hardware platform, or device that is being used. Of course, when time passes and hardware, operating systems and/or devices change, the software is now in trouble.

The more complex the system, the worse changeability is affected. As always, there is a trade-off in that sometimes, we need the complexity.

The project Jamie was on that we mentioned earlier, rewriting a mid-range computer operating system, was an example of software engineering done right! Every developer and most testers were given several weeks of full-time object-oriented development and design training. They were moving 25 million lines of PL/MP (an elegant procedural language) code to C++, not as a patch but a complete rewrite. They spent a lot of time coming up with standards and guidelines that every developer had to follow. They made a huge investment in static testing with training for everyone.

Low coupling and high cohesion were the buzzwords du jour. Perhaps buzzword is incorrect; they truly believed in what they were doing. Maintainability was the rule.

Their rule of thumb was to limit any method, function, or piece of code to what would fit on one page printed out. Maybe 20 to 25 lines of code. They used good object-oriented rules with classes, inheritance, and data hiding. They used messaging between modules to avoid any global variables. They had people writing libraries of classes and testing the heck out of them so they could really get reuse.

So you might assume that everything went fabulously. Well, the final result was outstanding, but there were more than a few stumbles along the way. They had one particular capability that the operating system had to deliver; their estimate was that using the new processor, they had to complete the task within 7,000 CPU cycles. After the rewrite, they found that it took over 99,000 CPU cycles to perform the action. They were off by a huge amount. And, because this action might occur thousands of times a minute, their design needed to be completely rethought. The fact is low coupling, high cohesion, inheritance, and data hiding have their own costs.

Every design decision has trade-offs. For many systems, good design and techniques are worth every penny we spend on them. But when speed of execution is paramount, very often those same techniques do not work well. In order to make this system work, they had to throw out all of the rules. For this module, they went back to huge functions with straight procedural code, global variables, low cohesion, and high coupling. And, they made it fast enough. However, as you might expect, it was very trouble prone; it took quite a while to stabilize it.

Poor documentation will adversely affect changeability. Where a maintenance programmer must guess at how to make changes, to infer as to what the original programmer was thinking, changeability is going to be greatly impacted. Documentation includes internal (comments in the code, self-documenting code through good naming styles, etc.) and external documentation, including high- and low-level design documents.

4.5.2.1 Internal Changeability Metrics

Internal changeability metrics help in predicting the maintainer’s or user’s effort when they’re trying to implement a specified modification to the system.

Change recordability: A measure of how completely changes to specifications and program modules are documented. ISO 9126-3 defines these as change comments in the code or other documentation. This metric is calculated with the formula

X = A / B

where A is the number of changes that have comments (as confirmed in reviews) and B is the total number of changes made. The value should be 0 <= X <= 1, where the closer to one, the better. A value near 0 indicates poor change control.

4.5.2.2 External Changeability Metrics

External changeability metrics help measure the effort needed when you’re trying to implement changes to the system.

Change cycle efficiency: A measure of how likely a user’s problem can solved within an acceptable time period. This is measured by monitoring the interaction between the user and the maintainer and recording the time between the initial user’s request and the resolution of their problem. The metric is calculated by using the formula

Tav = Sum(Tu) / N

where Tav is the average amount of time, Tu is the elapsed time for the user between sending the problem report and receiving a revised version, and N is the number of revised versions sent. The shorter Tav is, the better; however, large numbers of revisions would likely be counterproductive to the organization, so a balance should be struck.

Change implementation elapse time: A measure of how easily a maintainer can change the software to resolve a failure. This is calculated by using the formula

Tav = Sum(Tm) / N

where Tav is the average time, Tm is the elapsed time between when the failure is detected and the time that the failure cause is found, and N is the number of registered and removed failures. As before, there are two notes. Failures not yet found should be excluded, and effort (in person-hours) may be used instead of elapsed time. Shorter is better.

Modification complexity: This is also a measure of how easily a maintainer can change the software to solve a problem. This calculation is made using the formula

T = Sum(A / B) / N

where T is the average time to fix a failure, A is work time spent to change a specific failure, B is the size of the change, and N is the number of changes. The size of the change may be the number of code lines changed, the number of changed requirements, the number of changed pages of documentation, and so on. The shorter this time the better. Clearly, an organization should be concerned if the number of changes are excessive.

Software change control capability: A measure of how easily a user can identify a revised version of the software. This is also listed as how easily a maintainer can change the system to solve a problem. It is measured using the formula

X = A / B

where A is the number of items actually written to the change log and B is the number of change log items planned such that we can trace the software changes. Closer to one is better, although if there are few changes made, the value will tend toward zero.

4.5.3 Stability

The third subcharacteristic of maintainability is stability, defined as the ability of the system to avoid unexpected effects from modifications of the software. After we make a change to the system, how many defects are going to be generated simply from the change?

This subcharacteristic is essentially the side effect of all of the issues we dealt with in changeability. The lower the cohesion, the higher the coupling, and the worse the programming styles and documentation, the lower the stability of the system.

In addition, we need to consider the quality of the requirements. If the requirements are well delineated, well understood, and competently managed, then the system will tend to be more stable. If they are constantly changing and poorly documented and understood, then not so much.

System timing matters to stability. In real-time systems or when timing is critical, change will tend to throw timing chains off, causing failures in other places.

4.5.3.1 Internal Stability Metrics

Stability metrics help us predict how stable the system will be after modification.

Change impact: A measure of the frequency of adverse reactions after modification of the system. This metric is calculated by comparing the number of adverse impacts that occur after the system is modified to the actual number of modifications made, using the formula:

X = 1 - A / B

where A is the number of adverse impacts and B is the number of modifications made. The closer to one, the better. Note that, since there could conceivably be multiple adverse conditions coming from a sloppy change-management style, this metric could actually be negative. For example, suppose one change was made, but three adverse reactions were noted. The calculation, using this formula would be as follows:

X = 1 - 3/1 ==> -2.

Modification impact localization: A measure of how large the impact of a modification is on the system. This value is calculated by counting the number of affected variables from the modification and comparing it to the total number of variables in the product using the formula

X = A / B

where A is the number of affected variables as confirmed in review and B is the total number of variables. The definition of an affected variable is any variable in a line of code or computer instruction that was changed. The closer this value is to zero, the less the impact of modification is likely to be.

4.5.3.2 External Stability Metrics

Change success ratio: A measure of how well the user can operate the software system after maintenance without further failures. There are two formulae that can be used to measure this metric:

X = Na / Ta

Y = {(Na / Ta) / (Nb / Tb)}

Na is the number of cases where the user encounters failures after the software is changed, Nb is the number of times the user encounters failures before the software is changed, Ta is the operation time (a specified observation time) after the software is changed, and Tb is the time (a specified observation time) before the software is changed. Smaller and closer to 0 is better. Essentially, X and Y represent the frequency of encountering failures after the software is changed. The specified observation time is used to try to normalize the metric so we can compare release to release metrics better. Also, ISO 9126-2 suggests that the organization may want to differentiate between failures that come after repair to the same module/function and failures that occur in other modules/functions.

Modification impact localization: This is also a measure of how well the user can operate the system without further failures after maintenance. The calculation uses the formula

X = A / N

where A is the number of failures emerging after modification of the system (during a specified period) and N is the number of resolved failures. The standard suggests that the “chaining” of the failures be tracked; that is, the organization should differentiate between a failure that is attributed to the change for a previous failure and failures that do not appear to be related to the change. Smaller and closer to zero is better.

4.5.4 Testability

Testability is defined as the capability of a software product to be validated after change occurs. This certainly should be a concern to all technical test analysts.

A number of issues can challenge the testability of a system.

One of our all-time favorites is documentation. When documentation is poor or nonexistent, testers have a very hard time trying to figure out what to test. When a requirement or functional specification clearly states that “the system works this way!” then we can test to validate that it does. When we have no requirements, no previous system, no oracle as to what to expect, testing becomes a crap shoot. Is it working right? Shrug! Who knows?

Related to documentation is our old standby, lack of comments in the code and poor naming conventions, which make it harder to understand exactly what the code is supposed to do.

Implementing independent test teams can lead to unintentional (or even sometimes intentional) breakdowns in communications. Good communication between the test and development teams is important when dealing with testability.

Certain programming styles make the code harder to test. For example, object orientation was designed with data-hiding as one of its main objectives. Of course, data-hiding can also make it really difficult to figure out whether a test passed or not. And multiple levels of inheritance make it even harder; you might not know exactly where something happened, which class (object) actually was responsible for the action that was to be taken.

Lack of instrumentation in the code causes testability issues. Many systems are built with the ability to diagnose themselves; extra code is written to make sure that tasks are completed correctly and to log issues that occur. Unfortunately, this instrumentation is often seen as fluff rather than being required.

And as a final point, data issues can cause testability issues on their own. This is a case where better security and good encryption may make the system less testable. If you cannot find, measure, or understand the data, the system is harder to test. Like so much in software, intelligent trade-offs must be made.

4.5.4.1 Internal Testability Metrics

Internal testability metrics measure the expected effort required to test the modified system.

Completeness of built-in test function: A measure of how complete any built-in test capability is. To calculate this, count the number of implemented built-in test capabilities and compare it to how many the specifications call for. The formula to use is

X = A / B

where A is the number of built-in test functions implemented (as confirmed in review) and B is the number required in the specifications. The closer to one, the more complete.

Autonomy of testability: A measure of how independently the software system can be tested. This metric is calculated by counting the number of dependencies on other systems that have been simulated with stubs or drivers and comparing it to the total number of dependencies on other systems. As in many other metrics, the formula

X = A / B

is used, where A is the number of dependencies that have been simulated using stubs or drivers and B is the total number of dependencies on other systems. The closer to one the better. A value of one means that all other dependent systems can be simulated so the software can (essentially) be tested by itself.

Test progress observability: A measure of how completely the built-in test results can be displayed during testing. This can be calculated using the formula

X = A / B

where A is the number of implemented checkpoints as confirmed in review and B is the number required in the specifications. The closer to one, the better.

4.5.4.2 External Testability Metrics

Testability metrics measure the effort required to test a modified system.

Availability of built-in test function: A measure of how easily a user or maintainer can perform operational testing on a system without additional test facility preparation. This metric is calculated by the formula

X = A / B

where A is the number of cases in which the maintainer can use built-in test functionality and B is the number of test opportunities. The closer to one, the better.

Re-test efficiency: A measure of how easily a user or maintainer can perform operational testing and determine whether the software is ready for release or not. This metric is an average calculated by the formula

X = Sum(T) / N

where T is the time spent to make sure the system is ready for release after a failure is resolved and N is the number of resolved failures. Essentially, this is the average retesting time after failure resolution. Note that nonresolved failures are excluded from this measurement. Smaller is better.

Test restartability: A measure of how easily a user or maintainer can perform operational testing with checkpoints after maintenance. This is calculated by the formula

X = A / B

where A is the number of test cases in which the maintainer can pause and restart the executing test case at a desired point and B is the number of cases where executing test cases are paused. The closer to one, the better.

4.5.5 Compliance

4.5.5.1 Internal Compliance Metric

Maintainability compliance is a measure of how compliant the system is estimated to be with regard to applicable regulations, standards, and conventions. The ratio of compliance items implemented (as based on reviews) to those requiring compliance in the specifications is calculated using the formula

X = A / B

where A is the correctly implemented items related to maintainability compliance and B is the required number. The closer to one, the more compliant the system is.

4.5.5.2 External Compliance Metric

Maintainability compliance metrics measure how close the system adheres to various standards, conventions, and regulations. Compliance is measured using the formula

X = 1 - A / B

where A is the number of maintainability compliance items that were not implemented during testing and B is the total number of maintainability compliance items defined. Closer to one is better.

4.5.6 Exercise: Maintainability Testing

Using the HELLOCARMS system requirements document, analyze the risks and create an informal test design for maintainability, using one requirement as the basis.

4.5.7 Exercise: Maintainability Testing Debrief

Maintainability is an interesting quality characteristic for testers to deal with. Most maintainability issues are not amenable to our normal concept of a dynamic test, with input data, expected output data, and so on. Certainly some maintainability testing is done that way, when dealing with patches, updates, and so forth.

For this exercise, we are going to select requirement 050-010-010: Standards and guidelines will be developed and used for all code and other generated materials used in this project to enhance maintainability.

Our first effort, therefore, done as early as possible, would be to review the programming standards and guidelines with the rest of the test team and the development group—assuming, of course, that we have standards and guidelines. If there were none defined, we would try to get a cross-functional team busy defining them.

The majority of our effort would be during static testing. Starting (specifically for this requirement) at the low-level design phase, we would want to attend reviews, walk-throughs, and inspections. We would use available checklists, including Marick’s, Laszlo’s, and our own internal checklists (discussed in Chapter 5) based on defects found previously.

Throughout each review, we would be asking the same questions: Are we actually adhering to the standards and guidelines we have? Are we building a system that we will be able to troubleshoot when failures occur? Are we building a system with low coupling and high cohesion? Is it modular? How much effort will it take to test?

Since these standards and guidelines are not optional, we would work with the developers to make sure they understood them, and then we would start processing exceptions to them through the defect tracking database as with any other issues.

Beyond the standards and guidelines, there would still be some dynamic testing of changes made to the system, specifically for regression after patches and other modifications. We would want to mine the defect tracking database and the support database to try to learn where regression bugs have occurred. New testing would be predicated on those findings, especially if we found hot spots where failures occurred with regularity.

Many of our metrics would have to come from analyzing other metrics. How hard was it to figure out what was wrong? (Analyzability) When changes are needed, how much effort and time does it take to make them? (Changeability) How many regression bugs are found (in test and in the field) after changes are made? (Stability) And, how much effort has it taken for testers to be able to test the system? (Testability)

4.6 Portability Testing

Learning objectives

Only common learning objectives.

Portability refers to the ability of the application to install to, use in, and perhaps move to various environments. Of course, the first two are important for all systems. In the case of PC software, given the rapid pace of changes in operating systems, cohabiting and interoperating applications, hardware, bandwidth availability, and the like, being able to move and adapt to new environments is critical too.

Back when the computer field was just starting out, there was very little idea of portability. A computer program started out as a set of patch cords connecting up logic gates made out of vacuum tubes. Later on, assembly language evolved to facilitate easier programming. But still no portability—the assembler was based on the specific CPU and architecture that the computer used. The push to engineer higher-level languages was driven by the need for programs to be portable between systems and processors.

A number of classes of defects can cause portability problems, but certainly environment dependencies, resource hogging, and nonstandard operating system interactions are high on the list. For example, changing shared Registry keys during installation or removing shared files during uninstallation are classic portability problems on the Windows platform.

Fortunately, portability defects are amenable to straightforward test design techniques like pairwise testing, classification trees,30 equivalence partitioning, decision tables, and state-based testing. Portability issues often require a large number of configurations for testing.

Some software is not designed to be portable, nor should it be. If an organization designs an embedded system that runs in real time, we would expect that portability is the least of its worries. Indeed, in a review, we would question any compromises that were made to try to make the system portable if the organization had the possibility of marginalizing the operation of the system. However, there may come a day when the system must be moved to a different chip, a different system. At that point, it might be good if the system had some portability features built into it.

More than the other quality characteristics we have discussed in this chapter, portability requires compromises. A technical test analyst should understand the need for compromise but still make sure the system, as designed and delivered into test, is still suitable for the tasks it will be called to do.

The best way to discuss portability is to look at each of its subcharacteristics. This is a case of the total being a sum of its parts. Very little is published about portability without specifying these subcharacteristics: adaptability, replaceability, installability, coexistence, and compliance.

4.6.1 Adaptability

Adaptability is defined as the capability of the system to be adapted for different specified environments without applying actions other than those provided for that purpose.

In general, the more tightly a system is designed to fit a particular environment, the more suitable it will be for that environment and the less adaptable to other environments. Adaptability, for its own sake, is not all that desirable, frankly. On the other hand, adaptability for solid business or technical reasons is a very good idea. It is essential to understand the business (or technical) case in determining which trade-offs are advantageous.

When Jamie was a child, his mother read about a mysterious piece of clothing called a “Hawaiian muumuu.” They lived in a small town in the early 1960s; she was excited to be able to order such an exotic item. When she ordered from the catalog, it said, “one size fits all.” Jamie learned from that muumuu that, while one size fits all, it also fits nothing. The thing his mother was sent was enormous—they kidded about using it for a tent.

So what is the point? If we try to write software that will run on every platform everywhere, it likely will not fit any environment well. There are programming languages—such as Java—that are supposed to be able to “write once, run anywhere.” The ultimate portability! However, Java runs everywhere by having its own runtime virtual machine for each different platform. The Java byte code is portable, but only at a huge cost of engineering each virtual machine for each specific platform.

You don’t get something for nothing. Adaptability comes at a price: more design work, more complexity, more code bloat, and those tend to come with more defects. So, when your organization is looking into designing adaptability, make sure you know what the targeted environments are and what the business case is.

When testing adaptability, we must check that an application can function correctly in all intended target environments. Confusingly, this is also commonly referred to as compatibility testing. As you might imagine, when there are lots of options, specifying adaptability or compatibility tests involves pairwise testing, classification trees, and equivalence partitioning.

Since you likely will need to install the application into the environment, adaptability and installation might both be tested at the same time. Functional tests should then be run in those environments. Sometimes, a small sample of functions is sufficient to reveal any problems. More likely, many tests will be needed to get a reasonable picture. Unfortunately, many times a small amount of testing is all that organizations can afford to invest. Given the potentially enormous size of this task, our adaptability testing is often insufficient. As always in testing, the decision of how much to test, how deeply to dig in, will depend on risk and available resources.

There might also be procedural elements of adaptability that need testing. Perhaps data migration is required to move from one environment to another. In that case, we might have to test those procedures as well as the adaptability of the software.

4.6.1.1 Internal Adaptability Metrics

Adaptability metrics help predict the impact on the effort to adapt the system to a different environment.

Adaptability of data structures: A measure of how adaptable the product is to data structure changes. This metric is calculated using the formula

X = A / B

where A is the number of data structures that are correctly operable after adaptation, as confirmed in review, and B is the total number of data structures needing adaptation. The closer to one, the better.

Hardware environmental adaptability: A measure of how adaptable the software is to the hardware-related environmental change. This metric is specifically concerned with hardware devices and network facilities. The formula

X = A / B

is used, where A is the number of implemented functions that are capable of achieving required results in specified multiple hardware environments as required, confirmed in review, and B is the total number of functions that are required to have hardware adaption capability. The closer to one, the better.

Organizational environment adaptability: A measure of how adaptable the software is to organizational infrastructure change. This metric is calculated by the formula

X = A / B

where A is the number of implemented functions that are capable of achieving required results in multiple specified organizational and business environments, as confirmed in review, and B is the total number of functions requiring such adaptability. The closer to one, the better.

System software environmental adaptability: A measure of how adaptable the software product is to environmental changes related to system software. The standard specifically lists adaptability to operating system, network software, and co-operated application software as being measured. This metric uses the formula

X = A / B

where A is the number of implemented functions that are capable of achieving required results in specified multiple system software environments, as confirmed in review, and B is the total number of functions requiring such capability. Closer to one, is better.

Porting user friendliness: A measure of how effortless the porting operations on the project are estimated to be. This metric uses the same formula:

X = A / B

A is the number of functions being ported that are judged to be easy, as based in review, and B is the total number of functions that are required to be easy to adapt.

4.6.1.2 External Adaptability Metrics

Adaptability metrics measure the behavior of the system or user who is trying to adapt the software to different environments.

Adaptability of data structures: A measure of how easily a user or maintainer can adapt software to data in a new environment. This metric is calculated using the formula

X = A / B

where A is the number of data items that are not usable in the new environment because of adaptation limitations, and B is the number of data items that were expected to be operable in the new environment. The larger the number (i.e., the closer to one) the better. The data items here include such entities as data files, data tuples,31 data structures, databases, and so on. When calculating this metric, the same type of data should be used for both A and B.

Hardware environmental adaptability: A measure of how easily the user or maintainer can adapt the software to the environment. This metric is calculated using the formula

X = 1 - A / B

where A is the number of tasks that were not completed or did not work to adequate levels during operational testing with the new environment hardware and B is the total number of functions that were tested. The larger (closer to one) the better. ISO 9126-2 specifies this metric is to be used in reference to adaptability to hardware devices and network facilities. That separates it from the next metric.

Organizational environment adaptability: A measure of how easily the user or maintainer can adapt software to the environment, specifically the adaptability to the infrastructure of the organization. This metric is calculated much like the previous one using the formula

X = 1 - A / B

where A is the number of tasks that could not be completed or did not meet adequate levels during operational testing in the user’s business environment and B is the total number of functions that were tested. This particular metric is concerned with the environment of the business operation of the user’s organization. This separates it from the next similar measure. Larger is better.

System software environmental adaptability: This is also a measure of how easily the user or maintainer can adapt software to the environment, specifically adaptability to the operating system, network software, and co-operated application software. This metric is also calculated using the same formula:

X = 1 - A / B

A is the number of tasks that were not completed or did not work to adequate levels during operational testing with operating system software or concurrently running application software, and B is the total number of functions that were tested. Again, larger is better.

Porting user friendliness: This final adaptability metric also is a measure of how easily a user can adapt software to the environment. In this case, it is calculated by adding up all of the time that is spent by the user to complete adaptation of the software to the user’s environment when the user attempts to install or change the setup.

4.6.2 Replaceability

Replaceability testing is the capability for the system/component to be used in place of another specified software product for the same purpose in the same environment. We are concerned with checking into whether software components within a system can be exchanged for others.

The Microsoft style of system architecture has been a primary driver of the concept of software components, although Microsoft did not invent the idea. Remote procedure calls (RPCs) have been around a long time, allowing some of a system’s processing to be done on an external CPU rather than having all processing performed on one. In Windows, the basic design was for much of the application functionality to be placed outside the EXE file and into replaceable components called dynamic link libraries (DLLs). Early Windows functionality was mainly stored in three large DLLs. For testers, the idea of split functionality has created a number of problems; any tester who has sat for hours trying to emerge from DLL hell where incompatible versions of the same file cause cryptic failures can testify to that.

However, over the years, things have gotten better. From the Component Object Model (COM) to the Distributed Component Object Model (DCOM), all the way to service-oriented architecture (SOA), the idea of having tasks removed from the central executable has become more and more popular. Few organizations would now consider building a single, monolithic executable file containing all functionality. Many complex systems now consist of commercial off-the-shelf (COTS) components wrapped together with some connecting code. HELLOCARMS is a perfect example of that.

The design of the Microsoft Office suite is a pretty good example of replaceable/reusable components—even if it often does not seem that way. Much of the Office functionality is stored in COM objects; these may be updated individually without replacing the entire EXE. This architecture allows Office components to share, upgrade, and extend functionality on the fly. It also facilitates the use of macros and automation of tasks between the applications.

Many applications now come with the ability to use different major database management system (DBMS) packages. Moving forward, many in the industry expect this trend to only accelerate.

Testers must consider this whole range of replaceable components when they consider how they are going to test. The best way to consider distributed component architecture, from RPCs to COTS packages, is to think of loosely coupled functionality where good interface design is paramount. Essentially, we need to consider the interface to understand what to test. Much of this testing, therefore, is integration-type testing.

We should start with static testing of the interface. How will we call distributed functionality, how will the modules communicate? In integration test, we want to test all of the different components that we expect may be used. In system test, we certainly should consider the different configurations that we expect to see in production.

Low coupling is the key to successful replaceability. When designing a system, if the intent of the design is to allow multiple components to be used, then coupling too tightly to any one interface will cause irreplaceability. At this point the system is dependent on those external modules—that are likely not controlled by our organization.

This is an issue that must be considered by management when moving along a path of component-based systems. When everything was in one executable, we could responsibly test all of that functionality. With the growth of decentralization through replaceability of components, the question of who is responsible for testing what becomes paramount. That is a discussion we leave to the Advanced Test Management book, Advanced Software Testing, Vol. 2.

4.6.2.1 Internal Replaceability Metrics

Replaceability metrics help predict the impact the software may have on the effort of a user who is trying to use the software in place of other specified software in a specific environment and context of use.

Continued use of data: A measure of the amount of original data that is expected to remain unchanged after replacement of the software. This is calculated using the formula

X = A / B

where A is the number of data items that are expected to be usable after replacement, as confirmed in review, and B is the number of old data items that are required to be usable after replacement. The closer to one, the better.

Function inclusiveness: A measure of the number of functions expected to remain unchanged after replacement. This measurement is calculated using the formula

X = A / B

where A is the number of functions in the new software that produce similar results as the same functions in the old software (as confirmed in review) and B is the number of old functions. The closer to one the better.

4.6.2.2 External Replaceability Metrics

Replaceability metrics measure the behavior of the system or user who is trying to use the software in place of other specified software.

Continued use of data: A measure of how easily a user can continue to use the same data after replacing the software. Essentially, this metric measures the success of the software migration. The formula

X = A / B

is used, where A is the number of data items that are able to be used continually after software replacement and B is the number of data items that were expected to be used continuously. The closer to one, the better. This metric can be used both for a new version of the software and for a completely new software package.

Function inclusiveness: A measure of how easily the user can continue to use similar functions after replacing a software system. This metric is calculated the same way, using the formula

X = A / B

where A is the number of functions that produce similar results in the new software where changes have not been required and B is the number of similar functions provided in the new software as compared to the old. The closer this value is to one, the better.

User support functional consistency: A measure of how consistent the new components are to the existing user interface. This is measured by using the formula

X = 1 - A / B

where A is the number of functions found by the user to be unacceptably inconsistent to that user’s expectation and B is the number of new functions. Larger is better, meaning that few new functions are seen as inconsistent.

4.6.3 Installability

Installability is the capability of a system to be installed into a specific environment. Testers have to consider uninstallability at the same time.

There is good news and bad news about installability testing. Conceptually it is straightforward. That is the good news. We must install the software, using its standard installation, update, and patch facilities, onto its target environment or environments. How hard can that be? Well, that is the bad news. There are an almost infinite number of possible issues that may arise during that testing.

Here are just some of the risks that must be considered:

  • We install a system and the success of the install is dependent on all of the other software that the new system depends on working correctly. Are all co-installed systems working correctly? Are they all the right versions? Does the install procedure even check?

  • We find that the typical people involved in doing the installation can’t figure out how to do it properly, so they are confused, frustrated, and making lots of mistakes (resulting in systems left in an undefined, crashed, or even completely corrupted state). This type of problem should be revealed during a usability test of the installation documentation. You are testing the usability of the install, right?

  • We can’t install the software according to the instructions in an installation or user’s manual or via an installation wizard. This sounds straightforward, but notice that it requires testing in enough different environments that you can have confidence that it will work in most if not all final environments as well as looking at various installation options like partial, typical, or full installation. Remember—the install is a software system unto itself and must be tested.

  • We observe failures during installation (e.g., failure to load particular DLLs) that are not cleaned up correctly, so they leave the system in a corrupted state. It’s the variations in possibilities that make this a challenge.

  • We find that we can’t partially install, can’t abort the install, or can’t uninstall.

  • We find that the installation process or wizard will not shield us from—or perhaps won’t even detect—invalid hardware, software, operating systems, or configurations. This is likely to be connected to failures during or after the installation that leave the system in an undefined, crashed, or even completely corrupted state.

  • We find that trying to uninstall and clean up the machine destroys the system software load.

  • We find that the installation takes an unbearable amount of time to complete, or perhaps never completes.

  • We can’t downgrade or uninstall after a successful or unsuccessful installation.

  • We find that some of the error messages are neither meaningful nor informational.

  • Upon uninstallation, too few—or too many—modules are removed

By the way, for each of the types of risks we just mentioned, we have to consider not only installation problems, but also similar problems with updates and patches.

Not only do these tests involve monitoring the install, update, or patch process, but they also require some amount of functionality testing afterward to detect any problems that might have been silently introduced. Because, at the end of the day, the most important question to ask is, When we are all done installing, will the system work correctly?

And, just because it was not already interesting enough, we have to think about security. During the install, we need to have a high level of access to be able to perform all of the tasks. Are we opening up a security hole for someone to jump into?

How do we know that the install worked? Does all the functionality work? Do all the interoperating systems work correctly?

At the beginning of discussing installability, we said it was a good news, bad news scenario, the good news being that the install was conceptually straightforward. We lied. It’s not. The best way we know to deal with install testing is to make sure that it is treated as a completely different component to test. Some organizations have a separate install test team; that actually makes a lot of sense to us.

And one final note. As an automator, Jamie once thought it would be great to take all of the stuff we just talked about and automate the entire process—test it all by pushing a button. Unfortunately, we’ve never personally seen that done.

At a recent conference, Jamie talked to an automator who claimed that her group had successfully automated all of their installation tasks. However, she was woefully short on details of how, so we don’t know what to believe about her story.

With all of the problems possible in trying to test install and uninstall, with all of the different ways it can fail, it takes a human brain to deal with it all. Until our tools and methodologies get a whole lot better, we think we will continue to be doing most of this testing manually.

During Jamie’s first opportunity at being lead tester on a project, he decided to facilitate better communication between the test team and the support team by setting up a brown-bag lunch with both teams. They were testing a very complex system that included an AS/400 host module, a custom ODBC driver, and a full Windows application. We were responsible for testing everything that we sold.

During the lunch, Jamie asked the support team to list the top 10 customer complaints. It turned out that 7 of the top 10 complaints were install related. Oops! We weren’t even testing the install because Jamie figured it was not a big deal. He had come from an organization where they tested an operating system—there the install was tested by another group in another state.

Very often, install complaints rank very high in all problems reported to support.

4.6.3.1 Internal Installability Metrics

Installability metrics help predict the impact on a user trying to install the software into a specified environment.

Ease of set-up retry: A measure of how easy it is expected to repeat the setup operation. This is calculated using the formula

X = A / B

where A is the number of implemented retry operations for set-up, confirmed in review, and B is the total number of set-up operations required. The closer to one, the better.

Installation effort: A measure of the level of effort that will be required for installation of the system. This metric is calculated using the formula

X = A / B

where A is the number of automated installation steps, as confirmed in review, and B is the total number of installation steps required. Closer to one (i.e., fully automated) is better.

Installation flexibility: A measure of how customizable the installation capability is estimated to be. This is calculated using the formula

X = A / B

where A is the number of implemented customizable installation operations, as confirmed in review, and B is the total number required. The closer to one, the more flexible the installation.

4.6.3.2 External Installability Metrics

Installability metrics measure the impact on the user who is trying to install the software into a specified environment.

Ease of installation: A measure of how easily a user can install software to the operational environment. This is calculated by the formula

X = A / B

where A is the number of cases in which a user succeeded in changing the install operation for their own convenience and B is the total number of cases in which a user tried to change the install procedure. The closer to one, the better.

Ease of set-up retry: A measure of how easily a user can retry the set-up installation of the software. The standard does not address exactly why the retry might be needed, just that the retry is attempted. The metric is calculated using the formula

X = 1 - A / B

where A is the number of cases where the user fails in retrying the set-up and B is the total number of times it’s attempted. The closer to one, the better.

4.6.4 Coexistence

The fourth subcharacteristic we need to discuss for the portability characteristic is called coexistence testing—also called sociability or compatibility testing. This is defined as the capability to coexist with other independent software in a common environment sharing common resources. With this type of testing, we check that one or more systems that work in the same environment do so without conflict. Notice that this is not the same as interoperability testing because the systems might not be directly interacting. Earlier, we referred to these as “cohabiting” systems, though that phrase is a bit misleading since human cohabitation usually involves a fair amount of direct interaction.

It’s easy to forget coexistence testing and test applications by themselves. This problem is often found in groups that are organized into silos and where application development takes place separately in different groups. Once everything is installed into the data center, though, you are then doing de facto compatibility testing in production, which is not really a good idea. There are times when we might need to share testing with other project teams to try to avoid coexistence problems.

With coexistence testing, we are looking for problems like the following:

  • Applications have an adverse impact on each other’s functionality when loaded on the same environment, either directly (by crashing each other) or indirectly (by consuming all the resources). Resource contention is a common point of failure.

  • Applications work fine at first but then are damaged by patches and upgrades to other applications because of undefined dependencies.

  • DLL hell. Shared resources are not compatible and the last one installed will work, breaking the others.

Assume that we just installed this system. How do we know what other applications are on that system, much less which ones are going to fail to play nice? This is yet another install issue that must be considered. In systems where there is no shared functionality (i.e., one without DLLs), this is less important.

One solution that is becoming more common is the concept of virtual machines. We can control everything in the virtual machine, so we can avoid direct resource contention between processes.

4.6.4.1 Internal Coexistence Metrics

Coexistence metrics help predict the impact the software may have on other software products sharing the same operational hardware resources.

Available coexistence: A measure of how flexible the system is expected to be in sharing its environment with other products without adverse impact on them. The formula

X = A / B

is used where A is the number of entities with which the product is expected to coexist and B is the total number of entities in the production environment that require such coexistence. Closer to one is better.

Replaceability metrics help predict the impact the software may have on the effort of a user who is trying to use the software in place of other specified software in a specific environment and context of use.

Continued use of data: A measure of the amount of original data that is expected to remain unchanged after replacement of the software. This is calculated using the formula

X = A / B

where A is the number of data items that are expected to be usable after replacement as confirmed in review and B is the number of old data items that are required to be usable after replacement. The closer to one, the better.

Function inclusiveness: A measure of the number of functions expected to remain unchanged after replacement. This measurement is calculated using the formula

X = A / B

where A is the number of functions in the new software that produce similar results as the same functions in the old software (as confirmed in review) and B is the number of old functions. The closer to one, the better.

4.6.4.2 External Coexistence Metrics

Coexistence metrics measure the behavior of the system or user who is trying to use the software with other independent software in a common environment sharing common resources.

Available coexistence: A measure of how often a user encounters constraints or unexpected failures when operating concurrently with other software. This is calculated using the formula

X = A / T

where A is the number of constraints or failures that occur when operating concurrently with other software and T is the time duration of operation. The closer to zero, the better.

4.6.5 Compliance

4.6.5.1 Internal Compliance Metrics

Portability compliance metrics help assess the capability of the software to comply with standards, conventions, and regulations that may apply. It is measured using the formula

X = A / B

where A is the number of correctly implemented items related to portability compliance as confirmed in review and B is the total number of compliance items.

4.6.5.2 External Compliance Metrics

Portability compliance metrics measure the number of functions that fail to comply with required conventions, standards, or regulations. This metric uses the formula

X = 1 - A / B

where A is the number of portability compliance items that have not been implemented and B is the total number of portability compliance items that are specified. The closer to one, the better. ISO 9126-2 notes that this is a metric that works best when seen as a trend, with increasing compliance coming as the system becomes more mature.

4.6.6 Exercise: Portability Testing

Using the HELLOCARMS system requirements document, analyze the risks and create an informal test design for portability using one requirement.

4.6.7 Exercise: Portability Testing Debrief

Portability testing consists of adaptability, installability, coexistence, replaceability, and compliance subattributes. Because HELLOCARMS is surfaced on browsers, we find the compelling subattribute to be adaptability. Therefore, we have selected requirement 060-010-030 for discussion. It reads: “HELLOCARMS shall be configured to work with all popular browsers that represent five percent (5%) or more of the currently deployed browsers in any countries where Globobank does business.”

Our first effort would be to try to get a small change to this requirement during the requirements review period. The way it is written, it appears that, by release three, we need to be concerned about all versions of browsers rather than just the latest two versions as expressed in requirements 060-010-010 and 060-010-020. We hope this is an oversight and will move forward in our design assuming that we only need the latest two versions.

This particular requirement is not enforced until release three. However, we would start informally testing it with the first release. This is because we would not want the developers to have to remove technologies they used after the first two releases simply because they are not compatible with a seldom-used browser that still meets the five percent threshold. We would make sure that we stressed this upcoming requirement at low-level design and code review meetings.

We would have to survey what browsers are available. This entails discovering what countries Globobank is active in and performing Web research on them. We would hope to get our marketing group interested in helping out to prevent spending too much time on the research ourselves.

We would create a matrix of all the possible browsers that meet the criteria, including the current version and one previous version for each. We would also build into that matrix popular operating systems and connection speeds (dialup and two speeds of wideband).

This matrix is likely to be fairly large. We do not cover pairwise techniques in this book.32 However, if we did not know how to deal with this powerful configuration testing technique, we would enlist a test analyst to help us out. We would spread out our various planned tests over the matrix to get acceptable coverage, focusing most tests on those browsers/operating systems/speeds that represented most of our prospective users.

After release, we would make sure to monitor reported production failures through support, ensuring that we were tracking environment-related failures. We would use that information to tweak our testing as we move into the maintenance cycle.

4.7 General Planning Issues

Learning objectives

Only common learning objectives.

Let’s close this chapter with some general planning issues related to nonfunctional testing. It’s often the case that nonfunctional testing is overlooked or underestimated in test plans and project plans. As we’ve seen throughout this chapter, though, nonfunctional facets of system behavior can be very important, even critical to the quality of the system. Therefore, such omissions can create serious risks for the product, and thus the project. As was mentioned earlier, the failure to address nonfunctional risks—specifically those related to performance—led to a disastrous project release in the case of the United States healthcare.gov website.

While not wanting to comment on the specifics of that example, one common reason for such omissions is that many test managers lack adequate technical knowledge to understand the risks. As technical test analysts, we can contribute to the quality risk analysis and test planning processes to help ensure that such risks are not overlooked or minimized. However, technical test analysts do not always have the management perspective needed to see these gaps in advance. This section is designed to help you anticipate the gaps and help the test manager address them.

The first significant gap that can exist is a lack of understanding of stakeholder needs in terms of nonfunctional requirements. Business stakeholders especially are often highly attuned to functional issues and can explain what accurate, suitable behavior looks like; the systems with which the system under test must interoperate; and the regulations with which the system must comply. However, when asked about other behaviors, they’ll say, “Yes, of course the system must be secure, easy to use, reliable, quick, useful in any supported configuration, and ready for updates with new features and bug fixes at a moment’s notice.” Those are hardly useful requirements for security, usability, reliability, efficiency, portability, and maintainability. The opacity of stakeholder requirements is yet another hurdle to the test manager’s ability to plan for nonfunctional testing.

This is where you, the intrepid technical test analyst, come in. You should—especially after the material discussed in this chapter—be ready to have a more detailed discussion with stakeholders about what exactly they want in terms of nonfunctional behavior. Of course, you need to make sure that these discussions include all relevant stakeholders. These stakeholders include business stakeholder such as customers and users. They also include operations and maintenance staff. They include developers, architects, designers, and database administrators. As discussed in Chapter 1, including the broadest cross-section of stakeholders gives the most complete view of the issues. This applies not only for risks but also for planning considerations and requirements.

In some cases, the response to your queries about nonfunctional requirements, especially performance, reliability, security, and usability, can be, “It should work at least as well as the existing system,” or, “It should be as good as our competitors’ systems.” While this can be frustrating at first, notice that such a response actually provides you with a real, live, usable test oracle—which is what you were looking for anyway. If a stakeholder has given such a response, don’t be frustrated by it—thank them for giving you a realistic, existing system to measure your systems against.

Another significant challenge to nonfunctional testing is that tools are often required to carry out such tests. Reliability and performance tests are usually impossible to perform without tools. Some security testing can be done manually, but most hackers use tools as well as their twisted wiles, so security testers must use these tools as well. Open-source and other free tools are available, but many of our clients opt to use commercial tools for reasons discussed in Chapter 6. So, the time and money required to acquire, implement, and deploy the tools should be included in planning; even if you do use tools that are open-source or free, the cost of deploying the tools can be significant.

In addition, these tools are not easy to use. If you have no experience with a particular tool—or, worse yet, a whole type of tool—simply trying to use them yourself, no matter how technical you are, might result in false positives, false negatives, and a lot of unproductive thrashing around. So, the plan, as well as the budget and estimate, must include the costs of training people, learning how the tool works in your environment, and possibly even hiring consultants to help you use the tool, at least initially.

In some cases, a tool might be needed, but no commercial, open-source, or free tool exists. In this case, you might need to build your own tool. The implications of this approach are discussed to some extent in Chapter 6, and more extensively in the ISTQB Advanced Test Manager syllabus.33 Tool development is typically a significant development effort, though in some cases open-source components can be used to build the tool at a very reasonable cost. Any tool you build—just like any software your organization builds—should be assumed buggy, and so testing is required. The tool is an important asset, and so documentation is necessary. In order to use the tool in the future, maintenance will be needed. If the tool is used for safety-critical applications such as medical devices or avionics, certification of the tool according to regulatory guidelines is also required. All of these activities take time and cost money and thus must be planned.

And that’s not all. In many cases, in order to yield meaningful results, nonfunctional tests must be run in production-like environments. For example, performance and reliability tests are very sensitive to resource constraints, so running them in scaled-down environments can be misleading. Rex has seen security tests in production environments yield very different results than those same tests run in testing environments, just because the production environments were configured differently.

Replicating a production environment can be very easy and cheap, very hard and expensive, or somewhere in between. At the easy end, thanks to virtualization and cloud computing, whistling up an exact replica of an environment that already exists in a cloud computing setting can be quite simple, though the cost might still be high if the capacity of the environment is large. If you have a large, complex, unique, and expensive production environment—for example, a supercomputer or mainframe networked to an unusual configuration of servers and clients—then the cost, difficulty, and time lag associated with replicating that environment could be prohibitive in all but the most high-risk situations.

When environments cannot be replicated, most of Rex’s clients tend to resort to one of two options. The first, and probably most popular, is to test in a scaled-down environment. This can work for security testing, if great care is taken to preserve exactly—and we mean exactly—the configuration of the production environment, if not the scale of its resources. However, for performance and reliability testing, in the absence of exceptionally accurate models of how changes in resource capacity will affect behavior, it’s highly likely that such tests, while revealing some important defects, will also miss other important defects.

Another option is to test in the production environment. Here there can be no question of realism in the test environment. However, there is a significant issue of how you insulate the users from the tests. Putting load on the system, as is necessary in performance testing (at least on a transient basis) and reliability testing (for a much more extended period), has the possibility of affecting real users. Security tests could result in the leakage of actual, sensitive customer data, perhaps to the wider world than just the testers. In addition, simply embarrassing implications of testing can occur. Rex is aware of an instance where testing in production resulted in test data being sent to real customers, which might seem trivial except that the test data was a letter whose salutation read “Dear idiot” and the recipients were every customer the company had.

The test plan must carefully consider all these issues and risks. At the least, planning for the environment should be done carefully to avoid any conflict between the functional and nonfunctional tests. Timing will often be an issue, especially if production replica environments or actual production environments are used, but in some cases even when standard test environments can be used.

In addition to test environments, test data is often a challenge for nonfunctional testing. Especially in the case of performance, reliability, and security testing, using realistic data, including configuration data, is critical to obtaining meaningful results and finding important defects. One way to ensure realistic test data is to test using production data. However, production data often contains sensitive information such as personal identifying information, health status, financial information, and the like. In all cases, businesses, governments, nonprofits, and other organizations have an ethical duty to protect the public from harm that arises from the misuse of such data, though recent history shows that organizations are often less than diligent in this regard. When the pangs of conscience are not enough—and lately they don’t seem to be—governments step in, imposing civil and criminal requirements to be good stewards of such data.

So, whether for reasons of good organizational citizenship or due to fear of lawsuits or worse, an increasing number of our clients are adopting data protection policies. These policies restrict who can access production data, which often has the side effect of preventing testers from using the data in its pure form. Fortunately, there are tools and techniques that can be used to anonymize production data as part of the process of transferring the data into the test environment, making it impossible for a malevolent tester (or just a malevolent person with access to the test data) to do harm to individuals or society. Unfortunately, these tools are often expensive. Even when price is not an issue, the process of anonymizing and transferring often huge volumes of data from the production environment into the test environment can be a small project all by itself. Either way, obtaining nonfunctional test data often requires advanced planning.

In some cases, the issues associated with test data are not the volume or provenance but the nature of the data itself. Testing involves the use of a test oracle, which means a way of determining whether the test has passed or failed. However, application security increasingly involves transmitting data in encrypted forms. Encryption makes it unreadable by those who shouldn’t be able to read it, such as, for example, hackers looking to steal credit card information from a customer database. Of course, encryption makes it impossible for us to check that credit card number in the customer database to see whether it has saved properly—at least, encryption that is implemented properly will.

So, some clever solutions must be used to check the data. Partial oracles are one way of doing so. Using service virtualization to replicate a third-party service that has the authorization to receive the unencrypted information is another way of doing so. Asking the developers to install a backdoor that allows you to circumvent the encryption might seem clever at first, but it is in fact very, very stupid in most cases. These backdoors have a way of being discovered, and you don’t want to be the person who asked it to be put there when it is.

Speaking of service virtualization, this is just one of the organizational considerations involved in nonfunctional testing. Applications often—in fact, these days, usually—interact with other applications via various services available on the internal company network or the Internet. You’ll need to work with the teams that offer these services to identify the interfaces involved, how they work, whether test interfaces are available, and so forth. If you can coordinate a test interface with them, so much the better, provided they agree to support the interface. If you can’t get a test interface, or if you simply don’t want to rely on such an interface, you might decide to use service virtualization tools to replace the interface.

In some cases, nonfunctional tests must extend end to end, across all the applications, in which case you need people in other groups involved. Or perhaps you don’t have all the skills required to run a certain set of tests, such as security tests. If, for whatever reason, the skills, expertise, and involvement of other groups is required to successfully carry out nonfunctional tests, this involvement will need to be planned and coordinated. You’ll need to help the test manager identify these groups and the specific involvement necessary.

4.8 Sample Exam Questions

1. You are new to the organization and have been placed in a technical testing role. You are asked to investigate a number of complaints from customers who fear that they may have security issues with the software. Symptoms include virus warnings from security software, missing and corrupted documents, and the system suddenly slowing down. Which two of the following websites might you access to get help finding causes for these symptoms?

A. Wikipedia

B. CAPEC

C. SUMI

D. CVE

E. WAMMI

2. Nonfunctional testing has never been done at your organization, but your new director of quality has decided that it will be done in the future. And, she wants metrics to show that the system is getting better. One metric you are calculating is based on a period of testing that occurred last week. You are measuring the time that the system was actually working correctly compared to the time that it was automatically repairing itself after a failure. Which of the following metrics are you actually measuring?

A. MTBF

B. Mean down time

C. Mean recovery time

D. Availability

3. You find out that marketing has put a new claim into the literature for the system, saying that the software will work on Windows 95 through Win 7. Which of the following nonfunctional attributes would you most likely be interested in testing and collecting metrics for?

A. Adaptability

B. Portability

C. Coexistence

D. Stability

4. You are doing performance testing for the system your company sells. You have been running the system for over a week straight, pumping huge volumes of data through it. What kind of testing are you most likely performing?

A. Stress testing

B. Soak testing

C. Resource utilization testing

D. Spike testing

5. Rather than developing all of your own software from the ground up, your management team has decided to use available COTS packages in addition to new code for an upcoming project. You have been given the task of testing the entire system with a view to making sure your organization retains its independence from the COTS suppliers. Which of the following nonfunctional attributes would you most likely investigate?

A. Replaceability

B. Portability compliance

C. Coexistence

D. Adaptability

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset