Before diving into examples of how to test Python code, the nature of tests must be discussed in more detail. Why do we want to have tests? What do we gain from them? What are the downsides? What makes a good test; what’s a bad test? How can we classify tests? And how many of which kinds of tests should we write?

1.1 What Do We Want from a Test?

Why bother with writing tests at all? There are a number of reasons why we want to write or, at least, have tests.

It is not uncommon to have several tests in a test suite, written in response to different needs.

Fast Feedback

Every change to code comes with the risk of introducing bugs. Research shows that somewhere in the range of 7% to 20% of all bug fixes introduce new bugs.1

Wouldn’t it be great if we could find those bugs before they find their way to the customer? Or even before your colleagues see them? This is not just a question of vanity. If you receive quick feedback that you have introduced a bug, you are more likely to remember all the details of the part of the code base you just worked on, so fixing the bug tends to be much faster when you get fast feedback.

Many test cases are written to give this kind of fast feedback loop. You can often run them before you ever commit your changes to the source control system, and they make your work more efficient and keep your source control history clear.


Related to the previous point, but worth mentioning separately, is the confidence boost you can get from knowing that the test suite will catch simple mistakes for you. In most software-based businesses, there are critical areas where serious bugs could endanger the whole business. Just imagine you, as a developer, accidentally mess up the login system of a health-care data management product, and now people see others’ diagnoses. Or imagine that automatic billing charges the wrong amount to customers’ credit cards.

Even non-software businesses have had catastrophic failures from software errors. Both the Mars climate orbiter2 and the first launch of the Ariane 5 rocket3 suffered the loss of the respective vehicle, owing to software issues.

The criticality of their work puts emotional stress on software developers. Automated tests and good development methodology can help alleviate this stress.

Even if the software that people are developing is not mission-critical, risk adversity can cause developers or maintainers to make the smallest change possible and put off necessary refactoring that would keep the code maintainable. The confidence that a good test suite provides can enable developers to do what is necessary to keep the code base from becoming the proverbial big ball of mud.4

Debugging Aid

When developers change code, which in turn causes a test to fail, they want the test to be helpful in finding the bug. If a test simply says “something is wrong,” this knowledge is better than not knowing about the bug. It would be even more helpful if the test could provide a hint to start debugging.

If, for example, a test failure indicates that the function find_shortest_path raised an exception, rather than returning a path, as expected, we know that either that function (or one it called) broke, or it received wrong input. That’s a much better debugging aid.

Design Help

The Extreme Programming (XP) 5 movement advocates that you should practice test-driven development (TDD) . That is, before you write any code that solves a problem, you first write a failing test. Then you write just enough code to pass the test. Either you are done, or you write the next test. Rinse and repeat.

This has obvious advantages: you make sure that all code you write has test coverage and that you don’t write unnecessary or unreachable code. However, TDD practitioners have also reported that the test-first approach helped them write better code. One aspect is that writing a test forces you to think about the application programming interface (API) that the implementation will have, and so you start implementing with a better plan in mind. Another reason is that pure functions (functions whose return value depends only on the input and that don’t produce side effects or read data from databases, etc.) are very simple to test. Thus, the test-first approach guides the developer toward a better separation of algorithms or business logic from supporting logic. This separation of concerns is an aspect of good software design.

It should be noted that not everybody agrees with these observations, with counterpoints from experience or arguments that some code is much harder to test than write, leading to a waste of effort, by requiring tests for everything. Still, the design help that tests can provide is a reason why developers write code and so should not be missing here.

Specification of the Product

The days of big, unified specification documents for software projects are mostly over. Most projects follow some iterative development model, and even if there is a detailed specification document, it is often outdated.

When there is no detailed and up-to-date prose specification, the test suite can take the role of specification. When people are unsure how a program should behave in a certain situation, a test might provide the answer. For programming languages, data formats, protocols, and other things, it might even make sense to offer a test suite that can be used for validating more than one implementation.

1.2 Downsides of Tests

It would be disingenuous to keep quiet about the downsides that tests can have. These downsides should not detract you from writing tests, but being aware of them will help you decide what to test, how to write the tests, and, maybe, how many tests to write.


It takes time and effort to write tests. So, when you are tasked with implementing a feature, you not only have to implement the feature but also write tests for it, resulting in more work and less time do other things that might provide direct benefit to the business. Unless, of course, the tests provide enough time savings (for example, through not having to fix bugs in the production environment and clean up data that was corrupted through a bug) to amortize the time spent on writing the tests.

Extra Code to Maintain

Tests are code themselves and must be maintained, just like the code that is being tested. In general, you want the least amount of code possible that solves your problem, because the less code you have, the less code must be maintained. Think of code (including test code) as a liability rather than an asset.

If you write tests along with your features and bug fixes, you have to change those tests when requirements change. Some of the tests also require changing when refactoring, making the code base harder to change.


Some tests can be brittle, that is, they occasionally give the wrong result. A test that fails even though the code in question is correct is called a false positive. Such a test failure takes time to debug, without providing any value. A false negative is a test that does not fail when the code under test is broken. A false negative test provides no value either but tends to be much harder to spot than false positives, because most tools draw attention to failed tests.

Brittle tests undermine the trust in the test suite. If deployment of a product with failing tests becomes the norm because everybody assumes those failed tests are false positives, the signaling value of the test suite has dropped to zero. You might still use it to track which of the tests failed in comparison to the last run, but this tends to degenerate into a lot of manual work that nobody wants to do.

Unfortunately, some kinds of tests are very hard to do robustly. Graphical user interface (GUI) tests tend to be very sensitive to layout or technology changes. Tests that rely on components outside your control can also be a source of brittleness.

False Sense of Security

A flawless run of a test suite can give you a false sense of security. This can be due either to false negatives (tests that should fail but do not) or missing test scenarios. Even if a test suite achieves 100% statement coverage of the tested code, it might miss some code paths or scenarios. Thus, you see a passing test run and take that as an indication that your software works correctly, only to be flooded with error reports once real customers get in contact with the product.

There is no direct solution for the overconfidence that a test suite can provide. Only through experience with a code base and its tests will you get a feeling for realistic confidence levels that a green (i.e., passing) test run provides.

1.3 Characteristics of a Good Test

A good test is one that combines several of the reasons for writing tests, while avoiding the downsides as much as possible. This means the test should be fast to run, simple to understand and maintain, give good and specific feedback when it fails, and be robust.

Maybe somewhat surprisingly, it should also fail occasionally, albeit when one expects the test to fail. A test that never fails also never gives you feedback and can’t help you with debugging. That doesn’t mean you should delete a test for which you never recorded a failure. Maybe it failed on a developer’s machine, and he or she fixed the bug before checking changes.

Not all tests can fit all of the criteria for good tests, so let’s look at some of the different kinds of tests and the trade-offs that are inherent to them.

1.4 Kinds of Tests

There is a traditional model of how to categorize tests, based on their scope (how much code they cover) and their purpose. This model divides code that tests for correctness into unit, integration, and system tests. It also adds smoke tests, performance tests, and others for different purposes.

Unit Tests

A unit test exercises—in isolation—the smallest unit of a program that makes sense to cover. In a procedural or functional programming language, that tends to be a subroutine or function. In an object-oriented language such as Python, it could be a method. Depending on how strictly you interpret the definition, it could also be a class or a module.

A unit test should avoid running code outside the tested unit. So, if you are testing a database-heavy business application, your unit test should still not perform calls to the database (access the network for API calls) or the file system. There are ways to substitute such external dependencies for testing purposes that I will discuss later, though if you can structure your code to avoid such calls, at least in most units, all the better.

Because access to external dependencies is what makes most code slow, unit tests are usually blazingly fast. This makes them a good fit for testing algorithms or core business logic.

For example, if your application is a navigation assistant, there is at least one algorithmically challenging piece of code in there: the router, which, given a map, a starting point, and a target, produces a route or, maybe, a list of possible routes, with metrics such as length and expected time of arrival attached. This router, or even parts of it, is something that you want to cover with unit tests as thoroughly as you can, including strange edge cases that might cause infinite loops, or check that a journey from Berlin to Munich doesn’t send you via Rome.

The sheer volume of test cases that you want for such a unit makes other kinds of tests impractical. Also, you don’t want such tests to fail, owing to unrelated components, so keeping them focused on a unit improves their specificity.

Integration Tests

If you assembled a complex system such as a car or a spacecraft from individual components, and each component works fine in isolation, what are the chances the thing as a whole works? There are so many ways things could go wrong: some wiring might be faulty, components want to talk through incompatible protocols, or maybe the joints can’t withstand the vibration during operation.

It’s no different in software, so one writes integration tests. An integration test exercises several units at once. This makes mismatches at the boundaries between units obvious (via test failures), enabling such mistakes to be corrected early.

System Tests

A system test puts a piece of software into an environment and tests it there. For a classical three-tiered architecture, a system test starts from input through the user interface and tests all layers down to the database.

Where unit tests and integration tests are white box tests (tests that require and make use of the knowledge of how the software is implemented), system tests tend to be black box tests. They take the user’s perspective and don’t care about the guts of the system.

This makes system tests the most realistic, in terms of how the software is put under test, but they come with several downsides.

First, managing dependencies for system tests can be really hard. For example, if you are testing a web application, you typically first need an account that you can use for login, and then each test case requires a fixed set of data it can work with.

Second, system tests often exercise so many components at once that a test failure doesn’t give good clues as to what is actually wrong and requires that a developer look at each test failure, often to find out that changes are unrelated to the test failures.

Third, system tests expose failures in components that you did not intend to test. A system test might fail owing to a misconfigured Transport Layer Security (TLS) certificate in an API that the software uses, and that might be completely outside of your control.

Last, system tests are usually much slower than unit and integration tests. White box tests allow you to test just the components you want, so you can avoid running code that is not interesting. In a system test for a web application, you might have to perform a login, navigate to the page that you want to test, enter some data, and then finally do the test you actually want to do. System tests often require much more setup than unit or integration tests, increasing their runtime and lengthening the time until one can receive feedback about the code.

Smoke Tests

A smoke test is similar to a system test, in that it tests each layer in your technology stack, though it is not a thorough test for each. It is usually not written to test the correctness of some part of your application but, rather, that the application works at all in its current context.

A smoke test for a web application could be as simple as a login, followed by a call to the user’s profile page, verifying that the user’s name appears somewhere on this page. This does not validate any logic but will detect things like a misconfigured web server or database server or invalid configuration files or credentials.

To get more out of a smoke test, you can add a status page or API end point to your application that performs additional checks, such as for the presence of all necessary tables in a database, the availability of dependent services, and so on. Only if all those runtime dependencies are met will the status be “OK,” which a smoke test can easily determine. Typically, you write only one or two smoke tests for each deployable component but run them for each instance you deploy.

Performance Tests

The tests discussed so far focus on correctness, but nonfunctional qualities, such as performance and security, can be equally important. In principle, it is quite easy to run a performance test: record the current time, run a certain action, record the current time again. The difference between the two time recordings is the runtime of that action. If necessary, repeat and calculate some statistics (e.g., median, mean, standard deviation) from these values.

As usual, the devil is in the details. The main challenges are the creation of a realistic and reliable test environment, realistic test data, and realistic test scenarios.

Many business applications rely heavily on databases. So, your performance test environment also requires a database. Replicating a big production database instance for a testing environment can be quite expensive, both in terms of hardware and licensing costs. So, there is temptation to use a scaled-down testing database, which comes with the risk of invalidating the results. If something is slow in the performance tests, developers tend to say “that’s just the weaker database; prod could handle that easily”—and they might be right. Or not. There is no way to know.

Another insidious aspect of environment setup is the many moving parts when it comes to performance. On a virtual machine (VM), you typically don’t know how many CPU cycles the VM got from the hypervisor, or if the virtualization environment played funny tricks with the VM memory (such as swapping out part of the VM’s memory to disk), causing unpredictable performance.

On physical machines (which underlie every VM as well), you run into modern power-management systems that control clock speed, based on thermal considerations, and in some cases, even based on the specific instructions used in the CPU.6

All of these factors lead to performance measurements being much more indeterministic than you might naively expect from such a deterministic system as a computer.

1.5 Summary

As software developers, we want automated tests to give us fast feedback on changes, catch regressions before they reach the customer, and provide us enough confidence in a change that we can refactor code. A good test is fast, reliable, and has high diagnostic value when it fails.

Unit tests tend to be fast and have high diagnostic value but only cover small pieces of code. The more code a test covers, the slower and more brittle it tends to be become, and its diagnostic value decreases.

In the next chapter, we will look at how to write and run unit tests in Python. Then we will investigate how to run them automatically for each commit.

