Chapter 3: The Untestables

There are things which make testing more challenging. Many of them are global variables of sorts. Global variables cause spooky action at a distance. Do the same thing over and over again, and get a different result. Tests may randomly pass or fail, depending on the order in which they are run.

How to make any code testable

In general there are two options, the first one being simpler:

  • Pass in the result of the untestable thing as a parameter.

Before

fn():
  ...
  ☠️☠️☠️
  ...

After

fn(something):
  ...
  ...
  • Extract the untestable thing to a method and override it in a subclass. Or in a language with first-class functions, extract it to a function and pass in the untestable function as a parameter.

Before

fn():
  ...
  ☠️☠️☠️
  ...

After

fn():
  ...
  something()
  ...
  
something():
  ☠️☠️☠️

Tests can then replace the untestable thing with a value object or a test double.

Test doubles

If the SUT (system under test) is not a pure function and it's hard to test together with the real objects, its dependencies can be replaced with test doubles. The dependencies can be provided as method or constructor arguments (aka dependency injection).

There are five main categories of test doubles:

  • Dummy is a placeholder to make the code compile, but doesn't affect the SUT.
  • Stub returns data to the SUT e.g. using hard-coded method return values.
  • Spy records how the SUT calls the spy, so that the test can afterwards assert on the recorded data.
  • Mock contains pre-recorded expectations on how the SUT should call it, and will itself automatically verify the expectations. Requires a mocking framework.
  • Fake is a simplified implementation of a dependency, not appropriate for production use, e.g. persistence layer based on a hashmap.

Only mock types you own.

London school of TDD

Mock objects were invented in a London meetup, and it gave birth to a mock-based outside-in approach to TDD, which is commonly called London style TDD. This is in contrast to Detroit/Chicago style TDD, where the code is typically written bottom-up and dependencies are faked only when they complicate testing (named such because Chrysler's C3 project, which gave birth to Extreme Programming, happened in Detroit). They are also known as mockist and classicist styles.

London style TDD focuses on the communication protocols between objects sending messages to each other. It goes hand-in-hand [1] with the original vision of object-oriented programming, by Alan Kay, where objects are like individual computers on the network sending messages to each other (in which sense Erlang is the most object-oriented programming language).

When using mock objects, it's important to understand the object-oriented style for which they were created. Otherwise, over-mocking may lead to tight coupling between tests and implementation details. The best description of how mock objects were meant to be used is the book Growing Object-Oriented Software, Guided by Tests (Steve Freeman, Nat Pryce 2009).

Singletons

Singleton is an anti-pattern. It is the object-oriented equivalent of a global variable. Instead, just create one.

File system

The file system is a global variable which persists between test executions.

If a test needs to write to the disk, create a unique temporary directory on test setup and delete it recursively on teardown.

If the test process is killed or there are file locks, the teardown may not be able to delete the temporary directory. Avoid using /tmp and instead place the temporary directory inside the project directory, under the build target directory, so that any stale directories will be removed on a clean build.

Database

The database is a global variable which persists between test executions.

Make it easy to run the tests locally. Docker Compose makes it easy to start up a database without needing to install it. For cloud-only databases, using a development instance in the cloud is necessary. If more than one person (or process) uses the same database, then care must be taken to isolate the tests from parallel test runs.

Usually tests create the database schema on test setup, and remove it on teardown. Another style is to remove and recreate the schema on test setup, which makes it the responsibility of the next test to clean up what the previous test produced. Clean-before allows peeking the data after a test run, but you would still need to focus run a single test, so commenting out the teardown or using a sleep gives the same effect with the clean-after approach. (Mom always told to clean up after yourself.)

In focused integration tests, it may be possible to run each test in a rollback-only transaction. This should make tests faster by avoiding the need to recreate the database schema for each test. If more than one thread is involved or the SUT is complex, this strategy is usually not possible.

Tests may create their own test data, or there may be a shared set of test data in the database. The former makes tests more understandable and decoupled from each other. The latter can be used for also testing database migrations.

The test schema name may be hard-coded or unique. Unique names for each test make it possible to run tests in parallel. If the test process is killed, test teardown is never executed, so the tests should automatically remove stale test schemas (especially if using a shared long-running database instead of a local container/VM), or you will eventually learn the database's soft and hard limits.

Never run tests against a production database. One safeguard is for the tests to only connect to a database whose name starts with "test".

Database tests can be made faster by disabling fsync or using a RAM disk.

Dead ends

You could replace the database with an in-memory fake implementation for tests (e.g. hashmap). It will make the tests faster, but will require maintaining two parallel implementations - the real and the fake persistence layer. It works in simple cases, but gets harder the more database code there is. Even when using contract tests to make the implementations functionally equivalent, they will be leaky abstractions with non-obvious differences (transactions, foreign key constraints etc.). It's better to decouple business logic from persistence: you won't need to fake dependencies if you have no dependencies.¹

Some people use an embedded in-memory database in tests and a different database in production, for example HSQLDB vs PostgreSQL. This is a road to madness. Even if SQL is a standard,² each implementation is different, so you will anyways need to run the tests against both databases. It might avoid having to install a database and the data will be removed after the test process exits, but nowadays docker compose up -d db is easy and even with an in-memory database you will need to handle isolation between test cases. Speed is not an argument either; a PostgreSQL which is already running is faster than a HSQLDB that needs to start every test run, not to speak of runtime performance. Most importantly, you would be limited to a subset of SQL that works on both databases, or you will need to maintain alternative versions of the queries; you would miss out on useful database-specific features such as triggers/stored procedures and range types. Summa summarum, use the same technology in tests as in production.

You just saved 5+ years of experimenting.

Network sockets

Network socket port numbers are a global variable at the operating system level.

If using a shared continuous integration server, there can be many builds running on the same machine and they compete for the same port numbers. Even when running tests locally, you will typically have an application instance for manual testing running at the same time as the automated tests. The local development instance may use hard-coded ports, but the tests should allocate a random free port for the database and web server.

Most servers you can bind to listen on port 0 and the operating system will assign it an unused port number. After the server has started, you can find out what port was assigned to it and use that in the tests. docker compose port is handy for that. The other approach is to programmatically bind a socket to port 0, check what port number was assigned, close the socket, and then use that port number for starting the actual server - a port collision should be unlikely.

P.S. Docker by default binds to network interface 0.0.0.0 and it bypasses the firewall, so your development servers will be publicly accessible even if your firewall is configured to block all incoming connections. Always bind explicitly to 127.0.0.1 when publishing container ports to the host.

Time

Time is a global variable which is ever changing (hopefully monotonically increasing).

Code which reads the current time (e.g. using new Date()) is inherently untestable. Instead, pass in the current time as a method parameter, or inject a clock which can be replaced with a fake clock in tests.

Concurrency

The order of memory reads and writes between parallel threads, and the operating system's context switching, are unpredictable global variables.

If a test fails randomly, don't ignore it as an outlier. The code has a concurrency bug, in the production or test code. Save any stack traces and logs of the failure, and inspect the code ruthlessly, until you know why it failed. It's important to know the memory model of the programming language and the CPU.

Minimize the amount of code that needs to be thread-safe. Use concurrency abstractions which allow most of the code to be single-threaded. Immutability makes the code easier to reason about, also in single-threaded code.

Don't use sleep() in tests. The sleep time is either too long, making the tests slower, or too short, making them flaky (i.e. they fail randomly). Instead, react to events or use polling.

Concurrency artifacts such as CountDownLatch and CyclicBarrier are useful for unit testing concurrent code. With them you can make thread 1 wait at point A, until thread 2 has arrived at point B.

Testing cannot prove that code is thread-safe, but together with code review, you can get quite far by writing a test which executes lots of tasks in parallel and then asserts invariants about what the tasks did. For example, each write happened exactly once, each task saw a consistent view of the state, tasks could read their own writes, and so on.

Always have a timeout for asynchronous tests, in case the code gets stuck in an infinite loop or deadlock or doesn't send some event. The timeout needs to be long enough to not be triggered randomly when the computer is overloaded, but short enough that you don't need to wait long for the tests to fail, especially if the wait time is NumberOfTests * Timeout.

Randomness

It's desirable for tests to pass or fail reliably. But what if the code being tested is meant to have randomness? If you can't anymore assert exact values, you will need to approach it like property-based testing and assert invariants.

For example, let's test a function which returns random integers between 1 and 10. You can call it lots of times and check that all values are within the range 1 to 10. You may also check that, with a sufficiently large sample size, each of the integers between 1 and 10 is returned at least once. You may also check that the values are returned in unpredictable order: build a few lists of same length from the return values, and check that the list contents are different. Depending on the domain, there might be other restrictions as well. For example, when dealing cards from a deck of cards, each card appears exactly once.

But even if you assert that the random values are not predictable, once in a blue moon the values could be returned in seemingly predictable order¹ and your tests would fail. To improve repeatability, you could always use the same seed for the pseudorandom number generator. Or better yet, choose randomly from a couple different hard-coded seeds, so that the tests cannot be coupled to any single predictable random order.

(Testing whether something is true randomness is outside the scope of this course. That's in the realm of mathematics and not TDD.)

User interface

Tests should be sensitive to behavior changes and insensitive to structure changes. This is even more important in the user interface. Changing the visual style or layout of the UI, should not break behavioral tests.

There are patterns like passive view which try to separate the logic and visuals of the UI, to make the logic more testable. With the advent of React, UI components can be written as stateless functions, which makes testing them easier.

Unit testing web app components

Asserting on the innerText of a component (after whitespace normalization) produces tests which are decoupled from visual changes.

Asserting the presence/absence of a CSS class is useful for testing logic that is observable only visually. Make sure to use the same constant for presence and absence checks; a mispelled CSS class is always absent.

End-to-end testing web apps

Don't click buttons directly in test code. Create an automation layer of high-level operations and call those. The tests should focus on what the system does, and the automation layer on how the system does it. That way when the UI changed, only the automation layer needs to be updated, instead of fixing all tests individually.

Prefer selecting elements based on the visible text on the button/link/label; it makes the tests easier to read. But don't be afraid to add extra IDs, classes and data attributes to simplify testing.

Have only a few end-to-end tests. They are slow and flaky. Prefer unit tests. Set a hard limit for how many end-to-end tests the whole application may have (≤10 for even big apps) and stick to it. End-to-end tests should only check that things are wired together, not behavioral correctness. Overreliance on end-to-end tests can grind development to a halt.

Visual testing

It's hard to write an assertion that something looks good. But for a human it's easy to check it visually, and the computer can compare whether the pixels have changed since the last approval.

There are tools like Storybook for rendering UI components in various states, and it's possible to take a screenshot of the result and check whether it has changed.

Optimize the diff for humans. Even video and audio can be diffed as an image.


Proceed to Chapter 4: Legacy code or Exercises