Unit Tests Aren't Free

19 January 2019

When I was thinking about leaving Google, and interviewing at lots of companies, I tried to ask at least one interviewer per company about unit tests. Google’s culture was very pro-unit test - it was basically impossible to get a PR approved without unit tests, there was a ton of internal documentation and education about how to write unit tests (even in the bathroom. Seriously), etc. etc. - and I thought that I could probably learn some important information about a company’s engineering culture by asking this question. During one of my interviews, the interviewer said, “well, we really don’t have that many unit tests. We have really good monitoring and generally the code coverage of people using our product is better than the code coverage we’d get by adding unit tests, so we’ve built a system where we can see things go wrong very quickly and fix them very quickly.”

This blew my mind. At first I wanted to argue! Of course tests are good! Aren’t you just testing on your customers? It took me a while to see the underlying wisdom in this philosophy - it’s not a philosophy of “unit tests are bad and we should test only in production”; rather, it’s a philosophy of “we should be efficient and practical in how we build and maintain confidence in our systems.”

I hadn’t considered the spectrum of usefulness of tests before, largely because I’d been living in the strange world of Google. When you can devote nearly infinite resources to any particular problem, some cost:benefit calculations become very strange. For instance, when you ran unit tests at Google, your tests ran in parallel on dozens of machines in a data center, dedicated (for the time the tests were running) to your tests. There were rigidly defined rules governing what tests could or couldn’t do, that let those machines further optimize how they ran the tests. And there was rich dependency information for even further optimization. The visible “costs” of tests decrease sharply when you hire a large team of very skilled people devoted to decreasing those costs. When you buy enough hardware to multiply that effect, it drives the “visible cost an engineer feels for tests” close to zero.

Google was right to place a large emphasis on unit tests, but their calculus is unique: a surfeit of computing power, money to hire a large team of people to optimize how unit tests run, enormous scale, and (in no small part because of that scale) a relatively slow deploy process.

Not every company is like Google. Indeed, there are maybe 3 or 4 other companies in the world that have a similar set of freedoms and constraints. Unfortunately, it’s very easy to look at what Google does in the abstract (or one of those other 3 or 4 companies) and think it’s the right thing for everyone. It’s not. (nb: this doesn’t just apply to unit tests)

It’s not hard to think of companies that have a fast deploy process, moderate scale (large scale by any objective measure, but still orders of magnitude smaller than Google), and fewer (if any) engineers devoted solely to improving how unit tests run. This is a profoundly more common scenario than Google’s. So why apply Google’s philosophy in a place where the landscape is completely different?

Let’s think about the benefits of unit tests. These are the top four I’ve seen. There are certainly many more, but these are pretty broad, and most tests will fulfill at least one of these:

Make it easier to write code: When working in a big, complex product, manually testing your changes is often time-consuming or prone to failure. Unit tests can give you a consistent way to run your code, inspect it in the debugger, log error messages, etc. Using unit tests in this way can be an enormous time-saver. You can also use existing unit tests to ensure that your code is getting called the way you think it should be, so the benefit of these types of unit tests can be long-lived.
Nudge you towards writing better code: Code that’s easy to test can also be easier to understand: more modular, with more distinct lines of dependency and looser coupling between functions or objects. Succinct tests can and often do lead to succinct code.
Communicate intent: By reading the unit tests for your function or class, other engineers get to see how you expected your code to be used, and probably more importantly, how you didn’t expect it to be used. Assertions in many test frameworks compel you to write light documentation; eg. Hamcrest’s assertThat function takes a description string as its first parameter.
Build confidence that code is continuing to run as expected: This can make it easier to refactor or otherwise modify the code that’s under test, especially in non-type-safe languages where there’s no compiler to check that, for example, the correct number of parameters are being passed to a function. But even with type safety (or a static analyzer like Phan), unit tests can give you a qualitative improvement in the safety of your changes.

These things can add up to faster development with higher confidence in its correctness and safety. But they don’t always have that effect; there are costs associated with the above benefits. If we can be explicit about some costs of the tests in addition to their benefits, it gives us a framework to consider the value of a given test.

Make it harder to modify code: raise your hand if you’ve ever had the onerous task of modifying a whole bunch of unit tests to make them pass when you make a simple change to a function’s output. In the most pathological case this might be changing the contents of a returned string, but there are countless ways this can happen, and frequently all the work you do to make the tests pass doesn’t give you any more confidence in your code or the legacy code you’re modifying. You can of course counter that this is a symptom of poorly written tests; I would respond that these were perfectly well-written tests for the goal they were probably trying to achieve (making it easier to write the code in the first place). The frequency of this particular cost is high enough that it probably comes from people’s best intentions and most efficient behaviors at the time they were writing the code and the tests. It’s worth noting that this is a cost that even Google couldn’t escape - I wish I could recover the weeks of my life I spent modifying hardcoded XMPP stanzas in test files.
Sometimes testability of code can reduce the readability or succinctness of code. People joke about classes like FooSingletonFactoryFactory in Java, but classes like this exist to fulfill a pattern that’s imposed by unit tests, not a pattern that helps people understand code. I’m reasonably sure anyone reading this article can think of awkward-verging-on-unreadable constructions in code they’re working on that exist to mollify the restrictions imposed by the confluence of their language, their general test framework, and the particulars of their own project.
Running tests takes time. Here at Mailchimp, every test function spawns its own process. If a test function isn’t explicitly decorated to the contrary, every test function gets a set of clean, empty databases. This is magic in the sense that tests won’t be flaky because of opaque state dependencies modified by other tests, but it’s resource-intensive, to put it mildly. Multiply the work done for each test by the large number of tests, and this safety net takes significant time - about 10 minutes when running in parallel on our (very powerful) Jenkins boxes, well over an hour locally. When I worked on Windows, running all the tests took over 48 hours, so this isn’t a problem unique to Mailchimp. In the absence of committed and ongoing optimization, lots of tests take lots of time to run. We can optimize the stuff that’s common to every test, of course, and then we can take the next step of ensuring that unit tests don’t access the network (they do and will), don’t access the database unless absolutely necessary (they do and will), don’t have arbitrary sleep commands (they do and will) etc and ad infinitum. There are ways to mitigate this (eg. annotating tests based on potentially slow dependencies) too but that’s non-trivial work, and we aren’t alone in not having a team dedicated to doing this work.
Tests give you confidence. Isn’t this a good thing? Many times, it is. But there’s confidence, and there’s false confidence, and it’s frequently hard to tell them apart. Many times it’s very easy to think “okay, this code is under test already, I can feel safe about my changes.” This isn’t bad - it’s human nature. But existing tests might or might not cover the thing we’re changing, and even figuring out if they do or not can take longer than writing a new single-use test case, and sometimes (many times) there’s time pressure, whether internal or external, that fights against that, probably not even consciously. That tension would be very different if the lack of tests was more obvious, and might encourage, different, safer behavior. Paradoxically, there are times when the lack of a safety net can lead to increased safety.

There are other costs, some general, and some unique to particular organizations. But the four above are the ones that I’ve consistently seen since unit testing became popular (yes, there was a time we wrote entire working - well, “working”, I include Outlook in this set - applications without any unit tests at all).

A counter to every one of the above points is, “well, those are characteristics of bad tests (or bad testing frameworks), why not just make your tests good?” The point is, that’s not free either. There are absolutely times when it’s worthwhile to spend the time to improve a particular test rather than just deleting it, but there are times when that’s not the case. I’ve seen ample evidence that, in the absence of significant work and time, tests become “bad” in the sense that they fall prey to one or more of the above pathologies. Sometimes that work and time is best spent improving the tests themselves (or the testing frameworks, or the infrastructure on which the tests run); sometimes it’s best spent elsewhere.

Given that, the above gives us a simple scaffolding to help us think about how to apply some cost-benefit analysis to a particular test. Very often we’ll write tests as a harness to make sure our code is working - is it worth keeping those tests around? How are the tests impacting the perceived confidence about the code under test - is high confidence warranted? If not, could we delete the test? Similarly, maybe you’re refactoring an important code path that has test coverage, but the coverage is old and is causing you grief as you refactor. Is it worth thinking about how to refactor the tests before refactoring the code, or is it worth adding better observability and alerting, and removing the tests as you refactor? Either answer is possibly correct - you need to use good judgment here. But the answer isn’t automatically “improve the existing tests and/or add more” - it could be “improve some tests, add some monitoring, remove some tests.”

Nothing will be completely black and white. The answer to every interesting problem is “it depends.” We should embrace that ambiguity and shy away from absolutes - unit tests are neither good nor bad in the abstract, quantity of unit tests isn’t an absolute measure of anything other than quantity of unit tests, and we shouldn’t value adding a test more or less than we value consciously deciding not to add one, or to remove one.