One of the most useful classes I took in college was an introductory statistics class that was intended to discourage Poli. Sci. majors from continuing in their course of study (I was one at the time). The interesting thing about the class is that it included study and analysis of research as a core part of the class. We learned how to put a study together, how to develop controls, and different formulas used to confirm whether we had statistically significant results or not.

TDD

So when Phil Haack announced that Research Supports the Effectiveness of TDD I was more than a little interested in seeing what the linked report actually contained. Phil quotes from the abstract.

We found that test-first students on average wrote more tests and, in turn, students who wrote more tests tended to be more productive. We also observed that the minimum quality increased linearly with the number of programmer tests, independent of the development strategy employed.

Phil has obviously read the rest of the report and provides his favorite pieces that seem to do as his title suggests. One of the things I worry about when I see things supporting the latest and greatest software development practices, however, is a strong tendency towards confirmation bias—of looking for confirmation of current theories and overlooking counter-indicators.

So, being the curious type and since TDD is something I’m keeping an eye on to see if its something I might want to adopt myself some day, I went into the report.

The Report

The report itself is remarkably well thought-out. Though their sample size is small (a mere 24 students completed the exercise), it looks like they were very careful to make their study relevant to how software development actually works. They created tasks that built on and eventually altered previous functionality, communicated requirements as user stories (meaning the participants had to design the projects, including class layout and interfaces etc.) and the administrators had a large suite of black-box tests to confirm functional implementation (i.e. tested actual functionality delivered and ignored the structure, design, and tests).

It’s important to note that both the TDD and non-TDD groups created and used unit tests. Indeed, the groups are described as “Test First” and “Test Last” throughout the paper.

The authors include their data and results and even mention counter indications (as good report authors should). This is a good thing. If you have a turn for that kind of thing, I recommend reading through it. It isn’t long, particularly as these things go.

Lies, Damn Lies, and . . .

Unfortunately, the authors disappoint me when I compare the abstract to the data in the paper. Indeed, the report becomes an example of why I don’t give much credence to abstracts anymore. Clever academics have long used abstracts to support the conclusions they desire and this seems to be one such. If you’re careful (as these authors are) you can even do it without actually lying. Each of the following statements from the abstract is true:

  • The test-first students on average wrote more tests.
  • Students who wrote more tests tended to be more productive.
  • The minimum quality increased linearly with the number of tests.

Note that only the first is TDD-“Test First”-specific. The other two stand alone, though most readers won’t catch that. Here are some equally true statements based on the data (pages 9 and 10 of the report if you want to read along with me in your book):

  • The control group (non-TDD or “Test Last”) had higher quality in every dimension—they had higher floor, ceiling, mean, and median quality.
  • The control group produced higher quality with consistently fewer tests.
  • Quality was better correlated to number of tests for the TDD group (an interesting point of differentiation that I’m not sure the authors caught).
  • The control group’s productivity was highly predictable as a function of number of tests and had a stronger correlation than the TDD group.

So TDD’s relationship to quality is problematic at best. Its relationship to productivity is more interesting. I hope there’s a follow-up study because the productivity numbers simply don’t add up very well to me. There is an undeniable correlation between productivity and the number of tests, but that correlation is actually stronger in the non-TDD group (which had a single outlier compared to roughly half of the TDD group being outside the 95% band).

My Interpretation

One of the things my statistics class pounded home is that correlation doesn’t equal causation. That’s particularly important to remember in reports like this one. Indeed, I think causation is a real problem with this study, at least insofar as “proving TDD effective” is concerned. It actually serves to undercut many common TDD claims for superiority.

Productivity is an example where causality is far from certain. It makes sense to me that more productive programmers write more tests if only because productive programmers feel like they have the time to do things the way they know they should. Even with “Test First” the emotional effect of “being productive” is going to have an impact on the number of tests you create before moving on to the next phase. (note: you have to be careful here because the study measures tests per functional unit so it’s not that productive programmers are creating more tests over all, but rather that they’re creating more tests per functional unit.) Frankly, I think it’s natural to feel (and respond to) this effect even without a boss hanging over your shoulder waiting to evaluate your work.

In other words, you create more tests because you are a productive programmer rather than being a more productive programmer because you create more tests. At least, it makes sense to me that it would be so.

Truly problematic, however, are the quality results. That’s simply a disaster if you propound “Test First” as a guarantor of quality. I mean, sure, the number of unit tests per functional unit suggests a minimum quality through increased testing, but that’s only interesting in a situation where minimum quality is important (like, for example, at NASA or in code embedded in medical equipment). The lack of any other correlation here is pretty pronounced any way you care to slice the data. Having a big clump in that upper left quadrant is troubling enough but then having the “Test Last” group almost double your “Test First” group in the over 90% quality range is something that should be noticed and highlighted.

While correlation doesn’t equal causation, the lack of correlation pretty much requires a lack of causation.

Since quality is obviously not related to number of tests (at least in this study), I think it is highly disingenuous to highlight the possible connection between number of tests and minimum quality. While it’s the only positive thing you can say about quality and unit testing, it is hardly the only thing revealed. The lack of correlation between number of tests and quality is incredibly important and of broad general interest whether you are using formal TDD or not.

Thoughts and Wishes

I really wish that the authors had included a third group that did no unit testing. Since they broadened their scope post-facto to make statements about the efficacy of unit testing over all, it would be nice to have a baseline that didn’t do any unit testing yet went through the same process. The study setup was ideal for including such a group as both the productivity and quality measurements were entirely independent of whatever testing was done (or not done).

That said, something occurred to me while reading this study that hasn’t before: quality can be a property of unit tests as well as of production code. That’s a huge duh, but consider what that could mean. For one, it means that I’d love to see a follow-up that considers the quality of the tests themselves. I would expect to see a relationship between the quality of the product and the quality of the tests if only because quality programmers will be competent in both arenas. This might be such a duh, though, that people assume that test quality matches production quality. Still, I wonder if that is actually the case.

More interesting still would be the correlation between “Testing First” and unit test quality. Ask yourself this, which unit tests are going to be better, the ones created before you’ve implemented the functionality or those created after? Or, considered another way, at which point do you have a better understanding of the problem domain, before implementing it or after? If you’re like me, you understand the problem better after implementing it and can thus create more relevant tests after the fact than before. I know better where the important edge-cases are and how to expose them and that could mean that my tests are more robust.

And yes, I realize after the fact that this is an inversion of section 6.2 which is quoted at length by Phil. I wish the authors had spent as much time in section 6.1 explaining reasons for the quality results instead of what looks to me like an attempt to excuse them. That’s the problem with confirmation bias, those results that run counter to expectations aren’t as interesting and receive cursory exploration at best.

Anyway, without question, testing first leads to having more tests per functional unit. The question is if this is valuable. This study would seem to indicate that this is probably not the case, at least if quality is your intended gain. But then, I’m not that surprised that number of tests doesn’t correspond to quality just as I’m not surprised that the number of lines of code doesn’t correspond to productivity.