I’m a huge soccer fan, and I’m happily following the MLS Cup even though the local team was eliminated last week. Last night’s match between Real Salt Lake (RSL) and the Chicago Fire went to penalty kicks before one team finally prevailed. After the game ended, I went to mlsnet.com to watch the highlights and check out some of the stats. When I got there, the front page had this headline and teaser:
Quick – which team won? Did the Fire edge Real Salt Lake, or dir RSL outlast the Fire?
If you read a bit more, you’ll see that “RSL will face the Galaxy in the 2009 MLS Cup”, so if you go with majority rules you’ll be correct, since RSL did indeed edge the Fire last night. Headline errors aren’t all that uncommon (e.g. Dewey Defeats Truman), so I don’t fault the news site at all. Unfortunately, a very close relative of error, the false positive, has been bugging the crap out of me lately, and this headline reminded me that it’s past time to share my thoughts.
Let’s say you have 10,000 automated tests (or checks for those of you who speak Boltonese). We had a million or so on a medium sized project I was involved with once, so 10k seems like a fair enough sample size for this example. For the purpose of this example, let’s say that 98% of the tests are currently passing, and 2% (or 200 tests) are failing. This, of course, doesn’t mean you have 200 product bugs. Chances are that many of these failures are caused by the same product bug (and hopefully you have a way of discovering this automatically, because investigating even 200 failures manually is about as exciting as picking lint off of astroturf). Buried in those 200 failures are false positives – tests that fail due to bugs in the test rather than bugs in the product. I’ll be nice and say that 5% of the failures are false positives (you’re welcome do do your own math on this one). Now we’re down to 10 failures that aren’t really failures. You may be thinking that’s not too big of a deal – it’s only 1% of the total tests, and looking at 10 tests a bit closer to see what’s going on is definitely worth the overall sacrifice in test code quality. Testers in this situation either just ignore these test results or quickly patch them without too much further thought.
This worries me to no end. If 5% of your failing tests aren’t really failing, I think it’s fair to say that 5% of your passing tests aren’t really passing. I doubt that you (or the rest of the testers on your team) are capable of only making mistakes in the failing tests – you have crappy test code everywhere. A minute ago, you may have been ok with only 10 false positives out of 10k tests, but I also think that 490 of your “passing” tests are doing so even though they should be failing. Now feel free to add zeroes if you have more automated tests. I also challenge you to examine all 9800 tests to see which 490 are the “broken” tests.
Yet we (testers) continue to write fragile automation. I’ve heard quotes like, “It’s not product code, why should it be good”, or “We don’t have time to write good tests”, or “We don’t ship tests, we can’t make it as high quality as shipping code”. So, we deal with false positives, ignore the inverse problem, and bury our heads in the sand rather than write quality tests in the first place. In my opinion, it’s beyond idiotic – we’re wasting time, we’re wasting money, and we’re breeding the wrong habits from every tester who thinks of writing automation.
But I remain curious. Are my observations consistent with what you see? Please convince me that I shouldn’t be as worried (and angry) as I am about this.