Conflicting Results

I’m a huge soccer fan, and I’m happily following the MLS Cup even though the local team was eliminated last week. Last night’s match between Real Salt Lake (RSL) and the Chicago Fire went to penalty kicks before one team finally prevailed. After the game ended, I went to mlsnet.com to watch the highlights and check out some of the stats. When I got there, the front page had this headline and teaser:

image

Quick – which team won? Did the Fire edge Real Salt Lake, or dir RSL outlast the Fire?

If you read a bit more, you’ll see that “RSL will face the Galaxy in the 2009 MLS Cup”, so if you go with majority rules you’ll be correct, since RSL did indeed edge the Fire last night. Headline errors aren’t all that uncommon (e.g. Dewey Defeats Truman), so I don’t fault the news site at all. Unfortunately, a very close relative of error, the false positive, has been bugging the crap out of me lately, and this headline reminded me that it’s past time to share my thoughts.

Let’s say you have 10,000 automated tests (or checks for those of you who speak Boltonese). We had a million or so on a medium sized project I was involved with once, so 10k seems like a fair enough sample size for this example. For the purpose of this example, let’s say that 98% of the tests are currently passing, and 2% (or 200 tests) are failing. This, of course, doesn’t mean you have 200 product bugs. Chances are that many of these failures are caused by the same product bug (and hopefully you have a way of discovering this automatically, because investigating even 200 failures manually is about as exciting as picking lint off of astroturf). Buried in those 200 failures are false positives – tests that fail due to bugs in the test rather than bugs in the product. I’ll be nice and say that 5% of the failures are false positives (you’re welcome do do your own math on this one). Now we’re down to 10 failures that aren’t really failures. You may be thinking that’s not too big of a deal – it’s only 1% of the total tests, and looking at 10 tests a bit closer to see what’s going on is definitely worth the overall sacrifice in test code quality. Testers in this situation either just ignore these test results or quickly patch them without too much further thought.

This worries me to no end. If 5% of your failing tests aren’t really failing, I think it’s fair to say that 5% of your passing tests aren’t really passing.  I doubt that you (or the rest of the testers on your team) are capable of only making mistakes in the failing tests – you have crappy test code everywhere. A minute ago, you may have been ok with only 10 false positives out of 10k tests, but I also think that 490 of your “passing” tests are doing so even though they should be failing. Now feel free to add zeroes if you have more automated tests. I also challenge you to examine all 9800 tests to see which 490 are the “broken” tests.

Yet we (testers) continue to write fragile automation. I’ve heard quotes like, “It’s not product code, why should it be good”, or “We don’t have time to write good tests”, or “We don’t ship tests, we can’t make it as high quality as shipping code”. So, we deal with false positives, ignore the inverse problem, and bury our heads in the sand rather than write quality tests in the first place.  In my opinion, it’s beyond idiotic – we’re wasting time, we’re wasting money, and we’re breeding the wrong habits from every tester who thinks of writing automation.

But I remain curious. Are my observations consistent with what you see? Please convince me that I shouldn’t be as worried (and angry) as I am about this.

Comments

  1. I’m pretty confident we have the false negative problem. I try to put in self-tests for my automation frameworks to prove that my methods are at least capable of failure. I catch a lot of test bugs this way, unfortunately.

    We do typically try to hold test code to a high quality bar though, so we’re not quite as guilty of that problem.

  2. Never mind automated tests – I’m currently having problems reviewing manual tests that have passed that should have failed 🙁

    ( and I wish it was only 5% of them )

    [Alan] The numbers in my experience are worse too. But I thought the story was scary enough already .

  3. > Please convince me that
    > I shouldn’t be as worried
    >(and angry) as I am about this

    Hi Alan,

    It’s all about the approach.
    While most of testers only started studying programming they evolutionary reproduce all the errors and wrong assumptions that were already left behind by mature developers.

    A brilliant tester won’t necessary create a brilliant code… most likely it will be a crappy code, especially if record/playback or other code generation tool is used.

    So don’t blame automated tests – they just reproduce the garbage that was put on.

    [Alan] Believe me, I’m not blaming automation – I’m blaming the testers who find it perfectly acceptable to create poor automation.

  4. This is exactly the problem I found when I started automating my tests and it left me totally crestfallen because I worked so hard on them.

    My passing tests that should have been failing showed me that special attention must be paid if one is developing for test.

    Separating testing and coding is much harder than it would appear on the surface, and it’s not as simple as “developers” writing “better” code than “testers.” There is complexity in the problem set of evaluating results.

  5. Is it fair to assume that the type of coding errors which cause a test to incorrectly pass and errors which cause an incorrect fail are of equal probability?

    For safety, many electrical appliances are designed so that if something goes wrong, they will refuse to work rather than refusing to stop. Is there a way we can engineer our tests so that if there’s a bug, it’s more likely to be a bogus fail than a bogus pass?

    Sorry, I only have questions, no answers.

    [Alan] – but they’re good questions, and enough questions may lead to an answer. Thanks for the comment.

  6. PS Your blog may have a bug: My previous comment was posted at 1:42, but the time stamp displayed is 2:42.

    [Alan] – oops – thanks for pointing that out. I set up the wrong UTC offset when I set up the blog. Should be fixed now.

  7. We humans are not yet good a programming the evaluation portion of tests. For that reason the accuracy is woefully bad. A tweep told me “Automating checks is nearly 100% automated — meaning there is no need for human to evaluate the result – 0 or 1 …” If 0 or 1 was always an accurate pass or fail only on the system under test then the automated check would not need human evaluation. That has never been the case yet in ANY of the automation I’ve worked so hard to make maintainable. I think wishing for that is the wrong goal.

    Any automation which makes it easier for a human to evaluate the quality of important aspects of the software faster, cheaper, and better start to finish than they could without it is automation that is a success! I think our goals for automation are wrong and the areas where there is great potential for innovation, such as model based testing, random number generating varied tests, and even tests which detect change and are partly manual but help a tester be more powerful are discounted because of this one stupid idea that somehow human evaluation can be skipped.

    The automation can help testers do more. It is good at that. It only sucks at replacing testers because so far we aren’t good enough at programming machines to evaluate and test for us.

    Sorry, Alan, this is a big rant in your blog, but I get fired up about this stuff because the goals and motivations are harmful and they could be so much better!

  8. Preaching to the choir! My tests aren’t fail-proof, but i try to practice safe coding. That means it’s less headache for maintenance, more modular for reusing functions, and more readible for someone who might want to update it.

    If I expect the developers to provide me with good code to test, i feel that I owe them good code to test it with. Now, do we take it ad far as writing unit tests for our test scripts? Can I get a QA crew to QA QA?

Leave a Reply to Liz Marley Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.