Conflicting Results

I’m a huge soccer fan, and I’m happily following the MLS Cup even though the local team was eliminated last week. Last night’s match between Real Salt Lake (RSL) and the Chicago Fire went to penalty kicks before one team finally prevailed. After the game ended, I went to mlsnet.com to watch the highlights and check out some of the stats. When I got there, the front page had this headline and teaser:

image

Quick – which team won? Did the Fire edge Real Salt Lake, or dir RSL outlast the Fire?

If you read a bit more, you’ll see that “RSL will face the Galaxy in the 2009 MLS Cup”, so if you go with majority rules you’ll be correct, since RSL did indeed edge the Fire last night. Headline errors aren’t all that uncommon (e.g. Dewey Defeats Truman), so I don’t fault the news site at all. Unfortunately, a very close relative of error, the false positive, has been bugging the crap out of me lately, and this headline reminded me that it’s past time to share my thoughts.

Let’s say you have 10,000 automated tests (or checks for those of you who speak Boltonese). We had a million or so on a medium sized project I was involved with once, so 10k seems like a fair enough sample size for this example. For the purpose of this example, let’s say that 98% of the tests are currently passing, and 2% (or 200 tests) are failing. This, of course, doesn’t mean you have 200 product bugs. Chances are that many of these failures are caused by the same product bug (and hopefully you have a way of discovering this automatically, because investigating even 200 failures manually is about as exciting as picking lint off of astroturf). Buried in those 200 failures are false positives – tests that fail due to bugs in the test rather than bugs in the product. I’ll be nice and say that 5% of the failures are false positives (you’re welcome do do your own math on this one). Now we’re down to 10 failures that aren’t really failures. You may be thinking that’s not too big of a deal – it’s only 1% of the total tests, and looking at 10 tests a bit closer to see what’s going on is definitely worth the overall sacrifice in test code quality. Testers in this situation either just ignore these test results or quickly patch them without too much further thought.

This worries me to no end. If 5% of your failing tests aren’t really failing, I think it’s fair to say that 5% of your passing tests aren’t really passing.  I doubt that you (or the rest of the testers on your team) are capable of only making mistakes in the failing tests – you have crappy test code everywhere. A minute ago, you may have been ok with only 10 false positives out of 10k tests, but I also think that 490 of your “passing” tests are doing so even though they should be failing. Now feel free to add zeroes if you have more automated tests. I also challenge you to examine all 9800 tests to see which 490 are the “broken” tests.

Yet we (testers) continue to write fragile automation. I’ve heard quotes like, “It’s not product code, why should it be good”, or “We don’t have time to write good tests”, or “We don’t ship tests, we can’t make it as high quality as shipping code”. So, we deal with false positives, ignore the inverse problem, and bury our heads in the sand rather than write quality tests in the first place.  In my opinion, it’s beyond idiotic – we’re wasting time, we’re wasting money, and we’re breeding the wrong habits from every tester who thinks of writing automation.

But I remain curious. Are my observations consistent with what you see? Please convince me that I shouldn’t be as worried (and angry) as I am about this.

What I Do

When I meet new people, they often ask, “what do you do?” The answer I give initially, and the one I hope to get away with is, “I work at Microsoft.”

It rarely works. They inevitably follow up with, “what do you do there?” – which, for better or for worse is a much more difficult question to answer. Depending on their technical knowledge (and my mood), I’ll say something between “I work on a team that does technical training, internal consulting and cross-company projects for engineers”, “I’m the Director of Test Excellence, and “I stop people from being stupid”. It was much a much easier question to answer when I worked on a product team, but I like the job, so I’ll deal with the moments of awkwardness.

I thought I’d write down a longer answer for those who are curious (or want to help me with a better definition).

The biggest thing I’m working on this year is helping engineers across the company have a common concept of software quality. This includes working with marketing on customer perception of quality and a lot of talking with people from around the company to see which practices are common, and discover some practices that should be shared more widely. It’s a hard problem to solve (and there are a lot more pieces to it), but it’s a fun challenge.

I’m also working on a variety of small projects to increase collaboration among testers and other engineers at the company. With nearly 10,000 testers, there’s not nearly enough sharing of ideas, practices or innovation among people solving problems that are likely much more similar than people realize. Every time I see a duplication of effort or the same question asked on a distribution list for the 3rd time in a month I’m reminded of how much more work there is to do in this area.

Edit: I forgot a big chunk worth adding

A reasonably sized chunk of our organization’s work is technical training for engineers at Microsoft. My team teaches some classes, but I work with vendors to teach and design a fair number of test related courses world-wide. I also own scheduling and prioritization of technical courses for what we call MSUS (shorthand for all MS engineers in the US outside of Redmond). It sort of a thankless job, but needs to be done and I don’t mind doing it.

The bulk of the rest of the time goes to what I call – “being Alan”. I organize and schedule meetings for our company-wide test leadership team and test architect group, and chair our quality and test experts community. I also function as chief-of-staff for my boss’s staff meetings (he attends the meeting alternate weeks, and I take care of the agenda and flow of the meetings every week). I participate in a few virtual teams / micro-communities (e.g. cross-company test code quality initiatives or symposiums on testability). I’m on the board for sasqag, and put in a few hours a month keeping things alive on the msdn tester center. I give talks to internal teams a few times a month and mentor half-a-dozen testers in various roles around the company. Finally (and most importantly), I manage a team of 6 people who work on similar projects, as well as teach testing courses. It helps a lot that the team is so smart and so motivated, because I’m most likely not the best manager in the world.

Beyond all that, I spend probably too much time staying connected with what’s going on in the testing world outside of MS. I blog a few times a week, speak at a few conferences a year, and tweet once in a while. It can be a balance problem some times, but I think it’s important enough to make a significant effort to keep up.

There’s probably more, but I think that covers most of it. Now you know.

Stuff I Wrote

I just put together a collection of my published works (it’s not a long list). I also have an article coming out in a Korean testing magazine – I’ll see if I can get a link once it’s out.

I’ve been writing less lately while I turn my attention toward my often neglected day job. I have a few projects on the horizon, and I’ll add them to the list if (or as) they come to fruition.

It’s a Beautiful Day

I may have mentioned this on the old blog, but I’m pretty sure I haven’t mentioned it here yet. O’Reilly media recently released Beautiful Testing – a collection of essays from a variety of testing professionals (including yours truly).

Book cover of Beautiful Testing

I received my copy over the weekend (much to the annoyance, I’m sure, of several other authors who are about to rebel against the empire if their copies don’t show up soon). I’m happy to have mine, and although I read the entire book in digital format, I’ve been flipping through it off and on for the last 3 days. I’m thrilled to be a part of it, but I have to tell you that I’m more excited after reading it again and finally holding it in my hands. The variety of information, styles, and knowledge is fascinating – each one opening up different possibilities and questions to ponder. It’s a fun read that I hope you check out. Best yet, the proceeds from the book all go to buying mosquito nets to help prevent malaria in Africa – what a great opportunity to get some practical testing advice and help out those in need!

Settling on Quality?

Oh my – another quality post. I’m afraid I’m starting a trend for myself, but I have a story to share.

As all gainfully employed workers in the tech field will tell you, we all have side jobs as tech support for all members of our immediate and extended families. This weekend, my mother-in-law opened a support ticket with me regarding her laptop – it was crashing randomly (that’s all the details you get when your m-i-l opens a support ticket).

So – I turned on her laptop, let it boot, then dealt with message after message from applications starting up and telling me stuff I didn’t care about. A backup program telling me that it needed a product key, an external hard drive utility telling me the drive wasn’t connected (duh), and an OEM replacement for windows wireless config launching to tell me I’m connected to a wireless network. The experience was annoying. But there’s a bigger problem. As I was looking at the 3 different web browsers installed and the few dozen or so other random programs and utilities installed, my first thought was “no wonder she’s having computer problems – she’s installed every app under the sun”. I always try to keep my main work machines somewhat “clean” – only installing applications I consider tried and true for worry that they’ll mess something up. Then I realized that’s wrong – I should be able to install whatever the hell I want without fear of losing overall quality (who knows – maybe I can and it’s all a mental problem on my end). The point is, that we (computer users) don’t seem to expect software to work. We’re not as surprised, alarmed, or pissed off as we should be when software doesn’t work correctly. Honestly – I’ve belittled people in the past for calling things bugs when they’re 99.99% user error, but I was wrong – user error or not, that .01% matters.

Ok, so software sucks. It really doesn’t matter – it’s still a profitable industry. That’s true, but I wonder how long it will be true. I wonder if something horrible (even worse than Windows ME** :}) has to happen before the world demands higher quality software. My hope is that we can start making better software long before something like that happens.

Oh – as far as my mother-in-laws computer goes, there was a crash dump on the machine. I attached a debugger and poked around a crash in the wireless driver. I put a later rev of the driver on the machine and so far, so good. I hope it stays that way…for at least a little while.

Finding Quality

Leave it to Adam Goucher to beat me to the punch line. When I proposed that breaking down your definition of quality to a manageable set of ilities is a reasonable method for improving customer perceived quality, the logical next step is to try and find out which of the ilities you need to care about. Adam suggestion was:

Want to improve v2? Talk to customers of v1 and ask which of the ilities they suffer the most from. And/or talk to people who didn’t buy your software and ask them which ility chased them away. Those are the ones that count.

Perfect answer, but keep in mind that the questions you ask are critical. You can’t ask “do you want the product to be more reliable”, or even “from this list, choose the one you care about the most”. For former question will always result in a “yes” answer, and the second will probably just result in confusion. Instead, you can ask questions like “tell me what you like most about the software (or what you dislike the most”. Ask open ended questions and take notes – take lot’s of notes. After  you’ve talked to a good sample of customers, break the notes into individual comments and start sticking them on a wall. Look for affinity – start grouping items and looking for themes. Then, see if an ility aligns with a theme. Eventually, you’ll have a bunch of big fat quality bulls-eyes on the wall waiting for you to address.

I have one minor nit where Adam missed the mark – you don’t have to wait until v1 is out to collect this data. If your software team is worth their salt, they’ve defined the customer segments they care about far before v1 hits the street. Interview customers from that segment and ask them questions like “this product does foo, what do you expect a high quality product that does foo to do?”, or “what would make you want to use a product like this?” or “what would make a product that does foo unusable for you?”

The fun part is that I’ve sort of done this (in a very general way), and have a list (that certainly won’t work for every piece of software in the world, but is worth discussing). I haven’t yet figured out how to push the dial on these ilities, but but that’s what I’m going to try and figure out – using this blog as a sounding board while I think.

In Search of Quality

If you ever want to start a great discussion / debate with a group of testers, ask them to define quality and come up with measurements.“Quality” is such an overloaded term that it’s hard to get people to agree. The Weinberg fans (I suppose I’m one of them) will cite their mentor and say “Quality is Value (to some person)”. To that, I say fine – define Value and tell me how to measure that! Most of the time, I think measuring quality is similar to how Potter Stewart defined pornography when he said “it’s hard to define, but I know it when I see it”. I’ll admit that in many situations, the value of gut-feel and hunches about quality outweigh some of the quantitative attempts some organizations use. Unfortunately, I see many quality “experts” throw the baby out with the bathwater and dismiss quantitative metrics simply because they’re easy to get wrong.

If you ask most teams how they measure quality, they’ll probably tell you they measure code coverage, count test cases, track bugs found during different activities and a number of other engineering practices. They’re not improving product quality – they’re improving engineering quality. Now, improving engineering quality is a good thing too – it does a lot to decrease risk and increase confidence, but it does diddly-squat to improve the customers perception of quality. So here’s a conundrum – how do you measure perception before you can get someone to perceive it? One way is to define scenarios, and attempt to use the software like we think customers will use it, all the while noting where it’s difficult, or where something goes wrong, then working to get those issues fixed. In the end, we still cross our fingers and hope they’ll like what we gave them.

But I’m wondering if there’s a better way to make software our customers like (and perceive to be of high quality). Wikipedia has a great list of the ilities – attributes that lead to system quality, but the list is huge. If you attempt to improve that many items at once, you may as well work on nothing at all. But suppose you knew which of those ilities were most important to your most important customer segments. My hypothesis is that if you focus on pushing the bar on a small set of quality attributes that customer perception of quality will improve. It’s not easy (again – why some people just give up completely on metrics), but I think it can work.

Think of this scenario: You’re leading a team on v2 of their software. You ask them to “improve quality”. If their heads don’t spin and explode, they may ask to hire more testers or make plans to increase code coverage, or they may just ignore you because quality is too hard to measure. Or, you could tell your team they need to focus on reliability, usability, and compatibility (areas you know your customers care the most deeply about). You provide them with checklists and other aids to help them think about these areas more deeply as they work on v2. You may even come up with some measurements or models that show how much the team is improving in those areas.I’m fairly confident one of those approaches will lead to quality software 99 times out of 100.

I’ll dig into some of my favorite ilities and speculate how to improve them in future posts.

Ed: added the rest of the Weinberg quote because so many people were annoyed I left it out.

Welcome

I’ve been blogging for nearly 5 years now. When I first started, I didn’t think I wanted to be a blogger – I just wanted a place to interact with customers. I quickly realized that I liked writing and started to study writing and used blogging to work on my writing.

Now, 5 years later, I’ve written half a dozen magazine articles, most of one book, and a chapter from another, and written hundreds of blog posts. Now it’s time for something new.

Today, I’m not exactly sure what’s going to be new, but I have realized for a while that I wanted to move my blog away from the msdn hosted blogs and onto a new site. I’ve owned angryweasel.com for close to a decade now, and decided that it was time to use it (You can find the story about the name here – or at least part of it).

The content will remain the same – mostly at least, but I feel a bit more free moving my thoughts, notes, and ideas to my own little island on the web.

More to come

Why bugs don’t get fixed

I’ve run into more and more people lately who are astounded that software ships with known bugs. I’m frightened that many of these people are software testers and should know better. First, read this “old” (but good) article from Eric Sink. I doubt I have much to add, but I’ll try.

Many bugs aren’t worth fixing. “What kind of tester are you”, I can hear you shout, “Testers are the champions of quality for the customer!” I’ll repeat myself again (redundantly if I need to …) Many bugs aren’t worth fixing. I’ll tell you why. To fix most bugs, you need to change code. Changing code requires both resources (time), and it introduces risk. It sucks, but it’s true. Sometimes, the risk and investment just aren’t worth it, so bugs don’t get fixed.

The decision to fix or not to fix isn’t (or shouldn’t be) entirely hunch based. I like using the concept of user pain to help make this decision. There are 3 key factors I consider to determine user pain. These are:

  1. Severity – what’s the impact of the bug – does it crash the program? Does the customer lose data? Or is it less severe? Is there an easy workaround? Is it just a cosmetic issue?
  2. Frequency – how often will users hit this issue? Is it part of the main flow of the program, or is the issue hidden in an obscure feature. Minor issues in mainline scenarios may need to be fixed, but ugly stuff in an obscure feature may slide.
  3. Customers Impacted – if you’ve done your work up front, you have an idea of who your customers are, and an idea of how many users are in (or how many you would like to be in) each of your customer segments. From there, you need to determine if the issue will be hit by every user, or just a subset. If you have the ability to track how customers are using your product you can get more accurate data here.

From here, make up a formula. Assign a value scale to each of the above and apply some math – you can do straight addition, multiplication, or add weights based on your application and market. For our purposes, let’s just add and use a 10 pt scale for each bug :}.

Bug #1, for example, is a crashing bug (10pts) in a mainline scenario (10pts) impacting 80% of the customer segment (8pts). At 28pts on the user pain scale, I bet we’re going to fix this one.

Bug #2 is an alignment issue (2pts) in secondary window (2pts) in an area used by a few “legacy” users (2pts). At 6 pts, this is a likely candidate to not get fixed.

Unfortunately, they’re not all that easy. Bug #3 is a data loss bug (10pts). It occurs in one of the main parts of the application, but only under certain circumstances (5pts) (btw – numbers are completely made up and subjective). Customer research shows that it’s hardly ever used (2pts). At 17 pts, this one could go either way. On one hand, it’s probably not worth the investment to fix. As long as the issue is understood, and there are no blind spots, leaving the bug in place is probably the right thing to do.

On the other hand, you have to weigh this with the rest of the bugs in the system. The Broken Window theory applies here – if there are too many of these medium threshold bugs in the app, quality (or at the very least, the perception of quality) will suffer. You need to consider every bug in the system in the context of the rest of the (known) bugs in the system and use this knowledge to figure out where the line is between what gets fixed and what doesn’t get fixed.

It sucks that the industry ships software with known bugs – but given the development tools and languages we have today, there isn’t a sensible alternative.

Edit:

As this sits in my head, I think I’ve missed a fourth factor in the forumla: Ship Date. The proximity of ship date plays into the fix/don’t fix decison as much as the above. I’m not sure, however, whether it’s a fourth factor in the math, or if the threshold of what “value” of user pain turns into a bug fix as ship dates approach.

Who Owns Quality?

On request from Adam Goucher – another excerpt from How We Test Software at Microsoft.  BTW – Adam wrote a review of HWTSAM here – although Linda Wilkinson beat him to the clever title.

This is from a section on quality in chapter 16. It’s something I believe strongly in and would love to hear your comments.

Many years ago when I would ask the question, “who owns quality,” the answer would nearly always be “The test team owns quality.” Today, when I ask this question, the answer is customarily “Everyone owns quality.” While this may be a better answer to some, W. Mark Manduke of SEI has written: “When quality is declared to be everyone’s responsibility, no one is truly designated to be responsible for it, and quality issues fade into the chaos of the crisis du jour.” He concluded that “…when management truly commits to a quality culture, everyone will, indeed, be responsible for quality.”[1] A system where everyone truly owns quality requires a culture of quality. Without such a culture, all teams will make sacrifices against quality. Development teams may skip code reviews to save time, program management may cut corners on a specification, or fudge a definition of “done”, and test teams may change their goals on test pass or coverage rates deep in the product cycle. Despite many efforts to put quality assurance processes into place, it is a common practice among engineering teams to make exceptions in quality practices to meet deadlines or other goals. While it’s certainly important to be flexible in order to meet ship dates or other deadlines, quality often suffers because of a lack of a true quality owner.

Entire test teams may own facets of quality assurance, but they are rarely in the best position to champion or influence the adoption of a quality culture. Senior managers could be the quality champion, but their focus is justly on the business of managing the team, shipping the product, and running a successful business. While they may have quality goals in mind, they are rarely the champion for a culture of quality. Management leadership teams (typically the organization leaders of Development, Test, and Program Management) bear the weight of quality ownership for most teams. These leaders own and drive the engineering processes for the team, and are in the prime organizational position for evaluating, assessing, and implementing quality based engineering practices. Unfortunately, it seems that quality software and quality software engineering practices are rarely their chief concerns throughout any product engineering cycle.

Senior management support for a quality culture isn’t entirely enough. In a quality culture, every employee can have an impact on quality. Many of the most important quality improvements in manufacturing have come from suggestions by the workers. In the auto industry, for example, the average Japanese autoworker provides 28 suggestions per year, and 80% of those suggestions are implemented[2].

Ideally within Microsoft engineers from all disciplines are making suggestions to improve quality. Where a team does not have a culture of quality, the suggestions are few and precious few of those suggestions are implemented. Cultural apathy for quality will then lead to other challenges with passion and commitment among team members.


[1] STQE Magazine. Nov/Dec 2003 (Vol. 5, Issue 6)

[2] The Visionary Leader, Wall, Solum, and Sobul