Yesterday, I read a mail sent to an email alias I’m on, where the author was asking why tool X wasn’t enabled on his latest build. The mail looked something like this (genericized to protect the innocent).
foo.service doesn’t appear to be working
- I installed the build from <build_path>
- I verified the binaries existed <where they should exist>
- I queried to see if foo.service was running – it wasn’t
- I queried another way, and it didn’t show anything either
- When I run my tests, they fail and tell me that the service isn’t running
- I thought the service wasn’t started, so I tried starting it, but that also failed (error text: Failed to retrieve the fizzbaz from the bogatorium)
On first glance, it looks like there’s a real problem here. I try to avoid the “works on my machine” comments, but I thought it was strange that nobody else had seen this. I assumed the Problem was Between the Keyboard And the Chair (PBKAC).
At first, that seemed to be exactly the problem. You see, there is no foo.service. It’s actually called fizz.service, but it’s run as part of the foo toolset, so it’s an easy misunderstanding. If they would have queried for fizz.service, they would have seen it happily running.
Their tests failed because they had an invalid command line for their test (or more specifically, they specified an invalid address for the machine where they wanted to run the tests).
And then they got that strange error message attempting to start the service because they were attempting to restart a service that was already running (interesting, however, was that they did manage to try and start the fizz service rather than the foo service at this point.
Three errors, all different, yet relatively easy to see as symptoms of a common problem. Except they weren’t.
Absolutely, definitely, beyond a shadow of a doubt, PBKAC.
But maybe not. Actually, definitely not. The princess is in another castle – or between a different keyboard and chair.
PBKAC was definitely in play when the user tried to query for the wrong service. My service names above are silly, but our service names are actually nearly as confusing. You only make this mistake once, but it’s not too difficult to make.
When the tests failed, the error said the service wasn’t running. The error was 100% accurate (the service wasn’t running because the user connected to the wrong machine). What if, as a courtesy, that error message said something like, “The service isn’t running. Are you connected to a valid target machine”? I bet that would have set off a few light bulbs rather than generated more confusion.
That message about the fizzabaz and the bogatorium isn’t far from what the user actually saw. What they saw was again, a 100% accurate statement of what went wrong – it just gave no clue of what may have caused it.
Granted, these were internal tools, but that’s a horrible excuse to confuse someone. I bet I’ve seen hundreds (if not thousands) of error messages like this – but very few that offer even the tiniest bit of trouble shooting or diagnostic advice. It’s easy to blame the user when they do something “wrong” (or unexpected), but ultimately, it’s rarely their fault. And in the end, if they can’t do what they want to do with your software, they’ll take their business and money elsewhere.
And then, it’s your problem alone.