(Note: We can skip over this tedious nonsense. It’s only here for illustration.)
Root factors causing a single bug to be caught in QA instead of Dev, by another team instead of my own:
- Browse Error: Was Inadvertently Raised in Severity
- Functional bug is introduced (my bad)
- But it’s present in separate area from active work
- Handling Discrepancy: Data-page errors differ by context
- Caused unit tests to pass in single run but fail in the suite
- This was superficially identical in behavior to a past issue
- Past Issue: Unit Test discrepancies encountered previously
- New unit test failures are written off as sporadic, pre-existing platform issues
- Bug escapes from Dev into QA (my bad)
- Automation Frequency: Too Low
- Not captured in QA until encountered by another team
- Configuration Toggle: Not Set
- Because disabled, no messages were captured (my bad)
- Legacy Configs: Updated Without Notice
- Added confusion into the mix with other unrelated failures thrown in our test suite (keep your test suites clean, kids)
The Bug ⇔ PI ∨ AF ∨ CT ∨ LC ∨ BE ∨ HD
This is about as concise as you can possibly be, as far as describing what we could call “The Bug.” Six complex primary factors, where taking away any one of them would have stopped The Bug from “escaping” our team – which is our working definition of error.
Of course, this kind of overview can only be established after the fact, only after you’ve gone through all the steps of troubleshooting, investigating as many nodes of information as you can manage in order to hunt down the root kernels of the problem, the meaningful nodes.
When you’re just starting out, you can’t possibly predict how complex the underlying problem truly is, or how long it will take you to resolve it.
All in all, that process took us three full days of work to sort everything out.
… and it was not really a very significant or meaningful bug, standard stuff.
Did it take us a “long” time to resolve The Bug?
How do we even answer that question?
Could we have given a meaningful answer at the outset?
Could we have derived a meaningful estimate of The Bug’s “time to resolution?”
Management sure seems to like having that question answered, but firm conclusions are hard to come by.
And while I’m not certain, I’m confident our team’s chaotic approach of “social annealing” performs much better than any deterministic alternative.
Let me explain.
What are the different ways we could answer this question? (Other than the “wild-ass guess” approach, even though it’s my personal favorite.)
One deterministic approach would be to evaluate the range of possible investigation lengths.
For instance, if we could establish the shortest and longest paths to confirming the six primary factors, we can build a scale on which we can say whether or not our team was truly “fast” or “slow.” In other words, if we know the upper and lower limit, we can provide exactly the kind of answer business folks would love to have whenever they ask “how long should this take?”
Sounds simple enough.
But, even assuming modest complexity for each of the six post-facto monoids of investigative uncertainty, it would have taken a deterministic approach roughly the age of several universes in order to evaluate all the different nodes of information and paths of investigation, thereby establishing the shortest path to those six “simple facts,” or “primary facts,” or “primitive factors.”
One way we can show this is to represent The Bug’s discovery phase as a directed graph, an investigation tree of sorts, where each node in the tree presents the investigator with a piece of information, as well as a choice of what to investigate next. (The first piece of information would obviously be encountering The Bug itself.)
We know there are at least six nodes in this tree, based on the primary factors above, but necessarily there would have to be others. There would be at least a handful of somewhat less meaningful nodes in the tree, weaker signals that only suggest a problem in a given location, rather than confirm it. These are the breadcrumbs that can either lead our investigator closer to the six prime factors, or that can lead them away and waste the investigator’s time.
Given this framing, the next question would be, “How many nodes of meaningful information were there to potentially be investigated?” Answering this would definitely be… impossible. But for the sake of illustration, we can assume an absurdly low number – say, 20. Assuming that each of our six primary factors were embedded among 20 non-primary factors, that puts the total number of unique investigation factors at 126.
That seems manageable.
This is a vast oversimplification, of course. The real number of breadcrumbs within the The Bug’s investigation tree would naturally be much, much, much higher. But still, for now we can use 126 as our conservative estimate, safely assuming that The Bug’s potential investigation tree is at least that complex.
Then, we can add another simplification, assuming only two choices to be made for every node of information encountered at every point in The Bug’s investigation path. That means that every meaningful item of information uncovered during the investigation suggests at most two other nodes that should be investigated next, and that the order of gathering the information is meaningful (i.e. there is a more optimal investigation path depending on the investigator’s ability to make the right choices for each breadcrumb followed).
Note: Since the order of gathering information is meaningful, it’s possible for the same piece of information to recur as multiple nodes in the tree. This makes sense in The Bug’s investigation context if, for example, something might not look out of place until the second time it’s encountered, perhaps only after you realize the configuration behind it is off, or something roughly similar.
Two choices per node. 126 nodes.
Putting these two over-simplifications in place, we can arrive at a lower limit for the complexity of The Bug’s investigation tree, but… its implication is that there would be at least
2126 – 1
possible paths of troubleshooting to consider before completing the deterministic evaluation of the range of paths, which is what we’d need in order to determine a precise value for the expected investigation length.
And again, if you have more than two choices at each node of the investigation… watch out. And we haven’t even considered the “investigation cost” of each unit path being traversed. Yeesh.
Even assuming a deterministic computer with a processing rate of 1,000,000 nodes of information per second (pretty damn fast), it would still take
~ 8.5 x 1031
seconds to completely evaluate the possible paths of troubleshooting with the necessary degree of certainty.
In other words, it would still take that computer just over 2.7 x 1024 years to determine the best, fastest path to resolution.
And while my team is also pretty damn fast, we process info… a lot slower than a million units per second.
(A random sampling or a genetic algorithm to find an average path length would perform better, of course. You could also recast the investigation as a Bernoulli multi-armed bandit, which would give you another non-deterministic approach. But of course, none of these methods would give the exact answer.)
This is, in a certain sense, just the basic math of evaluating uncertainties, or really any complex processes with uncertain outcomes.
Those who do any kind of investigative work will sympathize.
It’s also the same reason that most types of progress bar are an outright lie.
There is just no realistic way to predict complex outcomes with that level confidence; and in those scenarios, such information cannot be extracted no matter how severe the demand.
So… how long should it take?
We’ll tell you as soon as we know.
The fastest way to answer the question is almost always to just solve the problem itself.
Or… guess, if you have to.
Three days doesn’t seem so bad in that light, but of course – tell that to management 😀
(Also my management is awesome and didn’t really care at all. Get you an employer like that.)
(Update, days later: Whelp… we subsequently encountered a latent bug that we could only see after getting this one corrected, so you could meaningfully say that The Bug itself was only ever really a single node in the investigation path of an even larger, more complex “container bug.”)
At the end of the day, I guess there must be some kind of real value baked into all these human hunches of ours.
We solve these kinds of computationally immense problems all the time – pretty much whenever at least two humans team up together.
In other words, groups of human brains can do some pretty impressive stuff.
When grouped together, human mind-sets are the most complex objects in the known universe.
Maybe we should start treating them that way.
Maybe we should tremble before the cosmic might inherent to any given group, any given team, any given coalition, any given mindset.
Maybe we should show some damn respect.
— yoav golan