Making metrics work
This is part three in a short series on the nature of metrics generally and within software development particularly.
In the previous article we outlined three fundamental characteristics of bad metrics: Goodhart’s problem, Drucker’s problem, and the context problem. There are also simple practical problems, and we’ll address those here too.
"When a measure becomes a target, it ceases to be a good measure. And the target no longer means what you think it does."
Goodhart's Law (restated)
So don’t make it a target and you don’t make it a problem. Right?
Easy to understand but hard to implement; Goodhart’s problem is essentially one of trust and one of discipline.
We know this is exceptionally hard to solve through simple experience. We've already cited a number of high profile examples*, but it’s damn near ubiquitous in any organisation of even moderate size. Trust just doesn’t scale very well. The reasons for that are surprisingly deep-rooted in our evolution and culture and well beyond the scope of this series.
So, some practical steps.
First, try to create the conditions for trust. If you aren’t there already, this generally means some amount of organisational change, both structurally and culturally - which is why the solution is hard. But if you can… make smaller, more autonomous teams. Distribute decision making. Connect people to both higher and lower objectives. Disconnect team funding from project funding. Remove project funding models entirely if you can!
A model for establishing trust? Source: John Cutler, via Twitter
Second, avoid “giving” metrics. Instead make teams responsible for conceiving and owning the metrics for the product and services they are trying to implement. A team feels less temptation to game a metric if it belongs to them and not a manager. Again, this will probably mean organisational change because to do this effectively a team needs to include the research and data and design people as well as engineers, and start solving the whole problem. It will also likely mean education on metrics - they can read this series for a start!
Third, accept the possibility of failure and don’t play blame games. Critically that means dissociating business metrics from individual performance assessments. Anything else means you are immediately pressuring the metric with what are essentially survival or status concerns (such as inferred or actual threats to authority, reputation, employment, etc). Don’t back people into corners.
Basecamp’s Shape-up and Amplitude’s North Star models are excellent reference points for how these three factors can work operationally, but it’s up to every organisation to get itself to a place where these can work culturally.
Of course these are just best laid schemes, so where possible it is valuable to look for counter-balancing metrics. If one metric is being gamed these are other metrics that should flag red, and vice versa. The classic call-centre example is caller queue-times being balanced by resolution rates. Even in high-trust environments, it’s always worth policing yourself for honesty because the slope is slippery.
Solving for Goodhart’s problem “simply” means removing the sources of those pressures. In any competitive space there will naturally be a continuous temptation to reassert those pressures in the name of stimulating maximal effort. Especially where those higher up (which really means further away) in the hierarchy have a nebulous feeling of threat or uncertainty. When they are themselves pressured by the ultimate metrics for an organisation - share prices or vote shares. Pressure tends to cascade down like that.
Nonetheless Goodhart’s law is telling us that not only will pressure absolutely fail but that it will corrupt the system, absolutely making the situation worse not better. Terminal on long enough timelines.
At some point empathy and courage are mandatory to resist this. Have strength; trust people.
"What gets measured gets managed — even when it’s pointless to measure and manage it, and even if it harms the purpose of the organisation to do so."
Just don’t manage what you shouldn’t and no problem. Or alternatively: don't start none, won't be none. Harder to understand but fortunately easier to implement; Drucker’s problem is essentially one of focus.
So what shouldn’t you manage? You shouldn’t manage that you have no plan for.
What's the question?
Data is everywhere and it is tempting because evolution favoured those who could pick up the signals and infer the patterns which indicated maybe a toothy thing in the bushes was about to eat you. Evolutionary favourable even if you were wrong. The costs of a false positive were maybe some lost opportunity, but the costs of a false negative were no more opportunities for you ever again.
This natural selection optimised mode of processing information is inductive reasoning - gather some data then create a best-fit theory. Inductive reasoning is by no means inherently bad; as a species we survived the toothy things for example, and it’s the basis of the scientific method. But when it comes to a world of easy data Drucker is warning us that the approach is harmful, and indeed philosophy also formally warns us about the incompleteness and inconsistency of induction.
So let’s try and apply the opposite approach, that of deductive reasoning. Here we start with a theory and then validate with observation. Now we’re not trying to actually build a deductive model of the world here, but the approach forces us to consider intent.
Start by considering what question you are asking the system. What would indicate an answer to that question? Is it a positive or negative feedback? Think of your metrics as these questions.
At this point you might have a few different metrics that might work, but you’ve already excluded anything else. Now consider what values of those metrics should trigger activity? And then what would those reactions to the events actually be? If you can’t conceive what you would actually do in the event of a “red” metric then the metric is just reactive noise and you’re getting lost in Drucker’s maze.
Predict the future, don't explain the past
Here we need to introduce the idea of leading and lagging indicators. Leading indicators tell you something about what is going to happen. Lagging indicators tell you what has already happened. A classic example is for the question of personal health, your weight is a lagging indicator and your calorie intake is a leading indicator.
Because they are looking at the past, lagging indicators are easy to measure but of course it’s too late to do anything about it. Because they look to the future, leading indicators are harder to measure but are very actionable.
If your metric has purpose, and it’s a leading indicator that you can articulate some clear potential triggers and reactions for, then you have solved Drucker’s problem. Be cautious though not to fall into McNamara’s fallacy. Deductive reasoning must challenge its premises continuously, and the best way to do that is with qualitative information or again a counterbalancing metric.
Being context sensitive
There are actually (at least) two challenges here, both linked. One is being conscious of and sensitive to ever changing contexts. The other is recognising what your current context actually is and what to do about it.
The practice of sabermetrics developed in baseball (and made famous by the book and film Moneyball) wasn’t the introduction of metrics to the game. Almost uniquely they had existed at the very centre of the game in the form of the box score for over 100 years before sabermetrics came about. No, it arose from a recognition of a context shift in the way professional baseball was played.
Recognising the shift. Source: Moneyball, Columbia Pictures
For example, batting average - the percentage of opportunities a player at bat hit the ball - dominates traditional offensive metrics. Makes sense. Hitting the ball more frequently is a positive indicator of good outcomes right? Problematically though, this loses the difference between a hit resulting in just a single base or one producing a home run, and therefore the actual value of the hit. In the modern era players have increasingly focused waiting for the right opportunity for the big hit, because getting the ball into the stands has no defence. While the value of the hits is going up (more home runs) there are fewer small hits. Thus the traditional metric actually indicates artificially poor performance.
Billy Beane of the Oakland Athletics achieved success by recognising the context shift and exploiting it with updated metrics while his peers were trapped in standard practice. Don’t get complacent. Challenge your metrics. Dump them and reinvent periodically, even if only to validate.
Dave Snowden’s Cynefin (pronounced KUH-nev-in) framework is one of the best devices for understanding where you are now (and what to do about it). In a broad sense it’s most critical to understand when you are in the unordered, emergent domains, and when you are in ordered, linear domains. Thought work activities (like software development) are almost always going to be emergent, as described in the last article.
But the picture can get muddied. Systems are composed of smaller sub-systems which have their own domain properties. Complex processes can and will have clear processes within them. Activities within a domain needn’t be of the domain. Also, the function of management is to progress a system towards stability - towards clear domains - so over long enough timelines a context should shift by the very actions within it.
This is a deep topic in itself. For the moment let it suffice that it is unlikely linear metrics will be valuable in software development. Indeed you are unlikely to be doing software development if you can use such metrics effectively.
You need to be mindful that systems are holistic and consider the consequences of metrics on both more global and more local layers. These are going to largely be unexpected consequences, so the best way to understand them is to simply ask, but over time metrics developed at multiple tiers can provide objective insights.
Lastly, two simple mechanical traits: frequency and obtainability.
The rate of measurement of a metric must be at least as frequent as the rate we alter the system, or we will be unable to clearly discern cause and effect or react in a timely manner. In practice most metrics should be effectively “live”.
The measurement itself must be readily obtainable, without undue effort or calculation, or unreliable access e.g. where owned by a disconnected team or requiring direct customer research. Either undermines the confidence in the measurement itself (as well as probably the frequency).
Good tooling collecting and exposing metrics via dashboards solves most problems in these areas, sometimes with a little bit of authority shifting. Fortunately the world has largely solved these issues. I mention these issues as a caution against regression and an argument in favour of investment.
The good metric
Let’s summarise all of this by characterising the qualities we expect to see in a good metric:
Is owned by the team itself
Is dissociated from individuals’ performance
Answers a question; a clear articulated purpose
Is actionable. Has clearly articulated triggers and reactions
Is a leading indicator
Gets reviewed for relevance often
Frequent and obtainable
Fundamentally all metrics will stimulate behaviour. You have to ask yourself what behaviour will this metric create, and is that desirable? In the last article in the series we’ll talk about how WORTH applies these ideas in practice, and where we’re looking towards.
I'm not quite sure if the UK Gov now abandoning publication of COVID-19 metrics constitutes a Goodhart problem or something else, but it ain't good. Fortunately software teams are accountable.*
At WORTH we believe that knowledge sharing should be free, enabling and impactful. Want further insight into our thoughts and ideas? Sign up to our newsletter.