When the Chair of the House of Commons Education Committee asked Michael Gove (Secretary of State of Education at the time) about comparative performance measurement between schools, this happened:
Chair: If “good” requires pupil performance to exceed the national average, and if all schools must be good, how is this mathematically possible?
Michael Gove: By getting better all the time.
(Full transcript here)
Now, sniggers to one size, there’s a few important points here. The first is that I don’t disagree with striving to get better all the time; neither do I think performance shouldn’t be measured. I also believe it can be useful to understand apparent differences in comparative peer performance.
So, what’s the problem?
Well, it’s the way it’s so often done – league tables.
Here’s an example using police forces, although you could replace them with schools, hospitals or other institutions, if you like.
League tables are over-simplified, misleading, fundamentally illegitimate, charlatans of the performance world; they purport to convey information about comparative peer performance, when in fact they are little more than mirages. They lie to you. They tell you stuff that isn’t there. They set you off on thought processes and assumptions that are utterly unwarranted. (A bit like slightly more elaborate binary comparisons. Ugh!) But the most dangerous thing about them is that they appear so plausible.
A notable problem with league tables is that they are routinely methodologically unsound and notoriously unstable. (This is particularly true of league tables constructed from complex public sector data). Due to statistical considerations I won’t inflict on you here, it is often mathematically impossible to neatly rank institutions in the tidy fashion we are so used to (i.e. one at the top, one at the bottom, and the remainder nicely stacked in between, from best to worst). You see, in league table world, about half of those ranked end up as ‘below average’, and someone is always bottom.
Not everyone can be above the national average! Why not? Because it’s an average.
What we should be doing is trying to establish if there are significant differences between peers, and this can be done very simply in a couple of ways, as demonstrated by Stick Child…
In this first example, the six police forces we saw earlier are assessed against each other, taking into account confidence intervals in the data. (Don’t worry if you’re unfamiliar with the term, just trust me that it’s important). As you can see, this tells us that two forces are performing significantly differently to the other six (i.e. there are no overlaps). We can’t, however, neatly rank them from ‘best’ to ‘worst’, because we can’t separate the ‘top’ two from each other, and we can’t separate the other four from each other.
Here’s another way of understanding comparative peer performance in a more contextualised manner:
This time we can observe that the six police forces are all within the boundaries of ‘normality’ (by applying Statistical Process Control methodology). If any of them were outside of the dotted lines we might be concerned that particular force was significantly different from its peers; however, in this case, all six forces are clustered around the mean average (solid horizontal line) and within the range of anticipated performance for the group.
Therefore, there is absolutely no way the forces should be placed in ranked order – they are likely to move positions each time a snapshot is taken because of normal variation, but as long as they stay within the lines (and ideally, improve as a group), it is wrong to judge performance based on apparent position.
You see, when this happens, we encounter the other big problem associated with the league table mindset – concern about someone’s position in a league table leads to unfair assumptions about performance, unnecessary ‘remedial’ activity to address the perceived deficiencies, pressure from management, sanctions, and so on. And all based on something that essentially isn’t there. Cue gaming and dysfunctional behaviour! Like clockwork.
And a final thought – if league tables are constructed using crime data, are we even measuring the right thing? See this.