It has long been recognised by those working with data that given a large enough sample size then most data points will have statistically significant correlations because at some level everything is related to everything else. The psychologist Paul Meehl famously called this the ‘Crud Factor’ leading us to believe there are real relationships in the data where in fact the linkage is trivial.
Nate Silver made the same point when he warned that the number of ‘meaningful relationships’ is not increasing in step with the meteoric increase in amount of data available. We simply generate a larger number of false positives, an issue endemic in data analytics which led John Ioannidis to suggest that two-thirds of the findings in medical journals were in fact not robust.
So if we cannot always rely on statistical techniques to cut through swathes of data to find meaningful patterns then where do we turn? Naturally, we look to ourselves. Perhaps this is implicit in the discussion about the qualities of good data scientists, being ‘informed sceptics’ that balance judgement and analysis or that the key qualities are having a sense of wonder, a quantitative knack, persistence and technical skills. However as soon as we recognise that humans are involved in the analysis of data we need to start exploring some of the frailties of our judgement for if there is one thing that behavioural economics has taught us, is that none of us is immune from misinterpreting data.
One cognitive function that is surely critical for any data scientist is the ability to find order and spot patterns in data. As humans we are excellent at doing this for which there are good evolutionary reasons – our ability to do so is what drives new findings and thus our advancement. But as with all cognitive functions, our strength is also a weakness as this ability is so integral to ourselves that this can tip over into detecting patterns when in fact none exist.
As Thomas Gilovitch points out in his classic book ‘How We Know What isn’t So’, one of the problems we encounter when looking at data is that it’s very hard for us to see when it is random and when there are in fact patterns. When we look at a truly random sequence we tend to think there are patterns in the data because it somehow looks too ordered or ‘lumpy’. So, for example, when we throw a coin twenty times then there is a 50% chance of getting 4 heads in a row, a 25 percent chance of five in a row, and a 10 percent chance of a run of six. But if you give this sequence to most individuals they will consider that these were patterns in the data and not at all random. This explains the ‘hot hand’ fallacy where we think we are on a winning streak – in whatever that may be – from cards to basketball to football. In each of these areas where the data is random but happens to include a sequence we massively over-interpret the importance of this pattern.
This effect does not only apply to numeric data but also to analysis of visuals. This is important as visualisation is rapidly becoming a key element of Big Data analytics. A good example of the pitfalls is from the latter part of World War II when the Germans had a particularly intense bombing campaign on London. It was a commonly held view at the time that the bombs were landing in clusters which made some parts of London more dangerous than others. However, after the war analysis of the data showed that the bombs had in fact landed in a random sequence and no part of London was more dangerous than another.
It is easy to see why Londoners had, at the time, concluded there was a sequence in the bombing as eye balling the data retrospectively easily allows one to start seeing patterns. A more rigorous approach of course requires us to generate hypotheses which are then tested on other sets of data.
And of course once we start seeing patterns we quickly start to generate stories that would explain data. As Duncan Watts might say, everything is obvious when you know why. We hate uncertainty and strive to reduce this by quickly adopting explanations. The challenge we then face is that we tend to only seek out information which is consistent with the story we have developed. This ‘confirmation bias’ was first identified in a series of experiments in the 1960’s which showed that we seek out data which confirms our theory rather than test it. And when we get new information we tend to interpret it in a way that is self-serving. So we cement in our misinterpretation of data alarmingly quickly.
Of course seeing patterns in data is hugely helpful when it allows us to generate hypotheses that we use as the starting point of proper testing of the data. And this is where big data needs to adopt the rigour of experimental approach, whereby we state our objectives, formulate tangible hypotheses to reflect these objectives and then design experiments to test these hypotheses. Market research should be well placed in this respect given that our understanding of consumers allows us to identify ways to mine data, to build these hypotheses.
If we fail to do this and just ‘let the data speak for itself’ then we run this risk of being misled by unholy alliance of the Crud Factor of spurious statistical correlations along with fallible human judgement. Far from Big Data being the end of science, scientific method may in fact prove to be the salvation of big data.
Colin Strong heads up technology research at GfK in the UK. He is particularly interested in the way our lives are increasingly mediated by data – between consumers, brands and government. Colin is interested in exploring the opportunities this represents for a better understanding of human behaviour but also to examine the implications for brand strategy and social policy.