When Expressive Engineering teaches statistics, our clients are typically engineers with access to vast amounts of data from their project domains. Software developers, for example, can easily run their programs on thousands or even millions of test files from their archives.
As Tim Harford of the Financial Times points out in his March feature on big data, the opportunities provided by large datasets can be elusive. It’s an insightful article, and with regards to large datasets, I’ll simplify the article’s warnings to two key points.
Firstly, the article warns against assuming that data from entire populations can be captured. When software developers run their programs against large test sets, it’s tempting to assume that their results accurately reflect the outcomes that will be observed in the real world.
But sampling bias should always be the concern of the data analyst. The Financial Times article cites the 1936 Literary Digest poll fiasco as a case in point (Expressive Engineering has used the same example in its courses for the last couple of years). The poll of millions of subscribers was far less accurate than the Gallup poll of 3,000, and incorrectly predicted a victory for Republican Alfred Landon over President Roosevelt. The reason? Only those who owned automobiles and telephones were sampled in the Literary Digest poll.
Closer to 2014, the Financial Times notes: “In 2013, US-based Twitter users were disproportionately young, urban or suburban, and black.” In my own analysis for Cochlear Limited, I found that clinical datasets of physiological data can be unrepresentative of everyday experience, even when large. Data gatherers have a tendency to filter out measurements that don’t look ‘normal’, and as an analyst I had to compensate for this with my own data collection.
Secondly, the article warns against dismissing the value of finding causal relationships versus mere correlations. It cites Google Flu Trends as the latest high profile victim. In 2009, Google claimed that flu-related searches on its search engine had a high-performing correlation with the actual spread of flu. But as Nature News reported in 2013, Google Flu Trends “had drastically overestimated peak flu levels” four years later.
I think the Financial Times summarises the problem quite well, so it’s worth quoting directly: “The problem was that Google did not know—could not begin to know—what linked the search terms with the spread of flu. Google’s engineers weren’t trying to figure out what caused what. They were merely finding statistical patterns in the data. … The claim that causation has been ‘knocked off its pedestal’ is fine if we are making predictions in a stable environment but not if the world is changing (as with Flu Trends) or if we ourselves hope to change it.”
For those of us engineers who are in the business of predictive modelling, big data is undoubtedly an exciting era. Heeding these two timeless warnings, however, will make you a better data analyst and, ultimately, a better engineer.