Saturday, June 14, 2014

Prediction is difficult, especially about the future (and, apparently, about flu)

http://upload.wikimedia.org/wikipedia/commons/5/5e/Niels_Bohr_Date_Unverified_LOC.jpgThe title of this posting, minus the comment about flu, is a quote attributed to Niels Bohr. I imagine him mumbling this while stewing over the horrors of making prospective predictions of experimental outcomes with no good theory to provide guidance. The concern is well founded and the idea that prediction is difficult is profound -- and it is relevant to much of the "big data" analysis that is currently in vogue. 

We should take notice of Bohr's admonition for the reasons so clearly described by Lazer et al ("The Parable of Google Flu: Traps in Big Data Analysis"), who review the failure of Google Flu Trends (GFT) in 2013. This is an excellent paper containing a direct critique of many issues facing not only GFT specifically, but also of the larger "big data" movement that is so much in the news today.

Briefly, GFT estimates flu prevalence by mining search terms from users of Google’s search engine and applying algorithms to the results. In the past, GFT's predictions have agreed with CDC surveillance data well, anticipating those data several days earlier than CDC. In 2013, however, it became clear that GFT was substantially overestimating flu levels. Lazer et al describe the failure and explain several ways in which the GFT approach is problematic.

Early in the paper they capture the essence of the Achilles heel of many "big data" projects at present, noting that
“Big data hubris” is the often implicit assumption that big data are a substitute for, rather than a supplement to, traditional data collection and analysis. We have asserted that there are enormous scientific possibilities in big data. However, quantity of data does not mean that one can ignore foundational issues of measurement, construct validity and reliability, and dependencies among data. The core challenge is that most big data that have received popular attention are not the output of instruments designed to produce valid and reliable data amenable for scientific analysis.
Read that again. Every word is important

The paper goes on to highlight several issues with GFT and what is known about the methodology involved in its predictions. Among other findings, they conclude that a forecasting model far simpler than the elaborate use of huge amounts of data in GFT could have forecast influenza better than GFT has for sometime. So why go to the bother of using massive computational resources to compute a result that's so inaccurate?

Fung, in a recent blog, provides a frank discussion of what "big data" are and, importantly, they are not. He describes the OCCAM framework, which amounts to "a more honest assessment of the current state of big data and the assumptions lurking in it". Within this framework, "big data" are:
  • Observational: much of the new data come from sensors or tracking devices that monitor continuously and indiscriminately without design, as opposed to questionnaires, interviews, or experiments with purposeful design
  • Lacking Controls: controls are typically unavailable, making valid comparisons and analysis more difficult
  • Seemingly Complete: the availability of data for most measurable units and the sheer volume of data generated is unprecedented, but more data creates more false leads and blind alleys, complicating the search for meaningful, predictable structure
  • Adapted: third parties collect the data, often for a purposes unrelated to the data scientists, presenting challenges of interpretation
  • Merged: different datasets are combined, exacerbating the problems relating to lack of definition and misaligned objectives
(Bullets taken directly from Fung.) Trying to make sense out of data that are poorly characterized or understood seems like a recipe for disaster. Traps aplenty indeed, and Lazer et al illustrate these traps for GFT in detail.

Such traps must be identified and worked around in sensible, theoretically sound ways. The OCCAM problems with "big data" do not mean that "big data" analysis is not promising. Rather, they mean that we need to be thoughtful when attempting to analyze such data, and that methods need to be developed to rationalize data so that they can produce meaningful results for biomedical and scientific issues.

What would Bohr think about "big data" if he were alive today? Who knows, of course, but I suspect he would be cautious to draw inferences based on any amount of data -- big or not -- unless those data are understood, characterized, and arguably relevant to a clear theoretical framework.

(image source: Wikipedia)

No comments:

Post a Comment