Musings on Infection: methods

Showing posts with label methods. Show all posts

Tuesday, September 2, 2014

Why model infectious disease: Ebola

Several weeks ago I wrote a blog on why modeling infectious disease is useful. Now seems like a good time to highlight a few issues regarding "why model?" within the context of the current Ebola event. Science Insider recently published a very nice piece on Ebola modeling and some initial results from different groups working the issue. Discussing the article with a few colleagues who are not modelers, however, I sensed some skepticism regarding the past track record of models and why it's useful to model this outbreak.

As described by many authors previously (see the links in the previous blog), a major use of modeling is to help researchers think carefully about a problem. That's especially true in the current situation, where models can help analyze complex issues. A few examples include:

What can be derived from data in hand, or data that can be collected, to improve our ability to clarify the situation?
Can we infer how quickly the virus is being transmitted and whether it is decreasing, increasing, or staying the same (questions regarding the basic reproduction ratio, R₀, and the effective reproduction ratio, R_eff)?
If vaccines become available, what coverage and efficacy might be necessary to control the outbreak (i.e., reduce R_eff below 1)? What vaccination strategies are likely to make optimal use of resources?
Are there combination interventions that might prove effective at reducing the incidence of infection?
What is the likelihood of Ebola cases arriving in distant nations via air travel?

In short, there are plenty of questions that modeling can help elucidate.

One should be skeptical about any epidemiologic method, including mathematical and computer modeling, when the stakes for public health are so high. Ultimately, however, policymakers need timely and defensible analytic guidance to support allocation of scarce resources. Modeling is one component of such guidance.

(image source: David Hartley)

Sunday, June 29, 2014

Why model infectious disease?

People sometimes ask: What use are mathematical models of infectious disease? There are excellent works addressing this question in depth, including McKenzie, Garnett et al, and Grundmann and Hellriegel, among many others. They are all recommended reading and offer comprehensive answers from multiple perspectives. In the meantime, I offer a few observations.

Sir Ronald Ross, who discovered that mosquitoes carry the malaria parasite, viewed the modeling process as a way of thinking carefully about epidemiologic issues. The process of constructing a mathematical model, by its very nature, requires that careful, precise ideas are formulated as the model is built. The discipline of writing down and analyzing disease processes can sharpen and inform one's thinking. The history of mathematical modeling and the payoff for malaria research is illustrated beautifully in Smith et al.

The modeling process can also uncover gaps in our knowledge and understanding, often highlighting the need for additional research and expertise in order to realistically address particular issues. Thus, modeling can be a process for both facilitating multi- or cross-disciplinary collaborations and identifying needed observational or laboratory studies. Examples of models highlighting knowledge gaps for mosquito-borne infections can be seen in Reiner and Perkins et al.

Importantly, models enable virtual experiments and studies, including ones that cannot be carried out easily, if at all, in the real world. Mathematical models are thus tools for analyzing what if scenarios, doing feasibility studies, and carrying out risk assessments. McKenzie illustrates these points clearly for the case of biodefense.

Models allow us to assess the impact of uncertainty, and variation in data, upon our ability to make decisions, as has been studied thoughtfully recently by Christley et al. Mathematical approaches exist and are commonly applied to models to deal with uncertainty in quantitative ways, as reviewed recently by Wu et al, and illustrated by Okais et al for the case of vaccination.

I also tend to think of models as mechanisms for summarizing, synthesizing, and communicating complex information. It never ceases to amaze me how much space in research papers is devoted to specifying a model (little space) relative to the amount of prose needed to explain the model, the data required to run it, and its output (much space). The clear, precise, and economical encapsulation of so much information, typically only a few lines of equations and table of parameter values, is very appealing. Mathematics is a much more precise language than the spoken or written word.

Modeling has many uses beyond those touched upon here, some of which will, no doubt, be the topics of future blogs.

Saturday, June 14, 2014

Prediction is difficult, especially about the future (and, apparently, about flu)

http://upload.wikimedia.org/wikipedia/commons/5/5e/Niels_Bohr_Date_Unverified_LOC.jpg

The title of this posting, minus the comment about flu, is a quote attributed to Niels Bohr. I imagine him mumbling this while stewing over the horrors of making prospective predictions of experimental outcomes with no good theory to provide guidance. The concern is well founded and the idea that prediction is difficult is profound -- and it is relevant to much of the "big data" analysis that is currently in vogue.

We should take notice of Bohr's admonition for the reasons so clearly described by Lazer et al ("The Parable of Google Flu: Traps in Big Data Analysis"), who review the failure of Google Flu Trends (GFT) in 2013. This is an excellent paper containing a direct critique of many issues facing not only GFT specifically, but also of the larger "big data" movement that is so much in the news today.

Briefly, GFT estimates flu prevalence by mining search terms from users of Google’s search engine and applying algorithms to the results. In the past, GFT's predictions have agreed with CDC surveillance data well, anticipating those data several days earlier than CDC. In 2013, however, it became clear that GFT was substantially overestimating flu levels. Lazer et al describe the failure and explain several ways in which the GFT approach is problematic.

Early in the paper they capture the essence of the Achilles heel of many "big data" projects at present, noting that

“Big data hubris” is the often implicit assumption that big data are a substitute for, rather than a supplement to, traditional data collection and analysis. We have asserted that there are enormous scientific possibilities in big data. However, quantity of data does not mean that one can ignore foundational issues of measurement, construct validity and reliability, and dependencies among data. The core challenge is that most big data that have received popular attention are not the output of instruments designed to produce valid and reliable data amenable for scientific analysis.

Read that again. Every word is important.

The paper goes on to highlight several issues with GFT and what is known about the methodology involved in its predictions. Among other findings, they conclude that a forecasting model far simpler than the elaborate use of huge amounts of data in GFT could have forecast influenza better than GFT has for sometime. So why go to the bother of using massive computational resources to compute a result that's so inaccurate?

Fung, in a recent blog, provides a frank discussion of what "big data" are and, importantly, they are not. He describes the OCCAM framework, which amounts to "a more honest assessment of the current state of big data and the assumptions lurking in it". Within this framework, "big data" are:

Observational: much of the new data come from sensors or tracking devices that monitor continuously and indiscriminately without design, as opposed to questionnaires, interviews, or experiments with purposeful design
Lacking Controls: controls are typically unavailable, making valid comparisons and analysis more difficult
Seemingly Complete: the availability of data for most measurable units and the sheer volume of data generated is unprecedented, but more data creates more false leads and blind alleys, complicating the search for meaningful, predictable structure
Adapted: third parties collect the data, often for a purposes unrelated to the data scientists, presenting challenges of interpretation
Merged: different datasets are combined, exacerbating the problems relating to lack of definition and misaligned objectives

(Bullets taken directly from Fung.) Trying to make sense out of data that are poorly characterized or understood seems like a recipe for disaster. Traps aplenty indeed, and Lazer et al illustrate these traps for GFT in detail.

Such traps must be identified and worked around in sensible, theoretically sound ways. The OCCAM problems with "big data" do not mean that "big data" analysis is not promising. Rather, they mean that we need to be thoughtful when attempting to analyze such data, and that methods need to be developed to rationalize data so that they can produce meaningful results for biomedical and scientific issues.

What would Bohr think about "big data" if he were alive today? Who knows, of course, but I suspect he would be cautious to draw inferences based on any amount of data -- big or not -- unless those data are understood, characterized, and arguably relevant to a clear theoretical framework.

(image source: Wikipedia)