Musings on Infection: big data

Showing posts with label big data. Show all posts

Tuesday, May 12, 2015

Intensive care from afar: Caregiver versus patient watcher

A recent NPR story by Michael Tomsic recounted the remarkable story of how the Carolinas HealthCare System monitors ICU patients in 10 of its hospitals from a remote "command center"-like facility. Several critical care specialists staff the center; nurses are present around the clock and doctors work nights. Command center staff also spend time at the hospitals they monitor.

The system began doing this roughly two years ago and have since found that the quiet atmosphere of the command center ("none of the bells and whistles going off that most ICUs need to alert nurses and doctors down the hall that they're needed") allows medical staff in the center to maintain a constant focus on patients. The approach seems to be working for the system: They've observed a higher patient volume, lower mortality rate, and decreased length of stay since opening the center (though, as the article describes, such improvement likely isn't due solely to the remote monitoring program).

The issue of alarm fatigue is recognized as an important patient safety issue, so the idea of placing a group of specialists outside the immediate patient environment for monitoring purposes has a strong rationale. What I found most interesting about the article, however, was revealed in remarks from two nurses interviewed. One observed that "There are things that I'm able to view here [in the command center] — trends that I'm able to view here — that I'm not able to view at the bedside", while another noted that since the command center staff has easy access to patient data, handoffs are better and issues are less likely to be missed.

Assuming that these ICUs are not fundamentally different from ICUs in other facilities, the story highlights an issue that is endemic far beyond this particular set of hospitals: the frequent failure to bring data to the bedside in an effective way. This is ironic, as the big data and IT revolution brags -- incessantly, it sometimes seems -- about delivering data and analytics to the point where they can be most useful. That isn't consistent with the remarks from healthcare workers in this article.

Is caregiving versus patient monitoring an either-or proposition? I doubt it, as I've seen data-driven intensive care delivered reliably over long periods of time. Rather, I think the question is how to make data actionable through delivery to the right people without disrupting their workflow. It's a question for all clinical environments beyond the ICU. We need to make more effective use of routine clinical data.

(image source: Wikipedia)

Saturday, June 14, 2014

Prediction is difficult, especially about the future (and, apparently, about flu)

http://upload.wikimedia.org/wikipedia/commons/5/5e/Niels_Bohr_Date_Unverified_LOC.jpg

The title of this posting, minus the comment about flu, is a quote attributed to Niels Bohr. I imagine him mumbling this while stewing over the horrors of making prospective predictions of experimental outcomes with no good theory to provide guidance. The concern is well founded and the idea that prediction is difficult is profound -- and it is relevant to much of the "big data" analysis that is currently in vogue.

We should take notice of Bohr's admonition for the reasons so clearly described by Lazer et al ("The Parable of Google Flu: Traps in Big Data Analysis"), who review the failure of Google Flu Trends (GFT) in 2013. This is an excellent paper containing a direct critique of many issues facing not only GFT specifically, but also of the larger "big data" movement that is so much in the news today.

Briefly, GFT estimates flu prevalence by mining search terms from users of Google’s search engine and applying algorithms to the results. In the past, GFT's predictions have agreed with CDC surveillance data well, anticipating those data several days earlier than CDC. In 2013, however, it became clear that GFT was substantially overestimating flu levels. Lazer et al describe the failure and explain several ways in which the GFT approach is problematic.

Early in the paper they capture the essence of the Achilles heel of many "big data" projects at present, noting that

“Big data hubris” is the often implicit assumption that big data are a substitute for, rather than a supplement to, traditional data collection and analysis. We have asserted that there are enormous scientific possibilities in big data. However, quantity of data does not mean that one can ignore foundational issues of measurement, construct validity and reliability, and dependencies among data. The core challenge is that most big data that have received popular attention are not the output of instruments designed to produce valid and reliable data amenable for scientific analysis.

Read that again. Every word is important.

The paper goes on to highlight several issues with GFT and what is known about the methodology involved in its predictions. Among other findings, they conclude that a forecasting model far simpler than the elaborate use of huge amounts of data in GFT could have forecast influenza better than GFT has for sometime. So why go to the bother of using massive computational resources to compute a result that's so inaccurate?

Fung, in a recent blog, provides a frank discussion of what "big data" are and, importantly, they are not. He describes the OCCAM framework, which amounts to "a more honest assessment of the current state of big data and the assumptions lurking in it". Within this framework, "big data" are:

Observational: much of the new data come from sensors or tracking devices that monitor continuously and indiscriminately without design, as opposed to questionnaires, interviews, or experiments with purposeful design
Lacking Controls: controls are typically unavailable, making valid comparisons and analysis more difficult
Seemingly Complete: the availability of data for most measurable units and the sheer volume of data generated is unprecedented, but more data creates more false leads and blind alleys, complicating the search for meaningful, predictable structure
Adapted: third parties collect the data, often for a purposes unrelated to the data scientists, presenting challenges of interpretation
Merged: different datasets are combined, exacerbating the problems relating to lack of definition and misaligned objectives

(Bullets taken directly from Fung.) Trying to make sense out of data that are poorly characterized or understood seems like a recipe for disaster. Traps aplenty indeed, and Lazer et al illustrate these traps for GFT in detail.

Such traps must be identified and worked around in sensible, theoretically sound ways. The OCCAM problems with "big data" do not mean that "big data" analysis is not promising. Rather, they mean that we need to be thoughtful when attempting to analyze such data, and that methods need to be developed to rationalize data so that they can produce meaningful results for biomedical and scientific issues.

What would Bohr think about "big data" if he were alive today? Who knows, of course, but I suspect he would be cautious to draw inferences based on any amount of data -- big or not -- unless those data are understood, characterized, and arguably relevant to a clear theoretical framework.

(image source: Wikipedia)

Tuesday, April 22, 2014

Social media, the Manhattan Project, and big data

Social media and the Manhattan Project may seem to be completely different things, but there are facets of each that are similar. Let me explain.

The Manhattan Project was a massive and ambitious R&D project undertaken in World War II that transformed theory and data into a functioning product within a few short years. It was supported by vast resources and it brought people from many different backgrounds together to solve complex problems. Because of its success, the term "Manhattan Project" has become a cliché for any concerted effort to solve "big" problems, like energy or cancer. Although the original project was for weapons development, today the term is used in a positive light -- a project to solve an important but hard problem.

People have studied the Manhattan Project to understand why it was successful. Several factors are thought to have contributed, including:

Talent -- the project recruited talented individuals and gave them responsibility for solving key problems.
Youth -- the mean age of the scientific and technical staff at Los Alamos, New Mexico, where important work was done, was 25.
Leadership -- divisions of the project were led by people of undisputed accomplishment and they, in turn, were managed, but not micromanaged, by strong leaders.
Focus -- the entire project was focused on outcomes and meeting milestones along the way.
Conviction -- participants were deeply committed to achieving the end goal before a wartime enemy did.
Communication -- while much of the project was protected information, free discussion within certain divisions was allowed and found to be necessary for success.
Dispersion -- the project had operations in several states.

Within the context of public health and medicine, social media has many of these attributes. Users tend to be young. They often organize themselves into virtual communities of interest focused on important and complex topics. Users are dispersed throughout the world and are willing to actively communicate on virtually any issue. Significant effort can be brought to bear on a question at an instant's notice. And users tend to follow online leaders and thought influencers.

Because of the number of healthcare providers and consumers using social media, and the diversity and expertise of these users, social media in medicine is an immense resource. It has been used for infectious disease surveillance; patient sentiment analysis; testing treatments and speeding patient recruitment into clinical trials; understanding views on vaccines; and many, many other purposes, too numerous to list here.

Another dimension of medical social media is the potential to produce "big data". In a previous post I discussed the need for theory and context in big data if we are to make sound inferences. Certainly, others have warned about the pitfalls of big data as well. With such social media projects comes the potential to acquire vast amounts of data in a focused, disciplined way. Social media projects hold the promise of collecting big data to address specific research questions.

I don't know if social media can be harnessed into a Manhattan Project for any single issue in healthcare, but it is clear that it is playing important roles throughout the spectrum of medicine. It is a conduit for collecting types and volumes of data previously un-imagined, for catalyzing participatory medicine and patient engagement, and for generating and sharing wisdom. The practice of medicine in the 21st Century has important psychosocial, behavioral, economic, and scientific dimensions; social media plays in them all.

(image source: David Hartley)

Wednesday, February 5, 2014

Big data and infection: The need for theory

Richard Hamming (1915-1998) was an American mathematician who did influential and important work in computer science and telecommunications. He spent the bulk of his career at Bell Laboratory and then the Naval Postgraduate School. As part of the latter position, he taught a graduate course in engineering. I recently discovered that there's a book on these lectures, and that the lectures themselves can be found on YouTube. The book, The Art of Doing Science and Engineering: Learning to Learn, is a delightful read and is appropriate for many audiences, including those in the biomedical domain. It oozes wisdom; see, for example, the transcript of a talk he once gave, which ultimately became a chapter in the book.

I hope you acquire a copy of The Art of Doing Science and enjoy it. If you do read it, you will come across one of the pearls of wisdom that speaks to me:

The purpose of computing is insight, not numbers.

How true it is. And if we take "computing" as a synonym for "big data", I think this idea is very applicable to the current trends in bio-medicine. While the methods that are emerging to analyze and draw inferences from extremely large data sets are powerful and hold much promise when applied to relevant data, there are several things to be kept in mind.

First, often big data sets are collected for one set of goals and purposes, but then used at a later time by researchers interested in completely different subjects. The problem with this is made clear by a chapter in Hamming's book entitled "You get what you measure". It begins (p. 202),

You may think the title means if you measure accurately you will get an accurate measurement, and if not then not; but it refers to a much more subtle thing—the way you choose to measure things controls to a large extent what happens. I repeat the story Eddington told about the fishermen who went fishing with a net. They examined the size of the fish they caught and concluded there was a minimum size to the fish in the sea. The instrument you use clearly affects what you see.

And so it is with data sets. It is necessary to understand the details of large data sets before throwing machine methods at them in search of an answer to your particular question. How was the data collected? For what purpose was it collected? Are there biases lurking within the data? To what accuracy and precision where they collected? Et cetera.

Second, although machine methods often produce compelling visualizations, elaborate visualizations can be misleading. As Ezra Klein has noted, the trappings of data and charts can be used to make bad arguments sound persuasive. Great looking graphics of data that lack integrity or relevancy aren't necessarily helpful, and may even be harmful.

Third, by themselves, data alone aren't enough. While the value of exploratory data analysis for hypothesis building and related purposes has long been known, the importance of theoretical context seems to have been forgotten (or, worse, recanted) in the era of big data. Without theory, data may tell us facts -- assuming that the data are measured in a known way, are characterized to a known accuracy and precision, and don't have serious biases in them -- but facts lacking context are little more then random "just so" stories. On the other hand, there are very exciting notions being investigated and discussed in the area of what might be called machine-assisted discovery. Some of that conversation is quite spirited.

As will probably come up in future blogs, there are many other things to be borne in mind when dealing with big data, or any data for that matter. Understanding these may help us use the tools better.

(image source: ABE Books)