Wednesday, February 5, 2014

Big data and infection: The need for theory

Richard Hamming (1915-1998) was an American mathematician who did influential and important work in computer science and telecommunications. He spent the bulk of his career at Bell Laboratory and then the Naval Postgraduate School. As part of the latter position, he taught a graduate course in engineering. I recently discovered that there's a book on these lectures, and that the lectures themselves can be found on YouTube. The book, The Art of Doing Science and Engineering: Learning to Learn, is a delightful read and is appropriate for many audiences, including those in the biomedical domain. It oozes wisdom; see, for example, the transcript of a talk he once gave, which ultimately became a chapter in the book.

I hope you acquire a copy of The Art of Doing Science and enjoy it. If you do read it, you will come across one of the pearls of wisdom that speaks to me:
The purpose of computing is insight, not numbers.
How true it is. And if we take "computing" as a synonym for "big data", I think this idea is very applicable to the current trends in bio-medicine. While the methods that are emerging to analyze and draw inferences from extremely large data sets are powerful and hold much promise when applied to relevant data, there are several things to be kept in mind. 

First, often big data sets are collected for one set of goals and purposes, but then used at a later time by researchers interested in completely different subjects. The problem with this is made clear by a chapter in Hamming's book entitled "You get what you measure". It begins (p. 202),
You may think the title means if you measure accurately you will get an accurate measurement, and if not then not; but it refers to a much more subtle thing—the way you choose to measure things controls to a large extent what happens. I repeat the story Eddington told about the fishermen who went fishing with a net. They examined the size of the fish they caught and concluded there was a minimum size to the fish in the sea. The instrument you use clearly affects what you see. 
And so it is with data sets. It is necessary to understand the details of large data sets before throwing machine methods at them in search of an answer to your particular question. How was the data collected? For what purpose was it collected? Are there biases lurking within the data? To what accuracy and precision where they collected? Et cetera.

Second, although machine methods often produce compelling visualizations, elaborate visualizations can be misleading. As Ezra Klein has noted, the trappings of data and charts can be used to make bad arguments sound persuasive. Great looking graphics of data that lack integrity or relevancy aren't necessarily helpful, and may even be harmful.

Third, by themselves, data alone aren't enough. While the value of exploratory data analysis for hypothesis building and related purposes has long been known, the importance of theoretical context seems to have been forgotten (or, worse, recanted) in the era of big data. Without theory, data may tell us facts -- assuming that the data are measured in a known way, are characterized to a known accuracy and precision, and don't have serious biases in them -- but facts lacking context are little more then random "just so" stories. On the other hand, there are very exciting notions being investigated and discussed in the area of what might be called machine-assisted discovery. Some of that conversation is quite spirited.

As will probably come up in future blogs, there are many other things to be borne in mind when dealing with big data, or any data for that matter. Understanding these may help us use the tools better.

(image source: ABE Books)

No comments:

Post a Comment