"Big data" is both empty marketing language and a genuine business problem. IT firms are salivating to handle more data, with Google, EMC, Tibco and IBM positioning old tools for a new age. The goal is to handle high-volume, heterogenous, unstructured data in quantities too monstrous in size to fit in a standard database. Forget about mega and giga. We’re going to get to know Greek prefixes like peta and exa.
Big data was a theme at this week’s 2012 Post-Approval Summit. For eight years, the meeting has been organized by academic surgeon Richard Gliklich and hosted on his turf, the Longwood portion of Harvard’s medical empire.
In The Ivy
Gliklich’s sizable side venture—the late-phase registry builder and contract research organization (CRO) Outcome—was sold to Quintiles last year. That has put the top CRO in the position of presiding over an elite conference at a medical school with a fabled reputation for scientific rigor.
It is never easy to pick the most thoughtful speaker at Dr. Gliklich’s meeting. This guy can rustle up more MD, PhD and FDA speakers for a single morning of his conference than some large life science meeting companies can find during a whole year. For 2012, one of the most compelling presentations was by Paul Stang, senior director of epidemiology at JNJ.
Stang doubts that big data will produce any Bible-worthy miracles. “It seems to have taken on a life of its own,” he says of the phrase. “I am going to temper your enthusiasm and give you the good, the bad and the ugly.” His most central point was intended to deflate the hype being distributed by IT firms. Big data, by itself, is not axiomatically better. It's just bigger, Stang insists.
He began by noting that Walmart handles a million retail transactions every hour. The big box store is managing a 2.5 petabyte data warehouse. That is more than 160 times larger than the holdings of the U.S. Library of Congress. The quantity of data held by the California insurer Kaiser Permanente, in turn, was unspecified but said to dwarf what Walmart has.
What’s driving the big data explosion in R&D? Genomics. Imaging. Other presentations at the Quintiles conference suggested that insurance claims, spontaneous adverse event reports, and unstructured text from electronic health records could also tip the scales toward data warehouses on a scale that has never been seen in the life sciences.
Stang went on to say that JNJ employs “informaticists.” It wasn’t clear, at first, if he was serious. It turns out that “informaticist” is a new career. “How many of you have touched an informaticist?” he joked. “Some of them used to be common people just a year ago. Those are the people who scare me the most.”
Innocents At Large
The frightening thing, Stang said, is that informaticists in corporate America or a college dorm may detect newsworthy conclusions in data which they scarcely fathom. Your faithful correspondent had an image of farm kids playing with dad’s chainsaw in a barn.
“We are starting to see more randomized studies of these extra data resources,” Stang said, referring to meta-analyses and data-mining efforts that are using larger and larger datasets. “I get very afraid when I hear of these mashups when the informatics people are combining data from a lot of sources and treating them equally.”
The era in which pharma folks alone control the data about individual drugs has been over for a while. But big data will accelerate the ability of other societal stakeholders to say, we're seeing a safety signal with this drug that the manufacturer has not reported. It's possible, of course, that big data could do the opposite, assisting the drug industry by identifying trends that support new regulatory filings or marketing claims.
Dim bulb informaticists, Stang warned, may sift, extract and compute against towering stacks of data without a nuanced understanding of the limitations in each stack. Depending on the personality of the informaticist, he or she may be compelled to publish alarming conclusions or contact regulatory agencies. Even in the pharmaceutical industry, Stang implied, informaticists may have good intentions but lack the slowly-acquired domain knowledge of best practices in epidemiology, clinical care or electronic health record system design.
’A Data Landfill’
To take just one example, the unavoidable dirtiness or subtlety in a million physician office notes may be something that a bright MBA or armchair epidemiologist can nod knowingly about. Will there be real understanding? Probably not. The potential negative complications for the reputation of a falsely accused life science company did not need to be spelled out for the audience at the Quintiles conference.
As he pointed out, there are multiple systems of medical diagnosis and reimbursement, with hundreds of thousands of special codes and medical terms that do not map neatly and definitively to each other. “Just because it has a diagnosis code of angina, the person may not have angina,” Stang noted. “Are we just creating a giant data landfill, a giant data dump, where we lose sight of what is real and what is not real? If you have a big enough data set, you can find just about anything. The real trick is how do we get something that is actionable.”
It may be useful, Stang added, to draw a distinction between data that are static and dynamic. Genetic variations, family history and demographic characteristics are static. But pain, moods, and blood pressure levels fluctuate. Stang suggested that basic research may have to establish a more comprehensive view of the ebb and flow of dynamic physiological data over time, presumably in healthy people first. Only then will intelligent big-data conclusions about medical treatments be well-founded.
Stang also suggested that more attention to graphics and charts may prevent confusion—both within the scientific community and outside it. In that regard, his own visuals were attractive, if unintelligibly small when squished into PowerPoint slides. In response to a question from the audience, he reluctantly offered a plug for the Spotfire visualization package from Tibco, saying his team selected it after evaluating many alternatives.
Stang did not utter a word about what JNJ is doing with big data. An unabridged big data case history from someone in the life sciences would be helpful for industry. Some people in the audience will someday wake up to read a New York Times analysis of big data about their own firm's top-selling medication, vaccine or device.
It is not premature to begin practicing how to explain big data research to the public. On the whole, today's news media have the scientific background knowledge of a golden retriever. The pharmaceutical industry, meanwhile, has struggled to help the public grasp old-school, traditional research. The looming prospect of big data, drawing on data tsunamis from hospital networks and insurance companies, will be politically fraught, with industry likely to be locked out of access to some data sources. Big communications will be needed.