BOSTON – Researchers in cutting-edge medical fields like neuroscience and genomics are bumping into big data problems that most businesses won’t ever encounter. The human genome, for example, is so complex that it takes large datasets to describe it, yet scientists are not close to understanding its makeup well enough to analyze all the data they collect. At the same time, the field continues to advance at such a rapid pace that results of experiments—and the tools to run them—risk obsolescence by the time they are ready.
Compared to the human genome, there are very few groundbreaking discoveries in the supply chain, or marketing.
But behind the complex human biology, the heads of research centers do encounter issues more familiar to enterprise IT and business executives: changing the culture to accept data sharing, using visualizations as tool to aid communication and further discovery, and planning for future cloud adoption.
At a panel discussion hosted by the Massachusetts Technology Leadership Council on Feb. 15, three medical research and bioinformatics experts at Boston-area institutions discussed the challenges they face in collecting, storing, and analyzing medical data.
You think you have problems finding insights in the wealth of your data? These people in medical research have big data problems that would make your hair curl, then fall out.
Michele Clamp, Harvard’s interim director of research computing and director of informatics and scientific applications, can’t tune her system to run the cutting-edge algorithms for analyzing thousands of human genomes.
Her infrastructure budget can’t keep up with amount of data collected; Harvard is already storing 15 petabytes of genomics data.
“We’re hitting the limits,” Clamp said. “We’re getting to the point where we can’t get the data off the disk fast enough to keep the CPUs spinning.”
Matthew Trunnell, CIO at the Broad Institute, a joint research center between Harvard and MIT, has the same problem. Another problem he shares with Clamp: A new algorithm to crunch genomics comes out every six months, and many of them are so complex that no efficiency is gained by running them in a parallel environment. Hadoop hasn’t proven to be a successful way to crunch the data because the datasets are so complex it makes it difficult to program a long string of MapReduce jobs fast enough to make them useful; it’s only a matter of time before a newer discovery comes along, making the work obsolete.
Peter Bergethon, the head of neuroinformatics and computational neurology at Pfizer Inc.’s research unit in Cambridge, Mass., has his own set of problems. The biologists researching the brain create models so complex it drives the engineers crazy; the engineers simplify the models so much in order to compute the problems that it introduces errors into the system. In the brain drug business, that won’t work.