Big Data Seminar: February 15, 2013
The challenges of big data are beyond storage; the true challenges are handling and analyzing the data.
Kris Joshi , Global Vice President Healthcare at Oracle Health Sciences opened the panel with a big data challenge in a sector far from life sciences, The US Border Patrol. The Border Patrol is tracking and analyzing approximately 50 billion transactions per day in real time. The good: If there is any cross correlation of a risk the appropriate personnel can be notified. The bad: If there is any failure at a center, every border must shut down within 30 minutes.
He then asked the panel, Peter Bergethon, PhD., Head of Neuroinformatics and Computational Neurology, Pfizer; Michele Clamp, Interim Director of Research Computing, Director of Informatics & Scientific Applications, Harvard University; and Matthew Trunnell, CIO, Broad Institute what big data challenges they faced in each of their respective roles.
There was a tremendous amount of consensus that data is coming in too fast to scale algorithmic approaches, interact with data, analyze the data, and then have the ability to repeat the process to get a better output. Within Broad and Harvard the infrastructure is developed for a general purpose and that is no longer a scalable model. Further, data sets in the life sciences sector are immensely different. And often times it is not the volume but the variety that causes issues. While Michele and Matthew feel the best solution would be to move from a relational database to fuzzier data models. Kris countered that, while flexible, data models also lose the traceability of the data which is required by many federal agencies.
Peter included the communication barriers between analysts and biologists as another major challenge. As a neurologist, his focus is on the need for big data to help constrain the data sets so it can be put in an environment where it can be predictive.
The panel's response to the best opportunities to improve alignment is to obtain more accurate raw data that analysts can trust. In addition, they feel that visualization software is still not up to par with where it needs to be for biologists to make sense of the data. This further supports a communications barrier between analysts and statisticians, and scientists.
On the volume issue, many of the panel members are moving, or plan to move, some of the data into the cloud so that sources of data can be shared and collaborated upon. Once taboo, the Cloud is beginning to be embraced among life science specialists.
Not surprisingly the discussion moved into the ever increasing issue of talent shortage. Not only is there a small world of people that really understand Hardtop, within the life sciences/biological industry the engineers need to have the ability to communicate with the scientists. What is the solution?
Peter discussed how Pfizer offers onsite training, but the costs are increasing so quickly and so much that they have begun outsourcing and he feels that is the future. Matthew added that there is a huge need for statistical geneticists, currently they just don’t exist.
There is also a cultural component to the skills gap problem. A limitation of perceived sense in value in sharing and explaining data to someone else. The data and the algorithms are proprietary and that limits collaboration. There is also a need for transparency on the application side which is why many, like the Broad and Harvard are using software with open licensing. They need to understand how to manipulate the tools to integrate the data sets properly.
It seems that in the end with money, resources, and improving technology, it is not the infrastructure that causes the true challenges of Big Data. But the disparate data sets, the ability for scientists to understand what the data is saying, and a major skills gap between analysts and scientists.