Tuesday, April 10, 2012

A Day in the Life of a Data Scientist

Today's Big Data Cluster seminar, Data Science: A Practitioner's Perspective, kicked off with our standard definition of volume, variety and velocity. But interestingly moderator Dave Menninger's research showed that in the end volume and velocity were by far what was considered important to data analytics organizations that are evaluating Big Data technology. In addition, Menniger's research shows that 90% of analytics organizations utilize more than one technology to analyze data, continuing to prove that one size does not fit all. And finally, either a sign of an emerging trend or a lack of ability to conduct something so complex, predictive analytics remains at the bottom of the list of BI (business intelligence) capabilities.
Our panel today included Dan Dunn, Product Operations, HubSpot; Michael Kane, PhD, Associate Research Scientist, Yale Center for Analytical Sciences, Yale University; Mike Keohane, VP of Software Engineering, OwnerIQ; Ian Stokes-Rees, PhD, NEBioGrid, Harvard Medical School.

The first question thrown out to panelists laid the foundation for the types of work they do, which primarily is exploratory data analysis using simple statistical methods. Michael Kane points out that as the methods become more complex, the mathematics becomes more predictive and harder to prove. Dan Dunn says that ultimately the most trustworthy outcomes rely on a combination of data variety and data integrity. This combination allows for more dimensions to be added and thus creating more links among the data points.

Another key component of the day was how crucial visualization is, not only to the end user of the report, but also for the data scientist to review along the way. According to Ian Stokes-Rees an interactive scatter-plot with clear axis points is necessary to provide your stakeholders to get complete buy-in.

While there was some lively debate on the best technologies available, particularly between our academics, the panel was in complete agreement that they are all very difficult to use, which brought us to our skills-gap portion of the day. So, what makes a great data scientist? Someone with experience in mathematics/statistics, computers/programming, and a grasp on business environments. Thus, a true rarity. According to Mike Keohane, data scientists may only have depth in one of those areas but if they have at least an understanding of the other traits then they are still a commodity.

In the end, data science is a team approach. Generally, the person performing the statistical analysis is not the same person who is administering the data or programming the technologies to run the data. In addition, having more than a single data scientist allows for checks and balances in analysis. There are so many tools and algorithms that can be completed, it is crucial to have a keen understanding of what questions should and are being asked.

The final question of the morning was what does the future hold? All panel members were in agreement that the volume of data will continue to increase along with its importance in business intelligence.

1 comment:

Dan Dun said...

I had a lot of fun with this panel. Interesting audience questions make it work. Thanks for inviting me!