Tuesday, December 18, 2012

Big Data in Motion

From: Gregory Piatetsky-Shapiro

On Dec 13, 2012 I attended a Big Data Cluster Roundtable: Big Data in Motion, organized byMassTLC: Mass Technology Leadership Council, @MassTLC.

Here are some of my observations and notes from the meeting.

MassTLC snazzy new logo conveyed speed and digital know-how. Sara Fraim@sarafraim introduced Big Data as one of MassTLC 9 clusters - others include Cloud, Digital Games, Energy, Healthcare, Mobile, Robotics, Sales & Marketing, and Software Development. One major MassTLC goal is to get synergy from cross-collaboration of people from different communities/clusters joining together.

The roundtable was led by Eric Alterman, Co-Founder and CEO, Flow and Eric Schnadig, CEO, Tervela.

About about 25 people assembled to look at Big Data in Motion, which can be described as Big Data in (almost) real-time, or structuring Big Data streams, or combining Big Data with Complex Event Processing. 'Real-time' for Big Data can be milliseconds or minutes, depending on the application.

Eric Schnadig said that Tervela is focusing on very high-speed connectivity, and is actively used on Wall street for high performance trading. He observed that 2012 is the peak of the Big Data hype cycle, and predicted that in 2013 the conversation will shift to the more defined segments. The focus of Tervela is Big Data in Motion, or dealing with Big Data I/O - how to move it?

The main issues in moving Big Data are the same as with regular data: bandwidth, security, fault-tolerance.

The audience member remarked that because of the Big Data size it is easier to move compute to data than move data to compute.

This concept has also been described as the Data Gravity - bigger data is harder to move, and has stronger pull on applications.

However, Big Data in Motion also deals with capturing data in real-time, properly reacting to it, creating if needed parallel streams, e.g. one to traditional DW, another to backup for compliance purposes, and another to compute engine for decisions.

It was also observed that analytics is not the only application for Big Data. Sometimes communication is a better use case. For example, incident information can be visualized in real-time on the map.

Eric Alterman talked about the importance of creating context. Once you have a customer complaint, then need to get previous complaints, route to appropriate dept for action.

Big Data frequently has embarrassinly parallel handling of data.

There are many use cases for treating data on the flow, and storage is only the end-point.

There are many use cases for dealing with Big Data in real-time, but information needs to be indexed first for fast access.

For healthcare IT, the data lifecycle is 60-90 days - glacial pace.

Splunk indexes information as it gets there. Indexing technology like B-trees is 40 years old but there are more modern methods like LSM trees (Log Structured Merge Trees).

An important use case of Big Data is that the receiver of information can be a machine, the end user can be not a human but an app.

Companies will be more successful if they can have mid-level programmers deal with Big Data, not data scientists (lots of work on such tools now).

Ad re-targeting - a good use case for Big Data in 'real-time'.

The old centralized Data Warehouse, built to reduce duplication, does not make sense today when storage is so cheap. There is move away from one centralized DW to many smaller DW.

Another use case: NYT large etailers are monitoring and changing prices in real-time.

I asked a question: 

what parts of Big Data market have the most hype and are due for correction
Some responses were
  • cloud analytics - there are too many platforms, and it is too expensive to move data. One possible exception is Amazon Redshift, which is a fast, petabyte-scale data warehouse service in the cloud. Cloud analytics companies can succeed if they are on AWS platform
  • lift (obtained from Big Data Analytics) and quality of analytics is also over-hyped. Same also my HBR blog post
However, others observed that Big Data Analytics are still in infancy - there is still big opportunity.

Underhyped topic: is how to store Big Data for a very long time, so that it will survive frequent change of formats.

Overall, an interesting meeting and a good discussion!

Some of the tweets during the event:
  • Lawrence @schwartzlaws Indexing is a bottleneck for Big Data in motion. Need to look at alternatives to traditional B-Trees - at #BigData Roundtable @MassTLC
  • Sara Fraim @sarafraim Not about analyzing each piece but using all data from mult sources at once and streaming to right place @flow@masstlc
  • Sara Fraim @sarafraim the next 'gem' in big data is moving the data from place to place, processing, and rerouting rapidly @masstlc#bigdata

1 comment:

rinto roy said...

Please note that you have written a very good blog.I came to know a lots of stuffs after reading your blog.Please keep sharing such good information so that we can keep sharing knowledges.