Friday, February 17, 2012

How to satisfy the demand for Big Analytics on Big Data? Different approaches showcase the strength of Big Data Industry in Massachusetts.

On February 15th I attended the The Big Data Disruption Summit at the appropriately named Microsoft NERD, in the high tech glass building overlooking Charles River and Boston.

Big Data clearly has a lot of buzz this year, and is approaching the top of the hype cycle according to Gartner. MassTLC's recent report Big Data and Analytics: A Major Market Opportunity for Massachusetts, identified 100+ Big Data related companies in Massachusetts.

The meeting was packed with business leaders, entrepreneurs, venture capitalists, and those data scientists that were present were very much in demand.

The event was opened by Dr. Michael Stonebraker, a leading database researcher for over 30 years and a serial entrepreneur - he started several successful database companies including Ingres, Illustra, Vertica, VoltDB, and most recently Paradigm4. Michael succeeds at doing several things at once, so some described him as a "parallel" entrepreneur. Click Here to view Michael's presentation on 'What is Big Data'.

After an overview of the 3 Vs of Big Data: Volume, Velocity, and Variety, Stonebraker argued that the focus of analytics in the past was "Little Analytics" on Big Volume, like finding an average closing price of MSFT on all trading days in the last 3 years - a request easily expressed in SQL.

Now there is demand for "Big Analytics" on Big Data, which may include complex math operations, such as machine learning or clustering. Dr. Stonebraker argued that most of these can be specified as linear algebra operations on array data. A typical inner loop in such algorithms may include matrix multiplication, SVD decomposition, or linear regression.

He gave an example of Big Analytics, where you need to compute a covariance between closing prices of stocks, and covariance which can be expressed easily as array operations, but not so easily in SQL. Imagine computing covariance for all pairs of stocks on NY Stock exchange for the last 1000 days. If you could do this, then do it for hourly prices, etc. Click Here to view Michael Stonebraker's presentation on 'How to do Complex Analytics.'

Christopher Ahlberg, CEO of Recorded Future (and co-founder of Spotfire, sold to Tibco in 2007) talked about the unstructured web as the most compelling source of predictive information. He described a project where they are monitoring South American Cities for potential unrest by scanning documents from 70,000 sources, with the need to visualize the results quickly. He described the evolution of their architecture from a key value store to mongoDB + sphinx. They have many users in finance, and interestingly, processing the overnight accumulation of information is very important for the trading signal in the first second of trading in New York.
Other speakers in the first panel outlined different approaches. Fritz Knabe of Netezza talked about new possibilities enabled by rapidly falling price of flash storage. When terabytes of memory is about $1000, much different and faster architectures become possible. However, the seamy underside of advent of flash is that the bottleneck moves from storage to power supply. This leads to interesting ideas like microservers, which have 8 servers on one board. Click Here to view Fritz's presentation.

Mark Watkins, co-founder of Goby and currently General Manager, Entertainment Content at Telenav, talked about mobile applications. His company is a pioneer in location services and providing a traffic-aware routing engine, which learns from traffic behavior. He also described an already deployed mobile recommendation system at Telenav, which can recommend interesting restaurants, events, and activities to you based on your interests and the data it has. Click Here to view Mark's presentation.
The first half was followed by the keynote presentation by Deepak Advani, Vice President, Business Analytics, Products and Solutions, IBM. He gave a very good overview of the many use cases where analytics and IBM technology produces good results, from IBM Watson technology now being applied to improve health care diagnoses, to The Oscar Senti-meter which provides sentiment analysis of Twitter messages about Oscar nominations. Click Here to view Deepak's presentation.

The second half of the event was focused on case studies of 4 start-ups and their learning experiences and challenges, with Andy Palmer, Startup Specialist serving as the moderator.

Kicking off the panel was Bill Simmons, CTO, DataXu. He talked about how their company tracks ad performance by using anonymous cookies. They build models to predict which ad impressions will lead to purchasing activity, and this is hard because for a million impressions there may be only hundreds of purchases (a very unbalanced class distribution). However, the ad cost is low and the economics work as their model is 2-3 times more accurate than random ads. Their software stack includes Hadoop, Hive, Postgres, Hbase, and Greenplum.

Alan Hoffman, Founder & President, Cloudant, talked about his experience as a physicist where he dealt with GB/sec of particle data. His company provides a noSQL data layer service and uses couchdb. He suggested there is no big magic solution to Big Data, but lots of small useful solutions.

Puneet Batra, Chief Data Scientist, Kyruus also has a physics background. Kyruus wants to provide rating information for providers - see Kyruus Aims to Become the Bloomberg for Hospitals.

George Radford, Field CTO, EMC Greenplum talked about adjusting to EMC acquisition of Greenplum. He commented that the last thing you want to do with big data is move it.
Andy Palmer wrapped the session emphasizing the need to think about a continuous upgrade path. Since design patterns in the system change rapidly, he argued for an MPP shared nothing architecture which scales well.

I asked the panel if they thought the potential of Big Data was overhyped.

Bill Simmons (DataXu) said that their method works and is able to improve accuracy by a factor of 2 to 3. However, the cost also need to be considered - what works in the US, where media is expensive, would not be cost-effective in China where media is cheap. Puneet Batra (Kyruus) suggested that one of the results of big data would be exposing bad decisions done without it and may will bring more rationality into business decisions.

Another question was whether as a result of changes in privacy practices driven by Facebook, we will see changes in medical privacy as more medical data becomes available for sharing. The panel felt it was more likely that personal medical data was the Quantified Self movement.

The meeting had a lot of energy and showcased the depth of Big Data Industry in Massachusetts. New ideas will likely percolate to new start-ups!

Guest Blogger Gregory Piatetsky, Editor, KDnuggets (twitter: @kdnuggets)


erwicker said...

Brilliant! Thanks so much

portable wireless router

Lena Harris said...

In a data industry, core business runs throughdocument archiving. It is important that firms involved in this kind of venture have the most up-to-date resources.