Thursday, May 22, 2014

How to select a Hadoop distro - stop thinking about Hadoop

Scoop, Flume, PIG, Zookeeper.  Do these mean anything to you?  If they do then the odds are you are looking at Hadoop.  The thing is that while that was cool a few years ago it really is time to face it that HDFS is a commodity, Map Reduce is interesting but not feasible for most users and the real question is how we turn all that raw data in HDFS into something we can actually use.

That means three things

  1. Performant and ANSI compliant SQL matters - if you aren't able to run traditional reporting package then you are making people change for no reason.  If you don't have an alternative then you aren't offering an answer
  2. Predictive analytics, statistical, machine learning and whatever else they want - this is the stuff that will actually be new to most people
  3. Reacting in real-time - and I mean FAST, not BI fast but ACTUALLY fast
The last one is about how you ingest data and then perform real time analytics which are able to incorporate forecasting information from Hadoop into real-time feedback that can be integrated into source systems.

So Hadoop and HDFS are actually the least important in your future, its critical but not important.  I've seen people spend ages looking at the innards rather than just getting on and actually solving problems.  Do you care what your mobile phone network looks like internally?  Do you care what the wiring back to the power station looks like?  HDFS is that for Data, its the critical substrate, something that needs to be there.  But where you should concentrate your efforts is on how it supports the business use cases above.

How does it support ANSI compliant SQL, how does it support your standard reporting packages.  How will you add new types of analytics, does it support the advanced analytics tools your business already successfully uses?  How does it enable real-time analytics and integration?  

Then of course its about how it works within your enterprise, so how does it work with data management tools, how does its monitoring fit in with your existing tools.  Basically is it a grown-up or a child in the information sand-pit.

Now this means its not really about the Hadoop or HDFS piece itself, its about the ecosystem of technologies into which it needs to integrate.  Otherwise its going to just be another silo of technologies that don't work well with everything else and ultimately doesn't deliver the value you need.

No comments: