Friday, July 8, 2016

Kudu or not Kudu

Apache Kudu is still in incubation but still it’s getting lot of traction from Hadoop community. Kudu is intended to “complete” Hadoop's storage layer to enable fast analytics on fast data. Let’s look at 3W’s of  Kudu.

Why Kudu?
Our Hadoop community has extensively used HDFS in last decade to solve our big data’s “volume” challenge. That solves fast batching, low throughput BUT shows high latency.
Then we tried HBase to solve our big data’s “velocity” challenge. That gives us random read/write, low latency, BUT relatively low throughput and no batch processing.
Some solution was required to bridge this gap…and that's what Kudu is expected to do.

What is Kudu?
You can imagine Kudu as HBase with underline parquet like storage. Thought Kudu can work independently without MapReduce or YARN, Kudu was designed to fit in with the Hadoop ecosystem. For end user Kudu is a storage system for tables of structured data. Also I can be easily integrated with MapReduce job like HDFS. The design goals that Kudu aimed to address were:
·        Strong performance for both scan and random access to help customers simplify complex hybrid architectures
·        High CPU efficiency in order to maximize the return on investment that our customers are making in modern processors
·        High IO efficiency in order to leverage modern persistent storage
·        The ability to update data in place, to avoid extraneous processing and data movement
·        The ability to support active-active replicated clusters that span multiple data centers in geographically distant locations
KUDU


When to use Kudu?
·       For processing HBase like data storage for real time.
·       For processing parquet like workload on real-time data.
·       For queries combinations intended to work on large historical and small real time data.
·       With Table having SQL-like schema.
·       Want to use NoSql like API (Insert, Delete, Scan, etc.).
HDFS and Kudu can co-exist on your Hadoop cluster at same time. Best practice will be to use them side by side as needed.

How to start with Kudu?
After its first public version realize on Sep 28, 2015, Kudu is available to use with CDH 5.4.7 or later.

The full press release is here. A blog post on Kudu is here. A dedicated Kudu website is here. An academic paper on it is here. Also, a public beta of Kudu is live now on GitHub.