Apache
Kudu is still in incubation but still it’s getting lot of traction from Hadoop
community. Kudu is intended to “complete”
Hadoop's storage layer to enable fast analytics on fast data. Let’s
look at 3W’s of Kudu.
Why Kudu?
Our
Hadoop community has extensively used HDFS in last decade to solve our big data’s
“volume” challenge. That solves fast batching, low throughput BUT shows high
latency.
Then
we tried HBase to solve our big data’s “velocity” challenge. That gives us random
read/write, low latency, BUT relatively low throughput and no batch processing.
Some
solution was required to bridge this gap…and that's what Kudu is expected to do.
What is Kudu?
You can imagine Kudu
as HBase with underline parquet like storage. Thought Kudu can work independently
without MapReduce or YARN, Kudu was designed to fit in with the Hadoop ecosystem.
For end user Kudu is a storage system for tables of structured data. Also
I can be easily integrated with MapReduce job like HDFS. The design goals that
Kudu aimed to address were:
·
Strong performance for both
scan and random access to help customers simplify complex hybrid architectures
·
High CPU efficiency in order
to maximize the return on investment that our customers are making in modern
processors
·
High IO efficiency in order to
leverage modern persistent storage
·
The ability to update data in
place, to avoid extraneous processing and data movement
·
The ability to support
active-active replicated clusters that span multiple data centers in geographically
distant locations
KUDU |
When to use
Kudu?
·
For
processing HBase like data storage for real time.
·
For
processing parquet like workload on real-time data.
·
For queries
combinations intended to work on large historical and small real time data.
·
With Table
having SQL-like schema.
·
Want to use
NoSql like API (Insert, Delete, Scan, etc.).
HDFS and Kudu can
co-exist on your Hadoop cluster at same time. Best practice will be to use them
side by side as needed.
How to start with Kudu?
After its first
public version realize on Sep 28, 2015, Kudu is available to use with CDH 5.4.7
or later.
The full press release is here. A blog post on
Kudu is here. A dedicated Kudu
website is here.
An academic paper on it is here.
Also, a public beta of Kudu is live now on
GitHub.