Wednesday, December 30, 2015

Apache Flink : new competition for Spark?

        On November 27th 2015, Apache released its first version of a new bigdata processing engine Flink. With most of the similar specifications, Flink offers advantages over Apache Spark, that may set a competition for Spark in near future.

What’s common?
  • Both are general-purpose data processing platforms and top level projects of Apache.
  • Common Libraries :
    • SQL queries (Spark: Spark SQL, Flink: MRQL),
    • Graph processing (Spark: GraphX, Flink: Spargel and Gelly),
    • Machine learning (Spark: MLlib, Flink: Flink ML)
    • Stream processing (Spark: Streaming, Flink: Streaming).
  • Both are capable of running in standalone mode, yet many are using them on top of Hadoop (YARN, HDFS) and Mesos.
  • Both provides high performance due to their in-memory operations.
  • Both have common application and use-cases in Bigdata area.

Advantages over Spark:
  • Instead of being a pure stream-processing engine like Spark, Flink is a fast-batch operation working on a small part of incoming data during a unit of time (Spark: micro-batching). For low latency Bigdata application every millisecond matters, and this is the point where Flink gets the leads.
  • Spark is a batch processing framework that can approximate stream processing, Flink is primarily a stream processing framework that can look like a batch processor.
  • Flink come up with an aggressive optimization engine. Similar to a SQL database's query planner, the Flink optimizer analyzes the code submitted to the cluster and produces what it thinks is the best pipeline for running on that particular setup.
  • Flink can run existing MapReduce (YARN) jobs directly on its execution engine, providing an incremental upgrade path. It can let YARN deal with the cluster resources.
  • Flink allows iterative processing to take place on the same nodes rather than having the cluster run each iteration independently. With a bit of reworking of your code to give the optimizer some hints, it can increase performance even further by performing delta iterations only on parts of your data set that are changing.
  • Flink helps culture to become "self-managed" by providing its own memory management system, separate from Java’s garbage collector.

In short :
With Spark, everything batch job. Even streaming  is a Micro-batch.
With Fink, everything is streaming. Even a batch job is a log running streaming job