Best data processing engine: Flink vs Spark

Flink has been recently graduated top-level Apache Project.
Flink started as a research project at the Technical University of Berlin in 2009.
Spark was originally developed in the AMPLab at UC Berkeley in 2009 and became an Apache top-level project in February 2014.

If Flink is less known than Spark, especially outside Europe, it is promising.
One of the most striking difference between Flink and Spark is probably the fact that Flink can deal with true real-time data,
whereas Spark (with Spark Streaming) deals with micro-batch streams that are only near real time.
Flink is claimed to process batch information just as efficiently as Spark.

Another great feature of Flink is that it allows iterative processing to take place on the same nodes rather than having the cluster run each iteration independently. For example, delta iterations can be run only on parts of the data that are changing.

Flink can also run existing mapreduce jobs and works on Tez runtime.
It also has its own memory management system separate from Java’s garbage collector. Flink provides some HTML viewer for debugging purposes.

Flink comes with extra libraries for machine learning (Flink ML) and graph processing (Gelly) as Spark does with MLlib and GraphX. Future compatibility with Python (beta version for the python API) or R will probably facilitates the adoption of data scientists.

Flink gains interest from large companies such as Spotify or Huawei and will certainly be soon one of the big actor in the bigdata world.

Written by Jean-Baptiste Poullet

Data analyst – consultant – freelancer
Expert in Bigdata
Founder of RBelgium – R community in Belgium
Owner of the company Stat’Rgy
Contact me at jeanbaptistepoullet@statrgy.com

Posted in Uncategorized.

One Comment

Leave a Reply

Your email address will not be published. Required fields are marked *