Real time data processing: Storm vs Spark
Real time data processing has been made possible with tools like Storm or Spark streaming (both top-level project from Apache).
You may wonder which one to choose. There are several things you should consider:
- scale
- latency
- iterative processing
- use what you know
- code reuse
- other language compatibility
- maturity
Batch vs Streaming
Storm and Spark are not doing the same things:
- Storm is a stream processing framework that also does micro-batching (Trident)
- Spark is a batch processing framework that also does micro-batching (Spark Streaming)
Stream processing means “one at a time”, whereas micro-batching means per batches, small ones, but still not one at a time. To have a fair comparison of Storm vs Spark Streaming, one should then compare with Storm Trident. Apache Storm is 2 streaming APIs:
- the Core Storm (Spouts and Bolts)
- one at a time,
- lower latency,
- operate on tuple streams
- Trident (Streams and Operations)
- Micro-Batch,
- Higher Throughput,
- Operate on streams of tuple batches and partitions
As illustration, a benchmark of Storm with 5 nodes on AWS (m1.large) with 1 Zookeeper, 1 Nimbus and 3 Supervisors gave ~150k msg/sec. and ~80ms latency with Core API and ~300k msg/sec. with ~250ms latency. A higher throughput is possible with increased latency and a better performance is possible with bigger hardware.
Storm provides lower level API than Spark, with no built-in concept of look back aggregations.
Sources of data
- Storm can work with an incredibly large variety of sources (from the Twitter Streaming API to Apache Kafka to everything in between).
- Spark can also work with numerous disparate sources including HDFS, Cassandra, HBase, and S3.
Other language compatibility
- Storm is mainly written in Clojure and spouts (sources of streams in a computation, e.g. a Twitter API) and bolts (process input streams and produce output streams) can be written in almost all language, including non-JVM languages like R or Python. Note that Storm Trident is only compatible with Java, Clojure and Scala.
- Spark is written in Scala and provide API support only for Scala, Java and Python.
How to pick one ?
Choosing between Storm or Spark will probably depend on your use case.
If your processing needs involve substantial requirements for graph processing, SQL access or batch processing, Spark would be preferred. Spark comes with a series of modules, one for streaming, one for machine learning (MLlib), one for graphs (GraphX), and one for connecting to SQL databases (SQL).
If you use iterative batch processing (machine learning, drill-down, historical comparisons, etc) you better use Spark, at least if the amount of data to be processed is under the RAM available on your cluster.
If you need to combine batch with streaming, Spark would spare you much effort compared to Storm.
If you are already working on YARN, Spark would be better fitted. Storm on YARN is still at its infant stage (see here). Note that both Storm and Spark are running on MESOS.
If you need to leverage code written in R or any other language not natively supported by Spark then Storm has some advantages.
If latency must be under 1 sec, you should consider Storm Core.
No need to choose one. Apache Flink does both quite efficiently. ;). It has a Storm-like runtime execution with a high-level spark like api that allows both batch analytics and real time analytics with flexible windowing.
Flink would be my first choice but still new in the market though. A year from now yes it may rule both the worlds.
your blog is excellent. Your blog is very much useful to me, Many thanks for that.
My warm regards to you.