What are the use cases for Apache Spark vs Hadoop



With Hadoop 2.0 and YARN Hadoop is supposedly no longer tied only map-reduce solutions. With that advancement, what are the use cases for Apache Spark vs Hadoop considering both sit atop of HDFS? I've read through the introduction documentation for Spark, but I'm curious if anyone has encountered a problem that was more efficient and easier to solve with Spark compared to Hadoop.


Posted 2014-06-17T20:48:35.267

Reputation: 481



Hadoop means HDFS, YARN, MapReduce, and a lot of other things. Do you mean Spark vs MapReduce? Because Spark runs on/with Hadoop, which is rather the point.

The primary reason to use Spark is for speed, and this comes from the fact that its execution can keep data in memory between stages rather than always persist back to HDFS after a Map or Reduce. This advantage is very pronounced for iterative computations, which have tens of stages each of which is touching the same data. This is where things might be "100x" faster. For simple, one-pass ETL-like jobs for which MapReduce was designed, it's not in general faster.

Another reason to use Spark is its nicer high-level language compared to MapReduce. It provides a functional programming-like view that mimics Scala, which is far nicer than writing MapReduce code. (Although you have to either use Scala, or adopt the slightly-less-developed Java or Python APIs for Spark). Crunch and Cascading already provide a similar abstraction on top of MapReduce, but this is still an area where Spark is nice.

Finally Spark has as-yet-young but promising subprojects for ML, graph analysis, and streaming, which expose a similar, coherent API. With MapReduce, you would have to turn to several different other projects for this (Mahout, Giraph, Storm). It's nice to have it in one package, albeit not yet 'baked'.

Why would you not use Spark? paraphrasing myself:

  • Spark is primarily Scala, with ported Java APIs; MapReduce might be friendlier and more native for Java-based developers
  • There is more MapReduce expertise out there now than Spark
  • For the data-parallel, one-pass, ETL-like jobs MapReduce was designed for, MapReduce is lighter-weight compared to the Spark equivalent
  • Spark is fairly mature, and so is YARN now, but Spark-on-YARN is still pretty new. The two may not be optimally integrated yet. For example until recently I don't think Spark could ask YARN for allocations based on number of cores? That is: MapReduce might be easier to understand, manage and tune

Sean Owen

Posted 2014-06-17T20:48:35.267

Reputation: 5 987

thanks for the clarification. Keeping data in memory sounds like it has some interesting implications -I'll read up on Spark's Resilient Distributed Dataset concept a bit more. – idclark – 2014-06-18T10:30:55.527

3+1 for a really clear and useful answer for a lot of people who had this question, like me. – vefthym – 2014-06-20T09:20:30.793

4Keep in mind that Sean Owen is a co-author of the new O'Reilly book on Spark. :-) – sheldonkreger – 2014-12-29T17:05:50.890


Not sure about the YARN, but I think that Spark makes a real difference compared to Hadoop (advertised as 100 times faster) if data can fit nicely in the memory of the computational nodes. Simply because it avoids hard disk access. If data doesn't fit memory there's still some gain because of buffering.


Posted 2014-06-17T20:48:35.267

Reputation: 599


It would be fair to compare Spark with MapReduce - Hadoop's processing framework. In the majority of cases, Spark may outperform MapReduce. The former enables in-memory data processing, which makes it possible to process data up to 100 times faster. For this reason, Spark is a preferred option if you need insights quickly, for example, if you need to

  • run customer analytics, e.g. compare the behavior of a customer with the behavior patterns of a particular customer segment and trigger certain actions;
  • manage risks and forecast various possible scenarios;
  • detect fraud in real-time;
  • run industrial big data analytics and predict anomalies and machine failures.

However, MapReduce is good at processing really huge datasets (if you are fine with the time required for processing). Besides, it's a more economical solution, as MapReduce reads from/writes to a disk. And disks are generally cheaper than memory.


Posted 2014-06-17T20:48:35.267

Reputation: 1


Good info @Sean Owen. Would like to add one additional. Spark may help to build Unified data pipelines in Lambda architecture addressing both Batch and Streaming layers with an ability to write to common serving layer. It is huge advantage to reuse the logic between batch and Streaming. Also Streaming K-Means algorithms in Spark1.3 is an added plus to ML apart from excellent job monitoring and process visualizations in 1.4.

Srini Vemula

Posted 2014-06-17T20:48:35.267

Reputation: 39


Machine learning is a good example of a problem type where Spark-based solutions are light-years ahead of mapreduce-based solutions, despite the young age of spark-on-yarn.

Max Gibiansky

Posted 2014-06-17T20:48:35.267

Reputation: 309

2I don't think this is true, but I think I know what you're getting at: in-memory works a lot faster for iterative computation and a lot of ML is iterative. – Sean Owen – 2015-01-16T17:18:15.257