What are the use cases for Apache Spark vs Hadoop

27

19

With Hadoop 2.0 and YARN Hadoop is supposedly no longer tied only map-reduce solutions. With that advancement, what are the use cases for Apache Spark vs Hadoop considering both sit atop of HDFS? I've read through the introduction documentation for Spark, but I'm curious if anyone has encountered a problem that was more efficient and easier to solve with Spark compared to Hadoop.

idclark

Posted 2014-06-17T20:48:35.267

Reputation: 283

Answers

36

Hadoop means HDFS, YARN, MapReduce, and a lot of other things. Do you mean Spark vs MapReduce? Because Spark runs on/with Hadoop, which is rather the point.

The primary reason to use Spark is for speed, and this comes from the fact that its execution can keep data in memory between stages rather than always persist back to HDFS after a Map or Reduce. This advantage is very pronounced for iterative computations, which have tens of stages each of which is touching the same data. This is where things might be "100x" faster. For simple, one-pass ETL-like jobs for which MapReduce was designed, it's not in general faster.

Another reason to use Spark is its nicer high-level language compared to MapReduce. It provides a functional programming-like view that mimics Scala, which is far nicer than writing MapReduce code. (Although you have to either use Scala, or adopt the slightly-less-developed Java or Python APIs for Spark). Crunch and Cascading already provide a similar abstraction on top of MapReduce, but this is still an area where Spark is nice.

Finally Spark has as-yet-young but promising subprojects for ML, graph analysis, and streaming, which expose a similar, coherent API. With MapReduce, you would have to turn to several different other projects for this (Mahout, Giraph, Storm). It's nice to have it in one package, albeit not yet 'baked'.

Why would you not use Spark? paraphrasing myself:

  • Spark is primarily Scala, with ported Java APIs; MapReduce might be friendlier and more native for Java-based developers
  • There is more MapReduce expertise out there now than Spark
  • For the data-parallel, one-pass, ETL-like jobs MapReduce was designed for, MapReduce is lighter-weight compared to the Spark equivalent
  • Spark is fairly mature, and so is YARN now, but Spark-on-YARN is still pretty new. The two may not be optimally integrated yet. For example until recently I don't think Spark could ask YARN for allocations based on number of cores? That is: MapReduce might be easier to understand, manage and tune

Sean Owen

Posted 2014-06-17T20:48:35.267

Reputation: 3 640

thanks for the clarification. Keeping data in memory sounds like it has some interesting implications -I'll read up on Spark's Resilient Distributed Dataset concept a bit more.idclark 2014-06-18T10:30:55.527

3+1 for a really clear and useful answer for a lot of people who had this question, like me.vefthym 2014-06-20T09:20:30.793

2Keep in mind that Sean Owen is a co-author of the new O'Reilly book on Spark. :-)sheldonkreger 2014-12-29T17:05:50.890

1

Not sure about the YARN, but I think that Spark makes a real difference compared to Hadoop (advertised as 100 times faster) if data can fit nicely in the memory of the computational nodes. Simply because it avoids hard disk access. If data doesn't fit memory there's still some gain because of buffering.

iliasfl

Posted 2014-06-17T20:48:35.267

Reputation: 526

0

Machine learning is a good example of a problem type where Spark-based solutions are light-years ahead of mapreduce-based solutions, despite the young age of spark-on-yarn.

Max Gibiansky

Posted 2014-06-17T20:48:35.267

Reputation: 301

1I don't think this is true, but I think I know what you're getting at: in-memory works a lot faster for iterative computation and a lot of ML is iterative.Sean Owen 2015-01-16T17:18:15.257

0

Good info @Sean Owen. Would like to add one additional. Spark may help to build Unified data pipelines in Lambda architecture addressing both Batch and Streaming layers with an ability to write to common serving layer. It is huge advantage to reuse the logic between batch and Streaming. Also Streaming K-Means algorithms in Spark1.3 is an added plus to ML apart from excellent job monitoring and process visualizations in 1.4.

Srini Vemula

Posted 2014-06-17T20:48:35.267

Reputation: 41