Spark with Couchbase - Speaker Deck

Slide 1

Slide 1 text

SPARK WITH COUCHBASE TO ELECTRIFY YOUR DATA PROCESSING Michael Nitschinger, Couchbase @daschl

Slide 2

Slide 2 text

What is Spark?

Slide 3

Slide 3 text

Slide 4

Slide 4 text

©2015 Couchbase Inc. 4 More Facts §  Over 450 contributors, very active Apache Big Data project. §  Huge public interest: Source: http://www.google.com/trends/explore?hl=en-‐US#q=apache%20spark,%20apache%20hadoop&cmpt=q

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

©2015 Couchbase Inc. 11 How does it work? §  Resilient Distributed Datatypes paper: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf rdd1.join(rdd2) .groupBy(…) .filter(…) RDD Objects build DAG agnostic to operators! doesn’t know about stages DAGScheduler split graph into stages of tasks submit each stage as ready DAG TaskScheduler TaskSet launch tasks via cluster manager retry failed or straggling tasks Cluster manager Worker execute tasks store and serve blocks Block manager Threads Task stage failed

Slide 12

Slide 12 text

Why should you care?

Slide 13

Slide 13 text

©2015 Couchbase Inc. 13 Spark Beneﬁts §  Linearly scalable to 1000+ worker nodes §  Simpler to use than Hadoop MR §  Only partial recompute on failure §  For developers and data scientists §  machine learning §  R integration §  Tight but not mandatory Hadoop integration §  Sources, Sinks §  Scheduler

Slide 14

Slide 14 text

©2015 Couchbase Inc. 14 Spark vs Hadoop §  Spark is RAM while Hadoop is mainly HDFS (disk) bound §  Fully compatible with Hadoop Input/Output §  Easier to develop against thanks to functional composition §  Hadoop certainly more mature, but Spark ecosystem growing fast

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

The Couchbase Spark Connector

Slide 18

Slide 18 text

©2015 Couchbase Inc. 18 Couchbase Connector §  Spark Core §  Automatic Cluster and Resource Management §  Creating and Persisting RDDs §  Java APIs in addition to Scala (planned before GA) §  Spark SQL §  Easy JSON handling and querying §  Tight N1QL Integration (partially in dp2, fully planned before GA) §  Spark Streaming §  Persisting DStreams §  DCP source (partially in dp2, fully planned before GA)

Slide 19

Slide 19 text

©2015 Couchbase Inc. 19 Facts §  Current Version: 1.0.0-‐dp2 §  Beta in July, GA in Q3 (tentative) §  Code: https://github.com/couchbaselabs/couchbase-‐spark-‐connector §  Docs until GA: https://github.com/couchbaselabs/couchbase-‐spark-‐connector/wiki