Spark with Couchbase

SPARK WITH COUCHBASE TO ELECTRIFY YOUR DATA PROCESSING Michael
Nitschinger, Couchbase @daschl

What is Spark?

©2015 Couchbase Inc. 3 Introduction § Apache Spark
is a fast and general engine for large-‐scale data processing.

©2015 Couchbase Inc. 4 More Facts § 
Over 450 contributors, very active Apache Big Data project. §  Huge public interest: Source: http://www.google.com/trends/explore?hl=en-‐US#q=apache%20spark,%20apache%20hadoop&cmpt=q

©2015 Couchbase Inc. 5 Community § Ecosystem growing
fast §  Hadoop §  RDBMS §  NoSQL § Package Repository §  http://spark-‐packages.org/ §  Connectors §  Utility Libraries

©2015 Couchbase Inc. 6 Components: Spark Core
Resilient Distributed Datasets Clustering Execution

©2015 Couchbase Inc. 7 Components: Spark SQL
Structured Data Frames Distributed querying with SQL

©2015 Couchbase Inc. 8 Components: Spark Streaming
Fault-‐tolerant streaming applications

©2015 Couchbase Inc. 9 Components: Spark MLib
Built-‐In Machine Learning Algorithms

©2015 Couchbase Inc. 10 Components: Spark GraphX
Graph processing and graph-‐parallel computations

©2015 Couchbase Inc. 11 How does it work?
§  Resilient Distributed Datatypes paper: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf rdd1.join(rdd2) .groupBy(…) .filter(…) RDD Objects build DAG agnostic to operators! doesn’t know about stages DAGScheduler split graph into stages of tasks submit each stage as ready DAG TaskScheduler TaskSet launch tasks via cluster manager retry failed or straggling tasks Cluster manager Worker execute tasks store and serve blocks Block manager Threads Task stage failed

Why should you care?

©2015 Couchbase Inc. 13 Spark Beneﬁts § 
Linearly scalable to 1000+ worker nodes §  Simpler to use than Hadoop MR §  Only partial recompute on failure §  For developers and data scientists §  machine learning §  R integration §  Tight but not mandatory Hadoop integration §  Sources, Sinks §  Scheduler

©2015 Couchbase Inc. 14 Spark vs Hadoop
§  Spark is RAM while Hadoop is mainly HDFS (disk) bound §  Fully compatible with Hadoop Input/Output §  Easier to develop against thanks to functional composition §  Hadoop certainly more mature, but Spark ecosystem growing fast

The Couchbase Spark Connector

©2015 Couchbase Inc. 18 Couchbase Connector § 
Spark Core §  Automatic Cluster and Resource Management §  Creating and Persisting RDDs §  Java APIs in addition to Scala (planned before GA) §  Spark SQL §  Easy JSON handling and querying §  Tight N1QL Integration (partially in dp2, fully planned before GA) §  Spark Streaming §  Persisting DStreams §  DCP source (partially in dp2, fully planned before GA)

©2015 Couchbase Inc. 19 Facts §  Current
Version: 1.0.0-‐dp2 §  Beta in July, GA in Q3 (tentative) §  Code: https://github.com/couchbaselabs/couchbase-‐spark-‐connector §  Docs until GA: https://github.com/couchbaselabs/couchbase-‐spark-‐connector/wiki

Questions?

Thank you.

Spark with Couchbase

Spark with Couchbase

Michael Nitschinger

More Decks by Michael Nitschinger

Other Decks in Programming

Featured

Transcript