Slide 1

Slide 1 text

SPARK  WITH  COUCHBASE   TO  ELECTRIFY  YOUR  DATA  PROCESSING Michael  Nitschinger,  Couchbase @daschl

Slide 2

Slide 2 text

What  is  Spark?  

Slide 3

Slide 3 text

©2015  Couchbase  Inc.   3   Introduction   § Apache  Spark  is  a  fast  and  general  engine  for  large-­‐scale   data  processing.  

Slide 4

Slide 4 text

©2015  Couchbase  Inc.   4   More  Facts   §  Over  450  contributors,  very  active  Apache  Big  Data  project.   §  Huge  public  interest:   Source:  http://www.google.com/trends/explore?hl=en-­‐US#q=apache%20spark,%20apache%20hadoop&cmpt=q  

Slide 5

Slide 5 text

©2015  Couchbase  Inc.   5   Community   § Ecosystem  growing  fast   §  Hadoop   §  RDBMS   §  NoSQL   § Package  Repository   §  http://spark-­‐packages.org/   §  Connectors   §  Utility  Libraries  

Slide 6

Slide 6 text

©2015  Couchbase  Inc.   6   Components:  Spark  Core   Resilient  Distributed  Datasets   Clustering   Execution  

Slide 7

Slide 7 text

©2015  Couchbase  Inc.   7   Components:  Spark  SQL   Structured  Data  Frames   Distributed  querying  with  SQL  

Slide 8

Slide 8 text

©2015  Couchbase  Inc.   8   Components:  Spark  Streaming   Fault-­‐tolerant  streaming  applications  

Slide 9

Slide 9 text

©2015  Couchbase  Inc.   9   Components:  Spark  MLib   Built-­‐In  Machine  Learning  Algorithms  

Slide 10

Slide 10 text

©2015  Couchbase  Inc.   10   Components:  Spark  GraphX   Graph  processing  and  graph-­‐parallel   computations  

Slide 11

Slide 11 text

©2015  Couchbase  Inc.   11   How  does  it  work?   §  Resilient  Distributed  Datatypes  paper:   https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf   rdd1.join(rdd2) .groupBy(…) .filter(…) RDD  Objects   build  DAG   agnostic  to   operators!   doesn’t  know  about   stages   DAGScheduler   split  graph  into   stages  of  tasks   submit  each  stage   as  ready   DAG   TaskScheduler   TaskSet   launch  tasks  via   cluster  manager   retry  failed  or   straggling  tasks   Cluster   manager   Worker   execute  tasks   store  and  serve   blocks   Block   manager   Threads   Task   stage   failed  

Slide 12

Slide 12 text

Why  should  you  care?  

Slide 13

Slide 13 text

©2015  Couchbase  Inc.   13   Spark  Benefits   §  Linearly  scalable  to  1000+  worker  nodes   §  Simpler  to  use  than  Hadoop  MR   §  Only  partial  recompute  on  failure   §  For  developers  and  data  scientists   §  machine  learning   §  R  integration   §  Tight  but  not  mandatory  Hadoop  integration   §  Sources,  Sinks   §  Scheduler  

Slide 14

Slide 14 text

©2015  Couchbase  Inc.   14   Spark  vs  Hadoop   §  Spark  is  RAM  while  Hadoop  is  mainly  HDFS  (disk)  bound     §  Fully  compatible  with  Hadoop  Input/Output   §  Easier  to  develop  against  thanks  to  functional  composition   §  Hadoop  certainly  more  mature,  but  Spark  ecosystem  growing  fast  

Slide 15

Slide 15 text

©2015  Couchbase  Inc.   15   Ecosystem  Flexibility     RDBMS   Streams   Web  APIs   DCP   KV   N1QL   Views   Batching   Data  Archive   OLTP  Data  

Slide 16

Slide 16 text

©2015  Couchbase  Inc.   16   Infrastructure  Consolidation  

Slide 17

Slide 17 text

The  Couchbase  Spark  Connector  

Slide 18

Slide 18 text

©2015  Couchbase  Inc.   18   Couchbase  Connector   §  Spark  Core   §  Automatic  Cluster  and  Resource  Management   §  Creating  and  Persisting  RDDs   §  Java  APIs  in  addition  to  Scala  (planned  before  GA)   §  Spark  SQL   §  Easy  JSON  handling  and  querying   §  Tight  N1QL  Integration  (partially  in  dp2,  fully  planned  before  GA)     §  Spark  Streaming   §  Persisting  DStreams   §  DCP  source  (partially  in  dp2,  fully  planned  before  GA)  

Slide 19

Slide 19 text

©2015  Couchbase  Inc.   19   Facts   §  Current  Version:  1.0.0-­‐dp2   §  Beta  in  July,  GA  in  Q3  (tentative)   §  Code:  https://github.com/couchbaselabs/couchbase-­‐spark-­‐connector   §  Docs  until  GA:     https://github.com/couchbaselabs/couchbase-­‐spark-­‐connector/wiki    

Slide 20

Slide 20 text

©2015  Couchbase  Inc.   20   Connection  Management  

Slide 21

Slide 21 text

©2015  Couchbase  Inc.   21   Connection  Management  

Slide 22

Slide 22 text

©2015  Couchbase  Inc.   22   Creating  RDDs  

Slide 23

Slide 23 text

©2015  Couchbase  Inc.   23   Persisting  RDDs  

Slide 24

Slide 24 text

©2015  Couchbase  Inc.   24   Spark  SQL  Integration  

Slide 25

Slide 25 text

©2015  Couchbase  Inc.   25   Spark  Streaming  with  DCP  

Slide 26

Slide 26 text

Questions?  

Slide 27

Slide 27 text

Thank  you.