Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Spark with Couchbase

Spark with Couchbase

This talk is about the current state of the spark couchbase connector, done at Couchbase Connect 2015 in Santa Clara. Check out the cb website for the recording!

Michael Nitschinger

June 04, 2015

More Decks by Michael Nitschinger

Other Decks in Programming


  1. ©2015  Couchbase  Inc.   3   Introduction   § Apache  Spark

     is  a  fast  and  general  engine  for  large-­‐scale   data  processing.  
  2. ©2015  Couchbase  Inc.   4   More  Facts   § 

    Over  450  contributors,  very  active  Apache  Big  Data  project.   §  Huge  public  interest:   Source:  http://www.google.com/trends/explore?hl=en-­‐US#q=apache%20spark,%20apache%20hadoop&cmpt=q  
  3. ©2015  Couchbase  Inc.   5   Community   § Ecosystem  growing

     fast   §  Hadoop   §  RDBMS   §  NoSQL   § Package  Repository   §  http://spark-­‐packages.org/   §  Connectors   §  Utility  Libraries  
  4. ©2015  Couchbase  Inc.   6   Components:  Spark  Core  

    Resilient  Distributed  Datasets   Clustering   Execution  
  5. ©2015  Couchbase  Inc.   7   Components:  Spark  SQL  

    Structured  Data  Frames   Distributed  querying  with  SQL  
  6. ©2015  Couchbase  Inc.   8   Components:  Spark  Streaming  

    Fault-­‐tolerant  streaming  applications  
  7. ©2015  Couchbase  Inc.   9   Components:  Spark  MLib  

    Built-­‐In  Machine  Learning  Algorithms  
  8. ©2015  Couchbase  Inc.   10   Components:  Spark  GraphX  

    Graph  processing  and  graph-­‐parallel   computations  
  9. ©2015  Couchbase  Inc.   11   How  does  it  work?

      §  Resilient  Distributed  Datatypes  paper:   https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf   rdd1.join(rdd2) .groupBy(…) .filter(…) RDD  Objects   build  DAG   agnostic  to   operators!   doesn’t  know  about   stages   DAGScheduler   split  graph  into   stages  of  tasks   submit  each  stage   as  ready   DAG   TaskScheduler   TaskSet   launch  tasks  via   cluster  manager   retry  failed  or   straggling  tasks   Cluster   manager   Worker   execute  tasks   store  and  serve   blocks   Block   manager   Threads   Task   stage   failed  
  10. ©2015  Couchbase  Inc.   13   Spark  Benefits   § 

    Linearly  scalable  to  1000+  worker  nodes   §  Simpler  to  use  than  Hadoop  MR   §  Only  partial  recompute  on  failure   §  For  developers  and  data  scientists   §  machine  learning   §  R  integration   §  Tight  but  not  mandatory  Hadoop  integration   §  Sources,  Sinks   §  Scheduler  
  11. ©2015  Couchbase  Inc.   14   Spark  vs  Hadoop  

    §  Spark  is  RAM  while  Hadoop  is  mainly  HDFS  (disk)  bound     §  Fully  compatible  with  Hadoop  Input/Output   §  Easier  to  develop  against  thanks  to  functional  composition   §  Hadoop  certainly  more  mature,  but  Spark  ecosystem  growing  fast  
  12. ©2015  Couchbase  Inc.   15   Ecosystem  Flexibility    

    RDBMS   Streams   Web  APIs   DCP   KV   N1QL   Views   Batching   Data  Archive   OLTP  Data  
  13. ©2015  Couchbase  Inc.   18   Couchbase  Connector   § 

    Spark  Core   §  Automatic  Cluster  and  Resource  Management   §  Creating  and  Persisting  RDDs   §  Java  APIs  in  addition  to  Scala  (planned  before  GA)   §  Spark  SQL   §  Easy  JSON  handling  and  querying   §  Tight  N1QL  Integration  (partially  in  dp2,  fully  planned  before  GA)     §  Spark  Streaming   §  Persisting  DStreams   §  DCP  source  (partially  in  dp2,  fully  planned  before  GA)  
  14. ©2015  Couchbase  Inc.   19   Facts   §  Current

     Version:  1.0.0-­‐dp2   §  Beta  in  July,  GA  in  Q3  (tentative)   §  Code:  https://github.com/couchbaselabs/couchbase-­‐spark-­‐connector   §  Docs  until  GA:     https://github.com/couchbaselabs/couchbase-­‐spark-­‐connector/wiki