Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Spark with Couchbase

Spark with Couchbase

This talk is about the current state of the spark couchbase connector, done at Couchbase Connect 2015 in Santa Clara. Check out the cb website for the recording!

Michael Nitschinger

June 04, 2015
Tweet

More Decks by Michael Nitschinger

Other Decks in Programming

Transcript

  1. SPARK  WITH  COUCHBASE  
    TO  ELECTRIFY  YOUR  DATA  PROCESSING
    Michael  Nitschinger,  Couchbase
    @daschl

    View Slide

  2. What  is  Spark?  

    View Slide

  3. ©2015  Couchbase  Inc.   3  
    Introduction  
    § Apache  Spark  is  a  fast  and  general  engine  for  large-­‐scale  
    data  processing.  

    View Slide

  4. ©2015  Couchbase  Inc.   4  
    More  Facts  
    §  Over  450  contributors,  very  active  Apache  Big  Data  project.  
    §  Huge  public  interest:  
    Source:  http://www.google.com/trends/explore?hl=en-­‐US#q=apache%20spark,%20apache%20hadoop&cmpt=q  

    View Slide

  5. ©2015  Couchbase  Inc.   5  
    Community  
    § Ecosystem  growing  fast  
    §  Hadoop  
    §  RDBMS  
    §  NoSQL  
    § Package  Repository  
    §  http://spark-­‐packages.org/  
    §  Connectors  
    §  Utility  Libraries  

    View Slide

  6. ©2015  Couchbase  Inc.   6  
    Components:  Spark  Core  
    Resilient  Distributed  Datasets  
    Clustering  
    Execution  

    View Slide

  7. ©2015  Couchbase  Inc.   7  
    Components:  Spark  SQL  
    Structured  Data  Frames  
    Distributed  querying  with  SQL  

    View Slide

  8. ©2015  Couchbase  Inc.   8  
    Components:  Spark  Streaming  
    Fault-­‐tolerant  streaming  applications  

    View Slide

  9. ©2015  Couchbase  Inc.   9  
    Components:  Spark  MLib  
    Built-­‐In  Machine  Learning  Algorithms  

    View Slide

  10. ©2015  Couchbase  Inc.   10  
    Components:  Spark  GraphX  
    Graph  processing  and  graph-­‐parallel  
    computations  

    View Slide

  11. ©2015  Couchbase  Inc.   11  
    How  does  it  work?  
    §  Resilient  Distributed  Datatypes  paper:  
    https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf  
    rdd1.join(rdd2)
    .groupBy(…)
    .filter(…)
    RDD  Objects  
    build  DAG  
    agnostic  to  
    operators!  
    doesn’t  know  about  
    stages  
    DAGScheduler  
    split  graph  into  
    stages  of  tasks  
    submit  each  stage  
    as  ready  
    DAG  
    TaskScheduler  
    TaskSet  
    launch  tasks  via  
    cluster  manager  
    retry  failed  or  
    straggling  tasks  
    Cluster  
    manager  
    Worker  
    execute  tasks  
    store  and  serve  
    blocks  
    Block  
    manager  
    Threads  
    Task  
    stage  
    failed  

    View Slide

  12. Why  should  you  care?  

    View Slide

  13. ©2015  Couchbase  Inc.   13  
    Spark  Benefits  
    §  Linearly  scalable  to  1000+  worker  nodes  
    §  Simpler  to  use  than  Hadoop  MR  
    §  Only  partial  recompute  on  failure  
    §  For  developers  and  data  scientists  
    §  machine  learning  
    §  R  integration  
    §  Tight  but  not  mandatory  Hadoop  integration  
    §  Sources,  Sinks  
    §  Scheduler  

    View Slide

  14. ©2015  Couchbase  Inc.   14  
    Spark  vs  Hadoop  
    §  Spark  is  RAM  while  Hadoop  is  mainly  HDFS  (disk)  bound  
     
    §  Fully  compatible  with  Hadoop  Input/Output  
    §  Easier  to  develop  against  thanks  to  functional  composition  
    §  Hadoop  certainly  more  mature,  but  Spark  ecosystem  growing  fast  

    View Slide

  15. ©2015  Couchbase  Inc.   15  
    Ecosystem  Flexibility  
     
    RDBMS  
    Streams  
    Web  APIs  
    DCP  
    KV  
    N1QL  
    Views  
    Batching  
    Data  Archive  
    OLTP  Data  

    View Slide

  16. ©2015  Couchbase  Inc.   16  
    Infrastructure  Consolidation  

    View Slide

  17. The  Couchbase  Spark  Connector  

    View Slide

  18. ©2015  Couchbase  Inc.   18  
    Couchbase  Connector  
    §  Spark  Core  
    §  Automatic  Cluster  and  Resource  Management  
    §  Creating  and  Persisting  RDDs  
    §  Java  APIs  in  addition  to  Scala  (planned  before  GA)  
    §  Spark  SQL  
    §  Easy  JSON  handling  and  querying  
    §  Tight  N1QL  Integration  (partially  in  dp2,  fully  planned  before  GA)  
     
    §  Spark  Streaming  
    §  Persisting  DStreams  
    §  DCP  source  (partially  in  dp2,  fully  planned  before  GA)  

    View Slide

  19. ©2015  Couchbase  Inc.   19  
    Facts  
    §  Current  Version:  1.0.0-­‐dp2  
    §  Beta  in  July,  GA  in  Q3  (tentative)  
    §  Code:  https://github.com/couchbaselabs/couchbase-­‐spark-­‐connector  
    §  Docs  until  GA:    
    https://github.com/couchbaselabs/couchbase-­‐spark-­‐connector/wiki  
     

    View Slide

  20. ©2015  Couchbase  Inc.   20  
    Connection  Management  

    View Slide

  21. ©2015  Couchbase  Inc.   21  
    Connection  Management  

    View Slide

  22. ©2015  Couchbase  Inc.   22  
    Creating  RDDs  

    View Slide

  23. ©2015  Couchbase  Inc.   23  
    Persisting  RDDs  

    View Slide

  24. ©2015  Couchbase  Inc.   24  
    Spark  SQL  Integration  

    View Slide

  25. ©2015  Couchbase  Inc.   25  
    Spark  Streaming  with  DCP  

    View Slide

  26. Questions?  

    View Slide

  27. Thank  you.  

    View Slide