This talk is about the current state of the spark couchbase connector, done at Couchbase Connect 2015 in Santa Clara. Check out the cb website for the recording!
Over 450 contributors, very active Apache Big Data project. § Huge public interest: Source: http://www.google.com/trends/explore?hl=en-‐US#q=apache%20spark,%20apache%20hadoop&cmpt=q
§ Resilient Distributed Datatypes paper: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf rdd1.join(rdd2) .groupBy(…) .filter(…) RDD Objects build DAG agnostic to operators! doesn’t know about stages DAGScheduler split graph into stages of tasks submit each stage as ready DAG TaskScheduler TaskSet launch tasks via cluster manager retry failed or straggling tasks Cluster manager Worker execute tasks store and serve blocks Block manager Threads Task stage failed
Linearly scalable to 1000+ worker nodes § Simpler to use than Hadoop MR § Only partial recompute on failure § For developers and data scientists § machine learning § R integration § Tight but not mandatory Hadoop integration § Sources, Sinks § Scheduler
§ Spark is RAM while Hadoop is mainly HDFS (disk) bound § Fully compatible with Hadoop Input/Output § Easier to develop against thanks to functional composition § Hadoop certainly more mature, but Spark ecosystem growing fast
Version: 1.0.0-‐dp2 § Beta in July, GA in Q3 (tentative) § Code: https://github.com/couchbaselabs/couchbase-‐spark-‐connector § Docs until GA: https://github.com/couchbaselabs/couchbase-‐spark-‐connector/wiki