Big Data Analytics with Couchbase, Hadoop, Kafka, Spark and More

Big Data Analytics with Couchbase, Hadoop, Kafka, Spark and More

By Matt Ingenthron, Sr. Director SDK Engineering and Developer Advocacy, Couchbase

B4c5e3be1957239def212083b3f50916?s=128

Datageeks Paris

March 25, 2015
Tweet

Transcript

  1. Big Data Analytics with Couchbase, Hadoop, Kafka, Spark and More

    Matt Ingenthron, Sr. Director SDK Engineering and Developer Advocacy
  2. About Me Matt Ingenthron Worked on large site scalability problems

    at previous company… memcached contributor Joined Couchbase very early and helped define key parts of system @ingenthr 2
  3. None
  4. Lambda Architecture 4 1 2 3 4 5 DATA BATCH

    SPEED SERVE QUERY
  5. Lambda Architecture 5 Interactive and Real Time Applications 1 2

    3 4 5 DATA BATCH SPEED SERVE QUERY HADOOP COUCHBASE STORM COUCHBASE Broker Cluster Spout for Topic Kafka Producers Ordered Subscriptions
  6. COMPLEX EVENT PROCESSING Real Time REPOSITORY PERPETUAL STORE ANALYTICAL DB

    BUSINESS INTELLIGENCE MONITORING CHAT/VOICE SYSTEM BATCH TRACK REAL-TIME TRACK DASHBOARD
  7. TRACKING and COLLECTION ANALYSIS AND VISUALIZATION REST FILTER METRICS

  8. Integration at Scale

  9. 9 Requirements for data streaming in modern systems… •  Must

    support high throughput and low latency •  Need to handle failures •  Pick up where you left off •  Be efficient about resource usage
  10. Data Sync is the Heart of Any Big Data System

    Fundamental piece of the architecture -  Data Sync maintains Data Redundancy for High Availability (HA) & Disaster Recovery (DR) -  Protect against failures – node, rack, region etc. -  Data Sync maintains Indexes -  Indexing is key to building faster access paths to query data -  Spatial, Full-text DCP and Couchbase Server Architecture
  11. What is DCP? DCP is an innovative protocol that drive

    data sync for Couchbase Server •  Increase data sync efficiency with massive data footprints •  Remove slower Disk-IO from the data sync path •  Improve latencies – replication for data durability •  In future, will provide a programmable data sync protocol for external stores outside Couchbase Server DCP powers many critical components What is DCP? 11
  12. Database Change Protocol: DCP

  13. read/write/update Active SERVER 1 Active SERVER 2 Active SERVER 3

    APP SERVER 1 COUCHBASE Client Library CLUSTER MAP COUCHBASE Client Library CLUSTER MAP APP SERVER 2 Shard 5 Shard 2 Shard 9 Shard Shard Shard Shard 4 Shard 7 Shard 8 Shard Shard Shard Shard 1 Shard 3 Shard 6 Shard Shard Shard Replica Replica Replica Shard 4 Shard 1 Shard 8 Shard Shard Shard Shard 6 Shard 3 Shard 2 Shard Shard Shard Shard 7 Shard 9 Shard 5 Shard Shard Shard Couchbase Server Architecture – Data Sync with Replication DCP Replicates Data Among Nodes
  14. SERVER 4 SERVER 5 Replica Active Replica Active read/write/update APP

    SERVER 1 COUCHBASE Client Library CLUSTER MAP COUCHBASE Client Library CLUSTER MAP APP SERVER 2 Active SERVER 1 Shard 9 Shard Replica Shard 4 Shard 1 Shard 8 Shard Shard Shard Active SERVER 2 Shard 8 Shard Replica Shard 6 Shard 3 Shard 2 Shard Shard Shard Active SERVER 3 Shard 6 Shard Replica Shard 7 Shard 9 Shard 5 Shard Shard Shard read/write/update Shard 5 Shard 2 Shard Shard Shard 4 Shard 7 Shard Shard Shard 1 Shard 3 Shard Shard Couchbase Server Architecture DCP Drives Rebuilding of Replicas Under Topology Changes
  15. Ordering To build interesting features a streaming protocol needs to

    have a concept of when operations happened. Couchbase operation ordering at the node level: § Each mutation is assigned a sequence number § Sequence numbers increase monotonically § Sequence numbers are assigned on a per VBucket basis Restart-ability Need to handle failures with grace, in particular being efficient about the amount of data being moved around. Consistency points Points in time for incremental backup, query consistency. Performance Recognize that durability on a distributed system may have different definitions. Design Goals 15
  16. Demo

  17. 17 Other Data Sources HDFS Shopper Tracking (click stream) Lightweight

    Analytics: •  Department shopped •  Tech platform •  Click tracks by Income Heavier Analytics, Develop Profiles
  18. 18 HDFS Kafka Consumer or CAMUS Producer Producer Producer Producer

    And at scale…
  19. Couchbase & Apache Spark Introduction & Integration

  20. What is Spark? Apache is a fast and general engine

    for large-scale data processing.
  21. Spark Components Spark Core: RDDs, Clustering, Execution, Fault Management

  22. Spark Components Spark SQL: Work with structured data, distributed SQL

    querying
  23. Spark Components Spark Streaming: Build fault-tolerant streaming applications

  24. Spark Components Mlib: Machine Learning built in

  25. Spark Components GraphX: Graph processing and graph-parallel computations

  26. Spark Benefits •  Linear Scalability •  Ease of Use • 

    Fault Tolerance •  For developers and data scientists •  Tight but not mandatory Hadoop integration
  27. Spark Facts •  Current Release: 1.3.0 •  Over 450 contributors,

    most active Apache Big Data project. •  Huge public interest: Source: http://www.google.com/trends/explore?hl=en-US#q=apache%20spark,%20apache%20hadoop&cmpt=q
  28. Daytona GraySort Performance Hadoop MR Record Spark Record Data Size

    102.5 TB 100 TB Elapsed Time 72 mins 23 mins # Nodes 2100 206 # Cores 50400 physical 6592 virtual Cluster Disk Throughput 3150 GB/s 618 GB/s Network Dedicated DC, 10Gbps EC2, 10Gbps Sort Rate 1.42 TB/min 4.27 TB/min Sort Rate/Node 0.67 GB/min 20.7 GB/min Source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html Benchmark: http://sortbenchmark.org/
  29. How does it work? Resilient Distributed Datatypes paper: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf RDD

    Creation Scheduling DAG Task Execution
  30. How does it work? Resilient Distributed Datatypes paper: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf RDD

    Creation Scheduling DAG Task Execution
  31. How does it work? Resilient Distributed Datatypes paper: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf RDD

    Creation Scheduling DAG Task Execution
  32. Spark vs Hadoop •  Spark is RAM while Hadoop is

    HDFS (disk) bound •  API easier to reason about & to develop against •  Fully compatible with Hadoop Input/Output formats •  Hadoop more mature, Spark ecosystem growing fast
  33. Ecosystem Flexibility RDBMS Streams Web APIs DCP KV N1QL Views

    Batching Archived Data OLTP
  34. Infrastructure Consolidation Streams Web APIs User Interaction

  35. What does this look like? 35 Read a sequence of

    RDDs in, and apply a few transformations… sc.couchbaseGet[JsonDocument(Seq(      "21st_amendment_brewery_cafe-­‐21a_ipa",        "aass_brewery-­‐genuine_pilsner"))          .map(doc  =>  doc.content().getString("name"))          .collect()          .foreach(println)  
  36. Reading a Couchbase Secondary Index… 36 Query a Couchbase View,

    subfilter, and then map over them… //  Read  the  first  10  rows  and  load  their  full  documents   val  beers  =  sc.couchbaseView(ViewQuery.from("beer",  "brewery_beers"))      .map(_.id)      .couchbaseGet[JsonDocument]()      .filter(doc  =>  doc.content().getString("type")  ==  "beer")      .cache()       //  Calculate  the  mean  for  all  beers   println(beers      .map(doc  =>  doc.content().getDouble("abv").asInstanceOf[Double])      .mean())    
  37. Couchbase Connector Spark Core §  Automatic Cluster and Resource Management

    §  Creating and Persisting RDDs §  Java APIs in addition to Scala (planned before GA) Spark SQL §  Easy JSON handling and querying §  Tight N1QL Integration (dp2) Spark Streaming §  Persisting DStreams §  DCP source (planned before GA)
  38. Connector Facts •  Current Version: 1.0.0-dp •  DP2 upcoming • 

    GA planned for Q3 Code: https://github.com/couchbaselabs/couchbase-spark-connector Docs until GA: https://github.com/couchbaselabs/couchbase-spark-connector/wiki
  39. Links •  Subscribe to Couchbase newsletter: •  http://info.couchbase.com/Community-Newsletter- Signup.html • 

    Couchbase is hiring a Solution Engineer in Paris •  http://www.couchbase.com/careers •  Join Paris Couchbase Meetup •  http://www.meetup.com/Couchbase-France/
  40. Questions

  41. Matt Ingenthron @ingenthr Michael Nitschinger @daschl Thanks