Slide 1

Slide 1 text

Big Data Analytics with Couchbase, Hadoop, Kafka, Spark and More Matt Ingenthron, Sr. Director SDK Engineering and Developer Advocacy

Slide 2

Slide 2 text

About Me Matt Ingenthron Worked on large site scalability problems at previous company… memcached contributor Joined Couchbase very early and helped define key parts of system @ingenthr 2

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

Lambda Architecture 4 1 2 3 4 5 DATA BATCH SPEED SERVE QUERY

Slide 5

Slide 5 text

Lambda Architecture 5 Interactive and Real Time Applications 1 2 3 4 5 DATA BATCH SPEED SERVE QUERY HADOOP COUCHBASE STORM COUCHBASE Broker Cluster Spout for Topic Kafka Producers Ordered Subscriptions

Slide 6

Slide 6 text

COMPLEX EVENT PROCESSING Real Time REPOSITORY PERPETUAL STORE ANALYTICAL DB BUSINESS INTELLIGENCE MONITORING CHAT/VOICE SYSTEM BATCH TRACK REAL-TIME TRACK DASHBOARD

Slide 7

Slide 7 text

TRACKING and COLLECTION ANALYSIS AND VISUALIZATION REST FILTER METRICS

Slide 8

Slide 8 text

Integration at Scale

Slide 9

Slide 9 text

9 Requirements for data streaming in modern systems… •  Must support high throughput and low latency •  Need to handle failures •  Pick up where you left off •  Be efficient about resource usage

Slide 10

Slide 10 text

Data Sync is the Heart of Any Big Data System Fundamental piece of the architecture -  Data Sync maintains Data Redundancy for High Availability (HA) & Disaster Recovery (DR) -  Protect against failures – node, rack, region etc. -  Data Sync maintains Indexes -  Indexing is key to building faster access paths to query data -  Spatial, Full-text DCP and Couchbase Server Architecture

Slide 11

Slide 11 text

What is DCP? DCP is an innovative protocol that drive data sync for Couchbase Server •  Increase data sync efficiency with massive data footprints •  Remove slower Disk-IO from the data sync path •  Improve latencies – replication for data durability •  In future, will provide a programmable data sync protocol for external stores outside Couchbase Server DCP powers many critical components What is DCP? 11

Slide 12

Slide 12 text

Database Change Protocol: DCP

Slide 13

Slide 13 text

read/write/update Active SERVER 1 Active SERVER 2 Active SERVER 3 APP SERVER 1 COUCHBASE Client Library CLUSTER MAP COUCHBASE Client Library CLUSTER MAP APP SERVER 2 Shard 5 Shard 2 Shard 9 Shard Shard Shard Shard 4 Shard 7 Shard 8 Shard Shard Shard Shard 1 Shard 3 Shard 6 Shard Shard Shard Replica Replica Replica Shard 4 Shard 1 Shard 8 Shard Shard Shard Shard 6 Shard 3 Shard 2 Shard Shard Shard Shard 7 Shard 9 Shard 5 Shard Shard Shard Couchbase Server Architecture – Data Sync with Replication DCP Replicates Data Among Nodes

Slide 14

Slide 14 text

SERVER 4 SERVER 5 Replica Active Replica Active read/write/update APP SERVER 1 COUCHBASE Client Library CLUSTER MAP COUCHBASE Client Library CLUSTER MAP APP SERVER 2 Active SERVER 1 Shard 9 Shard Replica Shard 4 Shard 1 Shard 8 Shard Shard Shard Active SERVER 2 Shard 8 Shard Replica Shard 6 Shard 3 Shard 2 Shard Shard Shard Active SERVER 3 Shard 6 Shard Replica Shard 7 Shard 9 Shard 5 Shard Shard Shard read/write/update Shard 5 Shard 2 Shard Shard Shard 4 Shard 7 Shard Shard Shard 1 Shard 3 Shard Shard Couchbase Server Architecture DCP Drives Rebuilding of Replicas Under Topology Changes

Slide 15

Slide 15 text

Ordering To build interesting features a streaming protocol needs to have a concept of when operations happened. Couchbase operation ordering at the node level: § Each mutation is assigned a sequence number § Sequence numbers increase monotonically § Sequence numbers are assigned on a per VBucket basis Restart-ability Need to handle failures with grace, in particular being efficient about the amount of data being moved around. Consistency points Points in time for incremental backup, query consistency. Performance Recognize that durability on a distributed system may have different definitions. Design Goals 15

Slide 16

Slide 16 text

Demo

Slide 17

Slide 17 text

17 Other Data Sources HDFS Shopper Tracking (click stream) Lightweight Analytics: •  Department shopped •  Tech platform •  Click tracks by Income Heavier Analytics, Develop Profiles

Slide 18

Slide 18 text

18 HDFS Kafka Consumer or CAMUS Producer Producer Producer Producer And at scale…

Slide 19

Slide 19 text

Couchbase & Apache Spark Introduction & Integration

Slide 20

Slide 20 text

What is Spark? Apache is a fast and general engine for large-scale data processing.

Slide 21

Slide 21 text

Spark Components Spark Core: RDDs, Clustering, Execution, Fault Management

Slide 22

Slide 22 text

Spark Components Spark SQL: Work with structured data, distributed SQL querying

Slide 23

Slide 23 text

Spark Components Spark Streaming: Build fault-tolerant streaming applications

Slide 24

Slide 24 text

Spark Components Mlib: Machine Learning built in

Slide 25

Slide 25 text

Spark Components GraphX: Graph processing and graph-parallel computations

Slide 26

Slide 26 text

Spark Benefits •  Linear Scalability •  Ease of Use •  Fault Tolerance •  For developers and data scientists •  Tight but not mandatory Hadoop integration

Slide 27

Slide 27 text

Spark Facts •  Current Release: 1.3.0 •  Over 450 contributors, most active Apache Big Data project. •  Huge public interest: Source: http://www.google.com/trends/explore?hl=en-US#q=apache%20spark,%20apache%20hadoop&cmpt=q

Slide 28

Slide 28 text

Daytona GraySort Performance Hadoop MR Record Spark Record Data Size 102.5 TB 100 TB Elapsed Time 72 mins 23 mins # Nodes 2100 206 # Cores 50400 physical 6592 virtual Cluster Disk Throughput 3150 GB/s 618 GB/s Network Dedicated DC, 10Gbps EC2, 10Gbps Sort Rate 1.42 TB/min 4.27 TB/min Sort Rate/Node 0.67 GB/min 20.7 GB/min Source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html Benchmark: http://sortbenchmark.org/

Slide 29

Slide 29 text

How does it work? Resilient Distributed Datatypes paper: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf RDD Creation Scheduling DAG Task Execution

Slide 30

Slide 30 text

How does it work? Resilient Distributed Datatypes paper: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf RDD Creation Scheduling DAG Task Execution

Slide 31

Slide 31 text

How does it work? Resilient Distributed Datatypes paper: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf RDD Creation Scheduling DAG Task Execution

Slide 32

Slide 32 text

Spark vs Hadoop •  Spark is RAM while Hadoop is HDFS (disk) bound •  API easier to reason about & to develop against •  Fully compatible with Hadoop Input/Output formats •  Hadoop more mature, Spark ecosystem growing fast

Slide 33

Slide 33 text

Ecosystem Flexibility RDBMS Streams Web APIs DCP KV N1QL Views Batching Archived Data OLTP

Slide 34

Slide 34 text

Infrastructure Consolidation Streams Web APIs User Interaction

Slide 35

Slide 35 text

What does this look like? 35 Read a sequence of RDDs in, and apply a few transformations… sc.couchbaseGet[JsonDocument(Seq(      "21st_amendment_brewery_cafe-­‐21a_ipa",        "aass_brewery-­‐genuine_pilsner"))          .map(doc  =>  doc.content().getString("name"))          .collect()          .foreach(println)  

Slide 36

Slide 36 text

Reading a Couchbase Secondary Index… 36 Query a Couchbase View, subfilter, and then map over them… //  Read  the  first  10  rows  and  load  their  full  documents   val  beers  =  sc.couchbaseView(ViewQuery.from("beer",  "brewery_beers"))      .map(_.id)      .couchbaseGet[JsonDocument]()      .filter(doc  =>  doc.content().getString("type")  ==  "beer")      .cache()       //  Calculate  the  mean  for  all  beers   println(beers      .map(doc  =>  doc.content().getDouble("abv").asInstanceOf[Double])      .mean())    

Slide 37

Slide 37 text

Couchbase Connector Spark Core §  Automatic Cluster and Resource Management §  Creating and Persisting RDDs §  Java APIs in addition to Scala (planned before GA) Spark SQL §  Easy JSON handling and querying §  Tight N1QL Integration (dp2) Spark Streaming §  Persisting DStreams §  DCP source (planned before GA)

Slide 38

Slide 38 text

Connector Facts •  Current Version: 1.0.0-dp •  DP2 upcoming •  GA planned for Q3 Code: https://github.com/couchbaselabs/couchbase-spark-connector Docs until GA: https://github.com/couchbaselabs/couchbase-spark-connector/wiki

Slide 39

Slide 39 text

Links •  Subscribe to Couchbase newsletter: •  http://info.couchbase.com/Community-Newsletter- Signup.html •  Couchbase is hiring a Solution Engineer in Paris •  http://www.couchbase.com/careers •  Join Paris Couchbase Meetup •  http://www.meetup.com/Couchbase-France/

Slide 40

Slide 40 text

Questions

Slide 41

Slide 41 text

Matt Ingenthron @ingenthr Michael Nitschinger @daschl Thanks