Your Data and the World Beyond MapReduce

Slide 1

Slide 1 text

#gcpug | #googlecloud Your Data and the World Beyond MapReduce

Slide 2

Slide 2 text

#gcpug | #googlecloud +Kazunori Sato @kazunori_279 Kaz Sato Developer Advocate, Cloud Platform, Google Inc. Cloud community advocacy Cloud product launch support

Slide 3

Slide 3 text

#gcpug | #googlecloud Dremel MillWheel Flume MapReduce 2012 2013 2002 2004 2006 2008 2010 GFS The World Beyond MapReduce Cloud Dataflow BigQuery

Slide 4

Slide 4 text

#gcpug | #googlecloud Google BigQuery

Slide 5

Slide 5 text

#gcpug | #googlecloud What is Google BigQuery? Google BigQuery is: A fully managed query service Not a database with transactions

Slide 6

Slide 6 text

#gcpug | #googlecloud BigQuery Web UI

Slide 7

Slide 7 text

#gcpug | #googlecloud In Google, we don’t use MapReduce for simple data analytics. We use Dremel = Google BigQuery SELECT top(appId, 20) AS app, count(*) AS count FROM installlog.2012 ORDER BY count DESC It scans 100B rows in ~20 sec, No index used

Slide 8

Slide 8 text

#gcpug | #googlecloud BigQuery Demo

Slide 9

Slide 9 text

#gcpug | #googlecloud Example: RegEx + GROUP BY on 100B rows SELECT language, SUM(views) as views FROM ( SELECT title, language, MAX(views) as views FROM [helixdata2:benchmark.Wiki100B] WHERE REGEXP_MATCH(title, "G.*o.*") GROUP EACH BY title, language ) GROUP EACH BY language ORDER BY views desc

Slide 10

Slide 10 text

#gcpug | #googlecloud Execution time: ~30 sec* ● 4 TB of data read ● 100 Billion regular expressions run ● 278 GB shuffled With single server (estimated numbers): ● 11.6 Hours to read 4 TB from disk ● 27 hours to run 100 Billion regexps ● 37 minutes to shuffle 278 GB Example: RegEx + GROUP BY on 100B rows *These figures are an example of Google BigQuery performance in a specific case, but do not represent a performance guarantee

Slide 11

Slide 11 text

#gcpug | #googlecloud Column Oriented Storage Record Oriented Storage Column Oriented Storage Less bandwidth, More compression

Slide 12

Slide 12 text

#gcpug | #googlecloud Massively Parallel Processing select top(title), count(*) from publicdata:samples.wikipedia Scanning 1 TB in 1 sec takes 5,000 disks Each query runs on thousands of servers

Slide 13

Slide 13 text

#gcpug | #googlecloud Mixer 0 Mixer 1 Mixer 1 Shard Shard Shard Shard ColumnIO on Colossus SELECT state, year COUNT(*) GROUP BY state WHERE year >= 1980 and year < 1990 ORDER BY count_babies DESC LIMIT 10 COUNT(*) GROUP BY state Fast Aggregation by Tree Structure

Slide 14

Slide 14 text

#gcpug | #googlecloud BigQuery Analytic Service in the Cloud BigQuery Analyze Export Import How to use BigQuery? Google Analytics ETL tools Connectors Google Cloud BI tools and Visualization Google Cloud Spreadsheets, R, Hadoop

Slide 15

Slide 15 text

#gcpug | #googlecloud BI Tools + BigQuery

Slide 16

Slide 16 text

#gcpug | #googlecloud IoT Example: RasPi > Fluentd > BigQuery > Google Spreadsheet

Slide 17

Slide 17 text

#gcpug | #googlecloud Customer Case: 7&i Net Media 7NSEC

Slide 18

Slide 18 text

#gcpug | #googlecloud Blazingly Fast Capable of scanning 100B rows in ~20 sec Low Cost Storage: $0.020 per GB per month Queries: $5 per TB Fully Managed Use thousands of servers with zero-ops SQL Simple and Intuitive SQL Benefits of BigQuery

Slide 19

Slide 19 text

#gcpug | #googlecloud Google Cloud Dataflow

Slide 20

Slide 20 text

#gcpug | #googlecloud Dremel MapReduce 2012 2013 2002 2004 2006 2008 2010 GFS MillWheel Flume Cloud Dataflow Cloud Dataflow BigQuery

Slide 21

Slide 21 text

#gcpug | #googlecloud ETL Filtering Enrichment Shaping Batch Streaming Composition Orchestration Cloud Dataflow Use Cases

Slide 22

Slide 22 text

#gcpug | #googlecloud Cloud Dataflow in Google Cloud Platform Stream Batch Cloud Pub/Sub Cloud Logging Cloud Dataflow BigQuery Cloud Storage Cloud Dataflow Bigtable Google Cloud Storage

Slide 23

Slide 23 text

#gcpug | #googlecloud Functional programming model Unified batch & stream processing Open source SDKs Fully Managed Runners Cloud Dataflow Features

Slide 24

Slide 24 text

#gcpug | #googlecloud Cloud Dataflow SDK for Java and Python

Slide 25

Slide 25 text

#gcpug | #googlecloud Cloud Dataflow: Concepts

Slide 26

Slide 26 text

#gcpug | #googlecloud Pipeline is a Directed Acyclic Graph (DAG) of data flow, just like: MapReduce/Hadoop jobs Apache Spark jobs Multiple Inputs and Outputs Cloud Storage BigQuery tables Cloud Datastore Cloud Pub/Sub Cloud Bigtable … and any external systems Pipeline

Slide 27

Slide 27 text

#gcpug | #googlecloud A collection of data in a Pipeline Represents any size of data Bounded or Unbounded (Stream) Or Key-Value Pairs {Seahawks, NFC, Champions, Seattle, ...} {KV

Slide 28

Slide 28 text

#gcpug | #googlecloud Transforms M M M R R GroupByKey ParDo Combine.GroupedValues

Slide 29

Slide 29 text

#gcpug | #googlecloud The Auto Complete Example Tweets Predictions read #argentina scores, my #art project, watching #armenia vs #argentina ExtractTags #argentina #art #armenia #argentina Count (argentina, 5M) (art, 9M) (armenia, 2M) ExpandPrefixes a->(argentina,5M) ar->(argentina,5M) arg->(argentina,5M) ar->(art, 9M) ... Top(3) write a->[apple, art, argentina] ar->[art, argentina, armenia] .apply(TextIO.Read.from(...)) .apply(ParDo.of(new ExtractTags())) .apply(Count.create()) .apply(ParDo.of(new ExpandPrefixes()) .apply(Top.largestPerKey(3) Pipeline p = Pipeline.create(); p.begin(); .apply(TextIO.Write.to(...)); p.run()

Slide 30

Slide 30 text

#gcpug | #googlecloud time #ar* rank game begins armenia wins! #argyle #armeniarocks Age out old data #argentinagoal From Batch To Stream

Slide 31

Slide 31 text

#gcpug | #googlecloud A Window is: A time slice of a PCollection Fixed: hourly, daily, … Sliding: last one min, ... Sessions: each session Nighttime Mid-Day Nighttime Windows

Slide 32

Slide 32 text

#gcpug | #googlecloud Streaming with Cloud PubSub Pipeline p = Pipeline.create(new PipelineOptions()); p.begin() .apply(PubsubIO.Read.topic(“input_topic”)) .apply(Window.into(SlidingWindows.of( Duration.standardMinutes(60))) .apply(ParDo.of(new ExtractTags())) .apply(Count.perElement()) .apply(ParDo.of(new ExpandPrefixes()) .apply(Top.largestPerKey(3)) .apply(PubsubIO.Write.topic(“output_topic”)); p.run();

Slide 33

Slide 33 text

#gcpug | #googlecloud Cloud Dataflow: Fully Managed Runners

Slide 34

Slide 34 text

#gcpug | #googlecloud Google Cloud Dataflow Optimize Schedule GCS GCS User Code & SDK Monitoring UI Life of a Pipeline

Slide 35

Slide 35 text

#gcpug | #googlecloud Execution Graph IN 1 IN 2 IN 3 IN 4 join OUT 1 OUT 2 C A D flatten F B = ParallelDo count E

Slide 36

Slide 36 text

#gcpug | #googlecloud Optimised Execution Graph IN 1 IN 2 IN 3 IN 4 OUT 1 OUT 2 GBK = ParallelDo GBK = GroupByKey + = CombineValues J 2 +F + Cnt GBK + C+D+J 1 B+D+J 1 A+J 1 E+J 1

Slide 37

Slide 37 text

#gcpug | #googlecloud Real-time Monitoring UI

Slide 38

Slide 38 text

#gcpug | #googlecloud 800 RPS 1,200 RPS 5,000 RPS 50 RPS Worker Scaling

Slide 39

Slide 39 text

#gcpug | #googlecloud Fully Managed Zero-ops for hundreds of servers Scalability, fault-tolerance and optimization Batch + Streaming Integrates the two paradigm with single logic Easy Pipeline Design Use Java/Python to define an abstract pipeline Benefits of Cloud Dataflow

Slide 40

Slide 40 text

#gcpug | #googlecloud

Slide 41

Slide 41 text

#gcpug | #googlecloud cloud.google.com/bigquery cloud.google.com/dataflow Getting Started

Slide 42

Slide 42 text

#gcpug | #googlecloud Thank You!