Your Data and the World Beyond MapReduce

#gcpug | #googlecloud Your Data and the World Beyond MapReduce

#gcpug | #googlecloud +Kazunori Sato @kazunori_279 Kaz Sato Developer Advocate,
Cloud Platform, Google Inc. Cloud community advocacy Cloud product launch support

#gcpug | #googlecloud Dremel MillWheel Flume MapReduce 2012 2013 2002
2004 2006 2008 2010 GFS The World Beyond MapReduce Cloud Dataflow BigQuery

#gcpug | #googlecloud Google BigQuery

#gcpug | #googlecloud What is Google BigQuery? Google BigQuery is:
A fully managed query service Not a database with transactions

#gcpug | #googlecloud BigQuery Web UI

#gcpug | #googlecloud In Google, we don’t use MapReduce for
simple data analytics. We use Dremel = Google BigQuery SELECT top(appId, 20) AS app, count(*) AS count FROM installlog.2012 ORDER BY count DESC It scans 100B rows in ~20 sec, No index used

#gcpug | #googlecloud BigQuery Demo

#gcpug | #googlecloud Example: RegEx + GROUP BY on 100B
rows SELECT language, SUM(views) as views FROM ( SELECT title, language, MAX(views) as views FROM [helixdata2:benchmark.Wiki100B] WHERE REGEXP_MATCH(title, "G.*o.*") GROUP EACH BY title, language ) GROUP EACH BY language ORDER BY views desc

#gcpug | #googlecloud Execution time: ~30 sec* • 4 TB
of data read • 100 Billion regular expressions run • 278 GB shuffled With single server (estimated numbers): • 11.6 Hours to read 4 TB from disk • 27 hours to run 100 Billion regexps • 37 minutes to shuffle 278 GB Example: RegEx + GROUP BY on 100B rows *These figures are an example of Google BigQuery performance in a specific case, but do not represent a performance guarantee

#gcpug | #googlecloud Column Oriented Storage Record Oriented Storage Column
Oriented Storage Less bandwidth, More compression

#gcpug | #googlecloud Massively Parallel Processing select top(title), count(*) from
publicdata:samples.wikipedia Scanning 1 TB in 1 sec takes 5,000 disks Each query runs on thousands of servers

#gcpug | #googlecloud Mixer 0 Mixer 1 Mixer 1 Shard
Shard Shard Shard ColumnIO on Colossus SELECT state, year COUNT(*) GROUP BY state WHERE year >= 1980 and year < 1990 ORDER BY count_babies DESC LIMIT 10 COUNT(*) GROUP BY state Fast Aggregation by Tree Structure

#gcpug | #googlecloud BigQuery Analytic Service in the Cloud BigQuery
Analyze Export Import How to use BigQuery? Google Analytics ETL tools Connectors Google Cloud BI tools and Visualization Google Cloud Spreadsheets, R, Hadoop

#gcpug | #googlecloud BI Tools + BigQuery

#gcpug | #googlecloud IoT Example: RasPi > Fluentd > BigQuery
> Google Spreadsheet

#gcpug | #googlecloud Customer Case: 7&i Net Media 7NSEC

#gcpug | #googlecloud Blazingly Fast Capable of scanning 100B rows
in ~20 sec Low Cost Storage: $0.020 per GB per month Queries: $5 per TB Fully Managed Use thousands of servers with zero-ops SQL Simple and Intuitive SQL Benefits of BigQuery

#gcpug | #googlecloud Google Cloud Dataflow

#gcpug | #googlecloud Dremel MapReduce 2012 2013 2002 2004 2006
2008 2010 GFS MillWheel Flume Cloud Dataflow Cloud Dataflow BigQuery

#gcpug | #googlecloud ETL Filtering Enrichment Shaping Batch Streaming Composition
Orchestration Cloud Dataflow Use Cases

#gcpug | #googlecloud Cloud Dataflow in Google Cloud Platform Stream
Batch Cloud Pub/Sub Cloud Logging Cloud Dataflow BigQuery Cloud Storage Cloud Dataflow Bigtable Google Cloud Storage

#gcpug | #googlecloud Functional programming model Unified batch & stream
processing Open source SDKs Fully Managed Runners Cloud Dataflow Features

#gcpug | #googlecloud Cloud Dataflow SDK for Java and Python

#gcpug | #googlecloud Cloud Dataflow: Concepts

#gcpug | #googlecloud Pipeline is a Directed Acyclic Graph (DAG)
of data flow, just like: MapReduce/Hadoop jobs Apache Spark jobs Multiple Inputs and Outputs Cloud Storage BigQuery tables Cloud Datastore Cloud Pub/Sub Cloud Bigtable … and any external systems Pipeline

#gcpug | #googlecloud A collection of data in a Pipeline
Represents any size of data Bounded or Unbounded (Stream) Or Key-Value Pairs {Seahawks, NFC, Champions, Seattle, ...} {KV<S, {Seahawks, Seattle, …}, KV<N, {NFC, …} KV<C, {Champion, …}} PCollections

#gcpug | #googlecloud Transforms M M M R R GroupByKey
ParDo Combine.GroupedValues

#gcpug | #googlecloud The Auto Complete Example Tweets Predictions read
#argentina scores, my #art project, watching #armenia vs #argentina ExtractTags #argentina #art #armenia #argentina Count (argentina, 5M) (art, 9M) (armenia, 2M) ExpandPrefixes a->(argentina,5M) ar->(argentina,5M) arg->(argentina,5M) ar->(art, 9M) ... Top(3) write a->[apple, art, argentina] ar->[art, argentina, armenia] .apply(TextIO.Read.from(...)) .apply(ParDo.of(new ExtractTags())) .apply(Count.create()) .apply(ParDo.of(new ExpandPrefixes()) .apply(Top.largestPerKey(3) Pipeline p = Pipeline.create(); p.begin(); .apply(TextIO.Write.to(...)); p.run()

#gcpug | #googlecloud time #ar* rank game begins armenia wins!
#argyle #armeniarocks Age out old data #argentinagoal From Batch To Stream

#gcpug | #googlecloud A Window is: A time slice of
a PCollection Fixed: hourly, daily, … Sliding: last one min, ... Sessions: each session Nighttime Mid-Day Nighttime Windows

#gcpug | #googlecloud Streaming with Cloud PubSub Pipeline p =
Pipeline.create(new PipelineOptions()); p.begin() .apply(PubsubIO.Read.topic(“input_topic”)) .apply(Window.into(SlidingWindows.of( Duration.standardMinutes(60))) .apply(ParDo.of(new ExtractTags())) .apply(Count.perElement()) .apply(ParDo.of(new ExpandPrefixes()) .apply(Top.largestPerKey(3)) .apply(PubsubIO.Write.topic(“output_topic”)); p.run();

#gcpug | #googlecloud Cloud Dataflow: Fully Managed Runners

#gcpug | #googlecloud Google Cloud Dataflow Optimize Schedule GCS GCS
User Code & SDK Monitoring UI Life of a Pipeline

#gcpug | #googlecloud Execution Graph IN 1 IN 2 IN
3 IN 4 join OUT 1 OUT 2 C A D flatten F B = ParallelDo count E

#gcpug | #googlecloud Optimised Execution Graph IN 1 IN 2
IN 3 IN 4 OUT 1 OUT 2 GBK = ParallelDo GBK = GroupByKey + = CombineValues J 2 +F + Cnt GBK + C+D+J 1 B+D+J 1 A+J 1 E+J 1

#gcpug | #googlecloud Real-time Monitoring UI

#gcpug | #googlecloud 800 RPS 1,200 RPS 5,000 RPS 50
RPS Worker Scaling

#gcpug | #googlecloud Fully Managed Zero-ops for hundreds of servers
Scalability, fault-tolerance and optimization Batch + Streaming Integrates the two paradigm with single logic Easy Pipeline Design Use Java/Python to define an abstract pipeline Benefits of Cloud Dataflow

#gcpug | #googlecloud

#gcpug | #googlecloud cloud.google.com/bigquery cloud.google.com/dataflow Getting Started

#gcpug | #googlecloud Thank You!

Your Data and the World Beyond MapReduce

Your Data and the World Beyond MapReduce

More Decks by Kazunori Sato

Other Decks in Programming

Featured

Transcript