Your Data and the World Beyond MapReduce

Your Data and the World Beyond MapReduce

91aeb42c5d9548918d1459f64240e503?s=128

Kazunori Sato

July 04, 2015
Tweet

Transcript

  1. #gcpug | #googlecloud Your Data and the World Beyond MapReduce

  2. #gcpug | #googlecloud +Kazunori Sato @kazunori_279 Kaz Sato Developer Advocate,

    Cloud Platform, Google Inc. Cloud community advocacy Cloud product launch support
  3. #gcpug | #googlecloud Dremel MillWheel Flume MapReduce 2012 2013 2002

    2004 2006 2008 2010 GFS The World Beyond MapReduce Cloud Dataflow BigQuery
  4. #gcpug | #googlecloud Google BigQuery

  5. #gcpug | #googlecloud What is Google BigQuery? Google BigQuery is:

    A fully managed query service Not a database with transactions
  6. #gcpug | #googlecloud BigQuery Web UI

  7. #gcpug | #googlecloud In Google, we don’t use MapReduce for

    simple data analytics. We use Dremel = Google BigQuery SELECT top(appId, 20) AS app, count(*) AS count FROM installlog.2012 ORDER BY count DESC It scans 100B rows in ~20 sec, No index used
  8. #gcpug | #googlecloud BigQuery Demo

  9. #gcpug | #googlecloud Example: RegEx + GROUP BY on 100B

    rows SELECT language, SUM(views) as views FROM ( SELECT title, language, MAX(views) as views FROM [helixdata2:benchmark.Wiki100B] WHERE REGEXP_MATCH(title, "G.*o.*") GROUP EACH BY title, language ) GROUP EACH BY language ORDER BY views desc
  10. #gcpug | #googlecloud Execution time: ~30 sec* • 4 TB

    of data read • 100 Billion regular expressions run • 278 GB shuffled With single server (estimated numbers): • 11.6 Hours to read 4 TB from disk • 27 hours to run 100 Billion regexps • 37 minutes to shuffle 278 GB Example: RegEx + GROUP BY on 100B rows *These figures are an example of Google BigQuery performance in a specific case, but do not represent a performance guarantee
  11. #gcpug | #googlecloud Column Oriented Storage Record Oriented Storage Column

    Oriented Storage Less bandwidth, More compression
  12. #gcpug | #googlecloud Massively Parallel Processing select top(title), count(*) from

    publicdata:samples.wikipedia Scanning 1 TB in 1 sec takes 5,000 disks Each query runs on thousands of servers
  13. #gcpug | #googlecloud Mixer 0 Mixer 1 Mixer 1 Shard

    Shard Shard Shard ColumnIO on Colossus SELECT state, year COUNT(*) GROUP BY state WHERE year >= 1980 and year < 1990 ORDER BY count_babies DESC LIMIT 10 COUNT(*) GROUP BY state Fast Aggregation by Tree Structure
  14. #gcpug | #googlecloud BigQuery Analytic Service in the Cloud BigQuery

    Analyze Export Import How to use BigQuery? Google Analytics ETL tools Connectors Google Cloud BI tools and Visualization Google Cloud Spreadsheets, R, Hadoop
  15. #gcpug | #googlecloud BI Tools + BigQuery

  16. #gcpug | #googlecloud IoT Example: RasPi > Fluentd > BigQuery

    > Google Spreadsheet
  17. #gcpug | #googlecloud Customer Case: 7&i Net Media 7NSEC

  18. #gcpug | #googlecloud Blazingly Fast Capable of scanning 100B rows

    in ~20 sec Low Cost Storage: $0.020 per GB per month Queries: $5 per TB Fully Managed Use thousands of servers with zero-ops SQL Simple and Intuitive SQL Benefits of BigQuery
  19. #gcpug | #googlecloud Google Cloud Dataflow

  20. #gcpug | #googlecloud Dremel MapReduce 2012 2013 2002 2004 2006

    2008 2010 GFS MillWheel Flume Cloud Dataflow Cloud Dataflow BigQuery
  21. #gcpug | #googlecloud ETL Filtering Enrichment Shaping Batch Streaming Composition

    Orchestration Cloud Dataflow Use Cases
  22. #gcpug | #googlecloud Cloud Dataflow in Google Cloud Platform Stream

    Batch Cloud Pub/Sub Cloud Logging Cloud Dataflow BigQuery Cloud Storage Cloud Dataflow Bigtable Google Cloud Storage
  23. #gcpug | #googlecloud Functional programming model Unified batch & stream

    processing Open source SDKs Fully Managed Runners Cloud Dataflow Features
  24. #gcpug | #googlecloud Cloud Dataflow SDK for Java and Python

  25. #gcpug | #googlecloud Cloud Dataflow: Concepts

  26. #gcpug | #googlecloud Pipeline is a Directed Acyclic Graph (DAG)

    of data flow, just like: MapReduce/Hadoop jobs Apache Spark jobs Multiple Inputs and Outputs Cloud Storage BigQuery tables Cloud Datastore Cloud Pub/Sub Cloud Bigtable … and any external systems Pipeline
  27. #gcpug | #googlecloud A collection of data in a Pipeline

    Represents any size of data Bounded or Unbounded (Stream) Or Key-Value Pairs {Seahawks, NFC, Champions, Seattle, ...} {KV<S, {Seahawks, Seattle, …}, KV<N, {NFC, …} KV<C, {Champion, …}} PCollections
  28. #gcpug | #googlecloud Transforms M M M R R GroupByKey

    ParDo Combine.GroupedValues
  29. #gcpug | #googlecloud The Auto Complete Example Tweets Predictions read

    #argentina scores, my #art project, watching #armenia vs #argentina ExtractTags #argentina #art #armenia #argentina Count (argentina, 5M) (art, 9M) (armenia, 2M) ExpandPrefixes a->(argentina,5M) ar->(argentina,5M) arg->(argentina,5M) ar->(art, 9M) ... Top(3) write a->[apple, art, argentina] ar->[art, argentina, armenia] .apply(TextIO.Read.from(...)) .apply(ParDo.of(new ExtractTags())) .apply(Count.create()) .apply(ParDo.of(new ExpandPrefixes()) .apply(Top.largestPerKey(3) Pipeline p = Pipeline.create(); p.begin(); .apply(TextIO.Write.to(...)); p.run()
  30. #gcpug | #googlecloud time #ar* rank game begins armenia wins!

    #argyle #armeniarocks Age out old data #argentinagoal From Batch To Stream
  31. #gcpug | #googlecloud A Window is: A time slice of

    a PCollection Fixed: hourly, daily, … Sliding: last one min, ... Sessions: each session Nighttime Mid-Day Nighttime Windows
  32. #gcpug | #googlecloud Streaming with Cloud PubSub Pipeline p =

    Pipeline.create(new PipelineOptions()); p.begin() .apply(PubsubIO.Read.topic(“input_topic”)) .apply(Window.into(SlidingWindows.of( Duration.standardMinutes(60))) .apply(ParDo.of(new ExtractTags())) .apply(Count.perElement()) .apply(ParDo.of(new ExpandPrefixes()) .apply(Top.largestPerKey(3)) .apply(PubsubIO.Write.topic(“output_topic”)); p.run();
  33. #gcpug | #googlecloud Cloud Dataflow: Fully Managed Runners

  34. #gcpug | #googlecloud Google Cloud Dataflow Optimize Schedule GCS GCS

    User Code & SDK Monitoring UI Life of a Pipeline
  35. #gcpug | #googlecloud Execution Graph IN 1 IN 2 IN

    3 IN 4 join OUT 1 OUT 2 C A D flatten F B = ParallelDo count E
  36. #gcpug | #googlecloud Optimised Execution Graph IN 1 IN 2

    IN 3 IN 4 OUT 1 OUT 2 GBK = ParallelDo GBK = GroupByKey + = CombineValues J 2 +F + Cnt GBK + C+D+J 1 B+D+J 1 A+J 1 E+J 1
  37. #gcpug | #googlecloud Real-time Monitoring UI

  38. #gcpug | #googlecloud 800 RPS 1,200 RPS 5,000 RPS 50

    RPS Worker Scaling
  39. #gcpug | #googlecloud Fully Managed Zero-ops for hundreds of servers

    Scalability, fault-tolerance and optimization Batch + Streaming Integrates the two paradigm with single logic Easy Pipeline Design Use Java/Python to define an abstract pipeline Benefits of Cloud Dataflow
  40. #gcpug | #googlecloud

  41. #gcpug | #googlecloud cloud.google.com/bigquery cloud.google.com/dataflow Getting Started

  42. #gcpug | #googlecloud Thank You!