Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Your Data and the World Beyond MapReduce

Your Data and the World Beyond MapReduce

Kazunori Sato

July 04, 2015
Tweet

More Decks by Kazunori Sato

Other Decks in Programming

Transcript

  1. #gcpug | #googlecloud +Kazunori Sato @kazunori_279 Kaz Sato Developer Advocate,

    Cloud Platform, Google Inc. Cloud community advocacy Cloud product launch support
  2. #gcpug | #googlecloud Dremel MillWheel Flume MapReduce 2012 2013 2002

    2004 2006 2008 2010 GFS The World Beyond MapReduce Cloud Dataflow BigQuery
  3. #gcpug | #googlecloud What is Google BigQuery? Google BigQuery is:

    A fully managed query service Not a database with transactions
  4. #gcpug | #googlecloud In Google, we don’t use MapReduce for

    simple data analytics. We use Dremel = Google BigQuery SELECT top(appId, 20) AS app, count(*) AS count FROM installlog.2012 ORDER BY count DESC It scans 100B rows in ~20 sec, No index used
  5. #gcpug | #googlecloud Example: RegEx + GROUP BY on 100B

    rows SELECT language, SUM(views) as views FROM ( SELECT title, language, MAX(views) as views FROM [helixdata2:benchmark.Wiki100B] WHERE REGEXP_MATCH(title, "G.*o.*") GROUP EACH BY title, language ) GROUP EACH BY language ORDER BY views desc
  6. #gcpug | #googlecloud Execution time: ~30 sec* • 4 TB

    of data read • 100 Billion regular expressions run • 278 GB shuffled With single server (estimated numbers): • 11.6 Hours to read 4 TB from disk • 27 hours to run 100 Billion regexps • 37 minutes to shuffle 278 GB Example: RegEx + GROUP BY on 100B rows *These figures are an example of Google BigQuery performance in a specific case, but do not represent a performance guarantee
  7. #gcpug | #googlecloud Column Oriented Storage Record Oriented Storage Column

    Oriented Storage Less bandwidth, More compression
  8. #gcpug | #googlecloud Massively Parallel Processing select top(title), count(*) from

    publicdata:samples.wikipedia Scanning 1 TB in 1 sec takes 5,000 disks Each query runs on thousands of servers
  9. #gcpug | #googlecloud Mixer 0 Mixer 1 Mixer 1 Shard

    Shard Shard Shard ColumnIO on Colossus SELECT state, year COUNT(*) GROUP BY state WHERE year >= 1980 and year < 1990 ORDER BY count_babies DESC LIMIT 10 COUNT(*) GROUP BY state Fast Aggregation by Tree Structure
  10. #gcpug | #googlecloud BigQuery Analytic Service in the Cloud BigQuery

    Analyze Export Import How to use BigQuery? Google Analytics ETL tools Connectors Google Cloud BI tools and Visualization Google Cloud Spreadsheets, R, Hadoop
  11. #gcpug | #googlecloud Blazingly Fast Capable of scanning 100B rows

    in ~20 sec Low Cost Storage: $0.020 per GB per month Queries: $5 per TB Fully Managed Use thousands of servers with zero-ops SQL Simple and Intuitive SQL Benefits of BigQuery
  12. #gcpug | #googlecloud Dremel MapReduce 2012 2013 2002 2004 2006

    2008 2010 GFS MillWheel Flume Cloud Dataflow Cloud Dataflow BigQuery
  13. #gcpug | #googlecloud Cloud Dataflow in Google Cloud Platform Stream

    Batch Cloud Pub/Sub Cloud Logging Cloud Dataflow BigQuery Cloud Storage Cloud Dataflow Bigtable Google Cloud Storage
  14. #gcpug | #googlecloud Functional programming model Unified batch & stream

    processing Open source SDKs Fully Managed Runners Cloud Dataflow Features
  15. #gcpug | #googlecloud Pipeline is a Directed Acyclic Graph (DAG)

    of data flow, just like: MapReduce/Hadoop jobs Apache Spark jobs Multiple Inputs and Outputs Cloud Storage BigQuery tables Cloud Datastore Cloud Pub/Sub Cloud Bigtable … and any external systems Pipeline
  16. #gcpug | #googlecloud A collection of data in a Pipeline

    Represents any size of data Bounded or Unbounded (Stream) Or Key-Value Pairs {Seahawks, NFC, Champions, Seattle, ...} {KV<S, {Seahawks, Seattle, …}, KV<N, {NFC, …} KV<C, {Champion, …}} PCollections
  17. #gcpug | #googlecloud The Auto Complete Example Tweets Predictions read

    #argentina scores, my #art project, watching #armenia vs #argentina ExtractTags #argentina #art #armenia #argentina Count (argentina, 5M) (art, 9M) (armenia, 2M) ExpandPrefixes a->(argentina,5M) ar->(argentina,5M) arg->(argentina,5M) ar->(art, 9M) ... Top(3) write a->[apple, art, argentina] ar->[art, argentina, armenia] .apply(TextIO.Read.from(...)) .apply(ParDo.of(new ExtractTags())) .apply(Count.create()) .apply(ParDo.of(new ExpandPrefixes()) .apply(Top.largestPerKey(3) Pipeline p = Pipeline.create(); p.begin(); .apply(TextIO.Write.to(...)); p.run()
  18. #gcpug | #googlecloud time #ar* rank game begins armenia wins!

    #argyle #armeniarocks Age out old data #argentinagoal From Batch To Stream
  19. #gcpug | #googlecloud A Window is: A time slice of

    a PCollection Fixed: hourly, daily, … Sliding: last one min, ... Sessions: each session Nighttime Mid-Day Nighttime Windows
  20. #gcpug | #googlecloud Streaming with Cloud PubSub Pipeline p =

    Pipeline.create(new PipelineOptions()); p.begin() .apply(PubsubIO.Read.topic(“input_topic”)) .apply(Window.into(SlidingWindows.of( Duration.standardMinutes(60))) .apply(ParDo.of(new ExtractTags())) .apply(Count.perElement()) .apply(ParDo.of(new ExpandPrefixes()) .apply(Top.largestPerKey(3)) .apply(PubsubIO.Write.topic(“output_topic”)); p.run();
  21. #gcpug | #googlecloud Google Cloud Dataflow Optimize Schedule GCS GCS

    User Code & SDK Monitoring UI Life of a Pipeline
  22. #gcpug | #googlecloud Execution Graph IN 1 IN 2 IN

    3 IN 4 join OUT 1 OUT 2 C A D flatten F B = ParallelDo count E
  23. #gcpug | #googlecloud Optimised Execution Graph IN 1 IN 2

    IN 3 IN 4 OUT 1 OUT 2 GBK = ParallelDo GBK = GroupByKey + = CombineValues J 2 +F + Cnt GBK + C+D+J 1 B+D+J 1 A+J 1 E+J 1
  24. #gcpug | #googlecloud Fully Managed Zero-ops for hundreds of servers

    Scalability, fault-tolerance and optimization Batch + Streaming Integrates the two paradigm with single logic Easy Pipeline Design Use Java/Python to define an abstract pipeline Benefits of Cloud Dataflow