Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Your Data and the World Beyond MapReduce

Your Data and the World Beyond MapReduce

Kazunori Sato

July 04, 2015
Tweet

More Decks by Kazunori Sato

Other Decks in Programming

Transcript

  1. #gcpug | #googlecloud
    Your Data and the World
    Beyond MapReduce

    View Slide

  2. #gcpug | #googlecloud
    +Kazunori Sato
    @kazunori_279
    Kaz Sato
    Developer Advocate,
    Cloud Platform, Google Inc.
    Cloud community advocacy
    Cloud product launch support

    View Slide

  3. #gcpug | #googlecloud
    Dremel
    MillWheel
    Flume
    MapReduce
    2012 2013
    2002 2004 2006 2008 2010
    GFS
    The World Beyond MapReduce
    Cloud Dataflow
    BigQuery

    View Slide

  4. #gcpug | #googlecloud
    Google BigQuery

    View Slide

  5. #gcpug | #googlecloud
    What is Google BigQuery?
    Google BigQuery is:
    A fully managed query service
    Not a database with transactions

    View Slide

  6. #gcpug | #googlecloud
    BigQuery Web UI

    View Slide

  7. #gcpug | #googlecloud
    In Google, we don’t use
    MapReduce for simple
    data analytics.
    We use Dremel
    = Google BigQuery
    SELECT
    top(appId, 20) AS app,
    count(*) AS count
    FROM installlog.2012
    ORDER BY
    count DESC
    It scans 100B rows in ~20 sec,
    No index used

    View Slide

  8. #gcpug | #googlecloud
    BigQuery Demo

    View Slide

  9. #gcpug | #googlecloud
    Example: RegEx + GROUP BY on 100B rows
    SELECT language, SUM(views) as views
    FROM (
    SELECT title, language, MAX(views) as views
    FROM [helixdata2:benchmark.Wiki100B]
    WHERE REGEXP_MATCH(title, "G.*o.*")
    GROUP EACH BY title, language
    )
    GROUP EACH BY language
    ORDER BY views desc

    View Slide

  10. #gcpug | #googlecloud
    Execution time: ~30 sec*
    ● 4 TB of data read
    ● 100 Billion regular expressions run
    ● 278 GB shuffled
    With single server (estimated numbers):
    ● 11.6 Hours to read 4 TB from disk
    ● 27 hours to run 100 Billion regexps
    ● 37 minutes to shuffle 278 GB
    Example: RegEx + GROUP BY on 100B rows
    *These figures are an example of Google BigQuery performance in a specific case, but do not
    represent a performance guarantee

    View Slide

  11. #gcpug | #googlecloud
    Column Oriented Storage
    Record Oriented Storage Column Oriented Storage
    Less bandwidth, More compression

    View Slide

  12. #gcpug | #googlecloud
    Massively Parallel Processing
    select top(title), count(*)
    from publicdata:samples.wikipedia
    Scanning 1 TB in 1 sec
    takes 5,000 disks
    Each query runs on thousands of servers

    View Slide

  13. #gcpug | #googlecloud
    Mixer 0
    Mixer 1 Mixer 1
    Shard Shard Shard Shard
    ColumnIO on Colossus SELECT state, year
    COUNT(*)
    GROUP BY state
    WHERE year >= 1980 and year < 1990
    ORDER BY count_babies DESC
    LIMIT 10
    COUNT(*)
    GROUP BY state
    Fast Aggregation by Tree Structure

    View Slide

  14. #gcpug | #googlecloud
    BigQuery Analytic Service in the Cloud
    BigQuery
    Analyze Export
    Import
    How to use BigQuery?
    Google
    Analytics
    ETL tools
    Connectors
    Google Cloud
    BI tools and
    Visualization
    Google Cloud
    Spreadsheets, R,
    Hadoop

    View Slide

  15. #gcpug | #googlecloud
    BI Tools +
    BigQuery

    View Slide

  16. #gcpug | #googlecloud
    IoT Example: RasPi > Fluentd > BigQuery >
    Google Spreadsheet

    View Slide

  17. #gcpug | #googlecloud
    Customer Case: 7&i Net Media
    7NSEC

    View Slide

  18. #gcpug | #googlecloud
    Blazingly Fast
    Capable of scanning 100B rows in ~20 sec
    Low Cost
    Storage: $0.020 per GB per month
    Queries: $5 per TB
    Fully Managed
    Use thousands of servers with zero-ops
    SQL
    Simple and Intuitive SQL
    Benefits of BigQuery

    View Slide

  19. #gcpug | #googlecloud
    Google Cloud Dataflow

    View Slide

  20. #gcpug | #googlecloud
    Dremel
    MapReduce
    2012 2013
    2002 2004 2006 2008 2010
    GFS
    MillWheel
    Flume
    Cloud Dataflow
    Cloud Dataflow
    BigQuery

    View Slide

  21. #gcpug | #googlecloud
    ETL
    Filtering
    Enrichment
    Shaping
    Batch
    Streaming
    Composition
    Orchestration
    Cloud Dataflow Use Cases

    View Slide

  22. #gcpug | #googlecloud
    Cloud Dataflow in
    Google Cloud Platform
    Stream
    Batch
    Cloud
    Pub/Sub
    Cloud
    Logging
    Cloud
    Dataflow
    BigQuery
    Cloud
    Storage
    Cloud
    Dataflow
    Bigtable
    Google
    Cloud
    Storage

    View Slide

  23. #gcpug | #googlecloud
    Functional programming model
    Unified batch & stream processing
    Open source SDKs
    Fully Managed Runners
    Cloud Dataflow Features

    View Slide

  24. #gcpug | #googlecloud
    Cloud Dataflow SDK for Java and Python

    View Slide

  25. #gcpug | #googlecloud
    Cloud Dataflow: Concepts

    View Slide

  26. #gcpug | #googlecloud
    Pipeline is a Directed Acyclic Graph
    (DAG) of data flow, just like:
    MapReduce/Hadoop jobs
    Apache Spark jobs
    Multiple Inputs and Outputs
    Cloud Storage
    BigQuery tables
    Cloud Datastore
    Cloud Pub/Sub
    Cloud Bigtable
    … and any external systems
    Pipeline

    View Slide

  27. #gcpug | #googlecloud
    A collection of data in a Pipeline
    Represents any size of data
    Bounded or Unbounded (Stream)
    Or Key-Value Pairs
    {Seahawks, NFC,
    Champions,
    Seattle, ...}
    {KVKVKVPCollections

    View Slide

  28. #gcpug | #googlecloud
    Transforms
    M M M
    R R
    GroupByKey
    ParDo
    Combine.GroupedValues

    View Slide

  29. #gcpug | #googlecloud
    The Auto Complete Example
    Tweets
    Predictions
    read #argentina scores, my #art project,
    watching #armenia vs #argentina
    ExtractTags #argentina #art #armenia #argentina
    Count (argentina, 5M) (art, 9M) (armenia, 2M)
    ExpandPrefixes
    a->(argentina,5M) ar->(argentina,5M)
    arg->(argentina,5M) ar->(art, 9M) ...
    Top(3)
    write
    a->[apple, art, argentina]
    ar->[art, argentina, armenia]
    .apply(TextIO.Read.from(...))
    .apply(ParDo.of(new ExtractTags()))
    .apply(Count.create())
    .apply(ParDo.of(new ExpandPrefixes())
    .apply(Top.largestPerKey(3)
    Pipeline p = Pipeline.create();
    p.begin();
    .apply(TextIO.Write.to(...));
    p.run()

    View Slide

  30. #gcpug | #googlecloud
    time
    #ar*
    rank
    game begins armenia wins!
    #argyle
    #armeniarocks
    Age out old data
    #argentinagoal
    From Batch To Stream

    View Slide

  31. #gcpug | #googlecloud
    A Window is:
    A time slice of a PCollection
    Fixed: hourly, daily, …
    Sliding: last one min, ...
    Sessions: each session
    Nighttime Mid-Day Nighttime
    Windows

    View Slide

  32. #gcpug | #googlecloud
    Streaming with Cloud PubSub
    Pipeline p = Pipeline.create(new PipelineOptions());
    p.begin()
    .apply(PubsubIO.Read.topic(“input_topic”))
    .apply(Window.into(SlidingWindows.of(
    Duration.standardMinutes(60)))
    .apply(ParDo.of(new ExtractTags()))
    .apply(Count.perElement())
    .apply(ParDo.of(new ExpandPrefixes())
    .apply(Top.largestPerKey(3))
    .apply(PubsubIO.Write.topic(“output_topic”));
    p.run();

    View Slide

  33. #gcpug | #googlecloud
    Cloud Dataflow:
    Fully Managed Runners

    View Slide

  34. #gcpug | #googlecloud
    Google Cloud Dataflow
    Optimize
    Schedule
    GCS GCS
    User Code & SDK Monitoring UI
    Life of a Pipeline

    View Slide

  35. #gcpug | #googlecloud
    Execution Graph
    IN
    1
    IN
    2
    IN
    3
    IN
    4
    join
    OUT
    1
    OUT
    2
    C
    A
    D
    flatten F
    B
    = ParallelDo
    count E

    View Slide

  36. #gcpug | #googlecloud
    Optimised Execution Graph
    IN
    1
    IN
    2
    IN
    3
    IN
    4
    OUT
    1
    OUT
    2
    GBK
    = ParallelDo
    GBK = GroupByKey
    + = CombineValues
    J
    2
    +F
    +
    Cnt GBK +
    C+D+J
    1
    B+D+J
    1
    A+J
    1
    E+J
    1

    View Slide

  37. #gcpug | #googlecloud
    Real-time Monitoring UI

    View Slide

  38. #gcpug | #googlecloud
    800 RPS 1,200 RPS 5,000 RPS 50 RPS
    Worker Scaling

    View Slide

  39. #gcpug | #googlecloud
    Fully Managed
    Zero-ops for hundreds of servers
    Scalability, fault-tolerance and optimization
    Batch + Streaming
    Integrates the two paradigm with single logic
    Easy Pipeline Design
    Use Java/Python to define an abstract pipeline
    Benefits of Cloud Dataflow

    View Slide

  40. #gcpug | #googlecloud

    View Slide

  41. #gcpug | #googlecloud
    cloud.google.com/bigquery
    cloud.google.com/dataflow
    Getting Started

    View Slide

  42. #gcpug | #googlecloud
    Thank You!

    View Slide