Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Alex Osterloh (Google) - Big Data the Cloud Way

Alex Osterloh (Google) - Big Data the Cloud Way

Big Data the Cloud Way

MunichDataGeeks

February 03, 2015
Tweet

More Decks by MunichDataGeeks

Other Decks in Technology

Transcript

  1. Photo source: istockphoto.com Big Data - The Cloud Way Alex

    Osterloh Google for Work [email protected], linkedin.com/in/aosterloh
  2. Google confidential │ Do not distribute Google confidential │ Do

    not distribute I dare you say “Big Data” one more time
  3. Google confidential │ Do not distribute Google confidential │ Do

    not distribute 1 minute at Google scale Big Data @Google Agenda Big Data in the Cloud
  4. Google confidential │ Do not distribute Google confidential │ Do

    not distribute 1 minute at Google scale But first ….
  5. We often treat new technologies like those we already know

    before we realize their true potential. Disruption
  6. 3 million searches 1000 new devices 100 hours 1 billion

    users 1 billion users 100 million gigabytes and also... 1 billion activated devices One minute at Google scale
  7. Google confidential │ Do not distribute Google’s innovation around dealing

    with big data Spanner Dremel MapReduce Big Table Colossus 2012 2013 2002 2004 2006 2008 2010 GFS MillWheel Flume
  8. Many Dremel use cases @Google • Analysis of crawled web

    documents • Crash reporting for Google products • OCR results from Google Books • Spam analysis • Debugging of map tiles on Google Maps • Results of tests run on Google’s distributed build system • Disk I/O statistics for hundreds of thousands of disks • Resource monitoring for jobs run in Google’s data centers
  9. SELECT top(appId, 20) AS app, count(*) AS count FROM installlog.2012;

    ORDER BY count DESC Dremel Use @Google 1: Google Play How can a business analyst find Top 20 Apps in matter of seconds? Scan of ~1 Billion records Result in under 20 seconds
  10. select date, rejection_reason, count(*) from line_item_table. last30days where line_item_id=56781234 Dremel

    Use @Google 2: DoubleClick Support What is root cause why ad was or was not delivered in the last 30 days? Scan of ~1.2B records Result in under 5 seconds
  11. Dremel Architecture Mixer 0 Mixer 1 Mixer 1 Leaf Leaf

    Leaf Leaf Distributed Storage • Columnar Storage • Long lived shared serving tree • Partial Reduction • Diskless data flow
  12. Example: Using BigQuery to analyse 173M taxi trips SELECT INTEGER(AVG(tip_amount)*100)/100

    avg_tip, REGEXP_EXTRACT(pickup_datetime, "2013- ([0-9]*)") month FROM [833682135931:nyctaxi.trip_fare] WHERE payment_type='CRD' GROUP BY 2 ORDER BY 1 desc Click Taxi to get to Live Demo What is the best month for getting tips as a NYC Taxi cab driver ?
  13. Example: Using BigQuery to analyse 173M taxi trips Click Taxi

    to get to Live Demo What is the best month for getting tips as a NYC Taxi cab driver ? SELECT INTEGER(AVG(tip_amount)*100)/100 avg_tip, REGEXP_EXTRACT(pickup_datetime, "2013- ([0-9]*)") month FROM [833682135931:nyctaxi.trip_fare] WHERE payment_type='CRD' GROUP BY 2 ORDER BY 1 desc
  14. Example: Using BigQuery to analyse 173M taxi trips Click Taxi

    to get to Live Demo What is the best month for getting tips as a NYC Taxi cab driver ? SELECT INTEGER(AVG(tip_amount)*100)/100 avg_tip, REGEXP_EXTRACT(pickup_datetime, "2013- ([0-9]*)") month FROM [833682135931:nyctaxi.trip_fare] WHERE payment_type='CRD' GROUP BY 2 ORDER BY 1 desc
  15. Example: Github (290M entries) SELECT repository_language, count (repository_language) as pushes

    FROM [githubarchive:github.timeline] where type="PushEvent" and repository_pushed_at contains "2012" Group by repository_language order by pushes desc limit 10 Click Icon to get to Live Demo How many uploads to Github by Language 2012 vs. 2015?
  16. SELECT repository_language, count (repository_language) as pushes FROM [githubarchive:github.timeline] where type="PushEvent"

    and and repository_pushed_at contains "2012" Group by repository_language order by pushes desc limit 10 Example: Github (290M entries) Click Icon to get to Live Demo How many uploads to Github by Language 2012 vs. 2015? 2012 2015 More examples at http://bigqueri.es/
  17. BigQuery vs. MapReduce • MapReduce ◦ Flexible batch processing, multiple

    MR stages ◦ High overall throughput ◦ High latency • BigQuery ◦ Optimized for SQL queries ◦ Very low latency and almost real-time ◦ Great for trial-and-error and ad hoc queries
  18. Google confidential | Do not distribute Google Cloud Platform Storage

    Cloud Storage Cloud SQL Cloud Datastore Compute Compute Engine App Engine App Services BigQuery Cloud Dataflow Heavy lifting: Google Heavy lifting: You
  19. Google Confidential But analysing Big Data can be hard Image

    source: http://blog.mikiobraun.de/2014/10/parts-bug-no-car-big-data-infrastructure.html
  20. Big Data publications at Google 2012 2013 MapReduce Spanner/F1 2003

    2006 2007 2010 2011 GFS MillWheel Dremel BigQuery Big Table Cloud Datastore Paxos impl. 2004 FlumeJava Cloud Dataflow MapReduce Dremel BigQuery MillWheel FlumeJava Cloud Dataflow
  21. Prefix Suggestions #ar #argentina, #arugularocks, #argylesocks #arg #argentina, #argylesocks, #argonauts

    #arge #argentina, #argentum, #argentine Example: Auto completing #hashtags g.co/io14videos
  22. {a->[apple, art, argentina], ar->[art, argentina, armenia],...} Count ExpandPrefixes Top(3) Write

    Read ExtractTags {a->(argentina, 5M), a->(armenia, 2M), …, ar-> (argentina, 5M), ar->(armenia, 2M), ...} {#argentina scores!, watching #armenia vs #argentina, my #art project, …} {argentina, armenia, argentina, art, ...} {argentina->5M, armenia->2M, art->90M, ...} Tweets Predictions
  23. Count ExpandPrefixes Top(3) Write Read ExtractTags Tweets Predictions Pipeline p

    = Pipeline.create(); p.begin() p.run(); .apply(ParDo.of(new ExtractTags())) .apply(Top.largestPerKey(3)) .apply(Count.create()) .apply(ParDo.of(new ExpandPrefixes()) .apply(TextIO.Write.to(“gs://…”)); .apply(TextIO.Read.from(“gs://…”)) class ExpandPrefixes … { ... public void processElement(ProcessContext c) { String word = c.element().getKey(); for (int i = 1; i <= word.length(); i++) { String prefix = word.substring(0, i); c.output(KV.of(prefix, c.element())); } } }
  24. Pipeline p = Pipeline.create(); p.begin() .apply(TextIO.Read.from(“gs://…”)) .apply(ParDo.of(new ExtractTags()) .apply(Count.create()) .apply(ParDo.of(new

    ExpandPrefixes()) .apply(Top.largestPerKey(3)) .apply(TextIO.Write.to(“gs://…”)); p.run(); Google Cloud Dataflow
  25. Google Cloud Pub/Sub for reads and writes. time #ar* rank

    game begins armenia wins! #argyle #armeniarocks Age out old data #argentinagoal Let’s stream it …
  26. Pipeline p = Pipeline.create(); p.begin() .apply(PubsubIO.Read.from(“input_topic”)) .apply(Bucket.by(SlidingWindows.of(60, MINUTES)) .apply(ParDo.of(new ExtractTags()))

    .apply(Count.create()) .apply(ParDo.of(new ExpandPrefixes()) .apply(Top.largestPerKey(3)) .apply(PubsubIO.Write.to(“output_topic”)); p.run();
  27. Google Confidential A simple, yet powerful SDK for building highly

    parallelized & flexible data processing pipelines A fully managed service which optimizes, schedules and executes data processing pipelines Cloud Dataflow
  28. Google Confidential Data arrives continuously, why wait to process it?

    Dataflow makes stream processing the new default. Batch available (with no code change) for when it’s appropriate (e.g. historical data reprocessing) Cloud Dataflow
  29. Back to the NY Taxis …. now with Cloud Dataflow

    Problem: Mapping 340M pickup and drop off to NYC neighborhood locations by frequency.
  30. Back to the NY Taxis …. now with Cloud Dataflow

    • Map 340M geo coordinates to NYC neighborhoods (342 Polygons) • Used BigQuery as input and output • Dataflow optimizes code by collapsing multiple logical passes into a single execution pass • Used Dataflow with 5 workers to run the pipeline • 25K records per second Interactive map at: http://nyctaximap.appspot.com/ Blog Post: http://bit.ly/nycdataflow Problem: Mapping 340M pickup and drop off to NYC neighborhood locations by frequency.