Alex Osterloh (Google) - Big Data the Cloud Way

Photo source: istockphoto.com Big Data - The Cloud Way

Photo source: istockphoto.com Big Data - The Cloud Way Alex
Osterloh Google for Work [email protected], linkedin.com/in/aosterloh

Google confidential │ Do not distribute Google confidential │ Do
not distribute I dare you say “Big Data” one more time

not distribute 1 minute at Google scale Big Data @Google Agenda Big Data in the Cloud

not distribute 1 minute at Google scale But first ….

Photo source: istockphoto.com The future of Cloud Computing

The mainframe

The PC

The result

The mobile web

The mobile app

We often treat new technologies like those we already know
before we realize their true potential. Disruption

The same thing is happening with cloud today. Disruption

Disruption The datacenter is not a collection of computers. The
datacenter IS the computer.

Photo source: istockphoto.com Big Data @Google

“Organize the world’s information and make it universally accessible and
useful.” Google’s Mission

Google confidential | Do not distribute Enterprise Google Trends Big
Data is everywhere

Google confidential | Do not distribute Enterprise 1 minute at
Google scale Big Data is everywhere

3 million searches 1000 new devices 100 hours 1 billion
users 1 billion users 100 million gigabytes and also... 1 billion activated devices One minute at Google scale

Google confidential │ Do not distribute Google’s innovation around dealing
with big data Spanner Dremel MapReduce Big Table Colossus 2012 2013 2002 2004 2006 2008 2010 GFS MillWheel Flume

Many Dremel use cases @Google • Analysis of crawled web
documents • Crash reporting for Google products • OCR results from Google Books • Spam analysis • Debugging of map tiles on Google Maps • Results of tests run on Google’s distributed build system • Disk I/O statistics for hundreds of thousands of disks • Resource monitoring for jobs run in Google’s data centers

SELECT top(appId, 20) AS app, count(*) AS count FROM installlog.2012;
ORDER BY count DESC Dremel Use @Google 1: Google Play How can a business analyst find Top 20 Apps in matter of seconds? Scan of ~1 Billion records Result in under 20 seconds

select date, rejection_reason, count(*) from line_item_table. last30days where line_item_id=56781234 Dremel
Use @Google 2: DoubleClick Support What is root cause why ad was or was not delivered in the last 30 days? Scan of ~1.2B records Result in under 5 seconds

Dremel Architecture • Traffic minimization • Higher Compression ratio •
Great for reads, not writes

Dremel Architecture Mixer 0 Mixer 1 Mixer 1 Leaf Leaf
Leaf Leaf Distributed Storage • Columnar Storage • Long lived shared serving tree • Partial Reduction • Diskless data flow

Today on the Google toilet...

Photo source: istockphoto.com Big Data in the Cloud with BigQuery
and Cloud Dataflow

Origins of BigQuery Google BigQuery http://bit.ly/dremelwp

Example: Using BigQuery to analyse 173M taxi trips SELECT INTEGER(AVG(tip_amount)*100)/100
avg_tip, REGEXP_EXTRACT(pickup_datetime, "2013- ([0-9]*)") month FROM [833682135931:nyctaxi.trip_fare] WHERE payment_type='CRD' GROUP BY 2 ORDER BY 1 desc Click Taxi to get to Live Demo What is the best month for getting tips as a NYC Taxi cab driver ?

Example: Using BigQuery to analyse 173M taxi trips Click Taxi
to get to Live Demo What is the best month for getting tips as a NYC Taxi cab driver ? SELECT INTEGER(AVG(tip_amount)*100)/100 avg_tip, REGEXP_EXTRACT(pickup_datetime, "2013- ([0-9]*)") month FROM [833682135931:nyctaxi.trip_fare] WHERE payment_type='CRD' GROUP BY 2 ORDER BY 1 desc

Example: Github (290M entries) SELECT repository_language, count (repository_language) as pushes
FROM [githubarchive:github.timeline] where type="PushEvent" and repository_pushed_at contains "2012" Group by repository_language order by pushes desc limit 10 Click Icon to get to Live Demo How many uploads to Github by Language 2012 vs. 2015?

SELECT repository_language, count (repository_language) as pushes FROM [githubarchive:github.timeline] where type="PushEvent"
and and repository_pushed_at contains "2012" Group by repository_language order by pushes desc limit 10 Example: Github (290M entries) Click Icon to get to Live Demo How many uploads to Github by Language 2012 vs. 2015? 2012 2015 More examples at http://bigqueri.es/

BigQuery vs. MapReduce • MapReduce ◦ Flexible batch processing, multiple
MR stages ◦ High overall throughput ◦ High latency • BigQuery ◦ Optimized for SQL queries ◦ Very low latency and almost real-time ◦ Great for trial-and-error and ad hoc queries

Store data with reliability, redundancy and consistency Go from data
to meaning Quickly! At scale ... BigQuery

Google Confidential and Proprietary Example : Ezakus running 1000 hadoop
Cluster

Google confidential | Do not distribute Google Cloud Platform Storage
Cloud Storage Cloud SQL Cloud Datastore Compute Compute Engine App Engine App Services BigQuery Cloud Dataflow Heavy lifting: Google Heavy lifting: You

Google confidential | Do not distribute IaaS Example: Breaking the
MinuteSort Record Compute Engine

Google confidential | Do not distribute Click to deploy ...

Google Confidential But analysing Big Data can be hard Image
source: http://blog.mikiobraun.de/2014/10/parts-bug-no-car-big-data-infrastructure.html

“At Google we don’t really use MapReduce anymore. ”

Big Data publications at Google 2012 2013 MapReduce Spanner/F1 2003
2006 2007 2010 2011 GFS MillWheel Dremel BigQuery Big Table Cloud Datastore Paxos impl. 2004 FlumeJava Cloud Dataflow MapReduce Dremel BigQuery MillWheel FlumeJava Cloud Dataflow

A small example …

Prefix Suggestions #ar #argentina, #arugularocks, #argylesocks #arg #argentina, #argylesocks, #argonauts
#arge #argentina, #argentum, #argentine Example: Auto completing #hashtags g.co/io14videos

{a->[apple, art, argentina], ar->[art, argentina, armenia],...} Count ExpandPrefixes Top(3) Write
Read ExtractTags {a->(argentina, 5M), a->(armenia, 2M), …, ar-> (argentina, 5M), ar->(armenia, 2M), ...} {#argentina scores!, watching #armenia vs #argentina, my #art project, …} {argentina, armenia, argentina, art, ...} {argentina->5M, armenia->2M, art->90M, ...} Tweets Predictions

Count ExpandPrefixes Top(3) Write Read ExtractTags Tweets Predictions Pipeline p
= Pipeline.create(); p.begin() p.run(); .apply(ParDo.of(new ExtractTags())) .apply(Top.largestPerKey(3)) .apply(Count.create()) .apply(ParDo.of(new ExpandPrefixes()) .apply(TextIO.Write.to(“gs://…”)); .apply(TextIO.Read.from(“gs://…”)) class ExpandPrefixes … { ... public void processElement(ProcessContext c) { String word = c.element().getKey(); for (int i = 1; i <= word.length(); i++) { String prefix = word.substring(0, i); c.output(KV.of(prefix, c.element())); } } }

Pipeline p = Pipeline.create(); p.begin() .apply(TextIO.Read.from(“gs://…”)) .apply(ParDo.of(new ExtractTags()) .apply(Count.create()) .apply(ParDo.of(new
ExpandPrefixes()) .apply(Top.largestPerKey(3)) .apply(TextIO.Write.to(“gs://…”)); p.run(); Google Cloud Dataflow

Google Cloud Dataflow Optimize Schedule GCS GCS

Google Cloud Pub/Sub for reads and writes. time #ar* rank
game begins armenia wins! #argyle #armeniarocks Age out old data #argentinagoal Let’s stream it …

Pipeline p = Pipeline.create(); p.begin() .apply(TextIO.Read.from(“gs://…”)) .apply(ParDo.of(new ExtractTags())) .apply(Count.create()) .apply(ParDo.of(new
ExpandPrefixes()) .apply(Top.largestPerKey(3)) .apply(TextIO.Write.to(“gs://…”)); p.run();

Pipeline p = Pipeline.create(); p.begin() .apply(PubsubIO.Read.from(“input_topic”)) .apply(ParDo.of(new ExtractTags())) .apply(Count.create()) .apply(ParDo.of(new
ExpandPrefixes()) .apply(Top.largestPerKey(3)) .apply(PubsubIO.Write.to(“output_topic”)); p.run();

Pipeline p = Pipeline.create(); p.begin() .apply(PubsubIO.Read.from(“input_topic”)) .apply(Bucket.by(SlidingWindows.of(60, MINUTES)) .apply(ParDo.of(new ExtractTags()))
.apply(Count.create()) .apply(ParDo.of(new ExpandPrefixes()) .apply(Top.largestPerKey(3)) .apply(PubsubIO.Write.to(“output_topic”)); p.run();

Google Cloud Dataflow Streaming Optimize Streaming Schedule Pub/Sub Pub/Sub

Google Confidential A simple, yet powerful SDK for building highly
parallelized & flexible data processing pipelines A fully managed service which optimizes, schedules and executes data processing pipelines Cloud Dataflow

Google Confidential Data arrives continuously, why wait to process it?
Dataflow makes stream processing the new default. Batch available (with no code change) for when it’s appropriate (e.g. historical data reprocessing) Cloud Dataflow

Back to the NY Taxis …. now with Cloud Dataflow
Problem: Mapping 340M pickup and drop off to NYC neighborhood locations by frequency.

Back to the NY Taxis …. now with Cloud Dataflow
• Map 340M geo coordinates to NYC neighborhoods (342 Polygons) • Used BigQuery as input and output • Dataflow optimizes code by collapsing multiple logical passes into a single execution pass • Used Dataflow with 5 workers to run the pipeline • 25K records per second Interactive map at: http://nyctaximap.appspot.com/ Blog Post: http://bit.ly/nycdataflow Problem: Mapping 340M pickup and drop off to NYC neighborhood locations by frequency.

Want to know the details on Dataflow ? http://bit.ly/dataflowdeepdive

Photo source: istockphoto.com Thank you. Alex Osterloh - Google Cloud
Platform Team [email protected], linkedin.com/in/aosterloh

Alex Osterloh (Google) - Big Data the Cloud Way

Alex Osterloh (Google) - Big Data the Cloud Way

More Decks by MunichDataGeeks

Other Decks in Technology

Featured

Transcript