Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Crunching Data with BigQuery Fast analysis of Big Data Jordan
Tigani, Software Engineer

01000001011011100111001101110111011001010111001 00010000001110100011011110010000001110100011010 00011001010010000001010101011011000111010001101 00101101101011000010111010001100101001000000101 00010111010101100101011100110111010001101001011 01111011011100010000001101111011001100010000001 00110001101001011001100110010100101100001000000 11101000110100001100101001000000101010101101110 01101001011101100110010101110010011100110110010 10010110000100000011000010110111001100100001000
00010001010111011001100101011100100111100101110 100101110011001000000011010000110010...........

Big Data at Google 72 hours 100 million gigabytes

SELECT kick_ass_product_plan AS strategy, AVG(kicking_factor) AS awesomeness FROM lots_of_data GROUP
BY strategy

+-------------+----------------+ | strategy | awesomeness | +-------------+----------------+ | "Forty-two" |
1000000.01 | +-------------+----------------+ 1 row in result set (10.2 s) Scanned 100GB

Regular expressions on 13 billion rows...

13 Billion rows 1 TB of data in 4 tables
FAST! AST

Google's Internal Technology: Dremel

MapReduce is Flexible but Heavy Master Mapper Mapper • Master
constructs the plan and begins spinning up workers Distributed Storage • Mappers read and write to distributed storage • Map => Shuffle => Reduce Reducer • Reducers read and write to distributed storage

Master Reducer Mapper Mapper Stage 2 MapReduce is Flexible but
Heavy Stage 1 Master Mapper Mapper Distributed Storage Reducer

Dremel vs MapReduce • MapReduce o Flexible batch processing o
High overall throughput o High latency • Dremel o Optimized for interactive SQL queries o Very low latency

Mixer 0 Mixer 1 Mixer 1 Leaf Leaf Leaf Leaf
Distributed Storage Dremel Architecture • Columnar Storage • Long lived shared serving tree • Partial Reduction • Diskless data flow

SELECT state, COUNT(*) count_babies FROM [publicdata:samples.natality] WHERE year >= 1980
AND year < 1990 GROUP BY state ORDER BY count_babies DESC LIMIT 10 Simple Query

Distributed Storage SELECT state, year O(Rows ~140M) COUNT(*) GROUP BY state WHERE year >= 1980 and year < 1990 O(50 states) LIMIT 10 ORDER BY count_babies DESC COUNT(*) GROUP BY state COUNT(*) GROUP BY state O(50 states) O(50 states)

Modeling Data

Example: Daily Weather Station Data weather_station_data station lat long mean_temp
humidity timestamp year month day 9384 33.57 86.75 89.3 .35 1351005129 2011 04 19 2857 36.77 119.72 78.5 .24 1351005135 2011 04 19 3475 40.77 73.98 68 .35 1351015930 2011 04 19 etc...

Example: Daily Weather Station Data station, lat, long, mean_temp, year,
mon, day 999999, 36.624, -116.023, 63.6, 2009, 10, 9 911904, 20.963, -156.675, 83.4, 2009, 10, 9 916890, -18133, 178433, 76.9, 2009, 10, 9 943320, -20678, 139488, 73.8, 2009, 10, 9 CSV

Organizing BigQuery Tables Your Source Data October 22 October 23
October 24

Modeling Event Data: Social Music Store logs.oct_24_2012_song_activities USERNAME ACTIVITY Cost
SONG ARTIST TIMESTAMP Michael LISTEN Too Close Alex Clare 1351065562 Michael LISTEN Gangnam Style PSY 1351105150 Jim LISTEN Complications Deadmau5 1351075720 Michael PURCHASE 0.99 Gangnam Style PSY 1351115962 logs.oct_24_2012_song_activities USERNAME ACTIVITY Cost SONG ARTIST TIMESTAMP Michael LISTEN Too Close Alex Clare 1351065562 Michael LISTEN Gangnam Style PSY 1351105150 Jim LISTEN Complications Deadmau5 1351075720 Michael PURCHASE 0.99 Gangnam Style PSY 1351115962

Users Who Listened to More than 10 Songs/Day SELECT UserId,
COUNT(*) as ListenActivities FROM [logs.oct_24_2012_song_activities] GROUP EACH BY UserId HAVING ListenActivites > 10

How Many Songs Listened to Total by Listeners of PSY?
SELECT UserId, count(*) as ListenActivities FROM [logs.oct_24_2012_song_activities] WHERE UserId IN ( SELECT UserId FROM [logs.oct_24_2012_song_activities] WHERE artist = 'PSY') GROUP EACH BY UserId HAVING ListenActivites > 10

Modeling Event Data: Nested and Repeated Values {"UserID" : "Michael",
"Listens": [ {"TrackId":1234,"Title":"Gangam Style", "Artist":"PSY","Timestamp":1351075700}, {"TrackId":1234,"Title":"Alex Clare", "Artist":"Alex Clare",'Timestamp":1351075700} ] "Purchases": [ {"Track":2345,"Title":"Gangam Style", "Artist":"PSY","Timestamp":1351075700,"Cost":0.99} ]} JSON {"UserID" : "Michael", "Listens": [ {"TrackId":1234,"Title":"Gangam Style", "Artist":"PSY","Timestamp":1351075700}, {"TrackId":1234,"Title":"Alex Clare", "Artist":"Alex Clare",'Timestamp":1351075700} ] "Purchases": [ {"Track":2345,"Title":"Gangam Style", "Artist":"PSY","Timestamp":1351075700,"Cost":0.99} ]} {"UserID" : "Michael", "Listens": [ {"TrackId":1234,"Title":"Gangnam Style", "Artist":"PSY","Timestamp":1351075700}, {"TrackId":1234,"Title":"Alex Clare", "Artist":"Alex Clare",'Timestamp":1351075700} ] "Purchases": [ {"Track":2345,"Title":"Gangnam Style", "Artist":"PSY","Timestamp":1351075700,"Cost":0.99} ]}

Which Users Have Listened to Beyonce? SELECT UserID, COUNT(ListenActivities.artist) WITHIN
RECORD AS song_count FROM [logs.oct_24_2012_songactivities] WHERE UserID IN (SELECT UserID, FROM [logs.oct_24_2012_songactivities] WHERE ListenActivities.artist = 'Beyonce');

What Position are PSY songs in our Users' Daily Playlists?
SELECT UserID, POSITION(ListenActivities.artist) FROM [sample_music_logs.oct_24_2012_songactivities] WHERE ListenActivities.artist = 'PSY';

SELECT AVG(POSITION(ListenActivities.artist)) FROM [sample_music_logs.oct_24_2012_songactivities], [sample_music_logs.oct_23_2012_songactivities], /* etc... */ WHERE ListenActivities.artist
= 'PSY'; Average Position of Songs by PSY in All Daily Playlists?

Summary: Choosing a BigQuery Data Model • "Shard" your Data
Using Multiple Tables • Source Data Files • CSV format • Newline-delimited JSON • Using Nested and Repeated Records • Simplify Some Types of Queries • Often Matches Document Database Models

Developing with BigQuery

Google Cloud Storage Upload Your Data BigQuery

Load your Data into BigQuery "jobReference":{ "projectId":"605902584318"}, "configuration":{ "load":{ "destinationTable":{
"projectId":"605902584318", "datasetId":"my_dataset", "tableId":"widget_sales"}, "sourceUris":[ "gs://widget-sales-data/2012080100.csv"], "schema":{ "fields":[{ "name":"widget", "type":"string"}, ... POST https://www.googleapis.com/bigquery/v2/projects/605902584318/jobs "jobReference":{ "projectId":"605902584318"}, "configuration":{ "load":{ "destinationTable":{ "projectId":"605902584318", "datasetId":"my_dataset", "tableId":"widget_sales"}, "sourceUris":[ "gs://widget-sales-data/2012080100.csv"], "schema":{ "fields":[{ "name":"widget", "type":"string"}, ...

Query Away! "jobReference":{ "projectId":"605902584318", "query":"SELECT TOP(widget, 50), COUNT(*) AS sale_count
FROM widget_sales", "maxResults":100, "apiVersion":"v2" } POST https://www.googleapis.com/bigquery/v2/projects/605902584318/jobs "jobReference":{ "projectId":"605902584318", "query":"SELECT TOP(widget, 50), COUNT(*) AS sale_count FROM widget_sales", "maxResults":100, "apiVersion":"v2" }

Libraries • Python • Java • .NET • Ruby •
JavaScript • Go • PHP • Objective-C

Libraries - Example JavaScript Query var request = gapi.client.bigquery.jobs.query({ 'projectId':
project_id, 'timeoutMs': '30000', 'query': 'SELECT state, AVG(mother_age) AS theav FROM [publicdata:samples.natality] WHERE year=2000 AND ever_born=1 GROUP BY state ORDER BY theav DESC;' }); request.execute(function(response) { console.log(response); $.each(response.result.rows, function(i, item) { ...

Custom Code and the Google Chart Tools API

Google Spreadsheets

Commercial Visualization Tools

Demo: Using BigQuery on BigQuery

• Full table scans FAST • Aggregate Queries on Massive
Datasets • Supports Flat and Nested/Repeated Data Models • It's an API BigQuery - Aggregate Big Data Analysis in Seconds Get started now: http://developers.google.com/bigquery/

SELECT questions FROM audience SELECT 'Thank You!' FROM jordan http://developers.google.com/bigquery

Schema definition birth_record parent_id_mother parent_id_father plurality is_male race weight parents
id race age cigarette_use state

Schema definition birth_record mother_race mother_age mother_cigarette_use mother_state father_race father_age father_cigarette_use
father_state plurality is_male race weight

Tools to prepare your data • App Engine MapReduce •
Commercial ETL tools • Pervasive • Informatica • Talend • UNIX command-line

Schema definition - sharding birth_record_2011 mother_race mother_age mother_cigarette_use mother_state father_race
father_age father_cigarette_use father_state plurality is_male race weight birth_record_2012 mother_race mother_age mother_cigarette_use mother_state father_race father_age father_cigarette_use father_state plurality is_male race weight birth_record_2013 birth_record_2014 birth_record_2015 birth_record_2016

Visualizing your Data

BigQuery architecture

“ If you do a table scan over a 1TB
table, you're going to have a bad time. ” Anonymous 16th century Italian Philosopher-Monk

• • Reading 1 TB/ second from disk: • 10k+
disks • Processing 1 TB / sec: • 5k processors Goal: Perform a 1 TB table scan in 1 second Parallelize Parallelize Parallelize!

Data access: Column Store Record Oriented Storage Column Oriented Storage

Distributed Storage (e.g. GFS) BigQuery Architecture Mixer 0 Mixer 1
Shard 0-8 Mixer 1 Shard 17-24 Mixer 1 Shard 9-16 Shard 0 Shard 10 Shard 12 Shard 24 Shard 20

Running your Queries

SELECT COUNT(foo), MAX(foo), STDDEV(foo) FROM ... BigQuery SQL Example: Simple
aggregates

SELECT ... FROM .... WHERE REGEXP_MATCH(url, "\.com$") AND user CONTAINS
'test' BigQuery SQL Example: Complex Processing

SELECT COUNT(*) FROM (SELECT foo ..... ) GROUP BY foo
BigQuery SQL Example: Nested SELECT

BigQuery SQL Example: Small JOIN SELECT huge_table.foo FROM huge_table JOIN
small_table ON small_table.foo = huge_table.foo

Distributed Storage (e.g. GFS) BigQuery Architecture: Small Join Mixer 0
Mixer 1 Shard 0-8 Mixer 1 Shard 17-24 Shard 0 Shard 24 Shard 20

Other new features!

Batch queries! • Don't need interactive queries for some jobs?
• priority: "BATCH"

• API • Column-based datastore • Full table scans FAST
• Aggregates • Commercial tool support • Use cases That's it

SELECT questions FROM audience SELECT 'Thank You!' FROM ryan http://developers.google.com/bigquery
@ryguyrg http://profiles.google.com/ryan.boyd

Data access: Column Store Record Oriented Storage Column Oriented Storage

A Little Later ... Row wp_namespace Revs 1 0 53697002
2 1 6151228 3 3 5519859 4 4 4184389 5 2 3108562 6 10 1052044 7 6 877417 8 14 838940 9 5 651749 10 11 192534 11 100 148135 Underlying table: • Wikipedia page revision records • Rows: 314 million • Byte size: 35.7 GB Query Stats: • Scanned 7G of data • <5 seconds • ~ 100M rows scanned / second

Distributed Storage SELECT wp_namespace, revision_id 10 GB / s COUNT (revision_id) GROUP BY wp_namespace WHERE timestamp > CUTOFF ORDER BY Revs DESC COUNT (revision_id) GROUP BY wp_namespace COUNT (revision_id) GROUP BY wp_namespace

"Multi-stage" Query SELECT contributor_id, INTEGER(LOG10(COUNT(revision_id))) LogEdits FROM [publicdata:samples.wikipedia] SELECT contributor_id,
INTEGER(LOG10(COUNT(revision_id))) LogEdits FROM [publicdata:samples.wikipedia] GROUP EACH BY contributor_id) SELECT LogEdits, COUNT(contributor_id) Contributors FROM ( SELECT contributor_id, INTEGER(LOG10(COUNT(*))) LogEdits FROM [publicdata:samples.wikipedia] GROUP EACH BY contributor_id) GROUP BY LogEdits ORDER BY LogEdits DESC

Mixer 0 Mixer 1 Mixer 1 Leaf Leaf Shuffler Shuffler
Distributed Storage SELECT contributor_id ORDER BY LogEdits DESC COUNT(contributor_id) GROUP BY LogEdits COUNT(contributor_id) GROUP BY LogEdits COUNT(contributor_id) GROUP BY LogEdits SELECT LE, Id COUNT(*) GROUP BY contributor_id Shuffle by contributor_id N^2 GB/s

When to use EACH • Shuffle definitely adds some overhead
• Poor query performance if used incorrectly • GROUP BY o Groups << Rows => Unbalanced load o Example: GROUP BY state • GROUP EACH BY o Groups ~ Rows o Example: GROUP BY user_id

Crunching Data with Google BigQuery. JORDAN TIG...

Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

More Decks by Big Data Spain

Other Decks in Technology

Featured

Transcript