Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Crunching Data with Google BigQuery. JORDAN TIG...

Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/crunching-data-with-google-bigquery/jordan-tigani

Big Data Spain

November 16, 2012
Tweet

More Decks by Big Data Spain

Other Decks in Technology

Transcript

  1. +-------------+----------------+ | strategy | awesomeness | +-------------+----------------+ | "Forty-two" |

    1000000.01 | +-------------+----------------+ 1 row in result set (10.2 s) Scanned 100GB
  2. MapReduce is Flexible but Heavy Master Mapper Mapper • Master

    constructs the plan and begins spinning up workers Distributed Storage • Mappers read and write to distributed storage • Map => Shuffle => Reduce Reducer • Reducers read and write to distributed storage
  3. Master Reducer Mapper Mapper Stage 2 MapReduce is Flexible but

    Heavy Stage 1 Master Mapper Mapper Distributed Storage Reducer
  4. Dremel vs MapReduce • MapReduce o Flexible batch processing o

    High overall throughput o High latency • Dremel o Optimized for interactive SQL queries o Very low latency
  5. Mixer 0 Mixer 1 Mixer 1 Leaf Leaf Leaf Leaf

    Distributed Storage Dremel Architecture • Columnar Storage • Long lived shared serving tree • Partial Reduction • Diskless data flow
  6. SELECT state, COUNT(*) count_babies FROM [publicdata:samples.natality] WHERE year >= 1980

    AND year < 1990 GROUP BY state ORDER BY count_babies DESC LIMIT 10 Simple Query
  7. Mixer 0 Mixer 1 Mixer 1 Leaf Leaf Leaf Leaf

    Distributed Storage SELECT state, year O(Rows ~140M) COUNT(*) GROUP BY state WHERE year >= 1980 and year < 1990 O(50 states) LIMIT 10 ORDER BY count_babies DESC COUNT(*) GROUP BY state COUNT(*) GROUP BY state O(50 states) O(50 states)
  8. Example: Daily Weather Station Data weather_station_data station lat long mean_temp

    humidity timestamp year month day 9384 33.57 86.75 89.3 .35 1351005129 2011 04 19 2857 36.77 119.72 78.5 .24 1351005135 2011 04 19 3475 40.77 73.98 68 .35 1351015930 2011 04 19 etc...
  9. Example: Daily Weather Station Data station, lat, long, mean_temp, year,

    mon, day 999999, 36.624, -116.023, 63.6, 2009, 10, 9 911904, 20.963, -156.675, 83.4, 2009, 10, 9 916890, -18133, 178433, 76.9, 2009, 10, 9 943320, -20678, 139488, 73.8, 2009, 10, 9 CSV
  10. Modeling Event Data: Social Music Store logs.oct_24_2012_song_activities USERNAME ACTIVITY Cost

    SONG ARTIST TIMESTAMP Michael LISTEN Too Close Alex Clare 1351065562 Michael LISTEN Gangnam Style PSY 1351105150 Jim LISTEN Complications Deadmau5 1351075720 Michael PURCHASE 0.99 Gangnam Style PSY 1351115962 logs.oct_24_2012_song_activities USERNAME ACTIVITY Cost SONG ARTIST TIMESTAMP Michael LISTEN Too Close Alex Clare 1351065562 Michael LISTEN Gangnam Style PSY 1351105150 Jim LISTEN Complications Deadmau5 1351075720 Michael PURCHASE 0.99 Gangnam Style PSY 1351115962
  11. Users Who Listened to More than 10 Songs/Day SELECT UserId,

    COUNT(*) as ListenActivities FROM [logs.oct_24_2012_song_activities] GROUP EACH BY UserId HAVING ListenActivites > 10
  12. How Many Songs Listened to Total by Listeners of PSY?

    SELECT UserId, count(*) as ListenActivities FROM [logs.oct_24_2012_song_activities] WHERE UserId IN ( SELECT UserId FROM [logs.oct_24_2012_song_activities] WHERE artist = 'PSY') GROUP EACH BY UserId HAVING ListenActivites > 10
  13. Modeling Event Data: Nested and Repeated Values {"UserID" : "Michael",

    "Listens": [ {"TrackId":1234,"Title":"Gangam Style", "Artist":"PSY","Timestamp":1351075700}, {"TrackId":1234,"Title":"Alex Clare", "Artist":"Alex Clare",'Timestamp":1351075700} ] "Purchases": [ {"Track":2345,"Title":"Gangam Style", "Artist":"PSY","Timestamp":1351075700,"Cost":0.99} ]} JSON {"UserID" : "Michael", "Listens": [ {"TrackId":1234,"Title":"Gangam Style", "Artist":"PSY","Timestamp":1351075700}, {"TrackId":1234,"Title":"Alex Clare", "Artist":"Alex Clare",'Timestamp":1351075700} ] "Purchases": [ {"Track":2345,"Title":"Gangam Style", "Artist":"PSY","Timestamp":1351075700,"Cost":0.99} ]} {"UserID" : "Michael", "Listens": [ {"TrackId":1234,"Title":"Gangnam Style", "Artist":"PSY","Timestamp":1351075700}, {"TrackId":1234,"Title":"Alex Clare", "Artist":"Alex Clare",'Timestamp":1351075700} ] "Purchases": [ {"Track":2345,"Title":"Gangnam Style", "Artist":"PSY","Timestamp":1351075700,"Cost":0.99} ]}
  14. Which Users Have Listened to Beyonce? SELECT UserID, COUNT(ListenActivities.artist) WITHIN

    RECORD AS song_count FROM [logs.oct_24_2012_songactivities] WHERE UserID IN (SELECT UserID, FROM [logs.oct_24_2012_songactivities] WHERE ListenActivities.artist = 'Beyonce');
  15. What Position are PSY songs in our Users' Daily Playlists?

    SELECT UserID, POSITION(ListenActivities.artist) FROM [sample_music_logs.oct_24_2012_songactivities] WHERE ListenActivities.artist = 'PSY';
  16. Summary: Choosing a BigQuery Data Model • "Shard" your Data

    Using Multiple Tables • Source Data Files • CSV format • Newline-delimited JSON • Using Nested and Repeated Records • Simplify Some Types of Queries • Often Matches Document Database Models
  17. Load your Data into BigQuery "jobReference":{ "projectId":"605902584318"}, "configuration":{ "load":{ "destinationTable":{

    "projectId":"605902584318", "datasetId":"my_dataset", "tableId":"widget_sales"}, "sourceUris":[ "gs://widget-sales-data/2012080100.csv"], "schema":{ "fields":[{ "name":"widget", "type":"string"}, ... POST https://www.googleapis.com/bigquery/v2/projects/605902584318/jobs "jobReference":{ "projectId":"605902584318"}, "configuration":{ "load":{ "destinationTable":{ "projectId":"605902584318", "datasetId":"my_dataset", "tableId":"widget_sales"}, "sourceUris":[ "gs://widget-sales-data/2012080100.csv"], "schema":{ "fields":[{ "name":"widget", "type":"string"}, ...
  18. Query Away! "jobReference":{ "projectId":"605902584318", "query":"SELECT TOP(widget, 50), COUNT(*) AS sale_count

    FROM widget_sales", "maxResults":100, "apiVersion":"v2" } POST https://www.googleapis.com/bigquery/v2/projects/605902584318/jobs "jobReference":{ "projectId":"605902584318", "query":"SELECT TOP(widget, 50), COUNT(*) AS sale_count FROM widget_sales", "maxResults":100, "apiVersion":"v2" }
  19. Libraries • Python • Java • .NET • Ruby •

    JavaScript • Go • PHP • Objective-C
  20. Libraries - Example JavaScript Query var request = gapi.client.bigquery.jobs.query({ 'projectId':

    project_id, 'timeoutMs': '30000', 'query': 'SELECT state, AVG(mother_age) AS theav FROM [publicdata:samples.natality] WHERE year=2000 AND ever_born=1 GROUP BY state ORDER BY theav DESC;' }); request.execute(function(response) { console.log(response); $.each(response.result.rows, function(i, item) { ...
  21. • Full table scans FAST • Aggregate Queries on Massive

    Datasets • Supports Flat and Nested/Repeated Data Models • It's an API BigQuery - Aggregate Big Data Analysis in Seconds Get started now: http://developers.google.com/bigquery/
  22. Tools to prepare your data • App Engine MapReduce •

    Commercial ETL tools • Pervasive • Informatica • Talend • UNIX command-line
  23. Schema definition - sharding birth_record_2011 mother_race mother_age mother_cigarette_use mother_state father_race

    father_age father_cigarette_use father_state plurality is_male race weight birth_record_2012 mother_race mother_age mother_cigarette_use mother_state father_race father_age father_cigarette_use father_state plurality is_male race weight birth_record_2013 birth_record_2014 birth_record_2015 birth_record_2016
  24. “ If you do a table scan over a 1TB

    table, you're going to have a bad time. ” Anonymous 16th century Italian Philosopher-Monk
  25. • • Reading 1 TB/ second from disk: • 10k+

    disks • Processing 1 TB / sec: • 5k processors Goal: Perform a 1 TB table scan in 1 second Parallelize Parallelize Parallelize!
  26. Distributed Storage (e.g. GFS) BigQuery Architecture Mixer 0 Mixer 1

    Shard 0-8 Mixer 1 Shard 17-24 Mixer 1 Shard 9-16 Shard 0 Shard 10 Shard 12 Shard 24 Shard 20
  27. SELECT ... FROM .... WHERE REGEXP_MATCH(url, "\.com$") AND user CONTAINS

    'test' BigQuery SQL Example: Complex Processing
  28. SELECT COUNT(*) FROM (SELECT foo ..... ) GROUP BY foo

    BigQuery SQL Example: Nested SELECT
  29. BigQuery SQL Example: Small JOIN SELECT huge_table.foo FROM huge_table JOIN

    small_table ON small_table.foo = huge_table.foo
  30. Distributed Storage (e.g. GFS) BigQuery Architecture: Small Join Mixer 0

    Mixer 1 Shard 0-8 Mixer 1 Shard 17-24 Shard 0 Shard 24 Shard 20
  31. • API • Column-based datastore • Full table scans FAST

    • Aggregates • Commercial tool support • Use cases That's it
  32. A Little Later ... Row wp_namespace Revs 1 0 53697002

    2 1 6151228 3 3 5519859 4 4 4184389 5 2 3108562 6 10 1052044 7 6 877417 8 14 838940 9 5 651749 10 11 192534 11 100 148135 Underlying table: • Wikipedia page revision records • Rows: 314 million • Byte size: 35.7 GB Query Stats: • Scanned 7G of data • <5 seconds • ~ 100M rows scanned / second
  33. Mixer 0 Mixer 1 Mixer 1 Leaf Leaf Leaf Leaf

    Distributed Storage SELECT wp_namespace, revision_id 10 GB / s COUNT (revision_id) GROUP BY wp_namespace WHERE timestamp > CUTOFF ORDER BY Revs DESC COUNT (revision_id) GROUP BY wp_namespace COUNT (revision_id) GROUP BY wp_namespace
  34. "Multi-stage" Query SELECT contributor_id, INTEGER(LOG10(COUNT(revision_id))) LogEdits FROM [publicdata:samples.wikipedia] SELECT contributor_id,

    INTEGER(LOG10(COUNT(revision_id))) LogEdits FROM [publicdata:samples.wikipedia] GROUP EACH BY contributor_id) SELECT LogEdits, COUNT(contributor_id) Contributors FROM ( SELECT contributor_id, INTEGER(LOG10(COUNT(*))) LogEdits FROM [publicdata:samples.wikipedia] GROUP EACH BY contributor_id) GROUP BY LogEdits ORDER BY LogEdits DESC
  35. Mixer 0 Mixer 1 Mixer 1 Leaf Leaf Shuffler Shuffler

    Distributed Storage SELECT contributor_id ORDER BY LogEdits DESC COUNT(contributor_id) GROUP BY LogEdits COUNT(contributor_id) GROUP BY LogEdits COUNT(contributor_id) GROUP BY LogEdits SELECT LE, Id COUNT(*) GROUP BY contributor_id Shuffle by contributor_id N^2 GB/s
  36. When to use EACH • Shuffle definitely adds some overhead

    • Poor query performance if used incorrectly • GROUP BY o Groups << Rows => Unbalanced load o Example: GROUP BY state • GROUP EACH BY o Groups ~ Rows o Example: GROUP BY user_id