Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/crunching-data-with-google-bigquery/jordan-tigani

Cb6e6da05b5b943d2691ceefa3381cad?s=128

Big Data Spain

November 16, 2012
Tweet

Transcript

  1. None
  2. Crunching Data with BigQuery Fast analysis of Big Data Jordan

    Tigani, Software Engineer
  3. 01000001011011100111001101110111011001010111001 00010000001110100011011110010000001110100011010 00011001010010000001010101011011000111010001101 00101101101011000010111010001100101001000000101 00010111010101100101011100110111010001101001011 01111011011100010000001101111011001100010000001 00110001101001011001100110010100101100001000000 11101000110100001100101001000000101010101101110 01101001011101100110010101110010011100110110010 10010110000100000011000010110111001100100001000

    00010001010111011001100101011100100111100101110 100101110011001000000011010000110010...........
  4. Big Data at Google 72 hours 100 million gigabytes

  5. SELECT kick_ass_product_plan AS strategy, AVG(kicking_factor) AS awesomeness FROM lots_of_data GROUP

    BY strategy
  6. +-------------+----------------+ | strategy | awesomeness | +-------------+----------------+ | "Forty-two" |

    1000000.01 | +-------------+----------------+ 1 row in result set (10.2 s) Scanned 100GB
  7. None
  8. None
  9. Regular expressions on 13 billion rows...

  10. 13 Billion rows 1 TB of data in 4 tables

    FAST! AST
  11. Google's Internal Technology: Dremel

  12. MapReduce is Flexible but Heavy Master Mapper Mapper • Master

    constructs the plan and begins spinning up workers Distributed Storage • Mappers read and write to distributed storage • Map => Shuffle => Reduce Reducer • Reducers read and write to distributed storage
  13. Master Reducer Mapper Mapper Stage 2 MapReduce is Flexible but

    Heavy Stage 1 Master Mapper Mapper Distributed Storage Reducer
  14. Dremel vs MapReduce • MapReduce o Flexible batch processing o

    High overall throughput o High latency • Dremel o Optimized for interactive SQL queries o Very low latency
  15. Mixer 0 Mixer 1 Mixer 1 Leaf Leaf Leaf Leaf

    Distributed Storage Dremel Architecture • Columnar Storage • Long lived shared serving tree • Partial Reduction • Diskless data flow
  16. SELECT state, COUNT(*) count_babies FROM [publicdata:samples.natality] WHERE year >= 1980

    AND year < 1990 GROUP BY state ORDER BY count_babies DESC LIMIT 10 Simple Query
  17. Mixer 0 Mixer 1 Mixer 1 Leaf Leaf Leaf Leaf

    Distributed Storage SELECT state, year O(Rows ~140M) COUNT(*) GROUP BY state WHERE year >= 1980 and year < 1990 O(50 states) LIMIT 10 ORDER BY count_babies DESC COUNT(*) GROUP BY state COUNT(*) GROUP BY state O(50 states) O(50 states)
  18. Modeling Data

  19. Example: Daily Weather Station Data weather_station_data station lat long mean_temp

    humidity timestamp year month day 9384 33.57 86.75 89.3 .35 1351005129 2011 04 19 2857 36.77 119.72 78.5 .24 1351005135 2011 04 19 3475 40.77 73.98 68 .35 1351015930 2011 04 19 etc...
  20. Example: Daily Weather Station Data station, lat, long, mean_temp, year,

    mon, day 999999, 36.624, -116.023, 63.6, 2009, 10, 9 911904, 20.963, -156.675, 83.4, 2009, 10, 9 916890, -18133, 178433, 76.9, 2009, 10, 9 943320, -20678, 139488, 73.8, 2009, 10, 9 CSV
  21. Organizing BigQuery Tables Your Source Data October 22 October 23

    October 24
  22. None
  23. Modeling Event Data: Social Music Store logs.oct_24_2012_song_activities USERNAME ACTIVITY Cost

    SONG ARTIST TIMESTAMP Michael LISTEN Too Close Alex Clare 1351065562 Michael LISTEN Gangnam Style PSY 1351105150 Jim LISTEN Complications Deadmau5 1351075720 Michael PURCHASE 0.99 Gangnam Style PSY 1351115962 logs.oct_24_2012_song_activities USERNAME ACTIVITY Cost SONG ARTIST TIMESTAMP Michael LISTEN Too Close Alex Clare 1351065562 Michael LISTEN Gangnam Style PSY 1351105150 Jim LISTEN Complications Deadmau5 1351075720 Michael PURCHASE 0.99 Gangnam Style PSY 1351115962
  24. Users Who Listened to More than 10 Songs/Day SELECT UserId,

    COUNT(*) as ListenActivities FROM [logs.oct_24_2012_song_activities] GROUP EACH BY UserId HAVING ListenActivites > 10
  25. How Many Songs Listened to Total by Listeners of PSY?

    SELECT UserId, count(*) as ListenActivities FROM [logs.oct_24_2012_song_activities] WHERE UserId IN ( SELECT UserId FROM [logs.oct_24_2012_song_activities] WHERE artist = 'PSY') GROUP EACH BY UserId HAVING ListenActivites > 10
  26. Modeling Event Data: Nested and Repeated Values {"UserID" : "Michael",

    "Listens": [ {"TrackId":1234,"Title":"Gangam Style", "Artist":"PSY","Timestamp":1351075700}, {"TrackId":1234,"Title":"Alex Clare", "Artist":"Alex Clare",'Timestamp":1351075700} ] "Purchases": [ {"Track":2345,"Title":"Gangam Style", "Artist":"PSY","Timestamp":1351075700,"Cost":0.99} ]} JSON {"UserID" : "Michael", "Listens": [ {"TrackId":1234,"Title":"Gangam Style", "Artist":"PSY","Timestamp":1351075700}, {"TrackId":1234,"Title":"Alex Clare", "Artist":"Alex Clare",'Timestamp":1351075700} ] "Purchases": [ {"Track":2345,"Title":"Gangam Style", "Artist":"PSY","Timestamp":1351075700,"Cost":0.99} ]} {"UserID" : "Michael", "Listens": [ {"TrackId":1234,"Title":"Gangnam Style", "Artist":"PSY","Timestamp":1351075700}, {"TrackId":1234,"Title":"Alex Clare", "Artist":"Alex Clare",'Timestamp":1351075700} ] "Purchases": [ {"Track":2345,"Title":"Gangnam Style", "Artist":"PSY","Timestamp":1351075700,"Cost":0.99} ]}
  27. Which Users Have Listened to Beyonce? SELECT UserID, COUNT(ListenActivities.artist) WITHIN

    RECORD AS song_count FROM [logs.oct_24_2012_songactivities] WHERE UserID IN (SELECT UserID, FROM [logs.oct_24_2012_songactivities] WHERE ListenActivities.artist = 'Beyonce');
  28. What Position are PSY songs in our Users' Daily Playlists?

    SELECT UserID, POSITION(ListenActivities.artist) FROM [sample_music_logs.oct_24_2012_songactivities] WHERE ListenActivities.artist = 'PSY';
  29. SELECT AVG(POSITION(ListenActivities.artist)) FROM [sample_music_logs.oct_24_2012_songactivities], [sample_music_logs.oct_23_2012_songactivities], /* etc... */ WHERE ListenActivities.artist

    = 'PSY'; Average Position of Songs by PSY in All Daily Playlists?
  30. Summary: Choosing a BigQuery Data Model • "Shard" your Data

    Using Multiple Tables • Source Data Files • CSV format • Newline-delimited JSON • Using Nested and Repeated Records • Simplify Some Types of Queries • Often Matches Document Database Models
  31. Developing with BigQuery

  32. None
  33. Google Cloud Storage Upload Your Data BigQuery

  34. Load your Data into BigQuery "jobReference":{ "projectId":"605902584318"}, "configuration":{ "load":{ "destinationTable":{

    "projectId":"605902584318", "datasetId":"my_dataset", "tableId":"widget_sales"}, "sourceUris":[ "gs://widget-sales-data/2012080100.csv"], "schema":{ "fields":[{ "name":"widget", "type":"string"}, ... POST https://www.googleapis.com/bigquery/v2/projects/605902584318/jobs "jobReference":{ "projectId":"605902584318"}, "configuration":{ "load":{ "destinationTable":{ "projectId":"605902584318", "datasetId":"my_dataset", "tableId":"widget_sales"}, "sourceUris":[ "gs://widget-sales-data/2012080100.csv"], "schema":{ "fields":[{ "name":"widget", "type":"string"}, ...
  35. Query Away! "jobReference":{ "projectId":"605902584318", "query":"SELECT TOP(widget, 50), COUNT(*) AS sale_count

    FROM widget_sales", "maxResults":100, "apiVersion":"v2" } POST https://www.googleapis.com/bigquery/v2/projects/605902584318/jobs "jobReference":{ "projectId":"605902584318", "query":"SELECT TOP(widget, 50), COUNT(*) AS sale_count FROM widget_sales", "maxResults":100, "apiVersion":"v2" }
  36. Libraries • Python • Java • .NET • Ruby •

    JavaScript • Go • PHP • Objective-C
  37. Libraries - Example JavaScript Query var request = gapi.client.bigquery.jobs.query({ 'projectId':

    project_id, 'timeoutMs': '30000', 'query': 'SELECT state, AVG(mother_age) AS theav FROM [publicdata:samples.natality] WHERE year=2000 AND ever_born=1 GROUP BY state ORDER BY theav DESC;' }); request.execute(function(response) { console.log(response); $.each(response.result.rows, function(i, item) { ...
  38. Custom Code and the Google Chart Tools API

  39. Google Spreadsheets

  40. Commercial Visualization Tools

  41. Demo: Using BigQuery on BigQuery

  42. • Full table scans FAST • Aggregate Queries on Massive

    Datasets • Supports Flat and Nested/Repeated Data Models • It's an API BigQuery - Aggregate Big Data Analysis in Seconds Get started now: http://developers.google.com/bigquery/
  43. SELECT questions FROM audience SELECT 'Thank You!' FROM jordan http://developers.google.com/bigquery

  44. None
  45. Schema definition birth_record parent_id_mother parent_id_father plurality is_male race weight parents

    id race age cigarette_use state
  46. Schema definition birth_record mother_race mother_age mother_cigarette_use mother_state father_race father_age father_cigarette_use

    father_state plurality is_male race weight
  47. Tools to prepare your data • App Engine MapReduce •

    Commercial ETL tools • Pervasive • Informatica • Talend • UNIX command-line
  48. Schema definition - sharding birth_record_2011 mother_race mother_age mother_cigarette_use mother_state father_race

    father_age father_cigarette_use father_state plurality is_male race weight birth_record_2012 mother_race mother_age mother_cigarette_use mother_state father_race father_age father_cigarette_use father_state plurality is_male race weight birth_record_2013 birth_record_2014 birth_record_2015 birth_record_2016
  49. Visualizing your Data

  50. BigQuery architecture

  51. “ If you do a table scan over a 1TB

    table, you're going to have a bad time. ” Anonymous 16th century Italian Philosopher-Monk
  52. • • Reading 1 TB/ second from disk: • 10k+

    disks • Processing 1 TB / sec: • 5k processors Goal: Perform a 1 TB table scan in 1 second Parallelize Parallelize Parallelize!
  53. Data access: Column Store Record Oriented Storage Column Oriented Storage

  54. Distributed Storage (e.g. GFS) BigQuery Architecture Mixer 0 Mixer 1

    Shard 0-8 Mixer 1 Shard 17-24 Mixer 1 Shard 9-16 Shard 0 Shard 10 Shard 12 Shard 24 Shard 20
  55. Running your Queries

  56. SELECT COUNT(foo), MAX(foo), STDDEV(foo) FROM ... BigQuery SQL Example: Simple

    aggregates
  57. SELECT ... FROM .... WHERE REGEXP_MATCH(url, "\.com$") AND user CONTAINS

    'test' BigQuery SQL Example: Complex Processing
  58. SELECT COUNT(*) FROM (SELECT foo ..... ) GROUP BY foo

    BigQuery SQL Example: Nested SELECT
  59. BigQuery SQL Example: Small JOIN SELECT huge_table.foo FROM huge_table JOIN

    small_table ON small_table.foo = huge_table.foo
  60. Distributed Storage (e.g. GFS) BigQuery Architecture: Small Join Mixer 0

    Mixer 1 Shard 0-8 Mixer 1 Shard 17-24 Shard 0 Shard 24 Shard 20
  61. Other new features!

  62. Batch queries! • Don't need interactive queries for some jobs?

    • priority: "BATCH"
  63. • API • Column-based datastore • Full table scans FAST

    • Aggregates • Commercial tool support • Use cases That's it
  64. SELECT questions FROM audience SELECT 'Thank You!' FROM ryan http://developers.google.com/bigquery

    @ryguyrg http://profiles.google.com/ryan.boyd
  65. None
  66. Data access: Column Store Record Oriented Storage Column Oriented Storage

  67. A Little Later ... Row wp_namespace Revs 1 0 53697002

    2 1 6151228 3 3 5519859 4 4 4184389 5 2 3108562 6 10 1052044 7 6 877417 8 14 838940 9 5 651749 10 11 192534 11 100 148135 Underlying table: • Wikipedia page revision records • Rows: 314 million • Byte size: 35.7 GB Query Stats: • Scanned 7G of data • <5 seconds • ~ 100M rows scanned / second
  68. Mixer 0 Mixer 1 Mixer 1 Leaf Leaf Leaf Leaf

    Distributed Storage SELECT wp_namespace, revision_id 10 GB / s COUNT (revision_id) GROUP BY wp_namespace WHERE timestamp > CUTOFF ORDER BY Revs DESC COUNT (revision_id) GROUP BY wp_namespace COUNT (revision_id) GROUP BY wp_namespace
  69. "Multi-stage" Query SELECT contributor_id, INTEGER(LOG10(COUNT(revision_id))) LogEdits FROM [publicdata:samples.wikipedia] SELECT contributor_id,

    INTEGER(LOG10(COUNT(revision_id))) LogEdits FROM [publicdata:samples.wikipedia] GROUP EACH BY contributor_id) SELECT LogEdits, COUNT(contributor_id) Contributors FROM ( SELECT contributor_id, INTEGER(LOG10(COUNT(*))) LogEdits FROM [publicdata:samples.wikipedia] GROUP EACH BY contributor_id) GROUP BY LogEdits ORDER BY LogEdits DESC
  70. Mixer 0 Mixer 1 Mixer 1 Leaf Leaf Shuffler Shuffler

    Distributed Storage SELECT contributor_id ORDER BY LogEdits DESC COUNT(contributor_id) GROUP BY LogEdits COUNT(contributor_id) GROUP BY LogEdits COUNT(contributor_id) GROUP BY LogEdits SELECT LE, Id COUNT(*) GROUP BY contributor_id Shuffle by contributor_id N^2 GB/s
  71. When to use EACH • Shuffle definitely adds some overhead

    • Poor query performance if used incorrectly • GROUP BY o Groups << Rows => Unbalanced load o Example: GROUP BY state • GROUP EACH BY o Groups ~ Rows o Example: GROUP BY user_id