Upgrade to Pro — share decks privately, control downloads, hide ads and more …

BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at Big Data Spain 2014

BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at Big Data Spain 2014

Performing ETL on big data can be slow, expensive and painful - but it doesn't have to be! In this session, we'll take an in-depth look at several real-world examples of computations that don't fit well with the SQL language model and how to solve them with user-defined functions in Google BigQuery.

Session presented at Big Data Spain 2014 Conference
18th Nov 2014
Kinépolis Madrid
http://www.bigdataspain.org

Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/2014/conference/hands-on-with-bigquery-javascript-user-defined-functions

Big Data Spain

December 23, 2014
Tweet

More Decks by Big Data Spain

Other Decks in Technology

Transcript

  1. Hands-on with BigQuery JavaScript User-Defined Functions Thomas Park Software Engineer

    - Google Felipe Hoffa @felipehoffa Developer Advocate - Google
  2. BigQuery: Big Data Analytics in the Cloud Unrivaled Performance and

    Scale • Scan multiple TB’s in seconds • Interactive query performance • No limits on amount of data Ease of Use and Adoption • No administration / provisioning • Convenience of SQL • Open interfaces (REST, WebUI, ODBC) • First 1 TB of data processed per month is free Advanced “Big Data” Storage • Familiar database structure • Easy data management and ACL’s • Fast, atomic imports
  3. Google confidential │ Do not distribute How many pageviews does

    Wikipedia have in a month? SELECT COUNT(*)FROM [fh-bigquery:wikipedia.wikipedia_views_201308] https://bigquery.cloud.google.com/table/fh-bigquery:wikipedia.pagecounts_20140602_18
  4. Google confidential │ Do not distribute $500 in Cloud Platform

    credit to launch your idea! Build. Store. Analyze. On the same infrastructure that powers Google Start building Click ‘Apply Now’ and complete the application with promo code: bigdata-spain Starter Pack Offer Description 1 2 3 Go to cloud.google.com/developers/starterpack
  5. Images by Connie Zhou Scenario: Door access records from a

    Very Well-Secured Lab where users must badge in to enter or leave Image by Tod Kurt
  6. user_id timestamp Beep!! 9h: arrive @ lab thomas 2014.07.15 09:00

    Beep!! 10h: leave to pick up prototype thomas 2014.07.15 10:00 Beep!! 10h15: return with prototype thomas 2014.07.15 10:15 Beep!! 12h: out for lunch thomas 2014.07.15 12:00
  7. How can we find out how much time each user

    spent in the lab? ...where each scan of the user’s access card is represented as a discrete row?
  8. rownum user_id timestamp 1 thomas 2014.07.15 09:00 2 thomas 2014.07.15

    10:00 3 thomas 2014.07.15 10:15 4 thomas 2014.07.15 12:00 60 minutes 105 minutes Our analysis with data in this format via SQL is horrid and painful A BigQuery + JS friendly format: data for each user in separate rows user_id timestamps thomas [ 09:00, 10:00, 10:15, 12:00, ... ] hoffa [ 08:10, 11:30, 12:00, 12:15, ... ] SELECT user_id, NEST(timestamp) AS timestamps FROM T GROUP BY user_id; Producing this format is trivial in BigQuery...
  9. // This function will be called once for each user,

    // and receive an array of timestamps. function(record, emit) { var total_time = 0; // Order of records built by NEST are not guaranteed! // Sort to guarantee ascending timestamps. var ts = record.ts.sort( function (a, b) {return a > b;}); // Loop over timestamp pairs, calculate interval. for (var i = 0; i < ts.length - 1; i += 2) { total_time += (ts[i+1] - ts[i]); } // Emit total time for this user. emit({user: record.user_id, total_time: total_time}); } JS: Total time for each user
  10. SELECT * FROM js( // Input table or query. [secret-lab:door_scans.201411]

    // Input columns. user_id, timestamps, // Output schema. "[{name: 'user_id', type:'string'}, {name: 'tot_time', type:'integer'}]", // The function. "function(r, emit) { ... emit(...); }")
  11. SELECT * FROM js( // Input table or query. [secret-lab:door_scans.201411]

    // Input columns. user_id, timestamps, // Output schema. "[{name: 'user_id', type:'string'}, {name: 'tot_time', type:'integer'}]", // The function. "function(r, emit) { ... emit(...); }") The JS function
  12. SELECT * FROM js( // Input table or query. [secret-lab:door_scans.201411]

    // Input columns. user_id, timestamps, // Output schema. "[{name: 'user_id', type:'string'}, {name: 'tot_time', type:'integer'}]", // The function. "function(r, emit) { ... emit(...); }") Input schema (column names only!)
  13. SELECT * FROM js( // Input table or query. [secret-lab:door_scans.201411]

    // Input columns. user_id, timestamps, // Output schema. "[{name: 'user_id', type:'string'}, {name: 'tot_time', type:'integer'}]", // The function. "function(r, emit) { ... emit(...); }") Output schema (full declaration)
  14. SELECT * FROM js( // Input table or query. [secret-lab:door_scans.201411]

    // Input columns. user_id, timestamps, // Output schema. "[{name: 'user_id', type:'string'}, {name: 'tot_time', type:'integer'}]", // The function. "function(r, emit) { ... emit(...); }") Input table or subquery
  15. SELECT * FROM js( // Input table or query. [secret-lab:door_scans.201411]

    // Input columns. user_id, timestamps, // Output schema. "[{name: 'user_id', type:'string'}, {name: 'tot_time', type:'integer'}]", // The function. "function(r, emit) { ... emit(...); }")
  16. How BigQuery works Get data from lower levels, filter /

    join / transform, send rows up Tree Structured Query Dispatch and Aggregation Distributed Storage SELECT title, requests Leaf Leaf Leaf Leaf SUM(requests) GROUP BY title WHERE REGEX_MATCH(title, 'pat.*rn') Mixer 1 Mixer 1 SUM(requests) GROUP BY title Mixer 0 LIMIT 10 ORDER BY c DESC SUM(requests) GROUP BY title
  17. Data for each row is calculated and streamed through a

    “Row Iterator” Subquery0 Subquery1 JOIN Row Iterator 0 Row Iterator 1 Row Iterator 2
  18. Can insert JavaScript Functions wherever we have a Row Iterator

    Subquery0 Subquery1 JOIN Row Iterator 0 Row Iterator 1 Row Iterator 2 UDF1 UDF0
  19. Join order item info with web hits info SELECT item

    FROM orders SELECT query string FROM hits JOIN Row Iterator 0 Row Iterator 1 Row Iterator 2 UDF1 UDF0
  20. http://www.store.com/?q=7%2e1+Speakers SELECT item FROM orders SELECT query string FROM hits

    JOIN Row Iterator 0 Row Iterator 1 Row Iterator 2 UDF1 UDF0
  21. http://www.store.com/?q=7%2e1+Speakers Extract and decode query term => “7.1 Speakers” SELECT

    item FROM orders SELECT query string FROM hits JOIN Row Iterator 0 Row Iterator 1 Row Iterator 2 UDF1 UDF0