BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at Big Data Spain 2014

BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at Big Data Spain 2014

Performing ETL on big data can be slow, expensive and painful - but it doesn't have to be! In this session, we'll take an in-depth look at several real-world examples of computations that don't fit well with the SQL language model and how to solve them with user-defined functions in Google BigQuery.

Session presented at Big Data Spain 2014 Conference
18th Nov 2014
Kinépolis Madrid
http://www.bigdataspain.org

Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/2014/conference/hands-on-with-bigquery-javascript-user-defined-functions

Cb6e6da05b5b943d2691ceefa3381cad?s=128

Big Data Spain

December 23, 2014
Tweet

Transcript

  1. HANDS ON WITH BIGQUERY JAVASCRIPT UDFS THOMAS PARK SOFTWARE ENGINEER

    - GOOGLE
  2. Hands-on with BigQuery JavaScript User-Defined Functions Thomas Park Software Engineer

    - Google Felipe Hoffa @felipehoffa Developer Advocate - Google
  3. Agenda Background Example: Cross-row intervals Under the hood Example: Codebreaking

    I. II. III. IV.
  4. Agenda Background Example: Cross-row intervals Under the hood Example: Codebreaking

    I. II. III. IV.
  5. What is BigQuery?

  6. BigQuery: Big Data Analytics in the Cloud Unrivaled Performance and

    Scale • Scan multiple TB’s in seconds • Interactive query performance • No limits on amount of data Ease of Use and Adoption • No administration / provisioning • Convenience of SQL • Open interfaces (REST, WebUI, ODBC) • First 1 TB of data processed per month is free Advanced “Big Data” Storage • Familiar database structure • Easy data management and ACL’s • Fast, atomic imports
  7. Google confidential │ Do not distribute How many pageviews does

    Wikipedia have in a month? SELECT COUNT(*)FROM [fh-bigquery:wikipedia.wikipedia_views_201308] https://bigquery.cloud.google.com/table/fh-bigquery:wikipedia.pagecounts_20140602_18
  8. Google confidential │ Do not distribute $500 in Cloud Platform

    credit to launch your idea! Build. Store. Analyze. On the same infrastructure that powers Google Start building Click ‘Apply Now’ and complete the application with promo code: bigdata-spain Starter Pack Offer Description 1 2 3 Go to cloud.google.com/developers/starterpack
  9. Agenda Background Example: Cross-row intervals Under the hood Example: Codebreaking

    I. II. III. IV.
  10. Images by Connie Zhou Scenario: Door access records from a

    Very Well-Secured Lab where users must badge in to enter or leave Image by Tod Kurt
  11. Images by Connie Zhou Example: Time-series analysis from discrete user

    action data Image by Tod Kurt
  12. user_id timestamp Beep!! 9h: arrive @ lab thomas 2014.07.15 09:00

    Beep!! 10h: leave to pick up prototype thomas 2014.07.15 10:00 Beep!! 10h15: return with prototype thomas 2014.07.15 10:15 Beep!! 12h: out for lunch thomas 2014.07.15 12:00
  13. How can we find out how much time each user

    spent in the lab? ...where each scan of the user’s access card is represented as a discrete row?
  14. rownum user_id timestamp 1 thomas 2014.07.15 09:00 2 thomas 2014.07.15

    10:00 3 thomas 2014.07.15 10:15 4 thomas 2014.07.15 12:00 60 minutes 105 minutes Our analysis with data in this format via SQL is horrid and painful A BigQuery + JS friendly format: data for each user in separate rows user_id timestamps thomas [ 09:00, 10:00, 10:15, 12:00, ... ] hoffa [ 08:10, 11:30, 12:00, 12:15, ... ] SELECT user_id, NEST(timestamp) AS timestamps FROM T GROUP BY user_id; Producing this format is trivial in BigQuery...
  15. // This function will be called once for each user,

    // and receive an array of timestamps. function(record, emit) { var total_time = 0; // Order of records built by NEST are not guaranteed! // Sort to guarantee ascending timestamps. var ts = record.ts.sort( function (a, b) {return a > b;}); // Loop over timestamp pairs, calculate interval. for (var i = 0; i < ts.length - 1; i += 2) { total_time += (ts[i+1] - ts[i]); } // Emit total time for this user. emit({user: record.user_id, total_time: total_time}); } JS: Total time for each user
  16. SELECT * FROM js( // Input table or query. [secret-lab:door_scans.201411]

    // Input columns. user_id, timestamps, // Output schema. "[{name: 'user_id', type:'string'}, {name: 'tot_time', type:'integer'}]", // The function. "function(r, emit) { ... emit(...); }")
  17. SELECT * FROM js( // Input table or query. [secret-lab:door_scans.201411]

    // Input columns. user_id, timestamps, // Output schema. "[{name: 'user_id', type:'string'}, {name: 'tot_time', type:'integer'}]", // The function. "function(r, emit) { ... emit(...); }") The JS function
  18. SELECT * FROM js( // Input table or query. [secret-lab:door_scans.201411]

    // Input columns. user_id, timestamps, // Output schema. "[{name: 'user_id', type:'string'}, {name: 'tot_time', type:'integer'}]", // The function. "function(r, emit) { ... emit(...); }") Input schema (column names only!)
  19. SELECT * FROM js( // Input table or query. [secret-lab:door_scans.201411]

    // Input columns. user_id, timestamps, // Output schema. "[{name: 'user_id', type:'string'}, {name: 'tot_time', type:'integer'}]", // The function. "function(r, emit) { ... emit(...); }") Output schema (full declaration)
  20. SELECT * FROM js( // Input table or query. [secret-lab:door_scans.201411]

    // Input columns. user_id, timestamps, // Output schema. "[{name: 'user_id', type:'string'}, {name: 'tot_time', type:'integer'}]", // The function. "function(r, emit) { ... emit(...); }") Input table or subquery
  21. SELECT * FROM js( // Input table or query. [secret-lab:door_scans.201411]

    // Input columns. user_id, timestamps, // Output schema. "[{name: 'user_id', type:'string'}, {name: 'tot_time', type:'integer'}]", // The function. "function(r, emit) { ... emit(...); }")
  22. Agenda Background Example: Cross-row intervals Under the hood Example: Codebreaking

    I. II. III. IV.
  23. How BigQuery works Get data from lower levels, filter /

    join / transform, send rows up Tree Structured Query Dispatch and Aggregation Distributed Storage SELECT title, requests Leaf Leaf Leaf Leaf SUM(requests) GROUP BY title WHERE REGEX_MATCH(title, 'pat.*rn') Mixer 1 Mixer 1 SUM(requests) GROUP BY title Mixer 0 LIMIT 10 ORDER BY c DESC SUM(requests) GROUP BY title
  24. Data for each row is calculated and streamed through a

    “Row Iterator” Subquery0 Subquery1 JOIN Row Iterator 0 Row Iterator 1 Row Iterator 2
  25. Can insert JavaScript Functions wherever we have a Row Iterator

    Subquery0 Subquery1 JOIN Row Iterator 0 Row Iterator 1 Row Iterator 2 UDF1 UDF0
  26. Join order item info with web hits info SELECT item

    FROM orders SELECT query string FROM hits JOIN Row Iterator 0 Row Iterator 1 Row Iterator 2 UDF1 UDF0
  27. http://www.store.com/?q=7%2e1+Speakers SELECT item FROM orders SELECT query string FROM hits

    JOIN Row Iterator 0 Row Iterator 1 Row Iterator 2 UDF1 UDF0
  28. http://www.store.com/?q=7%2e1+Speakers Extract and decode query term => “7.1 Speakers” SELECT

    item FROM orders SELECT query string FROM hits JOIN Row Iterator 0 Row Iterator 1 Row Iterator 2 UDF1 UDF0
  29. UDF execution Subquery0 Subquery1 JOIN UDF1 Process boundary UDF0 UDF0

    User Code
  30. Agenda Background Example: Cross-row intervals Under the hood Example: Codebreaking

    I. II. III. IV.
  31. Demos: Ñ

  32. Image: El Hormiguero (Flickr CC)

  33. http://jsfiddle.net/fhoffa/y4pt9s23/ Image: TheVanCats (Flickr CC)

  34. Questions? News: reddit.com/r/bigquery Ask: stackoverflow.com Share: bigqueri.es Thomas Park Felipe

    Hoffa @felipehoffa +FelipeHoffa Rate us? http://goo.gl/k3bzdw
  35. Backup slides / screenshots

  36. None
  37. None
  38. None
  39. None
  40. None
  41. None
  42. 17TH ~ 18th NOV 2014 MADRID (SPAIN)