Slide 1

Slide 1 text

HANDS ON WITH BIGQUERY JAVASCRIPT UDFS THOMAS PARK SOFTWARE ENGINEER - GOOGLE

Slide 2

Slide 2 text

Hands-on with BigQuery JavaScript User-Defined Functions Thomas Park Software Engineer - Google Felipe Hoffa @felipehoffa Developer Advocate - Google

Slide 3

Slide 3 text

Agenda Background Example: Cross-row intervals Under the hood Example: Codebreaking I. II. III. IV.

Slide 4

Slide 4 text

Agenda Background Example: Cross-row intervals Under the hood Example: Codebreaking I. II. III. IV.

Slide 5

Slide 5 text

What is BigQuery?

Slide 6

Slide 6 text

BigQuery: Big Data Analytics in the Cloud Unrivaled Performance and Scale ● Scan multiple TB’s in seconds ● Interactive query performance ● No limits on amount of data Ease of Use and Adoption ● No administration / provisioning ● Convenience of SQL ● Open interfaces (REST, WebUI, ODBC) ● First 1 TB of data processed per month is free Advanced “Big Data” Storage ● Familiar database structure ● Easy data management and ACL’s ● Fast, atomic imports

Slide 7

Slide 7 text

Google confidential │ Do not distribute How many pageviews does Wikipedia have in a month? SELECT COUNT(*)FROM [fh-bigquery:wikipedia.wikipedia_views_201308] https://bigquery.cloud.google.com/table/fh-bigquery:wikipedia.pagecounts_20140602_18

Slide 8

Slide 8 text

Google confidential │ Do not distribute $500 in Cloud Platform credit to launch your idea! Build. Store. Analyze. On the same infrastructure that powers Google Start building Click ‘Apply Now’ and complete the application with promo code: bigdata-spain Starter Pack Offer Description 1 2 3 Go to cloud.google.com/developers/starterpack

Slide 9

Slide 9 text

Agenda Background Example: Cross-row intervals Under the hood Example: Codebreaking I. II. III. IV.

Slide 10

Slide 10 text

Images by Connie Zhou Scenario: Door access records from a Very Well-Secured Lab where users must badge in to enter or leave Image by Tod Kurt

Slide 11

Slide 11 text

Images by Connie Zhou Example: Time-series analysis from discrete user action data Image by Tod Kurt

Slide 12

Slide 12 text

user_id timestamp Beep!! 9h: arrive @ lab thomas 2014.07.15 09:00 Beep!! 10h: leave to pick up prototype thomas 2014.07.15 10:00 Beep!! 10h15: return with prototype thomas 2014.07.15 10:15 Beep!! 12h: out for lunch thomas 2014.07.15 12:00

Slide 13

Slide 13 text

How can we find out how much time each user spent in the lab? ...where each scan of the user’s access card is represented as a discrete row?

Slide 14

Slide 14 text

rownum user_id timestamp 1 thomas 2014.07.15 09:00 2 thomas 2014.07.15 10:00 3 thomas 2014.07.15 10:15 4 thomas 2014.07.15 12:00 60 minutes 105 minutes Our analysis with data in this format via SQL is horrid and painful A BigQuery + JS friendly format: data for each user in separate rows user_id timestamps thomas [ 09:00, 10:00, 10:15, 12:00, ... ] hoffa [ 08:10, 11:30, 12:00, 12:15, ... ] SELECT user_id, NEST(timestamp) AS timestamps FROM T GROUP BY user_id; Producing this format is trivial in BigQuery...

Slide 15

Slide 15 text

// This function will be called once for each user, // and receive an array of timestamps. function(record, emit) { var total_time = 0; // Order of records built by NEST are not guaranteed! // Sort to guarantee ascending timestamps. var ts = record.ts.sort( function (a, b) {return a > b;}); // Loop over timestamp pairs, calculate interval. for (var i = 0; i < ts.length - 1; i += 2) { total_time += (ts[i+1] - ts[i]); } // Emit total time for this user. emit({user: record.user_id, total_time: total_time}); } JS: Total time for each user

Slide 16

Slide 16 text

SELECT * FROM js( // Input table or query. [secret-lab:door_scans.201411] // Input columns. user_id, timestamps, // Output schema. "[{name: 'user_id', type:'string'}, {name: 'tot_time', type:'integer'}]", // The function. "function(r, emit) { ... emit(...); }")

Slide 17

Slide 17 text

SELECT * FROM js( // Input table or query. [secret-lab:door_scans.201411] // Input columns. user_id, timestamps, // Output schema. "[{name: 'user_id', type:'string'}, {name: 'tot_time', type:'integer'}]", // The function. "function(r, emit) { ... emit(...); }") The JS function

Slide 18

Slide 18 text

SELECT * FROM js( // Input table or query. [secret-lab:door_scans.201411] // Input columns. user_id, timestamps, // Output schema. "[{name: 'user_id', type:'string'}, {name: 'tot_time', type:'integer'}]", // The function. "function(r, emit) { ... emit(...); }") Input schema (column names only!)

Slide 19

Slide 19 text

SELECT * FROM js( // Input table or query. [secret-lab:door_scans.201411] // Input columns. user_id, timestamps, // Output schema. "[{name: 'user_id', type:'string'}, {name: 'tot_time', type:'integer'}]", // The function. "function(r, emit) { ... emit(...); }") Output schema (full declaration)

Slide 20

Slide 20 text

SELECT * FROM js( // Input table or query. [secret-lab:door_scans.201411] // Input columns. user_id, timestamps, // Output schema. "[{name: 'user_id', type:'string'}, {name: 'tot_time', type:'integer'}]", // The function. "function(r, emit) { ... emit(...); }") Input table or subquery

Slide 21

Slide 21 text

SELECT * FROM js( // Input table or query. [secret-lab:door_scans.201411] // Input columns. user_id, timestamps, // Output schema. "[{name: 'user_id', type:'string'}, {name: 'tot_time', type:'integer'}]", // The function. "function(r, emit) { ... emit(...); }")

Slide 22

Slide 22 text

Agenda Background Example: Cross-row intervals Under the hood Example: Codebreaking I. II. III. IV.

Slide 23

Slide 23 text

How BigQuery works Get data from lower levels, filter / join / transform, send rows up Tree Structured Query Dispatch and Aggregation Distributed Storage SELECT title, requests Leaf Leaf Leaf Leaf SUM(requests) GROUP BY title WHERE REGEX_MATCH(title, 'pat.*rn') Mixer 1 Mixer 1 SUM(requests) GROUP BY title Mixer 0 LIMIT 10 ORDER BY c DESC SUM(requests) GROUP BY title

Slide 24

Slide 24 text

Data for each row is calculated and streamed through a “Row Iterator” Subquery0 Subquery1 JOIN Row Iterator 0 Row Iterator 1 Row Iterator 2

Slide 25

Slide 25 text

Can insert JavaScript Functions wherever we have a Row Iterator Subquery0 Subquery1 JOIN Row Iterator 0 Row Iterator 1 Row Iterator 2 UDF1 UDF0

Slide 26

Slide 26 text

Join order item info with web hits info SELECT item FROM orders SELECT query string FROM hits JOIN Row Iterator 0 Row Iterator 1 Row Iterator 2 UDF1 UDF0

Slide 27

Slide 27 text

http://www.store.com/?q=7%2e1+Speakers SELECT item FROM orders SELECT query string FROM hits JOIN Row Iterator 0 Row Iterator 1 Row Iterator 2 UDF1 UDF0

Slide 28

Slide 28 text

http://www.store.com/?q=7%2e1+Speakers Extract and decode query term => “7.1 Speakers” SELECT item FROM orders SELECT query string FROM hits JOIN Row Iterator 0 Row Iterator 1 Row Iterator 2 UDF1 UDF0

Slide 29

Slide 29 text

UDF execution Subquery0 Subquery1 JOIN UDF1 Process boundary UDF0 UDF0 User Code

Slide 30

Slide 30 text

Agenda Background Example: Cross-row intervals Under the hood Example: Codebreaking I. II. III. IV.

Slide 31

Slide 31 text

Demos: Ñ

Slide 32

Slide 32 text

Image: El Hormiguero (Flickr CC)

Slide 33

Slide 33 text

http://jsfiddle.net/fhoffa/y4pt9s23/ Image: TheVanCats (Flickr CC)

Slide 34

Slide 34 text

Questions? News: reddit.com/r/bigquery Ask: stackoverflow.com Share: bigqueri.es Thomas Park Felipe Hoffa @felipehoffa +FelipeHoffa Rate us? http://goo.gl/k3bzdw

Slide 35

Slide 35 text

Backup slides / screenshots

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

No content

Slide 38

Slide 38 text

No content

Slide 39

Slide 39 text

No content

Slide 40

Slide 40 text

No content

Slide 41

Slide 41 text

No content

Slide 42

Slide 42 text

17TH ~ 18th NOV 2014 MADRID (SPAIN)