Analyzing 33 million bike trips with BigQuery (GDG Brisbane)

Analyzing 33 million bike trips with BigQuery Sara Robinson @SRobTweets

2 @SRobTweets @bretmcg 2

3 @SRobTweets So many bikes, so many questions • How
many stations are there? • How many bikes are in circulation at a given time? • What are the most popular routes? • Do members have different riding habits than non-members? • Do men and women take the same routes? • What else???!???

4 @SRobTweets All Citibike data is publicly available!

Google Cloud Platform 7 How can we analyze it?

8 @SRobTweets BigQuery Managed big data analytics warehouse • Fast:
terabytes in seconds • Simple: SQL • Interoperable: Java, Python, Tableau, R… • Instant data sharing • Free monthly quota - 1 TB queries, 10 GB storage

9 @SRobTweets Analyzing our bike data in BigQuery 1. Import
the data 2. Analyze 3. Visualize

10 @SRobTweets Importing data into BigQuery Step 1: get the
CSV files from Cloud Storage

11 @SRobTweets Importing data into BigQuery Step 2: define a
BQ schema and import the files

12 @SRobTweets Other ways to import data into BigQuery •
Web UI • Rest API through client libraries • Cloud Dataflow • Firebase

Google Cloud Platform 13 Let’s analyze it!

14 @SRobTweets Visualizing our data

Google Cloud Platform 15 Other public datasets in BigQuery

16 @SRobTweets All of GitHub’s code in BigQuery

Let’s find all of the JavaScript files on GitHub SELECT
COUNT(*) FROM [bigquery-public-data:github_repos.files] WHERE RIGHT (path, 3) = ".js"

What about most popular NPM packages? SELECT COUNT(*) as cnt,
package FROM JS( (SELECT content FROM [bigquery-public-data:github_repos.contents] WHERE id IN ( SELECT id FROM [bigquery-public-data:github_repos.files] WHERE RIGHT(path, 12) = "package.json" )), content, "[{ name: 'package', type: 'string'}]", "function(row, emit) { try { x = JSON.parse(row.content); if (x.dependencies) { Object.keys(x.dependencies).forEach(function(dep) { emit({ package: dep }); }); } } catch (e) {} }" ) GROUP BY package ORDER BY cnt DESC

What about most popular NPM packages?

Google Cloud Platform 20 One more thing...

21 @SRobTweets What about Australian data?

22 @SRobTweets Weather data in BigQuery NOAA dataset: bit.ly/bq-noaa-dataset

23 @SRobTweets Find all the weather stations in Australia NOAA
dataset: bit.ly/bq-noaa-dataset SELECT * FROM `bigquery-public-data.noaa_gsod.stations` WHERE country = "AS"

24 @SRobTweets Mapping Australian weather trends NOAA dataset: bit.ly/bq-noaa-dataset SELECT
stn, b.lat, b.lon, round(avg(temp), 2) as avg_temp, round(avg(prcp),2) as avg_rainfall, round(avg(cast (wdsp as float64)), 2) as avg_windspeed FROM `bigquery-public-data.noaa_gsod.gsod*` a JOIN ( SELECT lat, lon, usaf FROM `bigquery-public-data.noaa_gsod.stations` ) b ON a.stn = b.usaf WHERE stn in ( SELECT usaf as stn FROM `bigquery-public-data.noaa_gsod.stations` where country = "AS" ) GROUP BY 1,2,3

25 @SRobTweets Australia subreddits Reddit comments dataset: bit.ly/bq-reddit-comments (thanks @felipehoffa!)
SELECT body, subreddit, score FROM `fh-bigquery.reddit_comments.2017_06` WHERE lower(subreddit) like '%australia%'

Google Cloud Platform 26 Let’s add some ML to our
big data

27 @SRobTweets Natural Language Processing with the Cloud NL API
Reddit comments NL API BigQuery Data visualization

28 @SRobTweets Natural Language Processing with the Cloud NL API

29 @SRobTweets Step 1: store the NL API response in
BigQuery

30 @SRobTweets Step 1: store the NL API response in
BigQuery But how will we parse these giant JSON strings?!

31 @SRobTweets Step 2: write some queries with UDFs SELECT
COUNT(*) as adjective, adj_count FROM JS( (SELECT tokens FROM [sara-bigquery:australia.reddit_comments] ), tokens, "[{ name:'adjective', type: 'string'}]", "function(row, emit) { try { x = JSON.parse(row.tokens); x.forEach(function(token) { if (token.partOfSpeech.tag === 'ADJ') { emit({ adjective: token.lemma.toLowerCase() }); } }); } catch (e) {} }" ) GROUP BY adjective ORDER BY adj_count DESC LIMIT 100

32 @SRobTweets Step 3: visualize the results Most commented adjectives

33 @SRobTweets Step 3: visualize the results Most commented subjects

34 @SRobTweets Resources • BigQuery: cloud.google.com/bigquery • Public NYC bike
data: bit.ly/bq-bike-table • Bike data blog post: bit.ly/nyc-bike-blog • NOAA dataset: bit.ly/bq-noaa-dataset • Reddit dataset: bit.ly/bq-reddit-comments

Thank You Sara Robinson @SRobTweets

Analyzing 33 million bike trips with BigQuery (...

Analyzing 33 million bike trips with BigQuery (GDG Brisbane)

More Decks by sararob

Other Decks in Programming

Featured

Transcript