Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Analyzing 33 million bike trips with BigQuery (...

sararob
August 11, 2017

Analyzing 33 million bike trips with BigQuery (GDG Brisbane)

With so many people using NYC’s bike share program, I wanted to see if I could spot any trends in usage. But how did I go from observing bikers on the street to creating visualizations of popular routes based on different biker demographics?
Enter Google’s managed big data analytics warehouse tool - BigQuery! Using BigQuery, I ran queries on 33 million bike trips and integrated my findings with a few visualization tools. I’ll explain my process, do some live queries, and show you how to do the same analysis with your own data.

sararob

August 11, 2017
Tweet

More Decks by sararob

Other Decks in Programming

Transcript

  1. 3 @SRobTweets So many bikes, so many questions • How

    many stations are there? • How many bikes are in circulation at a given time? • What are the most popular routes? • Do members have different riding habits than non-members? • Do men and women take the same routes? • What else???!???
  2. 8 @SRobTweets BigQuery Managed big data analytics warehouse • Fast:

    terabytes in seconds • Simple: SQL • Interoperable: Java, Python, Tableau, R… • Instant data sharing • Free monthly quota - 1 TB queries, 10 GB storage
  3. 12 @SRobTweets Other ways to import data into BigQuery •

    Web UI • Rest API through client libraries • Cloud Dataflow • Firebase
  4. Let’s find all of the JavaScript files on GitHub SELECT

    COUNT(*) FROM [bigquery-public-data:github_repos.files] WHERE RIGHT (path, 3) = ".js"
  5. What about most popular NPM packages? SELECT COUNT(*) as cnt,

    package FROM JS( (SELECT content FROM [bigquery-public-data:github_repos.contents] WHERE id IN ( SELECT id FROM [bigquery-public-data:github_repos.files] WHERE RIGHT(path, 12) = "package.json" )), content, "[{ name: 'package', type: 'string'}]", "function(row, emit) { try { x = JSON.parse(row.content); if (x.dependencies) { Object.keys(x.dependencies).forEach(function(dep) { emit({ package: dep }); }); } } catch (e) {} }" ) GROUP BY package ORDER BY cnt DESC
  6. 23 @SRobTweets Find all the weather stations in Australia NOAA

    dataset: bit.ly/bq-noaa-dataset SELECT * FROM `bigquery-public-data.noaa_gsod.stations` WHERE country = "AS"
  7. 24 @SRobTweets Mapping Australian weather trends NOAA dataset: bit.ly/bq-noaa-dataset SELECT

    stn, b.lat, b.lon, round(avg(temp), 2) as avg_temp, round(avg(prcp),2) as avg_rainfall, round(avg(cast (wdsp as float64)), 2) as avg_windspeed FROM `bigquery-public-data.noaa_gsod.gsod*` a JOIN ( SELECT lat, lon, usaf FROM `bigquery-public-data.noaa_gsod.stations` ) b ON a.stn = b.usaf WHERE stn in ( SELECT usaf as stn FROM `bigquery-public-data.noaa_gsod.stations` where country = "AS" ) GROUP BY 1,2,3
  8. 25 @SRobTweets Australia subreddits Reddit comments dataset: bit.ly/bq-reddit-comments (thanks @felipehoffa!)

    SELECT body, subreddit, score FROM `fh-bigquery.reddit_comments.2017_06` WHERE lower(subreddit) like '%australia%'
  9. 27 @SRobTweets Natural Language Processing with the Cloud NL API

    Reddit comments NL API BigQuery Data visualization
  10. 30 @SRobTweets Step 1: store the NL API response in

    BigQuery But how will we parse these giant JSON strings?!
  11. 31 @SRobTweets Step 2: write some queries with UDFs SELECT

    COUNT(*) as adjective, adj_count FROM JS( (SELECT tokens FROM [sara-bigquery:australia.reddit_comments] ), tokens, "[{ name:'adjective', type: 'string'}]", "function(row, emit) { try { x = JSON.parse(row.tokens); x.forEach(function(token) { if (token.partOfSpeech.tag === 'ADJ') { emit({ adjective: token.lemma.toLowerCase() }); } }); } catch (e) {} }" ) GROUP BY adjective ORDER BY adj_count DESC LIMIT 100
  12. 34 @SRobTweets Resources • BigQuery: cloud.google.com/bigquery • Public NYC bike

    data: bit.ly/bq-bike-table • Bike data blog post: bit.ly/nyc-bike-blog • NOAA dataset: bit.ly/bq-noaa-dataset • Reddit dataset: bit.ly/bq-reddit-comments