Analyzing 33 million bike trips with BigQuery (GDG Brisbane)

778425a9498f00198e57896c7b2a95d3?s=47 sararob
August 11, 2017

Analyzing 33 million bike trips with BigQuery (GDG Brisbane)

With so many people using NYC’s bike share program, I wanted to see if I could spot any trends in usage. But how did I go from observing bikers on the street to creating visualizations of popular routes based on different biker demographics?
Enter Google’s managed big data analytics warehouse tool - BigQuery! Using BigQuery, I ran queries on 33 million bike trips and integrated my findings with a few visualization tools. I’ll explain my process, do some live queries, and show you how to do the same analysis with your own data.

778425a9498f00198e57896c7b2a95d3?s=128

sararob

August 11, 2017
Tweet

Transcript

  1. Analyzing 33 million bike trips with BigQuery Sara Robinson @SRobTweets

  2. 2 @SRobTweets @bretmcg 2

  3. 3 @SRobTweets So many bikes, so many questions • How

    many stations are there? • How many bikes are in circulation at a given time? • What are the most popular routes? • Do members have different riding habits than non-members? • Do men and women take the same routes? • What else???!???
  4. 4 @SRobTweets All Citibike data is publicly available!

  5. 5 @SRobTweets All Citibike data is publicly available!

  6. 6 @SRobTweets All Citibike data is publicly available!

  7. Google Cloud Platform 7 How can we analyze it?

  8. 8 @SRobTweets BigQuery Managed big data analytics warehouse • Fast:

    terabytes in seconds • Simple: SQL • Interoperable: Java, Python, Tableau, R… • Instant data sharing • Free monthly quota - 1 TB queries, 10 GB storage
  9. 9 @SRobTweets Analyzing our bike data in BigQuery 1. Import

    the data 2. Analyze 3. Visualize
  10. 10 @SRobTweets Importing data into BigQuery Step 1: get the

    CSV files from Cloud Storage
  11. 11 @SRobTweets Importing data into BigQuery Step 2: define a

    BQ schema and import the files
  12. 12 @SRobTweets Other ways to import data into BigQuery •

    Web UI • Rest API through client libraries • Cloud Dataflow • Firebase
  13. Google Cloud Platform 13 Let’s analyze it!

  14. 14 @SRobTweets Visualizing our data

  15. Google Cloud Platform 15 Other public datasets in BigQuery

  16. 16 @SRobTweets All of GitHub’s code in BigQuery

  17. Let’s find all of the JavaScript files on GitHub SELECT

    COUNT(*) FROM [bigquery-public-data:github_repos.files] WHERE RIGHT (path, 3) = ".js"
  18. What about most popular NPM packages? SELECT COUNT(*) as cnt,

    package FROM JS( (SELECT content FROM [bigquery-public-data:github_repos.contents] WHERE id IN ( SELECT id FROM [bigquery-public-data:github_repos.files] WHERE RIGHT(path, 12) = "package.json" )), content, "[{ name: 'package', type: 'string'}]", "function(row, emit) { try { x = JSON.parse(row.content); if (x.dependencies) { Object.keys(x.dependencies).forEach(function(dep) { emit({ package: dep }); }); } } catch (e) {} }" ) GROUP BY package ORDER BY cnt DESC
  19. What about most popular NPM packages?

  20. Google Cloud Platform 20 One more thing...

  21. 21 @SRobTweets What about Australian data?

  22. 22 @SRobTweets Weather data in BigQuery NOAA dataset: bit.ly/bq-noaa-dataset

  23. 23 @SRobTweets Find all the weather stations in Australia NOAA

    dataset: bit.ly/bq-noaa-dataset SELECT * FROM `bigquery-public-data.noaa_gsod.stations` WHERE country = "AS"
  24. 24 @SRobTweets Mapping Australian weather trends NOAA dataset: bit.ly/bq-noaa-dataset SELECT

    stn, b.lat, b.lon, round(avg(temp), 2) as avg_temp, round(avg(prcp),2) as avg_rainfall, round(avg(cast (wdsp as float64)), 2) as avg_windspeed FROM `bigquery-public-data.noaa_gsod.gsod*` a JOIN ( SELECT lat, lon, usaf FROM `bigquery-public-data.noaa_gsod.stations` ) b ON a.stn = b.usaf WHERE stn in ( SELECT usaf as stn FROM `bigquery-public-data.noaa_gsod.stations` where country = "AS" ) GROUP BY 1,2,3
  25. 25 @SRobTweets Australia subreddits Reddit comments dataset: bit.ly/bq-reddit-comments (thanks @felipehoffa!)

    SELECT body, subreddit, score FROM `fh-bigquery.reddit_comments.2017_06` WHERE lower(subreddit) like '%australia%'
  26. Google Cloud Platform 26 Let’s add some ML to our

    big data
  27. 27 @SRobTweets Natural Language Processing with the Cloud NL API

    Reddit comments NL API BigQuery Data visualization
  28. 28 @SRobTweets Natural Language Processing with the Cloud NL API

  29. 29 @SRobTweets Step 1: store the NL API response in

    BigQuery
  30. 30 @SRobTweets Step 1: store the NL API response in

    BigQuery But how will we parse these giant JSON strings?!
  31. 31 @SRobTweets Step 2: write some queries with UDFs SELECT

    COUNT(*) as adjective, adj_count FROM JS( (SELECT tokens FROM [sara-bigquery:australia.reddit_comments] ), tokens, "[{ name:'adjective', type: 'string'}]", "function(row, emit) { try { x = JSON.parse(row.tokens); x.forEach(function(token) { if (token.partOfSpeech.tag === 'ADJ') { emit({ adjective: token.lemma.toLowerCase() }); } }); } catch (e) {} }" ) GROUP BY adjective ORDER BY adj_count DESC LIMIT 100
  32. 32 @SRobTweets Step 3: visualize the results Most commented adjectives

  33. 33 @SRobTweets Step 3: visualize the results Most commented subjects

  34. 34 @SRobTweets Resources • BigQuery: cloud.google.com/bigquery • Public NYC bike

    data: bit.ly/bq-bike-table • Bike data blog post: bit.ly/nyc-bike-blog • NOAA dataset: bit.ly/bq-noaa-dataset • Reddit dataset: bit.ly/bq-reddit-comments
  35. Thank You Sara Robinson @SRobTweets