Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Mining Github and Meetup.com to Explore the UK’...

Mining Github and Meetup.com to Explore the UK’s Digital Tech Clusters

Talk given at PyDiff. Cardiff. 15 Mar 2016.

Matt J Williams

March 15, 2016
Tweet

More Decks by Matt J Williams

Other Decks in Technology

Transcript

  1. PyDiff, Cardiff 15 March 2016 Mining Github and Meetup.com to

    Explore the UK’s Digital Tech Clusters Matt J. Williams [email protected] http://www.mattjw.net @voxmjw
  2. PyDiff, Cardiff 15 March 2016 Mining Github and Meetup.com to

    Explore the UK’s Digital Tech Clusters Matt J. Williams [email protected] http://www.mattjw.net @voxmjw
  3. Mapping digital tech clusters http://www.nesta.org.uk/publications/tech-nation-2016 http://www.techcityuk.com/technation/ TRANSFORMING UK INDUSTRIES •

    Tech Nation 2016 • Nesta-led project to map the state of the UK digital tech economy • Awesome report: – 27 digital tech clusters (regions) – Many datasets: big data, interviews, govt. stats – Many domains: firms, communities, skills demand, etc. • I contributed data collection and analytics for Github and Meetup.com • Data in this talk are drawn from my work on that project • Disclaimer: I’m not at Nesta! The project was led by the the awesome folks in Nesta’s Creative and Digital Economy team.
  4. What can we learn? • Locations of programmers • Interest

    in programming languages • Impact of projects • Locations of communities • Community interests • Local social network
  5. This talk • Python → From data collection to analysis

    • Mining data from Github and Meetup • About 80% Github, 20% Meetup • Profiling digital tech clusters • Python tools used
  6. Mining Github London = 8k+ users Manchester = 2k+ users

    Cambridge = 2k+ users ... How does search choose the 1,000?
  7. Mining Github: Sketch User Discovery User Profile Repos Forward Geocode

    & Filter Repo Langs Github Archive MapQuest Github API Aggregate into Geographic Regions (Digital Tech Clusters) for Analysis
  8. GithubArchive.org • Public archive of the public Github timeline •

    Every public event Event types: push, pull req, comment, add contrib, fork [+15 more] • Use this to obtain a list of all Github users
  9. Pop Quiz How many users discovered for Jan-June 2015? ...we’ll

    need to obtain their profile data and forward geocode their locations!
  10. One hour of events: ~100MB uncompressed Six months of events:

    ~0.5 TB Num coding-related events: 57,788,996 Pop Quiz How many users discovered for Jan-June 2015? ...we’ll need to obtain their profile data and forward geocode their locations!
  11. One hour of events: ~100MB uncompressed Six months of events:

    ~0.5 TB Num coding-related events: 57,788,996 Pop Quiz How many users discovered for Jan-June 2015? ...we’ll need to obtain their profile data and forward geocode their locations! 1.7 million users
  12. One hour of events: ~100MB uncompressed Six months of events:

    ~0.5 TB Num coding-related events: 57,788,996 Pop Quiz How many users discovered for Jan-June 2015? ...we’ll need to obtain their profile data and forward geocode their locations! 1.7 million users
  13. Forward Geocoding → “Cardiff, UK” lat, long structured address confidence

    1.7M users global users ...with location field ...and geocoded ...and in UK ...and at city granularity pypi: geocoder
  14. Forward Geocoding → “Cardiff, UK” lat, long structured address confidence

    1.7M users 350k users global users ...with location field ...and geocoded ...and in UK ...and at city granularity pypi: geocoder
  15. Forward Geocoding → “Cardiff, UK” lat, long structured address confidence

    1.7M users 350k users 320k users global users ...with location field ...and geocoded ...and in UK ...and at city granularity pypi: geocoder
  16. Forward Geocoding → “Cardiff, UK” lat, long structured address confidence

    1.7M users 350k users 320k users 21k users global users ...with location field ...and geocoded ...and in UK ...and at city granularity pypi: geocoder
  17. Forward Geocoding → “Cardiff, UK” lat, long structured address confidence

    1.7M users 350k users 320k users 21k users 18,470 users global users ...with location field ...and geocoded ...and in UK ...and at city granularity pypi: geocoder
  18. What’s a (geographic) tech cluster? Travel To Work Areas (TTWAs)

    Defined and managed by the ONS Inferred from UK census data – where do you live and where do you work? Each TTWA represents a labour basin TTWA 2001 boundaries chosen for the Tech Nation Report
  19. What’s a (geographic) tech cluster? Point in polygon lookup Stockport

    → Manchester TTWA Manchester → Manchester TTWA Macclesfield → Manchester TTWA Pure Python libraries for efficient boundary lookup... ...kinda suck [someone build this!] pypi: pysal
  20. Final dataset • 18,470 users • 89,363 repositories • 270

    programming languages • 100 TTWAs • Limitations: Bias towards open source Bias in location sharing Public repos only Geocoder misclassification (<5% error)
  21. Top Cardiff Github users Rank # Repos 1 28 =2

    21 =2 21 ... ... ... 4 19 ... ... ... 8 12 TOMCRICK TOMASBASHAM FLYINGSPARX MARTINJC ENCIMA
  22. Top Python users? Rank # Repos 1 8 2 7

    3 6 =4 5 =4 5 =4 5 =4 5
  23. Top Python users? Rank # Repos 1 8 2 7

    3 6 =4 5 =4 5 =4 5 =4 5 FLYINGSPARX ZEMOGLE CHRISNORTH GIRISHKUMARKH
  24. Top Python users? Rank # Repos 1 8 2 7

    3 6 =4 5 =4 5 =4 5 =4 5 BADDERS FLYINGSPARX ZEMOGLE CHRISNORTH GIRISHKUMARKH
  25. Top Python users? Rank # Repos 1 8 2 7

    3 6 =4 5 =4 5 =4 5 =4 5 EVILDMP BADDERS FLYINGSPARX ZEMOGLE CHRISNORTH GIRISHKUMARKH
  26. Top Python users? Rank # Repos 1 8 2 7

    3 6 =4 5 =4 5 =4 5 =4 5 MARTINJC EVILDMP BADDERS FLYINGSPARX ZEMOGLE CHRISNORTH GIRISHKUMARKH
  27. groups members events attendees Generalise low-level topics with dimensionality reduction

    and clustering* E.g., Data Analytics, Game Dev, etc. [*Not explained here. A ‘wizard’ did it.] scikit-learn
  28. London Brighton Bristol Edinburgh Leeds Liverpool Oxford Leicester Norwich Glasgow

    Cambridge Nottingham Cardiff Birmingham Sheffield Tyneside Belfast City-City Co-visitor Network City-City Co-visitor Network (Normalised)
  29. pandas basemap pysal ratelim geocoding rtree matplotlib ipython notebook fiona

    rsync jq twofishes Utils Software Python pymongo geopy What’s in the box? requests scikit-learn
  30. What does this all mean?! mining platforms for real-time view

    of tech hubs? [see Tech Nation report!]
  31. What does this all mean?! mining platforms for real-time view

    of tech hubs? [see Tech Nation report!] example of full pipeline: collection to analysis
  32. What does this all mean?! mining platforms for real-time view

    of tech hubs? [see Tech Nation report!] lots of untapped potential (but also lots of effort to collect data!) example of full pipeline: collection to analysis
  33. What does this all mean?! mining platforms for real-time view

    of tech hubs? [see Tech Nation report!] python is awesome for both data collection and analysis lots of untapped potential (but also lots of effort to collect data!) example of full pipeline: collection to analysis
  34. What does this all mean?! mining platforms for real-time view

    of tech hubs? [see Tech Nation report!] pointers for relevant Python packages python is awesome for both data collection and analysis lots of untapped potential (but also lots of effort to collect data!) example of full pipeline: collection to analysis
  35. What does this all mean?! mining platforms for real-time view

    of tech hubs? [see Tech Nation report!] pointers for relevant Python packages python is awesome for both data collection and analysis Vince needs to set his Github location field lots of untapped potential (but also lots of effort to collect data!) example of full pipeline: collection to analysis
  36. PyDiff, Cardiff 15 March 2016 Thanks for listening! Matt J.

    Williams [email protected] http://www.mattjw.net @voxmjw Special thanks to Hasan Bakhshi and Juan Mateos-Garcia at Nesta Any questions?