Mining Github and Meetup.com to Explore the UK’s Digital Tech Clusters

Mining Github and Meetup.com to Explore the UK’s Digital Tech Clusters

Talk given at PyDiff. Cardiff. 15 Mar 2016.

627b1a10da6bd579fd7f2ea8c73774b8?s=128

Matt J Williams

March 15, 2016
Tweet

Transcript

  1. PyDiff, Cardiff 15 March 2016 Mining Github and Meetup.com to

    Explore the UK’s Digital Tech Clusters Matt J. Williams mattjw@mattjw.net http://www.mattjw.net @voxmjw
  2. PyDiff, Cardiff 15 March 2016 Mining Github and Meetup.com to

    Explore the UK’s Digital Tech Clusters Matt J. Williams mattjw@mattjw.net http://www.mattjw.net @voxmjw
  3. Who the hell? Beer! A&E Game! ...Uri Gellar!

  4. Who the hell?

  5. Mapping digital tech clusters http://www.nesta.org.uk/publications/tech-nation-2016 http://www.techcityuk.com/technation/ TRANSFORMING UK INDUSTRIES •

    Tech Nation 2016 • Nesta-led project to map the state of the UK digital tech economy • Awesome report: – 27 digital tech clusters (regions) – Many datasets: big data, interviews, govt. stats – Many domains: firms, communities, skills demand, etc. • I contributed data collection and analytics for Github and Meetup.com • Data in this talk are drawn from my work on that project • Disclaimer: I’m not at Nesta! The project was led by the the awesome folks in Nesta’s Creative and Digital Economy team.
  6. None
  7. What can we learn?

  8. What can we learn? • Locations of programmers • Interest

    in programming languages • Impact of projects • Locations of communities • Community interests • Local social network
  9. This talk • Python → From data collection to analysis

    • Mining data from Github and Meetup • About 80% Github, 20% Meetup • Profiling digital tech clusters • Python tools used
  10. Mining Github

  11. Mining Github

  12. Mining Github

  13. Mining Github

  14. Mining Github Job Done?

  15. Mining Github London = 8k+ users Manchester = 2k+ users

    Cambridge = 2k+ users ... How does search choose the 1,000?
  16. Mining Github: Sketch User Discovery User Profile Repos Forward Geocode

    & Filter Repo Langs Github Archive MapQuest Github API Aggregate into Geographic Regions (Digital Tech Clusters) for Analysis
  17. GithubArchive.org • Public archive of the public Github timeline •

    Every public event Event types: push, pull req, comment, add contrib, fork [+15 more] • Use this to obtain a list of all Github users
  18. Pop Quiz How many users discovered for Jan-June 2015? ...we’ll

    need to obtain their profile data and forward geocode their locations!
  19. One hour of events: ~100MB uncompressed Six months of events:

    ~0.5 TB Num coding-related events: 57,788,996 Pop Quiz How many users discovered for Jan-June 2015? ...we’ll need to obtain their profile data and forward geocode their locations!
  20. One hour of events: ~100MB uncompressed Six months of events:

    ~0.5 TB Num coding-related events: 57,788,996 Pop Quiz How many users discovered for Jan-June 2015? ...we’ll need to obtain their profile data and forward geocode their locations! 1.7 million users
  21. One hour of events: ~100MB uncompressed Six months of events:

    ~0.5 TB Num coding-related events: 57,788,996 Pop Quiz How many users discovered for Jan-June 2015? ...we’ll need to obtain their profile data and forward geocode their locations! 1.7 million users
  22. Forward Geocoding → “Cardiff, UK” lat, long structured address confidence

    1.7M users global users ...with location field ...and geocoded ...and in UK ...and at city granularity pypi: geocoder
  23. Forward Geocoding → “Cardiff, UK” lat, long structured address confidence

    1.7M users 350k users global users ...with location field ...and geocoded ...and in UK ...and at city granularity pypi: geocoder
  24. Forward Geocoding → “Cardiff, UK” lat, long structured address confidence

    1.7M users 350k users 320k users global users ...with location field ...and geocoded ...and in UK ...and at city granularity pypi: geocoder
  25. Forward Geocoding → “Cardiff, UK” lat, long structured address confidence

    1.7M users 350k users 320k users 21k users global users ...with location field ...and geocoded ...and in UK ...and at city granularity pypi: geocoder
  26. Forward Geocoding → “Cardiff, UK” lat, long structured address confidence

    1.7M users 350k users 320k users 21k users 18,470 users global users ...with location field ...and geocoded ...and in UK ...and at city granularity pypi: geocoder
  27. Profile and Repo Data loads of queries (time!) needed

  28. Profile and Repo Data

  29. What’s a (geographic) tech cluster? Travel To Work Areas (TTWAs)

    Defined and managed by the ONS Inferred from UK census data – where do you live and where do you work? Each TTWA represents a labour basin TTWA 2001 boundaries chosen for the Tech Nation Report
  30. What’s a (geographic) tech cluster?

  31. What’s a (geographic) tech cluster? Point in polygon lookup Stockport

    → Manchester TTWA Manchester → Manchester TTWA Macclesfield → Manchester TTWA Pure Python libraries for efficient boundary lookup... ...kinda suck [someone build this!] pypi: pysal
  32. Final dataset • 18,470 users • 89,363 repositories • 270

    programming languages • 100 TTWAs • Limitations: Bias towards open source Bias in location sharing Public repos only Geocoder misclassification (<5% error)
  33. None
  34. None
  35. None
  36. *we only count authors of original (non-forked) repos

  37. Top Cardiff Github users Rank # Repos 1 28 =2

    21 =2 21 ... ... ... 4 19 ... ... ... 8 12 TOMCRICK TOMASBASHAM FLYINGSPARX MARTINJC ENCIMA
  38. Top Python users? Rank # Repos 1 8 2 7

    3 6 =4 5 =4 5 =4 5 =4 5
  39. Top Python users? Rank # Repos 1 8 2 7

    3 6 =4 5 =4 5 =4 5 =4 5 FLYINGSPARX ZEMOGLE CHRISNORTH GIRISHKUMARKH
  40. Top Python users? Rank # Repos 1 8 2 7

    3 6 =4 5 =4 5 =4 5 =4 5 BADDERS FLYINGSPARX ZEMOGLE CHRISNORTH GIRISHKUMARKH
  41. Top Python users? Rank # Repos 1 8 2 7

    3 6 =4 5 =4 5 =4 5 =4 5 EVILDMP BADDERS FLYINGSPARX ZEMOGLE CHRISNORTH GIRISHKUMARKH
  42. Top Python users? Rank # Repos 1 8 2 7

    3 6 =4 5 =4 5 =4 5 =4 5 MARTINJC EVILDMP BADDERS FLYINGSPARX ZEMOGLE CHRISNORTH GIRISHKUMARKH
  43. None
  44. None
  45. groups members events attendees

  46. groups members events attendees

  47. groups members events attendees

  48. groups members events attendees

  49. groups members events attendees

  50. groups members events attendees Generalise low-level topics with dimensionality reduction

    and clustering* E.g., Data Analytics, Game Dev, etc. [*Not explained here. A ‘wizard’ did it.] scikit-learn
  51. pypi: basemap

  52. None
  53. London Brighton Bristol Edinburgh Leeds Liverpool Oxford Leicester Norwich Glasgow

    Cambridge Nottingham Cardiff Birmingham Sheffield Tyneside Belfast City-City Co-visitor Network City-City Co-visitor Network (Normalised)
  54. pandas basemap pysal ratelim geocoding rtree matplotlib ipython notebook fiona

    rsync jq twofishes Utils Software Python pymongo geopy What’s in the box? requests scikit-learn
  55. What does this all mean?!

  56. What does this all mean?! mining platforms for real-time view

    of tech hubs? [see Tech Nation report!]
  57. What does this all mean?! mining platforms for real-time view

    of tech hubs? [see Tech Nation report!] example of full pipeline: collection to analysis
  58. What does this all mean?! mining platforms for real-time view

    of tech hubs? [see Tech Nation report!] lots of untapped potential (but also lots of effort to collect data!) example of full pipeline: collection to analysis
  59. What does this all mean?! mining platforms for real-time view

    of tech hubs? [see Tech Nation report!] python is awesome for both data collection and analysis lots of untapped potential (but also lots of effort to collect data!) example of full pipeline: collection to analysis
  60. What does this all mean?! mining platforms for real-time view

    of tech hubs? [see Tech Nation report!] pointers for relevant Python packages python is awesome for both data collection and analysis lots of untapped potential (but also lots of effort to collect data!) example of full pipeline: collection to analysis
  61. What does this all mean?! mining platforms for real-time view

    of tech hubs? [see Tech Nation report!] pointers for relevant Python packages python is awesome for both data collection and analysis Vince needs to set his Github location field lots of untapped potential (but also lots of effort to collect data!) example of full pipeline: collection to analysis
  62. PyDiff, Cardiff 15 March 2016 Thanks for listening! Matt J.

    Williams mattjw@mattjw.net http://www.mattjw.net @voxmjw Special thanks to Hasan Bakhshi and Juan Mateos-Garcia at Nesta Any questions?