PyDiff, Cardiff 15 March 2016 Mining Github and Meetup.com to Explore the UK’s Digital Tech Clusters Matt J. Williams [email protected] http://www.mattjw.net @voxmjw
PyDiff, Cardiff 15 March 2016 Mining Github and Meetup.com to Explore the UK’s Digital Tech Clusters Matt J. Williams [email protected] http://www.mattjw.net @voxmjw
Mapping digital tech clusters http://www.nesta.org.uk/publications/tech-nation-2016 http://www.techcityuk.com/technation/ TRANSFORMING UK INDUSTRIES • Tech Nation 2016 • Nesta-led project to map the state of the UK digital tech economy • Awesome report: – 27 digital tech clusters (regions) – Many datasets: big data, interviews, govt. stats – Many domains: firms, communities, skills demand, etc. • I contributed data collection and analytics for Github and Meetup.com • Data in this talk are drawn from my work on that project • Disclaimer: I’m not at Nesta! The project was led by the the awesome folks in Nesta’s Creative and Digital Economy team.
What can we learn? • Locations of programmers • Interest in programming languages • Impact of projects • Locations of communities • Community interests • Local social network
This talk • Python → From data collection to analysis • Mining data from Github and Meetup • About 80% Github, 20% Meetup • Profiling digital tech clusters • Python tools used
Mining Github: Sketch User Discovery User Profile Repos Forward Geocode & Filter Repo Langs Github Archive MapQuest Github API Aggregate into Geographic Regions (Digital Tech Clusters) for Analysis
GithubArchive.org • Public archive of the public Github timeline • Every public event Event types: push, pull req, comment, add contrib, fork [+15 more] • Use this to obtain a list of all Github users
One hour of events: ~100MB uncompressed Six months of events: ~0.5 TB Num coding-related events: 57,788,996 Pop Quiz How many users discovered for Jan-June 2015? ...we’ll need to obtain their profile data and forward geocode their locations!
One hour of events: ~100MB uncompressed Six months of events: ~0.5 TB Num coding-related events: 57,788,996 Pop Quiz How many users discovered for Jan-June 2015? ...we’ll need to obtain their profile data and forward geocode their locations! 1.7 million users
One hour of events: ~100MB uncompressed Six months of events: ~0.5 TB Num coding-related events: 57,788,996 Pop Quiz How many users discovered for Jan-June 2015? ...we’ll need to obtain their profile data and forward geocode their locations! 1.7 million users
Forward Geocoding → “Cardiff, UK” lat, long structured address confidence 1.7M users global users ...with location field ...and geocoded ...and in UK ...and at city granularity pypi: geocoder
Forward Geocoding → “Cardiff, UK” lat, long structured address confidence 1.7M users 350k users global users ...with location field ...and geocoded ...and in UK ...and at city granularity pypi: geocoder
Forward Geocoding → “Cardiff, UK” lat, long structured address confidence 1.7M users 350k users 320k users global users ...with location field ...and geocoded ...and in UK ...and at city granularity pypi: geocoder
Forward Geocoding → “Cardiff, UK” lat, long structured address confidence 1.7M users 350k users 320k users 21k users global users ...with location field ...and geocoded ...and in UK ...and at city granularity pypi: geocoder
What’s a (geographic) tech cluster? Travel To Work Areas (TTWAs) Defined and managed by the ONS Inferred from UK census data – where do you live and where do you work? Each TTWA represents a labour basin TTWA 2001 boundaries chosen for the Tech Nation Report
What’s a (geographic) tech cluster? Point in polygon lookup Stockport → Manchester TTWA Manchester → Manchester TTWA Macclesfield → Manchester TTWA Pure Python libraries for efficient boundary lookup... ...kinda suck [someone build this!] pypi: pysal
Final dataset • 18,470 users • 89,363 repositories • 270 programming languages • 100 TTWAs • Limitations: Bias towards open source Bias in location sharing Public repos only Geocoder misclassification (<5% error)
groups members events attendees Generalise low-level topics with dimensionality reduction and clustering* E.g., Data Analytics, Game Dev, etc. [*Not explained here. A ‘wizard’ did it.] scikit-learn
What does this all mean?! mining platforms for real-time view of tech hubs? [see Tech Nation report!] lots of untapped potential (but also lots of effort to collect data!) example of full pipeline: collection to analysis
What does this all mean?! mining platforms for real-time view of tech hubs? [see Tech Nation report!] python is awesome for both data collection and analysis lots of untapped potential (but also lots of effort to collect data!) example of full pipeline: collection to analysis
What does this all mean?! mining platforms for real-time view of tech hubs? [see Tech Nation report!] pointers for relevant Python packages python is awesome for both data collection and analysis lots of untapped potential (but also lots of effort to collect data!) example of full pipeline: collection to analysis
What does this all mean?! mining platforms for real-time view of tech hubs? [see Tech Nation report!] pointers for relevant Python packages python is awesome for both data collection and analysis Vince needs to set his Github location field lots of untapped potential (but also lots of effort to collect data!) example of full pipeline: collection to analysis
PyDiff, Cardiff 15 March 2016 Thanks for listening! Matt J. Williams [email protected] http://www.mattjw.net @voxmjw Special thanks to Hasan Bakhshi and Juan Mateos-Garcia at Nesta Any questions?