Slide 1

Slide 1 text

PyDiff, Cardiff 15 March 2016 Mining Github and Meetup.com to Explore the UK’s Digital Tech Clusters Matt J. Williams [email protected] http://www.mattjw.net @voxmjw

Slide 2

Slide 2 text

PyDiff, Cardiff 15 March 2016 Mining Github and Meetup.com to Explore the UK’s Digital Tech Clusters Matt J. Williams [email protected] http://www.mattjw.net @voxmjw

Slide 3

Slide 3 text

Who the hell? Beer! A&E Game! ...Uri Gellar!

Slide 4

Slide 4 text

Who the hell?

Slide 5

Slide 5 text

Mapping digital tech clusters http://www.nesta.org.uk/publications/tech-nation-2016 http://www.techcityuk.com/technation/ TRANSFORMING UK INDUSTRIES • Tech Nation 2016 • Nesta-led project to map the state of the UK digital tech economy • Awesome report: – 27 digital tech clusters (regions) – Many datasets: big data, interviews, govt. stats – Many domains: firms, communities, skills demand, etc. • I contributed data collection and analytics for Github and Meetup.com • Data in this talk are drawn from my work on that project • Disclaimer: I’m not at Nesta! The project was led by the the awesome folks in Nesta’s Creative and Digital Economy team.

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

What can we learn?

Slide 8

Slide 8 text

What can we learn? • Locations of programmers • Interest in programming languages • Impact of projects • Locations of communities • Community interests • Local social network

Slide 9

Slide 9 text

This talk • Python → From data collection to analysis • Mining data from Github and Meetup • About 80% Github, 20% Meetup • Profiling digital tech clusters • Python tools used

Slide 10

Slide 10 text

Mining Github

Slide 11

Slide 11 text

Mining Github

Slide 12

Slide 12 text

Mining Github

Slide 13

Slide 13 text

Mining Github

Slide 14

Slide 14 text

Mining Github Job Done?

Slide 15

Slide 15 text

Mining Github London = 8k+ users Manchester = 2k+ users Cambridge = 2k+ users ... How does search choose the 1,000?

Slide 16

Slide 16 text

Mining Github: Sketch User Discovery User Profile Repos Forward Geocode & Filter Repo Langs Github Archive MapQuest Github API Aggregate into Geographic Regions (Digital Tech Clusters) for Analysis

Slide 17

Slide 17 text

GithubArchive.org • Public archive of the public Github timeline • Every public event Event types: push, pull req, comment, add contrib, fork [+15 more] • Use this to obtain a list of all Github users

Slide 18

Slide 18 text

Pop Quiz How many users discovered for Jan-June 2015? ...we’ll need to obtain their profile data and forward geocode their locations!

Slide 19

Slide 19 text

One hour of events: ~100MB uncompressed Six months of events: ~0.5 TB Num coding-related events: 57,788,996 Pop Quiz How many users discovered for Jan-June 2015? ...we’ll need to obtain their profile data and forward geocode their locations!

Slide 20

Slide 20 text

One hour of events: ~100MB uncompressed Six months of events: ~0.5 TB Num coding-related events: 57,788,996 Pop Quiz How many users discovered for Jan-June 2015? ...we’ll need to obtain their profile data and forward geocode their locations! 1.7 million users

Slide 21

Slide 21 text

One hour of events: ~100MB uncompressed Six months of events: ~0.5 TB Num coding-related events: 57,788,996 Pop Quiz How many users discovered for Jan-June 2015? ...we’ll need to obtain their profile data and forward geocode their locations! 1.7 million users

Slide 22

Slide 22 text

Forward Geocoding → “Cardiff, UK” lat, long structured address confidence 1.7M users global users ...with location field ...and geocoded ...and in UK ...and at city granularity pypi: geocoder

Slide 23

Slide 23 text

Forward Geocoding → “Cardiff, UK” lat, long structured address confidence 1.7M users 350k users global users ...with location field ...and geocoded ...and in UK ...and at city granularity pypi: geocoder

Slide 24

Slide 24 text

Forward Geocoding → “Cardiff, UK” lat, long structured address confidence 1.7M users 350k users 320k users global users ...with location field ...and geocoded ...and in UK ...and at city granularity pypi: geocoder

Slide 25

Slide 25 text

Forward Geocoding → “Cardiff, UK” lat, long structured address confidence 1.7M users 350k users 320k users 21k users global users ...with location field ...and geocoded ...and in UK ...and at city granularity pypi: geocoder

Slide 26

Slide 26 text

Forward Geocoding → “Cardiff, UK” lat, long structured address confidence 1.7M users 350k users 320k users 21k users 18,470 users global users ...with location field ...and geocoded ...and in UK ...and at city granularity pypi: geocoder

Slide 27

Slide 27 text

Profile and Repo Data loads of queries (time!) needed

Slide 28

Slide 28 text

Profile and Repo Data

Slide 29

Slide 29 text

What’s a (geographic) tech cluster? Travel To Work Areas (TTWAs) Defined and managed by the ONS Inferred from UK census data – where do you live and where do you work? Each TTWA represents a labour basin TTWA 2001 boundaries chosen for the Tech Nation Report

Slide 30

Slide 30 text

What’s a (geographic) tech cluster?

Slide 31

Slide 31 text

What’s a (geographic) tech cluster? Point in polygon lookup Stockport → Manchester TTWA Manchester → Manchester TTWA Macclesfield → Manchester TTWA Pure Python libraries for efficient boundary lookup... ...kinda suck [someone build this!] pypi: pysal

Slide 32

Slide 32 text

Final dataset • 18,470 users • 89,363 repositories • 270 programming languages • 100 TTWAs • Limitations: Bias towards open source Bias in location sharing Public repos only Geocoder misclassification (<5% error)

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

*we only count authors of original (non-forked) repos

Slide 37

Slide 37 text

Top Cardiff Github users Rank # Repos 1 28 =2 21 =2 21 ... ... ... 4 19 ... ... ... 8 12 TOMCRICK TOMASBASHAM FLYINGSPARX MARTINJC ENCIMA

Slide 38

Slide 38 text

Top Python users? Rank # Repos 1 8 2 7 3 6 =4 5 =4 5 =4 5 =4 5

Slide 39

Slide 39 text

Top Python users? Rank # Repos 1 8 2 7 3 6 =4 5 =4 5 =4 5 =4 5 FLYINGSPARX ZEMOGLE CHRISNORTH GIRISHKUMARKH

Slide 40

Slide 40 text

Top Python users? Rank # Repos 1 8 2 7 3 6 =4 5 =4 5 =4 5 =4 5 BADDERS FLYINGSPARX ZEMOGLE CHRISNORTH GIRISHKUMARKH

Slide 41

Slide 41 text

Top Python users? Rank # Repos 1 8 2 7 3 6 =4 5 =4 5 =4 5 =4 5 EVILDMP BADDERS FLYINGSPARX ZEMOGLE CHRISNORTH GIRISHKUMARKH

Slide 42

Slide 42 text

Top Python users? Rank # Repos 1 8 2 7 3 6 =4 5 =4 5 =4 5 =4 5 MARTINJC EVILDMP BADDERS FLYINGSPARX ZEMOGLE CHRISNORTH GIRISHKUMARKH

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

No content

Slide 45

Slide 45 text

groups members events attendees

Slide 46

Slide 46 text

groups members events attendees

Slide 47

Slide 47 text

groups members events attendees

Slide 48

Slide 48 text

groups members events attendees

Slide 49

Slide 49 text

groups members events attendees

Slide 50

Slide 50 text

groups members events attendees Generalise low-level topics with dimensionality reduction and clustering* E.g., Data Analytics, Game Dev, etc. [*Not explained here. A ‘wizard’ did it.] scikit-learn

Slide 51

Slide 51 text

pypi: basemap

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

London Brighton Bristol Edinburgh Leeds Liverpool Oxford Leicester Norwich Glasgow Cambridge Nottingham Cardiff Birmingham Sheffield Tyneside Belfast City-City Co-visitor Network City-City Co-visitor Network (Normalised)

Slide 54

Slide 54 text

pandas basemap pysal ratelim geocoding rtree matplotlib ipython notebook fiona rsync jq twofishes Utils Software Python pymongo geopy What’s in the box? requests scikit-learn

Slide 55

Slide 55 text

What does this all mean?!

Slide 56

Slide 56 text

What does this all mean?! mining platforms for real-time view of tech hubs? [see Tech Nation report!]

Slide 57

Slide 57 text

What does this all mean?! mining platforms for real-time view of tech hubs? [see Tech Nation report!] example of full pipeline: collection to analysis

Slide 58

Slide 58 text

What does this all mean?! mining platforms for real-time view of tech hubs? [see Tech Nation report!] lots of untapped potential (but also lots of effort to collect data!) example of full pipeline: collection to analysis

Slide 59

Slide 59 text

What does this all mean?! mining platforms for real-time view of tech hubs? [see Tech Nation report!] python is awesome for both data collection and analysis lots of untapped potential (but also lots of effort to collect data!) example of full pipeline: collection to analysis

Slide 60

Slide 60 text

What does this all mean?! mining platforms for real-time view of tech hubs? [see Tech Nation report!] pointers for relevant Python packages python is awesome for both data collection and analysis lots of untapped potential (but also lots of effort to collect data!) example of full pipeline: collection to analysis

Slide 61

Slide 61 text

What does this all mean?! mining platforms for real-time view of tech hubs? [see Tech Nation report!] pointers for relevant Python packages python is awesome for both data collection and analysis Vince needs to set his Github location field lots of untapped potential (but also lots of effort to collect data!) example of full pipeline: collection to analysis

Slide 62

Slide 62 text

PyDiff, Cardiff 15 March 2016 Thanks for listening! Matt J. Williams [email protected] http://www.mattjw.net @voxmjw Special thanks to Hasan Bakhshi and Juan Mateos-Garcia at Nesta Any questions?