Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Mining Github and Meetup.com to Explore the UK’s Digital Tech Clusters

Mining Github and Meetup.com to Explore the UK’s Digital Tech Clusters

Talk given at PyDiff. Cardiff. 15 Mar 2016.

Matt J Williams

March 15, 2016
Tweet

More Decks by Matt J Williams

Other Decks in Technology

Transcript

  1. PyDiff, Cardiff
    15 March 2016
    Mining Github and Meetup.com
    to Explore the UK’s Digital
    Tech Clusters
    Matt J. Williams
    [email protected]
    http://www.mattjw.net
    @voxmjw

    View Slide

  2. PyDiff, Cardiff
    15 March 2016
    Mining Github and Meetup.com
    to Explore the UK’s Digital
    Tech Clusters
    Matt J. Williams
    [email protected]
    http://www.mattjw.net
    @voxmjw

    View Slide

  3. Who the hell?
    Beer!
    A&E Game!
    ...Uri Gellar!

    View Slide

  4. Who the hell?

    View Slide

  5. Mapping digital tech clusters
    http://www.nesta.org.uk/publications/tech-nation-2016
    http://www.techcityuk.com/technation/
    TRANSFORMING UK INDUSTRIES
    • Tech Nation 2016
    • Nesta-led project to map the state of the UK digital tech
    economy
    • Awesome report:
    – 27 digital tech clusters (regions)
    – Many datasets: big data, interviews, govt. stats
    – Many domains: firms, communities, skills demand, etc.
    • I contributed data collection and analytics for Github and
    Meetup.com
    • Data in this talk are drawn from my work on that project
    • Disclaimer: I’m not at Nesta! The project was led by the the
    awesome folks in Nesta’s Creative and Digital Economy team.

    View Slide

  6. View Slide

  7. What can we learn?

    View Slide

  8. What can we learn?
    • Locations of programmers
    • Interest in programming
    languages
    • Impact of projects
    • Locations of communities
    • Community interests
    • Local social network

    View Slide

  9. This talk
    • Python → From data collection to analysis
    • Mining data from Github and Meetup
    • About 80% Github, 20% Meetup
    • Profiling digital tech clusters
    • Python tools used

    View Slide

  10. Mining Github

    View Slide

  11. Mining Github

    View Slide

  12. Mining Github

    View Slide

  13. Mining Github

    View Slide

  14. Mining Github
    Job Done?

    View Slide

  15. Mining Github
    London = 8k+ users
    Manchester = 2k+ users
    Cambridge = 2k+ users
    ...
    How does search choose the 1,000?

    View Slide

  16. Mining Github: Sketch
    User Discovery
    User Profile Repos
    Forward Geocode & Filter
    Repo Langs
    Github Archive
    MapQuest
    Github API
    Aggregate into Geographic Regions
    (Digital Tech Clusters) for Analysis

    View Slide

  17. GithubArchive.org
    • Public archive of the public Github timeline
    • Every public event
    Event types: push, pull req, comment, add contrib, fork [+15 more]
    • Use this to obtain a list of all Github users

    View Slide

  18. Pop Quiz
    How many users discovered for Jan-June 2015?
    ...we’ll need to obtain their profile data and
    forward geocode their locations!

    View Slide

  19. One hour of events: ~100MB uncompressed
    Six months of events: ~0.5 TB
    Num coding-related events: 57,788,996
    Pop Quiz
    How many users discovered for Jan-June 2015?
    ...we’ll need to obtain their profile data and
    forward geocode their locations!

    View Slide

  20. One hour of events: ~100MB uncompressed
    Six months of events: ~0.5 TB
    Num coding-related events: 57,788,996
    Pop Quiz
    How many users discovered for Jan-June 2015?
    ...we’ll need to obtain their profile data and
    forward geocode their locations!
    1.7
    million
    users

    View Slide

  21. One hour of events: ~100MB uncompressed
    Six months of events: ~0.5 TB
    Num coding-related events: 57,788,996
    Pop Quiz
    How many users discovered for Jan-June 2015?
    ...we’ll need to obtain their profile data and
    forward geocode their locations!
    1.7
    million
    users

    View Slide

  22. Forward Geocoding

    “Cardiff, UK” lat, long
    structured
    address
    confidence
    1.7M users
    global users
    ...with location field
    ...and geocoded
    ...and in UK
    ...and at city granularity
    pypi: geocoder

    View Slide

  23. Forward Geocoding

    “Cardiff, UK” lat, long
    structured
    address
    confidence
    1.7M users
    350k users
    global users
    ...with location field
    ...and geocoded
    ...and in UK
    ...and at city granularity
    pypi: geocoder

    View Slide

  24. Forward Geocoding

    “Cardiff, UK” lat, long
    structured
    address
    confidence
    1.7M users
    350k users
    320k users
    global users
    ...with location field
    ...and geocoded
    ...and in UK
    ...and at city granularity
    pypi: geocoder

    View Slide

  25. Forward Geocoding

    “Cardiff, UK” lat, long
    structured
    address
    confidence
    1.7M users
    350k users
    320k users
    21k users
    global users
    ...with location field
    ...and geocoded
    ...and in UK
    ...and at city granularity
    pypi: geocoder

    View Slide

  26. Forward Geocoding

    “Cardiff, UK” lat, long
    structured
    address
    confidence
    1.7M users
    350k users
    320k users
    21k users
    18,470 users
    global users
    ...with location field
    ...and geocoded
    ...and in UK
    ...and at city granularity
    pypi: geocoder

    View Slide

  27. Profile and Repo Data
    loads of queries (time!) needed

    View Slide

  28. Profile and Repo Data

    View Slide

  29. What’s a (geographic) tech
    cluster?
    Travel To Work Areas (TTWAs)
    Defined and managed by the
    ONS
    Inferred from UK census data –
    where do you live and where
    do you work?
    Each TTWA represents a
    labour basin
    TTWA 2001 boundaries chosen
    for the Tech Nation Report

    View Slide

  30. What’s a (geographic) tech
    cluster?

    View Slide

  31. What’s a (geographic) tech
    cluster?
    Point in polygon lookup
    Stockport → Manchester TTWA
    Manchester → Manchester TTWA
    Macclesfield → Manchester TTWA
    Pure Python libraries for
    efficient boundary lookup...
    ...kinda suck
    [someone build this!]
    pypi: pysal

    View Slide

  32. Final dataset
    • 18,470 users
    • 89,363 repositories
    • 270 programming languages
    • 100 TTWAs
    • Limitations:
    Bias towards open source
    Bias in location sharing
    Public repos only
    Geocoder misclassification (<5% error)

    View Slide

  33. View Slide

  34. View Slide

  35. View Slide

  36. *we only count authors of
    original (non-forked) repos

    View Slide

  37. Top Cardiff Github users
    Rank # Repos
    1 28
    =2 21
    =2 21
    ... ... ...
    4 19
    ... ... ...
    8 12
    TOMCRICK
    TOMASBASHAM
    FLYINGSPARX
    MARTINJC
    ENCIMA

    View Slide

  38. Top Python users?
    Rank # Repos
    1 8
    2 7
    3 6
    =4 5
    =4 5
    =4 5
    =4 5

    View Slide

  39. Top Python users?
    Rank # Repos
    1 8
    2 7
    3 6
    =4 5
    =4 5
    =4 5
    =4 5
    FLYINGSPARX
    ZEMOGLE
    CHRISNORTH
    GIRISHKUMARKH

    View Slide

  40. Top Python users?
    Rank # Repos
    1 8
    2 7
    3 6
    =4 5
    =4 5
    =4 5
    =4 5
    BADDERS
    FLYINGSPARX
    ZEMOGLE
    CHRISNORTH
    GIRISHKUMARKH

    View Slide

  41. Top Python users?
    Rank # Repos
    1 8
    2 7
    3 6
    =4 5
    =4 5
    =4 5
    =4 5
    EVILDMP
    BADDERS
    FLYINGSPARX
    ZEMOGLE
    CHRISNORTH
    GIRISHKUMARKH

    View Slide

  42. Top Python users?
    Rank # Repos
    1 8
    2 7
    3 6
    =4 5
    =4 5
    =4 5
    =4 5
    MARTINJC
    EVILDMP
    BADDERS
    FLYINGSPARX
    ZEMOGLE
    CHRISNORTH
    GIRISHKUMARKH

    View Slide

  43. View Slide

  44. View Slide

  45. groups members
    events attendees

    View Slide

  46. groups members
    events attendees

    View Slide

  47. groups members
    events attendees

    View Slide

  48. groups members
    events attendees

    View Slide

  49. groups members
    events attendees

    View Slide

  50. groups members
    events attendees
    Generalise low-level topics with
    dimensionality reduction and
    clustering*
    E.g., Data Analytics, Game Dev, etc.
    [*Not explained here. A ‘wizard’ did it.]
    scikit-learn

    View Slide

  51. pypi: basemap

    View Slide

  52. View Slide

  53. London
    Brighton
    Bristol
    Edinburgh
    Leeds
    Liverpool
    Oxford
    Leicester Norwich
    Glasgow
    Cambridge
    Nottingham
    Cardiff
    Birmingham
    Sheffield
    Tyneside
    Belfast
    City-City Co-visitor Network City-City Co-visitor Network (Normalised)

    View Slide

  54. pandas
    basemap
    pysal
    ratelim
    geocoding
    rtree
    matplotlib
    ipython notebook
    fiona
    rsync
    jq
    twofishes
    Utils
    Software
    Python
    pymongo
    geopy
    What’s in the box?
    requests
    scikit-learn

    View Slide

  55. What does this all mean?!

    View Slide

  56. What does this all mean?!
    mining platforms
    for real-time view
    of tech hubs?
    [see Tech Nation report!]

    View Slide

  57. What does this all mean?!
    mining platforms
    for real-time view
    of tech hubs?
    [see Tech Nation report!]
    example of full
    pipeline: collection to
    analysis

    View Slide

  58. What does this all mean?!
    mining platforms
    for real-time view
    of tech hubs?
    [see Tech Nation report!]
    lots of untapped
    potential
    (but also lots of effort
    to collect data!)
    example of full
    pipeline: collection to
    analysis

    View Slide

  59. What does this all mean?!
    mining platforms
    for real-time view
    of tech hubs?
    [see Tech Nation report!]
    python is awesome
    for both data
    collection
    and analysis
    lots of untapped
    potential
    (but also lots of effort
    to collect data!)
    example of full
    pipeline: collection to
    analysis

    View Slide

  60. What does this all mean?!
    mining platforms
    for real-time view
    of tech hubs?
    [see Tech Nation report!]
    pointers for
    relevant Python
    packages
    python is awesome
    for both data
    collection
    and analysis
    lots of untapped
    potential
    (but also lots of effort
    to collect data!)
    example of full
    pipeline: collection to
    analysis

    View Slide

  61. What does this all mean?!
    mining platforms
    for real-time view
    of tech hubs?
    [see Tech Nation report!]
    pointers for
    relevant Python
    packages
    python is awesome
    for both data
    collection
    and analysis
    Vince needs to set his
    Github location field
    lots of untapped
    potential
    (but also lots of effort
    to collect data!)
    example of full
    pipeline: collection to
    analysis

    View Slide

  62. PyDiff, Cardiff
    15 March 2016
    Thanks for listening!
    Matt J. Williams
    [email protected]
    http://www.mattjw.net
    @voxmjw
    Special thanks to Hasan Bakhshi and
    Juan Mateos-Garcia at Nesta
    Any questions?

    View Slide