Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Mining Github and Meetup.com to Explore the UK’s Digital Tech Clusters

Mining Github and Meetup.com to Explore the UK’s Digital Tech Clusters

Talk given at PyDiff. Cardiff. 15 Mar 2016.

Matt J Williams

March 15, 2016
Tweet

More Decks by Matt J Williams

Other Decks in Technology

Transcript

  1. PyDiff, Cardiff
    15 March 2016
    Mining Github and Meetup.com
    to Explore the UK’s Digital
    Tech Clusters
    Matt J. Williams
    [email protected]
    http://www.mattjw.net
    @voxmjw

    View full-size slide

  2. PyDiff, Cardiff
    15 March 2016
    Mining Github and Meetup.com
    to Explore the UK’s Digital
    Tech Clusters
    Matt J. Williams
    [email protected]
    http://www.mattjw.net
    @voxmjw

    View full-size slide

  3. Who the hell?
    Beer!
    A&E Game!
    ...Uri Gellar!

    View full-size slide

  4. Who the hell?

    View full-size slide

  5. Mapping digital tech clusters
    http://www.nesta.org.uk/publications/tech-nation-2016
    http://www.techcityuk.com/technation/
    TRANSFORMING UK INDUSTRIES
    • Tech Nation 2016
    • Nesta-led project to map the state of the UK digital tech
    economy
    • Awesome report:
    – 27 digital tech clusters (regions)
    – Many datasets: big data, interviews, govt. stats
    – Many domains: firms, communities, skills demand, etc.
    • I contributed data collection and analytics for Github and
    Meetup.com
    • Data in this talk are drawn from my work on that project
    • Disclaimer: I’m not at Nesta! The project was led by the the
    awesome folks in Nesta’s Creative and Digital Economy team.

    View full-size slide

  6. What can we learn?

    View full-size slide

  7. What can we learn?
    • Locations of programmers
    • Interest in programming
    languages
    • Impact of projects
    • Locations of communities
    • Community interests
    • Local social network

    View full-size slide

  8. This talk
    • Python → From data collection to analysis
    • Mining data from Github and Meetup
    • About 80% Github, 20% Meetup
    • Profiling digital tech clusters
    • Python tools used

    View full-size slide

  9. Mining Github

    View full-size slide

  10. Mining Github

    View full-size slide

  11. Mining Github

    View full-size slide

  12. Mining Github

    View full-size slide

  13. Mining Github
    Job Done?

    View full-size slide

  14. Mining Github
    London = 8k+ users
    Manchester = 2k+ users
    Cambridge = 2k+ users
    ...
    How does search choose the 1,000?

    View full-size slide

  15. Mining Github: Sketch
    User Discovery
    User Profile Repos
    Forward Geocode & Filter
    Repo Langs
    Github Archive
    MapQuest
    Github API
    Aggregate into Geographic Regions
    (Digital Tech Clusters) for Analysis

    View full-size slide

  16. GithubArchive.org
    • Public archive of the public Github timeline
    • Every public event
    Event types: push, pull req, comment, add contrib, fork [+15 more]
    • Use this to obtain a list of all Github users

    View full-size slide

  17. Pop Quiz
    How many users discovered for Jan-June 2015?
    ...we’ll need to obtain their profile data and
    forward geocode their locations!

    View full-size slide

  18. One hour of events: ~100MB uncompressed
    Six months of events: ~0.5 TB
    Num coding-related events: 57,788,996
    Pop Quiz
    How many users discovered for Jan-June 2015?
    ...we’ll need to obtain their profile data and
    forward geocode their locations!

    View full-size slide

  19. One hour of events: ~100MB uncompressed
    Six months of events: ~0.5 TB
    Num coding-related events: 57,788,996
    Pop Quiz
    How many users discovered for Jan-June 2015?
    ...we’ll need to obtain their profile data and
    forward geocode their locations!
    1.7
    million
    users

    View full-size slide

  20. One hour of events: ~100MB uncompressed
    Six months of events: ~0.5 TB
    Num coding-related events: 57,788,996
    Pop Quiz
    How many users discovered for Jan-June 2015?
    ...we’ll need to obtain their profile data and
    forward geocode their locations!
    1.7
    million
    users

    View full-size slide

  21. Forward Geocoding

    “Cardiff, UK” lat, long
    structured
    address
    confidence
    1.7M users
    global users
    ...with location field
    ...and geocoded
    ...and in UK
    ...and at city granularity
    pypi: geocoder

    View full-size slide

  22. Forward Geocoding

    “Cardiff, UK” lat, long
    structured
    address
    confidence
    1.7M users
    350k users
    global users
    ...with location field
    ...and geocoded
    ...and in UK
    ...and at city granularity
    pypi: geocoder

    View full-size slide

  23. Forward Geocoding

    “Cardiff, UK” lat, long
    structured
    address
    confidence
    1.7M users
    350k users
    320k users
    global users
    ...with location field
    ...and geocoded
    ...and in UK
    ...and at city granularity
    pypi: geocoder

    View full-size slide

  24. Forward Geocoding

    “Cardiff, UK” lat, long
    structured
    address
    confidence
    1.7M users
    350k users
    320k users
    21k users
    global users
    ...with location field
    ...and geocoded
    ...and in UK
    ...and at city granularity
    pypi: geocoder

    View full-size slide

  25. Forward Geocoding

    “Cardiff, UK” lat, long
    structured
    address
    confidence
    1.7M users
    350k users
    320k users
    21k users
    18,470 users
    global users
    ...with location field
    ...and geocoded
    ...and in UK
    ...and at city granularity
    pypi: geocoder

    View full-size slide

  26. Profile and Repo Data
    loads of queries (time!) needed

    View full-size slide

  27. Profile and Repo Data

    View full-size slide

  28. What’s a (geographic) tech
    cluster?
    Travel To Work Areas (TTWAs)
    Defined and managed by the
    ONS
    Inferred from UK census data –
    where do you live and where
    do you work?
    Each TTWA represents a
    labour basin
    TTWA 2001 boundaries chosen
    for the Tech Nation Report

    View full-size slide

  29. What’s a (geographic) tech
    cluster?

    View full-size slide

  30. What’s a (geographic) tech
    cluster?
    Point in polygon lookup
    Stockport → Manchester TTWA
    Manchester → Manchester TTWA
    Macclesfield → Manchester TTWA
    Pure Python libraries for
    efficient boundary lookup...
    ...kinda suck
    [someone build this!]
    pypi: pysal

    View full-size slide

  31. Final dataset
    • 18,470 users
    • 89,363 repositories
    • 270 programming languages
    • 100 TTWAs
    • Limitations:
    Bias towards open source
    Bias in location sharing
    Public repos only
    Geocoder misclassification (<5% error)

    View full-size slide

  32. *we only count authors of
    original (non-forked) repos

    View full-size slide

  33. Top Cardiff Github users
    Rank # Repos
    1 28
    =2 21
    =2 21
    ... ... ...
    4 19
    ... ... ...
    8 12
    TOMCRICK
    TOMASBASHAM
    FLYINGSPARX
    MARTINJC
    ENCIMA

    View full-size slide

  34. Top Python users?
    Rank # Repos
    1 8
    2 7
    3 6
    =4 5
    =4 5
    =4 5
    =4 5

    View full-size slide

  35. Top Python users?
    Rank # Repos
    1 8
    2 7
    3 6
    =4 5
    =4 5
    =4 5
    =4 5
    FLYINGSPARX
    ZEMOGLE
    CHRISNORTH
    GIRISHKUMARKH

    View full-size slide

  36. Top Python users?
    Rank # Repos
    1 8
    2 7
    3 6
    =4 5
    =4 5
    =4 5
    =4 5
    BADDERS
    FLYINGSPARX
    ZEMOGLE
    CHRISNORTH
    GIRISHKUMARKH

    View full-size slide

  37. Top Python users?
    Rank # Repos
    1 8
    2 7
    3 6
    =4 5
    =4 5
    =4 5
    =4 5
    EVILDMP
    BADDERS
    FLYINGSPARX
    ZEMOGLE
    CHRISNORTH
    GIRISHKUMARKH

    View full-size slide

  38. Top Python users?
    Rank # Repos
    1 8
    2 7
    3 6
    =4 5
    =4 5
    =4 5
    =4 5
    MARTINJC
    EVILDMP
    BADDERS
    FLYINGSPARX
    ZEMOGLE
    CHRISNORTH
    GIRISHKUMARKH

    View full-size slide

  39. groups members
    events attendees

    View full-size slide

  40. groups members
    events attendees

    View full-size slide

  41. groups members
    events attendees

    View full-size slide

  42. groups members
    events attendees

    View full-size slide

  43. groups members
    events attendees

    View full-size slide

  44. groups members
    events attendees
    Generalise low-level topics with
    dimensionality reduction and
    clustering*
    E.g., Data Analytics, Game Dev, etc.
    [*Not explained here. A ‘wizard’ did it.]
    scikit-learn

    View full-size slide

  45. pypi: basemap

    View full-size slide

  46. London
    Brighton
    Bristol
    Edinburgh
    Leeds
    Liverpool
    Oxford
    Leicester Norwich
    Glasgow
    Cambridge
    Nottingham
    Cardiff
    Birmingham
    Sheffield
    Tyneside
    Belfast
    City-City Co-visitor Network City-City Co-visitor Network (Normalised)

    View full-size slide

  47. pandas
    basemap
    pysal
    ratelim
    geocoding
    rtree
    matplotlib
    ipython notebook
    fiona
    rsync
    jq
    twofishes
    Utils
    Software
    Python
    pymongo
    geopy
    What’s in the box?
    requests
    scikit-learn

    View full-size slide

  48. What does this all mean?!

    View full-size slide

  49. What does this all mean?!
    mining platforms
    for real-time view
    of tech hubs?
    [see Tech Nation report!]

    View full-size slide

  50. What does this all mean?!
    mining platforms
    for real-time view
    of tech hubs?
    [see Tech Nation report!]
    example of full
    pipeline: collection to
    analysis

    View full-size slide

  51. What does this all mean?!
    mining platforms
    for real-time view
    of tech hubs?
    [see Tech Nation report!]
    lots of untapped
    potential
    (but also lots of effort
    to collect data!)
    example of full
    pipeline: collection to
    analysis

    View full-size slide

  52. What does this all mean?!
    mining platforms
    for real-time view
    of tech hubs?
    [see Tech Nation report!]
    python is awesome
    for both data
    collection
    and analysis
    lots of untapped
    potential
    (but also lots of effort
    to collect data!)
    example of full
    pipeline: collection to
    analysis

    View full-size slide

  53. What does this all mean?!
    mining platforms
    for real-time view
    of tech hubs?
    [see Tech Nation report!]
    pointers for
    relevant Python
    packages
    python is awesome
    for both data
    collection
    and analysis
    lots of untapped
    potential
    (but also lots of effort
    to collect data!)
    example of full
    pipeline: collection to
    analysis

    View full-size slide

  54. What does this all mean?!
    mining platforms
    for real-time view
    of tech hubs?
    [see Tech Nation report!]
    pointers for
    relevant Python
    packages
    python is awesome
    for both data
    collection
    and analysis
    Vince needs to set his
    Github location field
    lots of untapped
    potential
    (but also lots of effort
    to collect data!)
    example of full
    pipeline: collection to
    analysis

    View full-size slide

  55. PyDiff, Cardiff
    15 March 2016
    Thanks for listening!
    Matt J. Williams
    [email protected]
    http://www.mattjw.net
    @voxmjw
    Special thanks to Hasan Bakhshi and
    Juan Mateos-Garcia at Nesta
    Any questions?

    View full-size slide