Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Visualizing Github

Idan Gazit
March 15, 2013

Visualizing Github

A talk given at PyCon 2013 by Dana Bauer and Idan Gazit. See http://lanyrd.com/2013/pycon/scdywb/ and http://lanyrd.com/2013/pycon/scdywh/.

Idan Gazit

March 15, 2013
Tweet

More Decks by Idan Gazit

Other Decks in Technology

Transcript

  1. Dana Bauer
    @geography76
    Idan Gazit
    @idangazit

    View full-size slide

  2. JavaScript
    Ruby
    Java
    Python
    Shell
    PHP
    C
    C++
    Perl
    Obj-C
    mapping
    meaning
    onto
    data

    View full-size slide

  3. oh hello there!

    View full-size slide

  4. ?
    data
    questions

    View full-size slide

  5. acquire
    parse
    filter
    mine
    represent
    refine
    interact

    View full-size slide

  6. acquire
    parse
    filter
    mine
    represent
    refine
    interact

    View full-size slide

  7. Dana Bauer
    @geography76
    Part I
    @idangazit
    Data to Information
    Idan Gazit

    View full-size slide

  8. Dana Bauer
    @geography76
    Idan Gazit
    @idangazit
    Part II
    Information to Meaning

    View full-size slide

  9. Acquiring the
    data
    STEP 1

    View full-size slide

  10. !."million
    users

    View full-size slide

  11. #.$million
    repos

    View full-size slide

  12. http://flic.kr/p/aZ4Z54

    View full-size slide

  13. http://flic.kr/p/4pdeJz

    View full-size slide

  14. subset
    a meaningful
    of

    View full-size slide

  15. github.com/languages

    View full-size slide

  16. API Whitelist
    Repo Cloning

    View full-size slide

  17. cloud
    amazon web services

    View full-size slide

  18. ????GB
    on disk

    View full-size slide

  19. '&&GB
    on disk
    #&
    %#

    View full-size slide

  20. &.mb
    min
    '(.)mb
    median
    "(.*mb
    average
    7,)'*mb
    max
    100gb of repositories
    (webkit/WebKit)

    View full-size slide

  21. Connection reset by peer

    View full-size slide

  22. EC2 instance size
    RAM & CPU

    View full-size slide

  23. 100–
    200ms
    from us
    2ms
    from EC!
    yay internet.

    View full-size slide

  24. Tooling
    ~100gb EBS volume
    bootstrapping instances
    running commands

    View full-size slide

  25. user-data scripts
    bare metal → ready to work
    http://alestic.com/2009/06/ec2-user-data-scripts

    View full-size slide

  26. # setup python and pip requirements
    curl -O http://python-distribute.org/distribut
    python distribute_setup.py
    easy_install pip
    pip install ipython tornado pyzmq
    pip install -r requirements.txt

    View full-size slide

  27. fabric
    http://fabric.readthedocs.org

    View full-size slide

  28. run “git clone” two thousand times
    make sure the directories exist
    load the list of top repos

    View full-size slide

  29. def clone_lang(lang, repos):
    """ Clone the most-watched repos for a given language """
    print('*** Cloning {} repositories...'.format(lang))
    langpath = REPOS_PATH.child(lang)
    local('mkdir -p {}'.format(langpath))
    for r in repos:
    user, reponame = r.split('/')
    userpath = langpath.child(user)
    repopath = userpath.child(reponame)
    if repopath.exists():
    print('Skipping {}...'.format(r))
    continue
    print('Cloning {}'.format(r))
    local('mkdir -p {}'.format(userpath))
    with lcd(userpath):
    local('git clone https://github.com/{}.git'.format(r))

    View full-size slide

  30. Parsing and
    Filtering
    STEP 2

    View full-size slide

  31. Say Cheese!
    data → snapshot

    View full-size slide

  32. snap-!b"#$%#%
    thank you for shopping at
    Dana & Idan’s data emporium!
    We appreciate your business.

    View full-size slide

  33. Data
    snapshot → EBS volume

    View full-size slide

  34. working with the data
    Playtime!
    collaboration

    View full-size slide

  35. fab launch_ec2

    View full-size slide

  36. exposed to the outside
    IPython notebook
    http://:8000

    View full-size slide

  37. “oops I closed the browser tab”
    long-running stuff
    “oops I shut my laptop lid”

    View full-size slide

  38. we hear you are flush.
    can haz a pony?
    — kthxbai.
    Dear IPython devs

    View full-size slide

  39. terminal multiplexer
    tmux

    View full-size slide

  40. s h e l l s i n s i d e y o u r s h e l l s
    INCEPTION

    View full-size slide

  41. http://flic.kr/p/5FYT2j
    immortal
    python
    shell

    View full-size slide

  42. see output
    reattach to the
    tmux session
    pick up where we left off

    View full-size slide

  43. ghetto pair programming in the cloud
    Double your REPL,
    Double your fun
    attach to the same tmux session

    View full-size slide

  44. storing results

    View full-size slide

  45. (haters gonna hate.)

    View full-size slide

  46. no JOINs
    no problem!
    no schema

    View full-size slide

  47. Tell me more about
    your octocats.
    Cool story, bro sister!

    View full-size slide

  48. git != github
    network
    authors != github users

    View full-size slide

  49. an asynchronous task queue
    celery
    with nifty features

    View full-size slide

  50. @celery.task(rate_limit='5,000/h')
    rate limiting
    gotchas!
    pay attention to X-RateLimit-Remaining

    View full-size slide

  51. Heroku’s first dyno is free
    celery in the cloud
    nobody says it has to be a web dyno…

    View full-size slide

  52. redis for broker, result store
    batteries included*
    “heroku run bash” to get a shell

    View full-size slide

  53. There’s no storage on Heroku, by design
    so why not...?

    View full-size slide

  54. Mining for
    Understanding
    STEP 3

    View full-size slide

  55. #,)!$,*%*
    commits

    View full-size slide

  56. *!,'"*
    authors

    View full-size slide

  57. "",!)*
    contributors

    View full-size slide

  58. http://en.wikipedia.org/wiki/File:Jamtlands_Flyg_EC120B_Colibri.JPG
    wat?

    View full-size slide

  59. yeah good luck with that.
    everybody please
    stand still?

    View full-size slide

  60. Do Repeat Yourself
    Idempotency
    (without shooting your own feet)

    View full-size slide

  61. Idempotency
    help others replicate your results
    peace of mind

    View full-size slide

  62. Break time!
    sudo go have a sandwich.
    Part II: Information to Meaning
    up next
    Idan Gazit, 3:15p

    View full-size slide

  63. Part II: Information to Meaning
    up now
    Idan Gazit, 3:15p
    Hi.

    View full-size slide

  64. Dana Bauer
    @geography76
    Part I
    @idangazit
    Data to Information
    Idan Gazit

    View full-size slide

  65. Dana Bauer
    @geography76
    Idan Gazit
    @idangazit
    Part II
    Information to Meaning

    View full-size slide

  66. Constraints
    design is

    View full-size slide

  67. acquire
    parse
    filter
    mine
    represent
    refine
    interact

    View full-size slide

  68. http://flic.kr/p/9mz5hj http://flic.kr/p/FNgEL

    View full-size slide

  69. Given the complexity of data, using it to
    provide a meaningful solution requires
    insights from diverse fields: statistics,
    data mining, graphic design, and
    information visualization.
    Ben Fry
    from “Visualizing Data”

    View full-size slide

  70. Choosing a Visual
    Representation
    STEP 5

    View full-size slide

  71. http://flic.kr/p/dxyTt1

    View full-size slide

  72. http://flic.kr/p/7FH2Re

    View full-size slide

  73. http://flic.kr/p/7oYTTS

    View full-size slide

  74. http://flic.kr/p/5hiBsz

    View full-size slide

  75. Meaning
    requires context

    View full-size slide

  76. JavaScript
    Ruby
    Java
    Python
    Shell
    PHP
    C
    C++
    Perl
    Obj-C
    873y40817 234 098 14092 309812 39182742 48714209 81239 84127498023 873y408

    View full-size slide

  77. CHOOSING A
    Medium
    CHOOSING AN
    Audience

    View full-size slide

  78. One size fits nobody.

    View full-size slide

  79. know about github
    passing familiarity with
    code and development
    curious about our
    community

    View full-size slide

  80. Know thine audience.....

    View full-size slide

  81. The Eye Candy Trap
    ..............
    beautiful noise is still just noise

    View full-size slide

  82. on the
    Web
    For
    Geeks

    View full-size slide

  83. on the
    Web
    For
    Geeks
    using
    ?

    View full-size slide

  84. Logo for the modern computer.
    see also: processing.js

    View full-size slide

  85. A data visualization toolkit for
    the web.
    D3 is…

    View full-size slide

  86. data
    mapping
    onto
    meaningful
    visuals

    View full-size slide

  87. data
    mapping
    onto
    the
    DOM

    View full-size slide

  88. scales
    linear
    logarithmic
    quantile
    ordinal
    time

    View full-size slide

  89. // linear scale
    x = d3.scale.linear()
    .domain([0, 1.0])
    .range([0, 255]);
    x(0); // == 0
    x(0.5); // == 127.5
    x(1); // == 255

    View full-size slide

  90. interpolation
    dimension
    position
    color
    orientation
    ... everything

    View full-size slide

  91. 0
    25
    50
    75
    100
    2007 2008 2009 2010
    Apples Bananas

    View full-size slide

  92. HTTP://NYTI.MS/WR1DHZ

    View full-size slide

  93. SVG
    not just an image

    View full-size slide

  94. Intermediate
    Representations

    View full-size slide

  95. JSON
    easy to serialize to
    structured
    preserves datatypes
    bloated for tabular data

    View full-size slide

  96. CSV
    mostly easy to serialize to*
    structured
    comes out the other side like a dict
    everything comes back as a string
    *unicode, meh

    View full-size slide

  97. name
    user
    rank
    watchers
    commits
    size
    earliest commit

    View full-size slide

  98. [
    {
    "type": "string",
    "name": "name"
    },
    {
    "type": "int",
    "name": "rank"
    "extents": [
    0,
    199
    ],
    },
    ...

    View full-size slide

  99. on the
    Web
    For
    Geeks
    using
    D$.js

    View full-size slide

  100. What do we want to know?
    um, we don’t know that yet.

    View full-size slide

  101. Frustration.
    “Plan to throw one away.”
    some

    View full-size slide

  102. People
    Languages
    Repositories

    View full-size slide

  103. Letting the Data
    Speak for Itself

    View full-size slide

  104. Responsive
    Visualizations

    View full-size slide

  105. People
    Languages
    Repositories

    View full-size slide

  106. visualization
    localization
    internationalization
    v##n
    l#&n
    i#%n
    totally a thing
    from now on

    View full-size slide

  107. Dana Bauer
    @geography76
    Idan Gazit
    @idangazit

    View full-size slide