Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Visualizing Github

Idan Gazit
March 15, 2013

Visualizing Github

A talk given at PyCon 2013 by Dana Bauer and Idan Gazit. See http://lanyrd.com/2013/pycon/scdywb/ and http://lanyrd.com/2013/pycon/scdywh/.

Idan Gazit

March 15, 2013
Tweet

More Decks by Idan Gazit

Other Decks in Technology

Transcript

  1. View Slide

  2. Dana Bauer
    @geography76
    Idan Gazit
    @idangazit

    View Slide

  3. View Slide

  4. JavaScript
    Ruby
    Java
    Python
    Shell
    PHP
    C
    C++
    Perl
    Obj-C
    mapping
    meaning
    onto
    data

    View Slide

  5. oh hello there!

    View Slide

  6. View Slide

  7. View Slide

  8. ?
    data
    questions

    View Slide

  9. acquire
    parse
    filter
    mine
    represent
    refine
    interact

    View Slide

  10. acquire
    parse
    filter
    mine
    represent
    refine
    interact

    View Slide

  11. Dana Bauer
    @geography76
    Part I
    @idangazit
    Data to Information
    Idan Gazit

    View Slide

  12. Dana Bauer
    @geography76
    Idan Gazit
    @idangazit
    Part II
    Information to Meaning

    View Slide

  13. Acquiring the
    data
    STEP 1

    View Slide

  14. !."million
    users

    View Slide

  15. #.$million
    repos

    View Slide

  16. http://flic.kr/p/aZ4Z54

    View Slide

  17. http://flic.kr/p/4pdeJz

    View Slide

  18. subset
    a meaningful
    of

    View Slide

  19. github.com/languages

    View Slide

  20. API Whitelist
    Repo Cloning

    View Slide

  21. cloud
    amazon web services

    View Slide

  22. flexibility

    View Slide

  23. ????GB
    on disk

    View Slide

  24. %&GB
    on disk

    View Slide

  25. '&&GB
    on disk
    #&
    %#

    View Slide

  26. &.mb
    min
    '(.)mb
    median
    "(.*mb
    average
    7,)'*mb
    max
    100gb of repositories
    (webkit/WebKit)

    View Slide

  27. stability

    View Slide

  28. Connection reset by peer

    View Slide

  29. EC2 instance size
    RAM & CPU

    View Slide

  30. speed

    View Slide

  31. 100ms
    200ms

    View Slide

  32. 100–
    200ms
    from us
    2ms
    from EC!
    yay internet.

    View Slide

  33. Tooling
    ~100gb EBS volume
    bootstrapping instances
    running commands

    View Slide

  34. user-data scripts
    bare metal → ready to work
    http://alestic.com/2009/06/ec2-user-data-scripts

    View Slide

  35. # setup python and pip requirements
    curl -O http://python-distribute.org/distribut
    python distribute_setup.py
    easy_install pip
    pip install ipython tornado pyzmq
    pip install -r requirements.txt

    View Slide

  36. fabric
    http://fabric.readthedocs.org

    View Slide

  37. run “git clone” two thousand times
    make sure the directories exist
    load the list of top repos

    View Slide

  38. def clone_lang(lang, repos):
    """ Clone the most-watched repos for a given language """
    print('*** Cloning {} repositories...'.format(lang))
    langpath = REPOS_PATH.child(lang)
    local('mkdir -p {}'.format(langpath))
    for r in repos:
    user, reponame = r.split('/')
    userpath = langpath.child(user)
    repopath = userpath.child(reponame)
    if repopath.exists():
    print('Skipping {}...'.format(r))
    continue
    print('Cloning {}'.format(r))
    local('mkdir -p {}'.format(userpath))
    with lcd(userpath):
    local('git clone https://github.com/{}.git'.format(r))

    View Slide

  39. fab clone

    View Slide

  40. Parsing and
    Filtering
    STEP 2

    View Slide

  41. Say Cheese!
    data → snapshot

    View Slide

  42. snap-!b"#$%#%
    thank you for shopping at
    Dana & Idan’s data emporium!
    We appreciate your business.

    View Slide

  43. Data
    snapshot → EBS volume

    View Slide

  44. working with the data
    Playtime!
    collaboration

    View Slide

  45. ~7k miles

    View Slide

  46. fab launch_ec2

    View Slide

  47. exposed to the outside
    IPython notebook
    http://:8000

    View Slide

  48. “oops I closed the browser tab”
    long-running stuff
    “oops I shut my laptop lid”

    View Slide

  49. we hear you are flush.
    can haz a pony?
    — kthxbai.
    Dear IPython devs

    View Slide

  50. terminal multiplexer
    tmux

    View Slide

  51. s h e l l s i n s i d e y o u r s h e l l s
    INCEPTION

    View Slide

  52. http://flic.kr/p/5FYT2j
    immortal
    python
    shell

    View Slide

  53. see output
    reattach to the
    tmux session
    pick up where we left off

    View Slide

  54. ghetto pair programming in the cloud
    Double your REPL,
    Double your fun
    attach to the same tmux session

    View Slide

  55. storing results

    View Slide

  56. (haters gonna hate.)

    View Slide

  57. no JOINs
    no problem!
    no schema

    View Slide

  58. Tell me more about
    your octocats.
    Cool story, bro sister!

    View Slide

  59. git != github
    network
    authors != github users

    View Slide

  60. an asynchronous task queue
    celery
    with nifty features

    View Slide

  61. @celery.task(rate_limit='5,000/h')
    rate limiting
    gotchas!
    pay attention to X-RateLimit-Remaining

    View Slide

  62. Heroku’s first dyno is free
    celery in the cloud
    nobody says it has to be a web dyno…

    View Slide

  63. redis for broker, result store
    batteries included*
    “heroku run bash” to get a shell

    View Slide

  64. There’s no storage on Heroku, by design
    so why not...?

    View Slide

  65. Mining for
    Understanding
    STEP 3

    View Slide

  66. #,)!$,*%*
    commits

    View Slide

  67. *!,'"*
    authors

    View Slide

  68. "",!)*
    contributors

    View Slide

  69. http://en.wikipedia.org/wiki/File:Jamtlands_Flyg_EC120B_Colibri.JPG
    wat?

    View Slide

  70. yeah good luck with that.
    everybody please
    stand still?

    View Slide

  71. Do Repeat Yourself
    Idempotency
    (without shooting your own feet)

    View Slide

  72. Idempotency
    help others replicate your results
    peace of mind

    View Slide

  73. Break time!
    sudo go have a sandwich.
    Part II: Information to Meaning
    up next
    Idan Gazit, 3:15p

    View Slide

  74. Part II: Information to Meaning
    up now
    Idan Gazit, 3:15p
    Hi.

    View Slide

  75. co i oo !

    View Slide

  76. View Slide

  77. Dana Bauer
    @geography76
    Part I
    @idangazit
    Data to Information
    Idan Gazit

    View Slide

  78. Dana Bauer
    @geography76
    Idan Gazit
    @idangazit
    Part II
    Information to Meaning

    View Slide

  79. Constraints
    design is

    View Slide

  80. acquire
    parse
    filter
    mine
    represent
    refine
    interact

    View Slide

  81. http://flic.kr/p/9mz5hj http://flic.kr/p/FNgEL

    View Slide

  82. Given the complexity of data, using it to
    provide a meaningful solution requires
    insights from diverse fields: statistics,
    data mining, graphic design, and
    information visualization.
    Ben Fry
    from “Visualizing Data”

    View Slide

  83. Choosing a Visual
    Representation
    STEP 5

    View Slide

  84. http://flic.kr/p/dxyTt1

    View Slide

  85. View Slide

  86. http://flic.kr/p/7FH2Re

    View Slide

  87. http://flic.kr/p/7oYTTS

    View Slide

  88. http://flic.kr/p/5hiBsz

    View Slide

  89. View Slide

  90. Meaning

    View Slide

  91. Meaning
    requires context

    View Slide

  92. JavaScript
    Ruby
    Java
    Python
    Shell
    PHP
    C
    C++
    Perl
    Obj-C
    873y40817 234 098 14092 309812 39182742 48714209 81239 84127498023 873y408

    View Slide

  93. CHOOSING A
    Medium
    CHOOSING AN
    Audience

    View Slide

  94. One size fits nobody.

    View Slide

  95. know about github
    passing familiarity with
    code and development
    curious about our
    community

    View Slide

  96. Know thine audience.....

    View Slide

  97. The Eye Candy Trap
    ..............
    beautiful noise is still just noise

    View Slide

  98. on the
    Web
    For
    Geeks

    View Slide

  99. on the
    Web
    For
    Geeks
    using
    ?

    View Slide

  100. Logo for the modern computer.
    see also: processing.js

    View Slide

  101. A data visualization toolkit for
    the web.
    D3 is…

    View Slide

  102. data
    mapping
    onto
    meaningful
    visuals

    View Slide

  103. data
    mapping
    onto
    the
    DOM

    View Slide

  104. scales
    linear
    logarithmic
    quantile
    ordinal
    time

    View Slide

  105. // linear scale
    x = d3.scale.linear()
    .domain([0, 1.0])
    .range([0, 255]);
    x(0); // == 0
    x(0.5); // == 127.5
    x(1); // == 255

    View Slide

  106. interpolation
    dimension
    position
    color
    orientation
    ... everything

    View Slide

  107. 0
    25
    50
    75
    100
    2007 2008 2009 2010
    Apples Bananas

    View Slide

  108. HTTP://NYTI.MS/WR1DHZ

    View Slide

  109. SVG
    not just an image

    View Slide

  110. JSON
    CSV
    and

    View Slide

  111. Intermediate
    Representations

    View Slide

  112. JSON
    easy to serialize to
    structured
    preserves datatypes
    bloated for tabular data

    View Slide

  113. CSV
    mostly easy to serialize to*
    structured
    comes out the other side like a dict
    everything comes back as a string
    *unicode, meh

    View Slide

  114. name
    user
    rank
    watchers
    commits
    size
    earliest commit

    View Slide

  115. [
    {
    "type": "string",
    "name": "name"
    },
    {
    "type": "int",
    "name": "rank"
    "extents": [
    0,
    199
    ],
    },
    ...

    View Slide

  116. Caching

    View Slide

  117. on the
    Web
    For
    Geeks
    using
    D$.js

    View Slide

  118. What do we want to know?
    um, we don’t know that yet.

    View Slide

  119. Frustration.
    “Plan to throw one away.”
    some

    View Slide

  120. People
    Languages
    Repositories

    View Slide

  121. Live Demo

    View Slide

  122. Letting the Data
    Speak for Itself

    View Slide

  123. Responsive
    Visualizations

    View Slide

  124. People
    Languages
    Repositories

    View Slide

  125. visualization
    localization
    internationalization
    v##n
    l#&n
    i#%n
    totally a thing
    from now on

    View Slide

  126. Dana Bauer
    @geography76
    Idan Gazit
    @idangazit

    View Slide