Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Visualizing Github

0d877f80c535471ede57c7a4c0f487de?s=47 Idan Gazit
March 15, 2013

Visualizing Github

A talk given at PyCon 2013 by Dana Bauer and Idan Gazit. See http://lanyrd.com/2013/pycon/scdywb/ and http://lanyrd.com/2013/pycon/scdywh/.


Idan Gazit

March 15, 2013


  1. None
  2. Dana Bauer @geography76 Idan Gazit @idangazit

  3. None
  4. JavaScript Ruby Java Python Shell PHP C C++ Perl Obj-C

    mapping meaning onto data
  5. oh hello there!

  6. None
  7. None
  8. ? data questions

  9. acquire parse filter mine represent refine interact

  10. acquire parse filter mine represent refine interact

  11. Dana Bauer @geography76 Part I @idangazit Data to Information Idan

  12. Dana Bauer @geography76 Idan Gazit @idangazit Part II Information to

  13. Acquiring the data STEP 1

  14. !."million users

  15. #.$million repos

  16. http://flic.kr/p/aZ4Z54

  17. http://flic.kr/p/4pdeJz

  18. subset a meaningful of

  19. github.com/languages

  20. API Whitelist Repo Cloning

  21. cloud amazon web services

  22. flexibility

  23. ????GB on disk

  24. %&GB on disk

  25. '&&GB on disk #& %#

  26. &.&#mb min '(.)mb median "(.*mb average 7,)'*mb max 100gb of

    repositories (webkit/WebKit)
  27. stability

  28. Connection reset by peer

  29. EC2 instance size RAM & CPU

  30. speed

  31. 100ms 200ms

  32. 100– 200ms from us 2ms from EC! yay internet.

  33. Tooling ~100gb EBS volume bootstrapping instances running commands

  34. user-data scripts bare metal → ready to work http://alestic.com/2009/06/ec2-user-data-scripts

  35. # setup python and pip requirements curl -O http://python-distribute.org/distribut python

    distribute_setup.py easy_install pip pip install ipython tornado pyzmq pip install -r requirements.txt
  36. fabric http://fabric.readthedocs.org

  37. run “git clone” two thousand times make sure the directories

    exist load the list of top repos
  38. def clone_lang(lang, repos): """ Clone the most-watched repos for a

    given language """ print('*** Cloning {} repositories...'.format(lang)) langpath = REPOS_PATH.child(lang) local('mkdir -p {}'.format(langpath)) for r in repos: user, reponame = r.split('/') userpath = langpath.child(user) repopath = userpath.child(reponame) if repopath.exists(): print('Skipping {}...'.format(r)) continue print('Cloning {}'.format(r)) local('mkdir -p {}'.format(userpath)) with lcd(userpath): local('git clone https://github.com/{}.git'.format(r))
  39. fab clone <enter>

  40. Parsing and Filtering STEP 2

  41. Say Cheese! data → snapshot

  42. snap-!b"#$%#% thank you for shopping at Dana & Idan’s data

    emporium! We appreciate your business.
  43. Data snapshot → EBS volume

  44. working with the data Playtime! collaboration

  45. ~7k miles

  46. fab launch_ec2 <enter>

  47. exposed to the outside IPython notebook http://<ec2 address>:8000

  48. “oops I closed the browser tab” long-running stuff “oops I

    shut my laptop lid”
  49. we hear you are flush. can haz a pony? —

    kthxbai. Dear IPython devs
  50. terminal multiplexer tmux

  51. s h e l l s i n s i

    d e y o u r s h e l l s INCEPTION
  52. http://flic.kr/p/5FYT2j immortal python shell

  53. see output reattach to the tmux session pick up where

    we left off
  54. ghetto pair programming in the cloud Double your REPL, Double

    your fun attach to the same tmux session
  55. storing results

  56. (haters gonna hate.)

  57. no JOINs no problem! no schema

  58. Tell me more about your octocats. Cool story, bro sister!

  59. git != github network authors != github users

  60. an asynchronous task queue celery with nifty features

  61. @celery.task(rate_limit='5,000/h') rate limiting gotchas! pay attention to X-RateLimit-Remaining

  62. Heroku’s first dyno is free celery in the cloud nobody

    says it has to be a web dyno…
  63. redis for broker, result store batteries included* “heroku run bash”

    to get a shell
  64. There’s no storage on Heroku, by design so why not...?

  65. Mining for Understanding STEP 3

  66. #,)!$,*%* commits

  67. *!,'"* authors

  68. "",!)* contributors

  69. http://en.wikipedia.org/wiki/File:Jamtlands_Flyg_EC120B_Colibri.JPG wat?

  70. yeah good luck with that. everybody please stand still?

  71. Do Repeat Yourself Idempotency (without shooting your own feet)

  72. Idempotency help others replicate your results peace of mind

  73. Break time! sudo go have a sandwich. Part II: Information

    to Meaning up next Idan Gazit, 3:15p
  74. Part II: Information to Meaning up now Idan Gazit, 3:15p

  75. co i oo !

  76. None
  77. Dana Bauer @geography76 Part I @idangazit Data to Information Idan

  78. Dana Bauer @geography76 Idan Gazit @idangazit Part II Information to

  79. Constraints design is

  80. acquire parse filter mine represent refine interact

  81. http://flic.kr/p/9mz5hj http://flic.kr/p/FNgEL

  82. Given the complexity of data, using it to provide a

    meaningful solution requires insights from diverse fields: statistics, data mining, graphic design, and information visualization. Ben Fry from “Visualizing Data”
  83. Choosing a Visual Representation STEP 5

  84. http://flic.kr/p/dxyTt1

  85. None
  86. http://flic.kr/p/7FH2Re

  87. http://flic.kr/p/7oYTTS

  88. http://flic.kr/p/5hiBsz

  89. None
  90. Meaning

  91. Meaning requires context

  92. JavaScript Ruby Java Python Shell PHP C C++ Perl Obj-C

    873y40817 234 098 14092 309812 39182742 48714209 81239 84127498023 873y408
  93. CHOOSING A Medium CHOOSING AN Audience

  94. One size fits nobody.

  95. know about github passing familiarity with code and development curious

    about our community
  96. Know thine audience.....

  97. The Eye Candy Trap .............. beautiful noise is still just

  98. on the Web For Geeks

  99. on the Web For Geeks using ?

  100. Logo for the modern computer. see also: processing.js

  101. A data visualization toolkit for the web. D3 is…

  102. data mapping onto meaningful visuals

  103. data mapping onto the DOM

  104. scales linear logarithmic quantile ordinal time

  105. // linear scale x = d3.scale.linear() .domain([0, 1.0]) .range([0, 255]);

    x(0); // == 0 x(0.5); // == 127.5 x(1); // == 255
  106. interpolation dimension position color orientation ... everything

  107. 0 25 50 75 100 2007 2008 2009 2010 Apples


  109. SVG not just an image

  110. JSON CSV and

  111. Intermediate Representations

  112. JSON easy to serialize to structured preserves datatypes bloated for

    tabular data
  113. CSV mostly easy to serialize to* structured comes out the

    other side like a dict everything comes back as a string *unicode, meh
  114. name user rank watchers commits size earliest commit

  115. [ { "type": "string", "name": "name" }, { "type": "int",

    "name": "rank" "extents": [ 0, 199 ], }, ...
  116. Caching

  117. on the Web For Geeks using D$.js

  118. What do we want to know? um, we don’t know

    that yet.
  119. Frustration. “Plan to throw one away.” some

  120. People Languages Repositories

  121. Live Demo

  122. Letting the Data Speak for Itself

  123. Responsive Visualizations

  124. People Languages Repositories

  125. visualization localization internationalization v##n l#&n i#%n totally a thing from

    now on
  126. Dana Bauer @geography76 Idan Gazit @idangazit