Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

Dana Bauer @geography76 Idan Gazit @idangazit

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

JavaScript Ruby Java Python Shell PHP C C++ Perl Obj-C mapping meaning onto data

Slide 5

Slide 5 text

oh hello there!

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

? data questions

Slide 9

Slide 9 text

acquire parse filter mine represent refine interact

Slide 10

Slide 10 text

acquire parse filter mine represent refine interact

Slide 11

Slide 11 text

Dana Bauer @geography76 Part I @idangazit Data to Information Idan Gazit

Slide 12

Slide 12 text

Dana Bauer @geography76 Idan Gazit @idangazit Part II Information to Meaning

Slide 13

Slide 13 text

Acquiring the data STEP 1

Slide 14

Slide 14 text

!."million users

Slide 15

Slide 15 text

#.$million repos

Slide 16

Slide 16 text

http://flic.kr/p/aZ4Z54

Slide 17

Slide 17 text

http://flic.kr/p/4pdeJz

Slide 18

Slide 18 text

subset a meaningful of

Slide 19

Slide 19 text

github.com/languages

Slide 20

Slide 20 text

API Whitelist Repo Cloning

Slide 21

Slide 21 text

cloud amazon web services

Slide 22

Slide 22 text

flexibility

Slide 23

Slide 23 text

????GB on disk

Slide 24

Slide 24 text

%&GB on disk

Slide 25

Slide 25 text

'&&GB on disk #& %#

Slide 26

Slide 26 text

&.mb min '(.)mb median "(.*mb average 7,)'*mb max 100gb of repositories (webkit/WebKit)

Slide 27

Slide 27 text

stability

Slide 28

Slide 28 text

Connection reset by peer

Slide 29

Slide 29 text

EC2 instance size RAM & CPU

Slide 30

Slide 30 text

speed

Slide 31

Slide 31 text

100ms 200ms

Slide 32

Slide 32 text

100– 200ms from us 2ms from EC! yay internet.

Slide 33

Slide 33 text

Tooling ~100gb EBS volume bootstrapping instances running commands

Slide 34

Slide 34 text

user-data scripts bare metal → ready to work http://alestic.com/2009/06/ec2-user-data-scripts

Slide 35

Slide 35 text

# setup python and pip requirements curl -O http://python-distribute.org/distribut python distribute_setup.py easy_install pip pip install ipython tornado pyzmq pip install -r requirements.txt

Slide 36

Slide 36 text

fabric http://fabric.readthedocs.org

Slide 37

Slide 37 text

run “git clone” two thousand times make sure the directories exist load the list of top repos

Slide 38

Slide 38 text

def clone_lang(lang, repos): """ Clone the most-watched repos for a given language """ print('*** Cloning {} repositories...'.format(lang)) langpath = REPOS_PATH.child(lang) local('mkdir -p {}'.format(langpath)) for r in repos: user, reponame = r.split('/') userpath = langpath.child(user) repopath = userpath.child(reponame) if repopath.exists(): print('Skipping {}...'.format(r)) continue print('Cloning {}'.format(r)) local('mkdir -p {}'.format(userpath)) with lcd(userpath): local('git clone https://github.com/{}.git'.format(r))

Slide 39

Slide 39 text

fab clone

Slide 40

Slide 40 text

Parsing and Filtering STEP 2

Slide 41

Slide 41 text

Say Cheese! data → snapshot

Slide 42

Slide 42 text

snap-!b"#$%#% thank you for shopping at Dana & Idan’s data emporium! We appreciate your business.

Slide 43

Slide 43 text

Data snapshot → EBS volume

Slide 44

Slide 44 text

working with the data Playtime! collaboration

Slide 45

Slide 45 text

~7k miles

Slide 46

Slide 46 text

fab launch_ec2

Slide 47

Slide 47 text

exposed to the outside IPython notebook http://:8000

Slide 48

Slide 48 text

“oops I closed the browser tab” long-running stuff “oops I shut my laptop lid”

Slide 49

Slide 49 text

we hear you are flush. can haz a pony? — kthxbai. Dear IPython devs

Slide 50

Slide 50 text

terminal multiplexer tmux

Slide 51

Slide 51 text

s h e l l s i n s i d e y o u r s h e l l s INCEPTION

Slide 52

Slide 52 text

http://flic.kr/p/5FYT2j immortal python shell

Slide 53

Slide 53 text

see output reattach to the tmux session pick up where we left off

Slide 54

Slide 54 text

ghetto pair programming in the cloud Double your REPL, Double your fun attach to the same tmux session

Slide 55

Slide 55 text

storing results

Slide 56

Slide 56 text

(haters gonna hate.)

Slide 57

Slide 57 text

no JOINs no problem! no schema

Slide 58

Slide 58 text

Tell me more about your octocats. Cool story, bro sister!

Slide 59

Slide 59 text

git != github network authors != github users

Slide 60

Slide 60 text

an asynchronous task queue celery with nifty features

Slide 61

Slide 61 text

@celery.task(rate_limit='5,000/h') rate limiting gotchas! pay attention to X-RateLimit-Remaining

Slide 62

Slide 62 text

Heroku’s first dyno is free celery in the cloud nobody says it has to be a web dyno…

Slide 63

Slide 63 text

redis for broker, result store batteries included* “heroku run bash” to get a shell

Slide 64

Slide 64 text

There’s no storage on Heroku, by design so why not...?

Slide 65

Slide 65 text

Mining for Understanding STEP 3

Slide 66

Slide 66 text

#,)!$,*%* commits

Slide 67

Slide 67 text

*!,'"* authors

Slide 68

Slide 68 text

"",!)* contributors

Slide 69

Slide 69 text

http://en.wikipedia.org/wiki/File:Jamtlands_Flyg_EC120B_Colibri.JPG wat?

Slide 70

Slide 70 text

yeah good luck with that. everybody please stand still?

Slide 71

Slide 71 text

Do Repeat Yourself Idempotency (without shooting your own feet)

Slide 72

Slide 72 text

Idempotency help others replicate your results peace of mind

Slide 73

Slide 73 text

Break time! sudo go have a sandwich. Part II: Information to Meaning up next Idan Gazit, 3:15p

Slide 74

Slide 74 text

Part II: Information to Meaning up now Idan Gazit, 3:15p Hi.

Slide 75

Slide 75 text

co i oo !

Slide 76

Slide 76 text

No content

Slide 77

Slide 77 text

Dana Bauer @geography76 Part I @idangazit Data to Information Idan Gazit

Slide 78

Slide 78 text

Dana Bauer @geography76 Idan Gazit @idangazit Part II Information to Meaning

Slide 79

Slide 79 text

Constraints design is

Slide 80

Slide 80 text

acquire parse filter mine represent refine interact

Slide 81

Slide 81 text

http://flic.kr/p/9mz5hj http://flic.kr/p/FNgEL

Slide 82

Slide 82 text

Given the complexity of data, using it to provide a meaningful solution requires insights from diverse fields: statistics, data mining, graphic design, and information visualization. Ben Fry from “Visualizing Data”

Slide 83

Slide 83 text

Choosing a Visual Representation STEP 5

Slide 84

Slide 84 text

http://flic.kr/p/dxyTt1

Slide 85

Slide 85 text

No content

Slide 86

Slide 86 text

http://flic.kr/p/7FH2Re

Slide 87

Slide 87 text

http://flic.kr/p/7oYTTS

Slide 88

Slide 88 text

http://flic.kr/p/5hiBsz

Slide 89

Slide 89 text

No content

Slide 90

Slide 90 text

Meaning

Slide 91

Slide 91 text

Meaning requires context

Slide 92

Slide 92 text

JavaScript Ruby Java Python Shell PHP C C++ Perl Obj-C 873y40817 234 098 14092 309812 39182742 48714209 81239 84127498023 873y408

Slide 93

Slide 93 text

CHOOSING A Medium CHOOSING AN Audience

Slide 94

Slide 94 text

One size fits nobody.

Slide 95

Slide 95 text

know about github passing familiarity with code and development curious about our community

Slide 96

Slide 96 text

Know thine audience.....

Slide 97

Slide 97 text

The Eye Candy Trap .............. beautiful noise is still just noise

Slide 98

Slide 98 text

on the Web For Geeks

Slide 99

Slide 99 text

on the Web For Geeks using ?

Slide 100

Slide 100 text

Logo for the modern computer. see also: processing.js

Slide 101

Slide 101 text

A data visualization toolkit for the web. D3 is…

Slide 102

Slide 102 text

data mapping onto meaningful visuals

Slide 103

Slide 103 text

data mapping onto the DOM

Slide 104

Slide 104 text

scales linear logarithmic quantile ordinal time

Slide 105

Slide 105 text

// linear scale x = d3.scale.linear() .domain([0, 1.0]) .range([0, 255]); x(0); // == 0 x(0.5); // == 127.5 x(1); // == 255

Slide 106

Slide 106 text

interpolation dimension position color orientation ... everything

Slide 107

Slide 107 text

0 25 50 75 100 2007 2008 2009 2010 Apples Bananas

Slide 108

Slide 108 text

HTTP://NYTI.MS/WR1DHZ

Slide 109

Slide 109 text

SVG not just an image

Slide 110

Slide 110 text

JSON CSV and

Slide 111

Slide 111 text

Intermediate Representations

Slide 112

Slide 112 text

JSON easy to serialize to structured preserves datatypes bloated for tabular data

Slide 113

Slide 113 text

CSV mostly easy to serialize to* structured comes out the other side like a dict everything comes back as a string *unicode, meh

Slide 114

Slide 114 text

name user rank watchers commits size earliest commit

Slide 115

Slide 115 text

[ { "type": "string", "name": "name" }, { "type": "int", "name": "rank" "extents": [ 0, 199 ], }, ...

Slide 116

Slide 116 text

Caching

Slide 117

Slide 117 text

on the Web For Geeks using D$.js

Slide 118

Slide 118 text

What do we want to know? um, we don’t know that yet.

Slide 119

Slide 119 text

Frustration. “Plan to throw one away.” some

Slide 120

Slide 120 text

People Languages Repositories

Slide 121

Slide 121 text

Live Demo

Slide 122

Slide 122 text

Letting the Data Speak for Itself

Slide 123

Slide 123 text

Responsive Visualizations

Slide 124

Slide 124 text

People Languages Repositories

Slide 125

Slide 125 text

visualization localization internationalization v##n l#&n i#%n totally a thing from now on

Slide 126

Slide 126 text

Dana Bauer @geography76 Idan Gazit @idangazit