Slide 1

Slide 1 text

Overcoming Big Data Apathy Matthew Kirk - Modulus 7

Slide 2

Slide 2 text

o Next on Data Hoarding

Slide 3

Slide 3 text

Compulsive Hoarding excessive acquisition of and inability or unwillingness to discard large quantities of objects

Slide 4

Slide 4 text

Big Data

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

80/20 Principle 80% of the outcome is determined by 20% of the input

Slide 7

Slide 7 text

The Whirlwind Tour • Backwards induction • Visualizing > 3 dimensions • Determining variable importance using random forests

Slide 8

Slide 8 text

Many Roads to Travel

Slide 9

Slide 9 text

Backwards Induction

Slide 10

Slide 10 text

AARRR! Aquisition Activation Retention Referral Revenue Dave McClure

Slide 11

Slide 11 text

Acquisition Tree

Slide 12

Slide 12 text

Map the Relationship

Slide 13

Slide 13 text

Correlation != Causation If I wake up in the morning and drink coffee does that mean coffee implies sunshine?

Slide 14

Slide 14 text

Visualizing more than 3d

Slide 15

Slide 15 text

Many solutions • Color • Tables • Glyph plots • Scatterplot Matricies

Slide 16

Slide 16 text

Start with Tables

Slide 17

Slide 17 text

Color

Slide 18

Slide 18 text

Scatterplot Matrices

Slide 19

Slide 19 text

Chernoff Faces

Slide 20

Slide 20 text

Shoes.rb Chernoff Faces Follow along at http://github.com/hexgnu/chernoff

Slide 21

Slide 21 text

How can we apply 80/20 to Variables?

Slide 22

Slide 22 text

Classification And Regression Trees

Slide 23

Slide 23 text

Random Forests • Pick subset of population data • cluster the data in groups and subgroups • Make multiple trees this way • Use rest of data (not subset) to determine best classification • Tree with most predictive power wins

Slide 24

Slide 24 text

Nimbus • nimbusgem.org • Designed for bioinformatics and genetics

Slide 25

Slide 25 text

Example using Nimbus • What makes Matt retweet something? • ~1800 tweets • Retweet is classified as a 1 and everything else a 0 • Code is up at http://github.com/hexgnu/ tweet-randomforest

Slide 26

Slide 26 text

Determining Variable Importance • ((permuted_snp_errors / oob_size) - generalization_error) • permuted_snp_errors = sum of missed classifications inside of tree by each variable • oob_size (out of bag size) means all the data points not used • generalization_error = sum of errors for entire forest

Slide 27

Slide 27 text

Minimum Depth • Can be used in survival diagram where the end point determines the classification. • Rank by depth in tree

Slide 28

Slide 28 text

Results from running data set • According to both Variable Importance and Minimum Depth I am more likely to retweet tweets with: • @bentlygtcspeed • @sophthewiseone (this is my wife) • @avdi • @mattmight

Slide 29

Slide 29 text

Example tweets I retweeted • BentleyGTCSpeed: I"m going to do a training." "Really? On the way home will you do a driving?" • Sophthewiseone: TIL: Hitler was a micromanager. "Hitler was constantly interfering in the decisions of his subordinates." Explains a lot...

Slide 30

Slide 30 text

More example tweets • Avdi: HOLY CRAP ALL-CAPS RECRUITER DO YOU HAVE A JOB LEAD FOR ME?!!! • Mattmight:"More data" does little good unless it becomes "more information."

Slide 31

Slide 31 text

Now that you know Rinse and repeat

Slide 32

Slide 32 text

Conclusion Start with the End in Mind Visualize the important stuff Use Classification Trees to pick Variables

Slide 33

Slide 33 text

Matt Kirk @mjkirk modulus7.com

Slide 34

Slide 34 text

http://www.flickr.com/photos/theophileescargot/5950985345/ http://rickmanelius.com/article/do-you-dread-emails http://www.keepcalm-o-matic.co.uk/p/keep-calm-and-cut-the-bullshit/ http://upload.wikimedia.org/wikipedia/commons/7/7d/LogisticMap_BifurcationDiagram.png http://www.ewp.rpi.edu/hartford/~stoddj/BE/Image29.gif http://www.flickr.com/photos/earlg/160807760/ http://www.flickr.com/photos/aigle_dore/5626287267/ http://www.flickr.com/photos/rubenholthuijsen/7430874638/ Photo Credits