Matt Kirk
February 28, 2013
220

# Overcoming Big Data Apathy

## Matt Kirk

February 28, 2013

## Transcript

1. Overcoming Big
Data Apathy
Matthew Kirk - Modulus 7

2. o
Next on Data Hoarding

3. Compulsive
Hoarding
excessive
acquisition of and
inability or
unwillingness to
quantities of objects

4. Big Data

5. 80/20
Principle
80% of the outcome is determined by 20% of the input

6. The Whirlwind Tour
• Backwards induction
• Visualizing > 3 dimensions
• Determining variable importance using
random forests

8. Backwards
Induction

9. AARRR!
Aquisition
Activation
Retention
Referral
Revenue
Dave McClure

10. Acquisition Tree

11. Map the
Relationship

12. Correlation !=
Causation
If I wake up in the morning and drink coffee does that
mean coffee implies sunshine?

13. Visualizing
more than 3d

14. Many solutions
• Color
• Tables
• Glyph plots
• Scatterplot Matricies

Tables

16. Color

17. Scatterplot
Matrices

18. Chernoff Faces

19. Shoes.rb Chernoff
Faces

20. How can we apply
80/20 to Variables?

21. Classiﬁcation
And Regression
Trees

22. Random Forests
• Pick subset of population data
• cluster the data in groups and subgroups
• Make multiple trees this way
• Use rest of data (not subset) to determine
best classiﬁcation
• Tree with most predictive power wins

23. Nimbus
• nimbusgem.org
• Designed for bioinformatics and genetics

24. Example using Nimbus
• What makes Matt retweet something?
• ~1800 tweets
• Retweet is classiﬁed as a 1 and everything
else a 0
• Code is up at http://github.com/hexgnu/
tweet-randomforest

25. Determining Variable
Importance
• ((permuted_snp_errors / oob_size) -
generalization_error)
• permuted_snp_errors = sum of missed
classiﬁcations inside of tree by each variable
• oob_size (out of bag size) means all the
data points not used
• generalization_error = sum of errors for
entire forest

26. Minimum Depth
• Can be used in survival diagram where the
end point determines the classiﬁcation.
• Rank by depth in tree

27. Results from running
data set
• According to both Variable Importance and
Minimum Depth I am more likely to
retweet tweets with:
• @bentlygtcspeed
• @sophthewiseone (this is my wife)
• @avdi
• @mattmight

28. Example tweets I
retweeted
• BentleyGTCSpeed: I"m going to do a
training." "Really? On the way home will
you do a driving?"
• Sophthewiseone: TIL: Hitler was a
micromanager. "Hitler was constantly
interfering in the decisions of his
subordinates." Explains a lot...

29. More example tweets
• Avdi: HOLY CRAP ALL-CAPS RECRUITER
DO YOU HAVE A JOB LEAD FOR ME?!!!
• Mattmight:"More data" does little good

30. Now that you know
Rinse and repeat

31. Conclusion
Visualize the important stuff
Use Classiﬁcation Trees to pick Variables

32. Matt Kirk
@mjkirk
modulus7.com

33. http://www.ﬂickr.com/photos/theophileescargot/5950985345/