Overcoming Big Data Apathy

528e7fbdef9dc7b028e079d259e2c394?s=47 Matt Kirk
February 28, 2013

Overcoming Big Data Apathy

528e7fbdef9dc7b028e079d259e2c394?s=128

Matt Kirk

February 28, 2013
Tweet

Transcript

  1. Overcoming Big Data Apathy Matthew Kirk - Modulus 7

  2. o Next on Data Hoarding

  3. Compulsive Hoarding excessive acquisition of and inability or unwillingness to

    discard large quantities of objects
  4. Big Data

  5. None
  6. 80/20 Principle 80% of the outcome is determined by 20%

    of the input
  7. The Whirlwind Tour • Backwards induction • Visualizing > 3

    dimensions • Determining variable importance using random forests
  8. Many Roads to Travel

  9. Backwards Induction

  10. AARRR! Aquisition Activation Retention Referral Revenue Dave McClure

  11. Acquisition Tree

  12. Map the Relationship

  13. Correlation != Causation If I wake up in the morning

    and drink coffee does that mean coffee implies sunshine?
  14. Visualizing more than 3d

  15. Many solutions • Color • Tables • Glyph plots •

    Scatterplot Matricies
  16. Start with Tables

  17. Color

  18. Scatterplot Matrices

  19. Chernoff Faces

  20. Shoes.rb Chernoff Faces Follow along at http://github.com/hexgnu/chernoff

  21. How can we apply 80/20 to Variables?

  22. Classification And Regression Trees

  23. Random Forests • Pick subset of population data • cluster

    the data in groups and subgroups • Make multiple trees this way • Use rest of data (not subset) to determine best classification • Tree with most predictive power wins
  24. Nimbus • nimbusgem.org • Designed for bioinformatics and genetics

  25. Example using Nimbus • What makes Matt retweet something? •

    ~1800 tweets • Retweet is classified as a 1 and everything else a 0 • Code is up at http://github.com/hexgnu/ tweet-randomforest
  26. Determining Variable Importance • ((permuted_snp_errors / oob_size) - generalization_error) •

    permuted_snp_errors = sum of missed classifications inside of tree by each variable • oob_size (out of bag size) means all the data points not used • generalization_error = sum of errors for entire forest
  27. Minimum Depth • Can be used in survival diagram where

    the end point determines the classification. • Rank by depth in tree
  28. Results from running data set • According to both Variable

    Importance and Minimum Depth I am more likely to retweet tweets with: • @bentlygtcspeed • @sophthewiseone (this is my wife) • @avdi • @mattmight
  29. Example tweets I retweeted • BentleyGTCSpeed: I"m going to do

    a training." "Really? On the way home will you do a driving?" • Sophthewiseone: TIL: Hitler was a micromanager. "Hitler was constantly interfering in the decisions of his subordinates." Explains a lot...
  30. More example tweets • Avdi: HOLY CRAP ALL-CAPS RECRUITER DO

    YOU HAVE A JOB LEAD FOR ME?!!! • Mattmight:"More data" does little good unless it becomes "more information."
  31. Now that you know Rinse and repeat

  32. Conclusion Start with the End in Mind Visualize the important

    stuff Use Classification Trees to pick Variables
  33. Matt Kirk @mjkirk modulus7.com

  34. http://www.flickr.com/photos/theophileescargot/5950985345/ http://rickmanelius.com/article/do-you-dread-emails http://www.keepcalm-o-matic.co.uk/p/keep-calm-and-cut-the-bullshit/ http://upload.wikimedia.org/wikipedia/commons/7/7d/LogisticMap_BifurcationDiagram.png http://www.ewp.rpi.edu/hartford/~stoddj/BE/Image29.gif http://www.flickr.com/photos/earlg/160807760/ http://www.flickr.com/photos/aigle_dore/5626287267/ http://www.flickr.com/photos/rubenholthuijsen/7430874638/ Photo Credits