$30 off During Our Annual Pro Sale. View Details »

Overcoming Big Data Apathy

Matt Kirk
February 28, 2013

Overcoming Big Data Apathy

Matt Kirk

February 28, 2013
Tweet

More Decks by Matt Kirk

Other Decks in Technology

Transcript

  1. Overcoming Big
    Data Apathy
    Matthew Kirk - Modulus 7

    View Slide

  2. o
    Next on Data Hoarding

    View Slide

  3. Compulsive
    Hoarding
    excessive
    acquisition of and
    inability or
    unwillingness to
    discard large
    quantities of objects

    View Slide

  4. Big Data

    View Slide

  5. View Slide

  6. 80/20
    Principle
    80% of the outcome is determined by 20% of the input

    View Slide

  7. The Whirlwind Tour
    • Backwards induction
    • Visualizing > 3 dimensions
    • Determining variable importance using
    random forests

    View Slide

  8. Many Roads to Travel

    View Slide

  9. Backwards
    Induction

    View Slide

  10. AARRR!
    Aquisition
    Activation
    Retention
    Referral
    Revenue
    Dave McClure

    View Slide

  11. Acquisition Tree

    View Slide

  12. Map the
    Relationship

    View Slide

  13. Correlation !=
    Causation
    If I wake up in the morning and drink coffee does that
    mean coffee implies sunshine?

    View Slide

  14. Visualizing
    more than 3d

    View Slide

  15. Many solutions
    • Color
    • Tables
    • Glyph plots
    • Scatterplot Matricies

    View Slide

  16. Start with
    Tables

    View Slide

  17. Color

    View Slide

  18. Scatterplot
    Matrices

    View Slide

  19. Chernoff Faces

    View Slide

  20. Shoes.rb Chernoff
    Faces
    Follow along at http://github.com/hexgnu/chernoff

    View Slide

  21. How can we apply
    80/20 to Variables?

    View Slide

  22. Classification
    And Regression
    Trees

    View Slide

  23. Random Forests
    • Pick subset of population data
    • cluster the data in groups and subgroups
    • Make multiple trees this way
    • Use rest of data (not subset) to determine
    best classification
    • Tree with most predictive power wins

    View Slide

  24. Nimbus
    • nimbusgem.org
    • Designed for bioinformatics and genetics

    View Slide

  25. Example using Nimbus
    • What makes Matt retweet something?
    • ~1800 tweets
    • Retweet is classified as a 1 and everything
    else a 0
    • Code is up at http://github.com/hexgnu/
    tweet-randomforest

    View Slide

  26. Determining Variable
    Importance
    • ((permuted_snp_errors / oob_size) -
    generalization_error)
    • permuted_snp_errors = sum of missed
    classifications inside of tree by each variable
    • oob_size (out of bag size) means all the
    data points not used
    • generalization_error = sum of errors for
    entire forest

    View Slide

  27. Minimum Depth
    • Can be used in survival diagram where the
    end point determines the classification.
    • Rank by depth in tree

    View Slide

  28. Results from running
    data set
    • According to both Variable Importance and
    Minimum Depth I am more likely to
    retweet tweets with:
    • @bentlygtcspeed
    • @sophthewiseone (this is my wife)
    • @avdi
    • @mattmight

    View Slide

  29. Example tweets I
    retweeted
    • BentleyGTCSpeed: I"m going to do a
    training." "Really? On the way home will
    you do a driving?"
    • Sophthewiseone: TIL: Hitler was a
    micromanager. "Hitler was constantly
    interfering in the decisions of his
    subordinates." Explains a lot...

    View Slide

  30. More example tweets
    • Avdi: HOLY CRAP ALL-CAPS RECRUITER
    DO YOU HAVE A JOB LEAD FOR ME?!!!
    • Mattmight:"More data" does little good
    unless it becomes "more information."

    View Slide

  31. Now that you know
    Rinse and repeat

    View Slide

  32. Conclusion
    Start with the End in Mind
    Visualize the important stuff
    Use Classification Trees to pick Variables

    View Slide

  33. Matt Kirk
    @mjkirk
    modulus7.com

    View Slide

  34. http://www.flickr.com/photos/theophileescargot/5950985345/
    http://rickmanelius.com/article/do-you-dread-emails
    http://www.keepcalm-o-matic.co.uk/p/keep-calm-and-cut-the-bullshit/
    http://upload.wikimedia.org/wikipedia/commons/7/7d/LogisticMap_BifurcationDiagram.png
    http://www.ewp.rpi.edu/hartford/~stoddj/BE/Image29.gif
    http://www.flickr.com/photos/earlg/160807760/
    http://www.flickr.com/photos/aigle_dore/5626287267/
    http://www.flickr.com/photos/rubenholthuijsen/7430874638/
    Photo Credits

    View Slide