Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Overcoming Big Data Apathy

Matt Kirk
February 28, 2013

Overcoming Big Data Apathy

Matt Kirk

February 28, 2013
Tweet

More Decks by Matt Kirk

Other Decks in Technology

Transcript

  1. The Whirlwind Tour • Backwards induction • Visualizing > 3

    dimensions • Determining variable importance using random forests
  2. Correlation != Causation If I wake up in the morning

    and drink coffee does that mean coffee implies sunshine?
  3. Random Forests • Pick subset of population data • cluster

    the data in groups and subgroups • Make multiple trees this way • Use rest of data (not subset) to determine best classification • Tree with most predictive power wins
  4. Example using Nimbus • What makes Matt retweet something? •

    ~1800 tweets • Retweet is classified as a 1 and everything else a 0 • Code is up at http://github.com/hexgnu/ tweet-randomforest
  5. Determining Variable Importance • ((permuted_snp_errors / oob_size) - generalization_error) •

    permuted_snp_errors = sum of missed classifications inside of tree by each variable • oob_size (out of bag size) means all the data points not used • generalization_error = sum of errors for entire forest
  6. Minimum Depth • Can be used in survival diagram where

    the end point determines the classification. • Rank by depth in tree
  7. Results from running data set • According to both Variable

    Importance and Minimum Depth I am more likely to retweet tweets with: • @bentlygtcspeed • @sophthewiseone (this is my wife) • @avdi • @mattmight
  8. Example tweets I retweeted • BentleyGTCSpeed: I"m going to do

    a training." "Really? On the way home will you do a driving?" • Sophthewiseone: TIL: Hitler was a micromanager. "Hitler was constantly interfering in the decisions of his subordinates." Explains a lot...
  9. More example tweets • Avdi: HOLY CRAP ALL-CAPS RECRUITER DO

    YOU HAVE A JOB LEAD FOR ME?!!! • Mattmight:"More data" does little good unless it becomes "more information."
  10. Conclusion Start with the End in Mind Visualize the important

    stuff Use Classification Trees to pick Variables