Matthew Kirk - Modulus 7
Next on Data Hoarding
acquisition of and
quantities of objects
80% of the outcome is determined by 20% of the input
The Whirlwind Tour
• Backwards induction
• Visualizing > 3 dimensions
• Determining variable importance using
Many Roads to Travel
If I wake up in the morning and drink coffee does that
mean coffee implies sunshine?
more than 3d
• Glyph plots
• Scatterplot Matricies
Follow along at http://github.com/hexgnu/chernoff
How can we apply
80/20 to Variables?
• Pick subset of population data
• cluster the data in groups and subgroups
• Make multiple trees this way
• Use rest of data (not subset) to determine
• Tree with most predictive power wins
• Designed for bioinformatics and genetics
Example using Nimbus
• What makes Matt retweet something?
• ~1800 tweets
• Retweet is classiﬁed as a 1 and everything
else a 0
• Code is up at http://github.com/hexgnu/
• ((permuted_snp_errors / oob_size) -
• permuted_snp_errors = sum of missed
classiﬁcations inside of tree by each variable
• oob_size (out of bag size) means all the
data points not used
• generalization_error = sum of errors for
• Can be used in survival diagram where the
end point determines the classiﬁcation.
• Rank by depth in tree
Results from running
• According to both Variable Importance and
Minimum Depth I am more likely to
retweet tweets with:
• @sophthewiseone (this is my wife)
Example tweets I
• BentleyGTCSpeed: I"m going to do a
training." "Really? On the way home will
you do a driving?"
• Sophthewiseone: TIL: Hitler was a
micromanager. "Hitler was constantly
interfering in the decisions of his
subordinates." Explains a lot...
More example tweets
• Avdi: HOLY CRAP ALL-CAPS RECRUITER
DO YOU HAVE A JOB LEAD FOR ME?!!!
• Mattmight:"More data" does little good
unless it becomes "more information."
Now that you know
Rinse and repeat
Start with the End in Mind
Visualize the important stuff
Use Classiﬁcation Trees to pick Variables