Overcoming Big Data Apathy

Overcoming Big Data Apathy Matthew Kirk - Modulus 7

o Next on Data Hoarding

Compulsive Hoarding excessive acquisition of and inability or unwillingness to
discard large quantities of objects

Big Data

80/20 Principle 80% of the outcome is determined by 20%
of the input

The Whirlwind Tour • Backwards induction • Visualizing > 3
dimensions • Determining variable importance using random forests

Many Roads to Travel

Backwards Induction

AARRR! Aquisition Activation Retention Referral Revenue Dave McClure

Acquisition Tree

Map the Relationship

Correlation != Causation If I wake up in the morning
and drink coffee does that mean coffee implies sunshine?

Visualizing more than 3d

Many solutions • Color • Tables • Glyph plots •
Scatterplot Matricies

Start with Tables

Scatterplot Matrices

Chernoff Faces

Shoes.rb Chernoff Faces Follow along at http://github.com/hexgnu/chernoff

How can we apply 80/20 to Variables?

Classiﬁcation And Regression Trees

Random Forests • Pick subset of population data • cluster
the data in groups and subgroups • Make multiple trees this way • Use rest of data (not subset) to determine best classiﬁcation • Tree with most predictive power wins

Nimbus • nimbusgem.org • Designed for bioinformatics and genetics

Example using Nimbus • What makes Matt retweet something? •
~1800 tweets • Retweet is classiﬁed as a 1 and everything else a 0 • Code is up at http://github.com/hexgnu/ tweet-randomforest

Determining Variable Importance • ((permuted_snp_errors / oob_size) - generalization_error) •
permuted_snp_errors = sum of missed classiﬁcations inside of tree by each variable • oob_size (out of bag size) means all the data points not used • generalization_error = sum of errors for entire forest

Minimum Depth • Can be used in survival diagram where
the end point determines the classiﬁcation. • Rank by depth in tree

Results from running data set • According to both Variable
Importance and Minimum Depth I am more likely to retweet tweets with: • @bentlygtcspeed • @sophthewiseone (this is my wife) • @avdi • @mattmight

Example tweets I retweeted • BentleyGTCSpeed: I"m going to do
a training." "Really? On the way home will you do a driving?" • Sophthewiseone: TIL: Hitler was a micromanager. "Hitler was constantly interfering in the decisions of his subordinates." Explains a lot...

More example tweets • Avdi: HOLY CRAP ALL-CAPS RECRUITER DO
YOU HAVE A JOB LEAD FOR ME?!!! • Mattmight:"More data" does little good unless it becomes "more information."

Now that you know Rinse and repeat

Conclusion Start with the End in Mind Visualize the important
stuff Use Classiﬁcation Trees to pick Variables

Matt Kirk @mjkirk modulus7.com

http://www.flickr.com/photos/theophileescargot/5950985345/ http://rickmanelius.com/article/do-you-dread-emails http://www.keepcalm-o-matic.co.uk/p/keep-calm-and-cut-the-bullshit/ http://upload.wikimedia.org/wikipedia/commons/7/7d/LogisticMap_BifurcationDiagram.png http://www.ewp.rpi.edu/hartford/~stoddj/BE/Image29.gif http://www.flickr.com/photos/earlg/160807760/ http://www.flickr.com/photos/aigle_dore/5626287267/ http://www.flickr.com/photos/rubenholthuijsen/7430874638/ Photo Credits

Overcoming Big Data Apathy

Overcoming Big Data Apathy

Matt Kirk

More Decks by Matt Kirk

Other Decks in Technology

Featured

Transcript