Tech Talk at Lookout (24 January 2012)

External Introduction to R (And Machine Learning) Tech Talk, 24
January 2012

External Wat What is R? What makes a good model?
Example: Modeling flowers Example: Modeling users

External Who’s heard of R? Internal

External Who’s used R? Internal

External Who hates R?

External History of R in 60 seconds Started life as
“S”--the FORTRAN-based “statistical programming language”. (1975) R began as an implementation of S with syntax inspired by Scheme. (1993) Core mainly written in C and Fortran.

External Why it sucks Legacy stack Written by non-engineers (like
me): spaghetti code, difficult to maintain, just, just awful Slow for data-intensive I/O tasks, concurrency and parallelized tasks is akin to skinning alpacas

External Why it’s awesome Great for exploratory analysis of raw
data Rich statistical library of packages Hooks into “regular” languages easily RPy, rsruby gem, Rtalk

External Download R Windows R OS X Native RStudio ESS
(Emacs speaks Statistics) Every flavor of Linux R-Vim

External What makes a good model?

External Good models are

External Good models are Precise

External Good models are Accurate Precise

External Good models are Accurate Precise Generalized

External

External Example: Modeling TV

External All the characters on Friends are Kitty cats. (Not
accurate or precise)

External white-collar felines between the age of 25-30. (Precise, but
not accurate) All the characters on Friends are

External humans. (Accurate, but not precise) All the characters on
Friends are

External 20-something raging hedonists who surely live in a rent-controlled
apartment in lower Manhattan. (Accurate and precise) All the characters on Friends are

External You see what’s going on here... A model that’s
100% accurate, precise and generalized doesn’t exist! There will always be tradeoffs between all three Your ideal model will surely depend on your risk tolerance for all three Does it need to be fast enough for an MVP? (eg real-time referrer analysis) Does it need to be absolutely accurate? (eg malware classification) Does it need to be ultra-granular? (eg customer segmentation

External Example: Modeling Flowers

External Iris dataset (R example) The data looks like this:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 versicolor 4 4.6 3.1 1.5 0.2 virginica 5 5.0 3.6 1.4 0.2 versicolor 6 5.4 3.9 1.7 0.4 setosa

External Iris dataset (R example) Can we predict which species
from the four features given? Yes, with the following framework:

External Iris dataset (R example) Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 versicolor 4 4.6 3.1 1.5 0.2 virginica 5 5.0 3.6 1.4 0.2 versicolor 6 15.4 13.9 21.7 7.4 setosa

External Poor person’s k-means clustering

External

External Where K-means fails

External K-means in R (three steps) library(caret) data(iris) kmeans(iris[,-c(“species”)], iris[,”species”],
3) (Optimization left as an exercise)

External Questions?

Tech Talk at Lookout (24 January 2012)

Tech Talk at Lookout (24 January 2012)

More Decks by Thomson Nguyen

Other Decks in Science

Featured

Transcript