Pro Yearly is on sale from $80 to $50! »

Tech Talk at Lookout (24 January 2012)

039ea0930c2c634154747fcb65d574de?s=47 Thomson Nguyen
February 16, 2012

Tech Talk at Lookout (24 January 2012)

This is a slide deck (for public release) from a tech talk I gave at Lookout on R and a light introduction to unsupervised learning through k-means clustering.

039ea0930c2c634154747fcb65d574de?s=128

Thomson Nguyen

February 16, 2012
Tweet

Transcript

  1. External Introduction to R (And Machine Learning) Tech Talk, 24

    January 2012
  2. External Wat What is R? What makes a good model?

    Example: Modeling flowers Example: Modeling users
  3. External Who’s heard of R? Internal

  4. External Who’s heard of R? Internal

  5. External Who’s used R? Internal

  6. External Who hates R?

  7. External History of R in 60 seconds Started life as

    “S”--the FORTRAN-based “statistical programming language”. (1975) R began as an implementation of S with syntax inspired by Scheme. (1993) Core mainly written in C and Fortran.
  8. External Why it sucks Legacy stack Written by non-engineers (like

    me): spaghetti code, difficult to maintain, just, just awful Slow for data-intensive I/O tasks, concurrency and parallelized tasks is akin to skinning alpacas
  9. External Why it’s awesome Great for exploratory analysis of raw

    data Rich statistical library of packages Hooks into “regular” languages easily RPy, rsruby gem, Rtalk
  10. External Download R Windows R OS X Native RStudio ESS

    (Emacs speaks Statistics) Every flavor of Linux R-Vim
  11. External What makes a good model?

  12. External Good models are

  13. External Good models are Precise

  14. External Good models are Accurate Precise

  15. External Good models are Accurate Precise Generalized

  16. External

  17. External Example: Modeling TV

  18. External All the characters on Friends are Kitty cats. (Not

    accurate or precise)
  19. External white-collar felines between the age of 25-30. (Precise, but

    not accurate) All the characters on Friends are
  20. External humans. (Accurate, but not precise) All the characters on

    Friends are
  21. External 20-something raging hedonists who surely live in a rent-controlled

    apartment in lower Manhattan. (Accurate and precise) All the characters on Friends are
  22. External You see what’s going on here... A model that’s

    100% accurate, precise and generalized doesn’t exist! There will always be tradeoffs between all three Your ideal model will surely depend on your risk tolerance for all three Does it need to be fast enough for an MVP? (eg real-time referrer analysis) Does it need to be absolutely accurate? (eg malware classification) Does it need to be ultra-granular? (eg customer segmentation
  23. External Example: Modeling Flowers

  24. External Iris dataset (R example) The data looks like this:

    Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 versicolor 4 4.6 3.1 1.5 0.2 virginica 5 5.0 3.6 1.4 0.2 versicolor 6 5.4 3.9 1.7 0.4 setosa
  25. External Iris dataset (R example) Can we predict which species

    from the four features given? Yes, with the following framework:
  26. External Iris dataset (R example) Sepal.Length Sepal.Width Petal.Length Petal.Width Species

    1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 versicolor 4 4.6 3.1 1.5 0.2 virginica 5 5.0 3.6 1.4 0.2 versicolor 6 15.4 13.9 21.7 7.4 setosa
  27. External Poor person’s k-means clustering

  28. External

  29. External

  30. External

  31. External

  32. External Where K-means fails

  33. External K-means in R (three steps) library(caret) data(iris) kmeans(iris[,-c(“species”)], iris[,”species”],

    3) (Optimization left as an exercise)
  34. External Questions?