Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tech Talk at Lookout (24 January 2012)

Thomson Nguyen
February 16, 2012

Tech Talk at Lookout (24 January 2012)

This is a slide deck (for public release) from a tech talk I gave at Lookout on R and a light introduction to unsupervised learning through k-means clustering.

Thomson Nguyen

February 16, 2012
Tweet

More Decks by Thomson Nguyen

Other Decks in Science

Transcript

  1. External Wat What is R? What makes a good model?

    Example: Modeling flowers Example: Modeling users
  2. External History of R in 60 seconds Started life as

    “S”--the FORTRAN-based “statistical programming language”. (1975) R began as an implementation of S with syntax inspired by Scheme. (1993) Core mainly written in C and Fortran.
  3. External Why it sucks Legacy stack Written by non-engineers (like

    me): spaghetti code, difficult to maintain, just, just awful Slow for data-intensive I/O tasks, concurrency and parallelized tasks is akin to skinning alpacas
  4. External Why it’s awesome Great for exploratory analysis of raw

    data Rich statistical library of packages Hooks into “regular” languages easily RPy, rsruby gem, Rtalk
  5. External Download R Windows R OS X Native RStudio ESS

    (Emacs speaks Statistics) Every flavor of Linux R-Vim
  6. External white-collar felines between the age of 25-30. (Precise, but

    not accurate) All the characters on Friends are
  7. External 20-something raging hedonists who surely live in a rent-controlled

    apartment in lower Manhattan. (Accurate and precise) All the characters on Friends are
  8. External You see what’s going on here... A model that’s

    100% accurate, precise and generalized doesn’t exist! There will always be tradeoffs between all three Your ideal model will surely depend on your risk tolerance for all three Does it need to be fast enough for an MVP? (eg real-time referrer analysis) Does it need to be absolutely accurate? (eg malware classification) Does it need to be ultra-granular? (eg customer segmentation
  9. External Iris dataset (R example) The data looks like this:

    Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 versicolor 4 4.6 3.1 1.5 0.2 virginica 5 5.0 3.6 1.4 0.2 versicolor 6 5.4 3.9 1.7 0.4 setosa
  10. External Iris dataset (R example) Can we predict which species

    from the four features given? Yes, with the following framework:
  11. External Iris dataset (R example) Sepal.Length Sepal.Width Petal.Length Petal.Width Species

    1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 versicolor 4 4.6 3.1 1.5 0.2 virginica 5 5.0 3.6 1.4 0.2 versicolor 6 15.4 13.9 21.7 7.4 setosa