Know Thy Neighbor: Scikit and the K-Nearest Neighbor Algorithm

D21717ea76044d31115c573d368e6ff4?s=47 PyCon 2014
April 13, 2014

Know Thy Neighbor: Scikit and the K-Nearest Neighbor Algorithm

This presentation will give a brief overview of machine learning, the k-nearest neighbor algorithm and scikit-learn. Sometimes developers need to make decisions, even when they don't have all of the required information. Machine learning attempts to solve this problem by using known data (a training data sample) to make predictions about the unknown. For example, usually a user doesn't tell Amazon explicitly what type of book they want to read, but based on the user's purchasing history, and the user's demographic, Amazon is able to induce what the user might like to read.

Scikit-learn makes use of the k-nearest neighbor algorithm and allows developers to make predictions. Using training data one could make inferences such as what type of food, tv show, or music the user prefers. In this presentation we will introduce the k-nearest neighbor algorithm, and discuss when one might use this algorithm.

D21717ea76044d31115c573d368e6ff4?s=128

PyCon 2014

April 13, 2014
Tweet

Transcript

  1. Know  Thy  Neighbor:  An  Introduc6on  to   Scikit-­‐Learn  and  K-­‐NN

      Por6a  Burton   PLB  Analy6cs   www.github.com/pkafei    
  2. About  Me:   •  Organizer  of  the  Portland  Data  Science

     group   •  Volunteer  of  HackOregon   •  Founder  of  PLB  Analy6cs  
  3. What  We  will  Cover  Today   1.  Brief  Intro  to

     Machine  Learning   2.  Go  Over  Scikit-­‐learn   3.  Explain  the  k-­‐Nearest  Neighbor  algorithm   4.  Demo  of  Scikit-­‐learn  and  kNN  
  4. Machine  Learning  

  5. Machine  Learning   •  The  algorithm  learns   from  the

     data  
  6. What  is  Machine  Learning   Algorithms  use  data  to….  

      • Create  predic6ve  models   • Classify  unknown  en66es   • Discover  paWerns  
  7. Basic  Workflow  of  Machine  Learning  

  8. 70%   • Clean  and  Standardize  Data   20%   • Preprocess,

     Training,   Validate   10%   • Analyze  and  Visualize  
  9. Scikit-­‐Learn  

  10. What  is  scikit-­‐learn?   • Python  machine  learning  package   • Great

     documenta6on   • Has  built  in  datasets(i.e.  Boston  housing   market)  
  11. **

  12. Many  companies  use  Scikit-­‐Learn  

  13. None
  14. Are  You  a  Recipe?  Yum.   •  Dis6nguishes  ‘recipe’  notes

      from  ‘work’  notes   •  Sugges6ng  notebooks  is  a   classifica6on  problem   •  Implements  naïve  bayes     classifica6on  algorithm  
  15. Naïve  Bayes  Classifica6on   “naive”  assump6on  of  independence  between  every

     pair  of  features  
  16. Supervised  vs.  Unsupervised  Learning  

  17. Unsupervised  Learning   Data  points  are  not  labeled  with  outcomes.

      PaWerns  are  found  by  the  algorithm.  
  18. Supervised  Learning   When your samples are labeled

  19. Theore6cal  Data  Model  for  Supervised   Learning   Observations

  20. Remember  to…   Keep your sample size high

  21. Keep your feature set low And  don’t  forget  to  

  22. Examples  of  Supervised  Learning  

  23. Handwri6ng     Analysis  

  24.     Spam  Filters  

  25. k-­‐NN   • k  Nearest  Neighbor  algorithm   – The  simplest  machine

     learning  algorithm   – It  is  a  lazy  algorithm  :  doesn’t  run  computa6ons   on  the  dataset  un6l  you  give  it  a  new  data  point   you  are  trying  to  test   – Our  example  uses  k-­‐NN  for  supervised  learning    
  26. Mystery  Fruit   ?

  27. Majority  Vote   •  Equal  weight:  Each  kNN  neighbor  has

     equal   weight   •  Distance  weight:  Each  kNN  neighbor’s  vote  is   based  on  the  distance    
  28. How  k-­‐NN  works  

  29. Downsides  of  kNN   • Since  there  is  minimum  training  there

     is  a   high  computa6onal  cost  in  tes6ng  new  data   • Correla6on  is  falsely  high  (data  points  can  be   given  too  much  weight)  
  30. Live  demo  6me!  

  31. Our  Data  Set:   •  Typical!   •  Mul6variate  data

     set   was  created  in  1936   •  Analyzed  by  Sir   Ronald  Fischer   •  Collected  by  Edgar   Anderson    
  32. Live  coding  demo:   the  data  set   Iris virginica

    Iris versicolor Iris setosa Petal
  33. The  plot  from  the  use  case   Sepal Length (cm)

    Sepal Width (cm) Training Data Test Data
  34. Example  data  points  for  each  iris  species   Sepal  

    length     (x-­‐axis)   Sepal   width     (y-­‐axis)   Species   5.1   3.5   I.  setosa   5.5   2.3   I.  versicolor   6.7   2.5   I.  virginica  
  35. References:   hWp://www.solver.com/xlminer/help/k-­‐nearest-­‐neighbors-­‐predic6on-­‐example     hWp://saravananthirumuruganathan.wordpress.com/2010/05/17/a-­‐detailed-­‐introduc6on-­‐to-­‐k-­‐nearest-­‐ neighbor-­‐knn-­‐algorithm/     hWp://scikit-­‐learn.org/stable/modules/neighbors.html

        hWp://peekaboo-­‐vision.blogspot.com/2013/01/machine-­‐learning-­‐cheat-­‐sheet-­‐for-­‐scikit.html     hWp://stackoverflow.com/ques6ons/1832076/what-­‐is-­‐the-­‐difference-­‐between-­‐supervised-­‐learning-­‐and-­‐ unsupervised-­‐learning     hWp://stackoverflow.com/ques6ons/2620343/what-­‐is-­‐machine-­‐learning  
  36. References:   hWp://blog.evernote.com/tech/2013/01/22/stay-­‐classified/     hWp://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf     hWp://en.wikipedia.org/wiki/Iris_flower_data_set  

      hWp://en.wikipedia.org/wiki/Support_vector_machine      
  37. Extra  Slides  

  38. Theore6cal  data  model  for   unsupervised  learning   The “outcomes”

    are our observations. This is what is given to the algorithm Variables that are unknown to us Output of algorithm: Relationships among the ‘outcomes’. Ex: clusters of data points