Slide 1

Slide 1 text

Know  Thy  Neighbor:  An  Introduc6on  to   Scikit-­‐Learn  and  K-­‐NN   Por6a  Burton   PLB  Analy6cs   www.github.com/pkafei    

Slide 2

Slide 2 text

About  Me:   •  Organizer  of  the  Portland  Data  Science  group   •  Volunteer  of  HackOregon   •  Founder  of  PLB  Analy6cs  

Slide 3

Slide 3 text

What  We  will  Cover  Today   1.  Brief  Intro  to  Machine  Learning   2.  Go  Over  Scikit-­‐learn   3.  Explain  the  k-­‐Nearest  Neighbor  algorithm   4.  Demo  of  Scikit-­‐learn  and  kNN  

Slide 4

Slide 4 text

Machine  Learning  

Slide 5

Slide 5 text

Machine  Learning   •  The  algorithm  learns   from  the  data  

Slide 6

Slide 6 text

What  is  Machine  Learning   Algorithms  use  data  to….     • Create  predic6ve  models   • Classify  unknown  en66es   • Discover  paWerns  

Slide 7

Slide 7 text

Basic  Workflow  of  Machine  Learning  

Slide 8

Slide 8 text

70%   • Clean  and  Standardize  Data   20%   • Preprocess,  Training,   Validate   10%   • Analyze  and  Visualize  

Slide 9

Slide 9 text

Scikit-­‐Learn  

Slide 10

Slide 10 text

What  is  scikit-­‐learn?   • Python  machine  learning  package   • Great  documenta6on   • Has  built  in  datasets(i.e.  Boston  housing   market)  

Slide 11

Slide 11 text

**

Slide 12

Slide 12 text

Many  companies  use  Scikit-­‐Learn  

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

Are  You  a  Recipe?  Yum.   •  Dis6nguishes  ‘recipe’  notes   from  ‘work’  notes   •  Sugges6ng  notebooks  is  a   classifica6on  problem   •  Implements  naïve  bayes     classifica6on  algorithm  

Slide 15

Slide 15 text

Naïve  Bayes  Classifica6on   “naive”  assump6on  of  independence  between  every  pair  of  features  

Slide 16

Slide 16 text

Supervised  vs.  Unsupervised  Learning  

Slide 17

Slide 17 text

Unsupervised  Learning   Data  points  are  not  labeled  with  outcomes.   PaWerns  are  found  by  the  algorithm.  

Slide 18

Slide 18 text

Supervised  Learning   When your samples are labeled

Slide 19

Slide 19 text

Theore6cal  Data  Model  for  Supervised   Learning   Observations

Slide 20

Slide 20 text

Remember  to…   Keep your sample size high

Slide 21

Slide 21 text

Keep your feature set low And  don’t  forget  to  

Slide 22

Slide 22 text

Examples  of  Supervised  Learning  

Slide 23

Slide 23 text

Handwri6ng     Analysis  

Slide 24

Slide 24 text

    Spam  Filters  

Slide 25

Slide 25 text

k-­‐NN   • k  Nearest  Neighbor  algorithm   – The  simplest  machine  learning  algorithm   – It  is  a  lazy  algorithm  :  doesn’t  run  computa6ons   on  the  dataset  un6l  you  give  it  a  new  data  point   you  are  trying  to  test   – Our  example  uses  k-­‐NN  for  supervised  learning    

Slide 26

Slide 26 text

Mystery  Fruit   ?

Slide 27

Slide 27 text

Majority  Vote   •  Equal  weight:  Each  kNN  neighbor  has  equal   weight   •  Distance  weight:  Each  kNN  neighbor’s  vote  is   based  on  the  distance    

Slide 28

Slide 28 text

How  k-­‐NN  works  

Slide 29

Slide 29 text

Downsides  of  kNN   • Since  there  is  minimum  training  there  is  a   high  computa6onal  cost  in  tes6ng  new  data   • Correla6on  is  falsely  high  (data  points  can  be   given  too  much  weight)  

Slide 30

Slide 30 text

Live  demo  6me!  

Slide 31

Slide 31 text

Our  Data  Set:   •  Typical!   •  Mul6variate  data  set   was  created  in  1936   •  Analyzed  by  Sir   Ronald  Fischer   •  Collected  by  Edgar   Anderson    

Slide 32

Slide 32 text

Live  coding  demo:   the  data  set   Iris virginica Iris versicolor Iris setosa Petal

Slide 33

Slide 33 text

The  plot  from  the  use  case   Sepal Length (cm) Sepal Width (cm) Training Data Test Data

Slide 34

Slide 34 text

Example  data  points  for  each  iris  species   Sepal   length     (x-­‐axis)   Sepal   width     (y-­‐axis)   Species   5.1   3.5   I.  setosa   5.5   2.3   I.  versicolor   6.7   2.5   I.  virginica  

Slide 35

Slide 35 text

References:   hWp://www.solver.com/xlminer/help/k-­‐nearest-­‐neighbors-­‐predic6on-­‐example     hWp://saravananthirumuruganathan.wordpress.com/2010/05/17/a-­‐detailed-­‐introduc6on-­‐to-­‐k-­‐nearest-­‐ neighbor-­‐knn-­‐algorithm/     hWp://scikit-­‐learn.org/stable/modules/neighbors.html     hWp://peekaboo-­‐vision.blogspot.com/2013/01/machine-­‐learning-­‐cheat-­‐sheet-­‐for-­‐scikit.html     hWp://stackoverflow.com/ques6ons/1832076/what-­‐is-­‐the-­‐difference-­‐between-­‐supervised-­‐learning-­‐and-­‐ unsupervised-­‐learning     hWp://stackoverflow.com/ques6ons/2620343/what-­‐is-­‐machine-­‐learning  

Slide 36

Slide 36 text

References:   hWp://blog.evernote.com/tech/2013/01/22/stay-­‐classified/     hWp://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf     hWp://en.wikipedia.org/wiki/Iris_flower_data_set     hWp://en.wikipedia.org/wiki/Support_vector_machine      

Slide 37

Slide 37 text

Extra  Slides  

Slide 38

Slide 38 text

Theore6cal  data  model  for   unsupervised  learning   The “outcomes” are our observations. This is what is given to the algorithm Variables that are unknown to us Output of algorithm: Relationships among the ‘outcomes’. Ex: clusters of data points