Know Thy Neighbor: Scikit and the K-Nearest Neighbor Algorithm

Slide 1

Slide 1 text

Know Thy Neighbor: An Introduc6on to Scikit-‐Learn and K-‐NN Por6a Burton PLB Analy6cs www.github.com/pkafei

Slide 2

Slide 2 text

About Me: •  Organizer of the Portland Data Science group •  Volunteer of HackOregon •  Founder of PLB Analy6cs

Slide 3

Slide 3 text

What We will Cover Today 1. Brief Intro to Machine Learning 2. Go Over Scikit-‐learn 3. Explain the k-‐Nearest Neighbor algorithm 4. Demo of Scikit-‐learn and kNN

Slide 4

Slide 4 text

Machine Learning

Slide 5

Slide 5 text

Machine Learning •  The algorithm learns from the data

Slide 6

Slide 6 text

What is Machine Learning Algorithms use data to…. • Create predic6ve models • Classify unknown en66es • Discover paWerns

Slide 7

Slide 7 text

Basic Workﬂow of Machine Learning

Slide 8

Slide 8 text

70% • Clean and Standardize Data 20% • Preprocess, Training, Validate 10% • Analyze and Visualize

Slide 9

Slide 9 text

Scikit-‐Learn

Slide 10

Slide 10 text

What is scikit-‐learn? • Python machine learning package • Great documenta6on • Has built in datasets(i.e. Boston housing market)

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Many companies use Scikit-‐Learn

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

Are You a Recipe? Yum. •  Dis6nguishes ‘recipe’ notes from ‘work’ notes •  Sugges6ng notebooks is a classiﬁca6on problem •  Implements naïve bayes classiﬁca6on algorithm

Slide 15

Slide 15 text

Naïve Bayes Classiﬁca6on “naive” assump6on of independence between every pair of features

Slide 16

Slide 16 text

Supervised vs. Unsupervised Learning

Slide 17

Slide 17 text

Unsupervised Learning Data points are not labeled with outcomes. PaWerns are found by the algorithm.

Slide 18

Slide 18 text

Supervised Learning When your samples are labeled

Slide 19

Slide 19 text

Theore6cal Data Model for Supervised Learning Observations

Slide 20

Slide 20 text

Remember to… Keep your sample size high

Slide 21

Slide 21 text

Keep your feature set low And don’t forget to

Slide 22

Slide 22 text

Examples of Supervised Learning

Slide 23

Slide 23 text

Handwri6ng Analysis

Slide 24

Slide 24 text

Spam Filters

Slide 25

Slide 25 text

k-‐NN • k Nearest Neighbor algorithm – The simplest machine learning algorithm – It is a lazy algorithm : doesn’t run computa6ons on the dataset un6l you give it a new data point you are trying to test – Our example uses k-‐NN for supervised learning

Slide 26

Slide 26 text

Mystery Fruit ?

Slide 27

Slide 27 text

Majority Vote •  Equal weight: Each kNN neighbor has equal weight •  Distance weight: Each kNN neighbor’s vote is based on the distance

Slide 28

Slide 28 text

How k-‐NN works

Slide 29

Slide 29 text

Downsides of kNN • Since there is minimum training there is a high computa6onal cost in tes6ng new data • Correla6on is falsely high (data points can be given too much weight)

Slide 30

Slide 30 text

Live demo 6me!

Slide 31

Slide 31 text

Our Data Set: •  Typical! •  Mul6variate data set was created in 1936 •  Analyzed by Sir Ronald Fischer •  Collected by Edgar Anderson

Slide 32

Slide 32 text

Live coding demo: the data set Iris virginica Iris versicolor Iris setosa Petal

Slide 33

Slide 33 text

The plot from the use case Sepal Length (cm) Sepal Width (cm) Training Data Test Data

Slide 34

Slide 34 text

Example data points for each iris species Sepal length (x-‐axis) Sepal width (y-‐axis) Species 5.1 3.5 I. setosa 5.5 2.3 I. versicolor 6.7 2.5 I. virginica

Slide 35

Slide 35 text

References: hWp://www.solver.com/xlminer/help/k-‐nearest-‐neighbors-‐predic6on-‐example hWp://saravananthirumuruganathan.wordpress.com/2010/05/17/a-‐detailed-‐introduc6on-‐to-‐k-‐nearest-‐ neighbor-‐knn-‐algorithm/ hWp://scikit-‐learn.org/stable/modules/neighbors.html hWp://peekaboo-‐vision.blogspot.com/2013/01/machine-‐learning-‐cheat-‐sheet-‐for-‐scikit.html hWp://stackoverflow.com/ques6ons/1832076/what-‐is-‐the-‐difference-‐between-‐supervised-‐learning-‐and-‐ unsupervised-‐learning hWp://stackoverflow.com/ques6ons/2620343/what-‐is-‐machine-‐learning

Slide 36

Slide 36 text

References: hWp://blog.evernote.com/tech/2013/01/22/stay-‐classiﬁed/ hWp://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf hWp://en.wikipedia.org/wiki/Iris_ﬂower_data_set hWp://en.wikipedia.org/wiki/Support_vector_machine

Slide 37

Slide 37 text

Extra Slides

Slide 38

Slide 38 text

Theore6cal data model for unsupervised learning The “outcomes” are our observations. This is what is given to the algorithm Variables that are unknown to us Output of algorithm: Relationships among the ‘outcomes’. Ex: clusters of data points