450

# Introduction to Classification

Lecture for the 95865 Unstructured Data Analysis course in Fall 2017.

November 14, 2017

## Transcript

3. None
4. None

8. ### Image Tagging as Classiﬁcation https://static.pexels.com/photos/126407/pexels-photo-126407.jpeg https://static.pexels.com/photos/54632/cat-animal-eyes-grey-54632.jpeg https://c1.staticﬂickr.com/4/3645/3523440998_474d43ddc6_b.jpg 200 255 101

205 100 200 255 101 205 100 200 255 101 205 100 x1 x2 xn
9. ### Image Tagging as Classiﬁcation https://static.pexels.com/photos/126407/pexels-photo-126407.jpeg https://static.pexels.com/photos/54632/cat-animal-eyes-grey-54632.jpeg https://c1.staticﬂickr.com/4/3645/3523440998_474d43ddc6_b.jpg yn = spiderman

y2 = cat y1 = cat 200 255 101 205 100 200 255 101 205 100 200 255 101 205 100 x1 x2 xn

}
12. ### Classiﬁcation Essentials Labeled data { (x1, y1), (x2, y2), …

} Types of labels: • Binary: yi ∈ { 0, 1 } • Multi-class: yi ∈ { cat, spiderman, … } • Multi-label: yi ∈ { {cat, feline}, {…} }
13. ### Classiﬁcation Essentials Labeled data { (x1, y1), (x2, y2), …

} Classification Model
14. ### Classiﬁcation Essentials Labeled data { (x1, y1), (x2, y2), …

} Classification Model • Linear function • Tree-based • Nearest-neighbor • …

labels
17. ### A Simple Classiﬁer — kNN Training Data { (x1, y1),

(x2, y2), … }

? i
19. ### A Simple Classiﬁer — kNN i k = 1 k-NN

majority vote Test Point xi, yi = ?
20. ### A Simple Classiﬁer — kNN i k = 2 k-NN

majority vote Break ties randomly Test Point xi, yi = ?
21. ### A Simple Classiﬁer — kNN i k = 3 k-NN

majority vote Break ties randomly Test Point xi, yi = ?
22. ### A Simple Classiﬁer — kNN Classifier decision boundary If we

wish to minimize the probability of misclassiﬁcation, this is done by assigning the test point x to the class having the largest posterior probability, corresponding to the largest value of Kk/K. Thus to classify a new point, we identify the K nearest points from the training data set and then assign the new point to the class having the largest number of representatives amongst this set. Ties can be broken at random. The particular case of K = 1 is called the nearest-neighbour rule, because a test point is simply assigned to the same class as the nearest point from the training set. These concepts are illustrated in Figure 2.27. In Figure 2.28, we show the results of applying the K-nearest-neighbour algo- rithm to the oil ﬂow data, introduced in Chapter 1, for various values of K. As expected, we see that K controls the degree of smoothing, so that small K produces many small regions of each class, whereas large K leads to fewer larger regions. x6 x7 K = 1 0 1 2 0 1 2 x6 x7 K = 3 0 1 2 0 1 2 x6 x7 K = 31 0 1 2 0 1 2 Figure 2.28 Plot of 200 data points from the oil data set showing values of x6 plotted against x7 , where the red, green, and blue points correspond to the ‘laminar’, ‘annular’, and ‘homogeneous’ classes, respectively. Also shown are the classiﬁcations of the input space given by the K-nearest-neighbour algorithm for various values of K.
23. ### A Simple Classiﬁer — kNN Classifier decision boundary If we

wish to minimize the probability of misclassiﬁcation, this is done by assigning the test point x to the class having the largest posterior probability, corresponding to the largest value of Kk/K. Thus to classify a new point, we identify the K nearest points from the training data set and then assign the new point to the class having the largest number of representatives amongst this set. Ties can be broken at random. The particular case of K = 1 is called the nearest-neighbour rule, because a test point is simply assigned to the same class as the nearest point from the training set. These concepts are illustrated in Figure 2.27. In Figure 2.28, we show the results of applying the K-nearest-neighbour algo- rithm to the oil ﬂow data, introduced in Chapter 1, for various values of K. As expected, we see that K controls the degree of smoothing, so that small K produces many small regions of each class, whereas large K leads to fewer larger regions. x6 x7 K = 1 0 1 2 0 1 2 x6 x7 K = 3 0 1 2 0 1 2 x6 x7 K = 31 0 1 2 0 1 2 Figure 2.28 Plot of 200 data points from the oil data set showing values of x6 plotted against x7 , where the red, green, and blue points correspond to the ‘laminar’, ‘annular’, and ‘homogeneous’ classes, respectively. Also shown are the classiﬁcations of the input space given by the K-nearest-neighbour algorithm for various values of K. Non-linear decision boundary
24. ### A Simple Classiﬁer — kNN Optimal classifier with error e

1-NN with k = 1 will have error < 2e (Cover and Hart, 1967)

26. ### A Linear Classiﬁer — SVM Assumption: Data is linearly separable

{ (X1, t1), (X2, t2), … } ti ∈ { +1, -1 }

w1 w2

separator

w1 w2
31. ### A Linear Classiﬁer — SVM w1 w2 Margin Perpendicular distance

between the separator and the nearest data point
32. ### A Linear Classiﬁer — SVM w1 w2 Goal Find the

maximum margin linear separator
33. ### A Linear Classiﬁer — SVM w2 Find the maximum margin

linear separator Goal
34. ### A Linear Classiﬁer — SVM How do we handle non-linearly

separable data?

] ϕ( )

in 2-D

42. ### Kernels Solution: The Kernel Trick Problem: The non-linear mapping ϕ

can be extremely large oneweirdkerneltrick.com

44. ### The Kernel Trick A valid kernel function K(x, y) implicitly

defines a feature mapping Much easier to define K(x, y)
45. ### Examples of Valid Kernels K( x , x 0) =

e( k x x 0k2 2 2 ) Radial basis function K( x , x 0) = ( x T y + c)d Polynomial

47. ### Generative Models Model the unknown data generating process Learn parameters

of the model from the data
48. ### Example: Finding Progression Stages Yang, J., McAuley, J., Leskovec, J.,

LePendu, P., and Shah, N. Finding progression stages in time-evolving event sequences. WWW 2014. Data — Beer ratings by users on RateBeer.com
49. ### Example: Finding Progression Stages Yang, J., McAuley, J., Leskovec, J.,

LePendu, P., and Shah, N. Finding progression stages in time-evolving event sequences. WWW 2014. Question — How do beer drinkers “progress” over time?
50. ### Example: Finding Progression Stages Yang, J., McAuley, J., Leskovec, J.,

LePendu, P., and Shah, N. Finding progression stages in time-evolving event sequences. WWW 2014. Maximize the data likelihood xij ⇠ Multinomial(⇥(ci, sj)) ⇥(ci, sj) ⇠ Dirichlet( ) A generative model
51. ### Example: Finding Progression Stages Yang, J., McAuley, J., Leskovec, J.,

LePendu, P., and Shah, N. Finding progression stages in time-evolving event sequences. WWW 2014.

vocabulary

59. ### Naive Bayes Generating a document xi: 1. Pick a label

yi from {ham, spam} with probability θ θ = P(spam)
60. ### Naive Bayes Generating a document xi: 1. Pick a label

yi from {ham, spam} with probability θ θ = P(spam) P(wj | ham) P(wj | spam)

62. ### Naive Bayes Generating a document xi: 1. Pick a label

yi from {ham, spam} with probability θ 2. For each possible word w1 — wM, include it in the document with probability P(wj | yi) θ = P(spam) P(wj | ham) P(wj | spam)
63. ### Naive Bayes How many coins? θ = P(spam) P(wj |

ham) P(wj | spam)

65. ### Naive Bayes — Important Assumption P(w1, w2, …, wM |

label) = P(w1 | label)…P(wM, | label)
66. ### Naive Bayes — Important Assumption P(w1, w2, …, wM |

label) = P(w1 | label)…P(wM, | label) The presence of each word within a document is conditionally independent of the other words, given the label

68. ### I observe the word “million” in the document Conditional Independence

SPAM “MILLION” “WINNER”
69. ### I observe the word “million” in the document Conditional Independence

SPAM “MILLION” “WINNER” How do the chances of observing the word “winner” change?
70. ### Conditional Independence SPAM “MILLION” “WINNER” Knowing that “million” is observed

changes our degree of uncertainty about observing “winner”
71. ### Conditional Independence SPAM “MILLION” “WINNER” Knowing that “million” is observed

changes our degree of uncertainty about observing “winner” The events are not independent

is not spam
73. ### Conditional Independence SPAM “MILLION” “WINNER” I know that the document

is not spam I observe the word “million” in the document
74. ### Conditional Independence SPAM “MILLION” “WINNER” I know that the document

is not spam I observe the word “million” in the document Does that affect the chances of observing “winner”?
75. ### Naive Bayes — Fitting Parameters P ( yi = spam)

= ✓ = number of spam documents total number of documents P ( wj | spam) = number of times word wj occurs in spam total number of words labeled spam
76. ### Naive Bayes — Prediction My father was a very wealthy

cocoa merchant in Abidjan, the economic capital of Ivory Coast before he was poisoned to death by his business associates on one of their outing to discus on a business deal. http://www.hoax-slayer.net/wumi-abdul-advance-fee-scam/ P(y = spam) ∝ P(“my” | spam)…P(“deal” | spam)P(spam) P(y = ham) ∝ P(“my” | ham)…P(“deal” | ham)P(spam)
77. ### Naive Bayes — Smoothing What if we observe a word

in a document that we never saw during training? P(“Ivory” | spam) = 0.0 P(y = spam) ∝ P(“my” | spam)…P(“Ivory” | spam)P(spam) = 0.0
78. ### Smooth word counts P ( wj | spam) = number

of times word wj occurs in spam + 1 total number of words labeled spam + |V | ccurs in spam + 1 eled spam + |V | unique words in the training data Naive Bayes — Smoothing

NB
82. ### Evaluation Goals Model selection — finding the best-performing model k=2

k=3 kNN NB “Hyperparameter" selection — finding the best hyperparameters for a given model
83. ### Goal: Minimize the error on future, unobserved data Evaluation Goals

Model selection — finding the best-performing model k=2 k=3 kNN NB “Hyperparameter" selection — finding the best hyperparameters for a given model
84. ### Method I — Train, Validation, Test • Split data randomly

into train, validation and test • Split can be stratified based on the label • See sklearn.model_selection.train_test_split Data Train Test Val
85. ### Method I — Train and Test Data Can we do

this without wasting valuable training data?
86. ### Method II — k-Fold Cross-Validation • Split data randomly into

train and test • Split train data randomly into k equal “folds” • Train on k-1 folds, validate on the remaining • Average the k metrics from each fold • See sklearn.model_selection.KFold
87. ### Method II — k-Fold Cross-Validation (Shufﬂed) Training Data Fold 1

Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Test Data
88. ### Hyperparameter Selection • Grid-search: Search a “well- spaced grid” of

hyperparameter values • Performance metrics averaged over the k validation folds • Select the hyperparameters with the best performance
89. ### Performance Metrics Factors driving the choice of a performance metric:

• Data — balanced vs. skewed • Task — ranking, classification, clustering • Real-world use-case
90. None
91. ### Performance Metrics Dangerous = +1, +1, …, +1 -1, …,

-1 Skewed Data
92. ### Performance Metrics Dangerous = +1, +1, …, +1 -1, …,

-1 Skewed Data 90% 10%
93. ### Performance Metrics Dangerous = +1, +1, …, +1 -1, …,

-1 Skewed Data Accuracy of “always guess +1” = 90%! 90% 10%
94. ### Performance Metrics — Confusion Matrix Actual P Actual F Predicted

P True Positive False Positive Predicted F False Negative True Negative
95. ### True Positive Rate Performance Metrics — TPR, FPR False Positive

Rate TP / (TP + FN) FP / (FP + TN) Example: Percentage of dangerous objects correctly identified as such. Example: Percentage of safe objects incorrectly identified as dangerous.
96. ### Performance Metrics — Thresholds -1 +1 -1 +1 Actual TPR

FPR 0.0 1.0 1.0
97. ### Performance Metrics — Thresholds -1 +1 -1 +1 Actual 0.1

0.3 0.4 0.8 Prob. TPR FPR 0.0 1.0 1.0
98. ### Performance Metrics — Thresholds -1 +1 -1 +1 Actual 0.1

0.3 0.4 0.8 Prob. TPR FPR 0.0 1.0 1.0 Thresh. Pred. FPR/TPR
99. ### Performance Metrics — Thresholds -1 +1 -1 +1 Actual 0.1

0.3 0.4 0.8 Prob. TPR FPR 0.0 1.0 1.0 Thresh. Pred. FPR/TPR -1 -1 -1 +1 0.8 0.00 / 0.00
100. ### Performance Metrics — Thresholds -1 +1 -1 +1 Actual 0.1

0.3 0.4 0.8 Prob. TPR FPR 0.0 1.0 1.0 Thresh. Pred. FPR/TPR -1 -1 -1 +1 0.8 0.00 / 0.00 -1 -1 +1 +1 0.4 0.50 / 0.50
101. ### Performance Metrics — Thresholds -1 +1 -1 +1 Actual 0.1

0.3 0.4 0.8 Prob. TPR FPR 0.0 1.0 1.0 Thresh. Pred. FPR/TPR -1 -1 -1 +1 0.8 0.00 / 0.00 -1 -1 +1 +1 0.4 0.50 / 0.50 -1 +1 +1 +1 0.3 0.50 / 1.00
102. ### Performance Metrics — Thresholds -1 +1 -1 +1 Actual 0.1

0.3 0.4 0.8 Prob. Thresh. -1 -1 -1 +1 Pred. 0.8 -1 -1 +1 +1 0.4 -1 +1 +1 +1 0.3 +1 +1 +1 +1 0.1 TPR FPR FPR/TPR 0.00 / 0.00 0.50 / 0.50 0.50 / 1.00 1.00 / 1.00 0.0 1.0 1.0
103. ### Performance Metrics — Thresholds -1 +1 -1 +1 Actual 0.1

0.3 0.4 0.8 Prob. Thresh. -1 -1 -1 +1 Pred. 0.8 -1 -1 +1 +1 0.4 -1 +1 +1 +1 0.3 +1 +1 +1 +1 0.1 TPR FPR FPR/TPR 0.00 / 0.00 0.50 / 0.50 0.50 / 1.00 1.00 / 1.00 0.0 1.0 1.0
104. ### Performance Metrics — Precision/Recall Precision Recall TP / (TP +

FP) FP / (FP + TN) Example: Percentage of objects identified as dangerous that were actually dangerous. Example: Percentage of dangerous objects correctly identified as such. F1-Score (PxR ) / (P + R)
105. ### Performance Metrics — Precision/Recall -1 +1 -1 +1 Actual 0.1

0.3 0.4 0.8 Prob. Thresh. -1 -1 -1 +1 Pred. 0.8 -1 -1 +1 +1 0.4 -1 +1 +1 +1 0.3 +1 +1 +1 +1 0.1 P R P / R 1.00 / 0.50 0.50 / 0.50 0.66 / 1.00 0.50 / 1.00 0.0 1.0 1.0