November 14, 2017
510

# Introduction to Classification

Lecture for the 95865 Unstructured Data Analysis course in Fall 2017.

November 14, 2017

## Transcript

6. ### Image Tagging as Classiﬁcation https://static.pexels.com/photos/126407/pexels-photo-126407.jpeg https://static.pexels.com/photos/54632/cat-animal-eyes-grey-54632.jpeg https://c1.staticﬂickr.com/4/3645/3523440998_474d43ddc6_b.jpg 200 255 101

205 100 200 255 101 205 100 200 255 101 205 100 x1 x2 xn
7. ### Image Tagging as Classiﬁcation https://static.pexels.com/photos/126407/pexels-photo-126407.jpeg https://static.pexels.com/photos/54632/cat-animal-eyes-grey-54632.jpeg https://c1.staticﬂickr.com/4/3645/3523440998_474d43ddc6_b.jpg yn = spiderman

y2 = cat y1 = cat 200 255 101 205 100 200 255 101 205 100 200 255 101 205 100 x1 x2 xn

}
10. ### Classiﬁcation Essentials Labeled data { (x1, y1), (x2, y2), …

} Types of labels: • Binary: yi ∈ { 0, 1 } • Multi-class: yi ∈ { cat, spiderman, … } • Multi-label: yi ∈ { {cat, feline}, {…} }
11. ### Classiﬁcation Essentials Labeled data { (x1, y1), (x2, y2), …

} Classification Model
12. ### Classiﬁcation Essentials Labeled data { (x1, y1), (x2, y2), …

} Classification Model • Linear function • Tree-based • Nearest-neighbor • …

labels
15. ### A Simple Classiﬁer — kNN Training Data { (x1, y1),

(x2, y2), … }

? i
17. ### A Simple Classiﬁer — kNN i k = 1 k-NN

majority vote Test Point xi, yi = ?
18. ### A Simple Classiﬁer — kNN i k = 2 k-NN

majority vote Break ties randomly Test Point xi, yi = ?
19. ### A Simple Classiﬁer — kNN i k = 3 k-NN

majority vote Break ties randomly Test Point xi, yi = ?
20. ### A Simple Classiﬁer — kNN Classifier decision boundary If we

wish to minimize the probability of misclassiﬁcation, this is done by assigning the test point x to the class having the largest posterior probability, corresponding to the largest value of Kk/K. Thus to classify a new point, we identify the K nearest points from the training data set and then assign the new point to the class having the largest number of representatives amongst this set. Ties can be broken at random. The particular case of K = 1 is called the nearest-neighbour rule, because a test point is simply assigned to the same class as the nearest point from the training set. These concepts are illustrated in Figure 2.27. In Figure 2.28, we show the results of applying the K-nearest-neighbour algo- rithm to the oil ﬂow data, introduced in Chapter 1, for various values of K. As expected, we see that K controls the degree of smoothing, so that small K produces many small regions of each class, whereas large K leads to fewer larger regions. x6 x7 K = 1 0 1 2 0 1 2 x6 x7 K = 3 0 1 2 0 1 2 x6 x7 K = 31 0 1 2 0 1 2 Figure 2.28 Plot of 200 data points from the oil data set showing values of x6 plotted against x7 , where the red, green, and blue points correspond to the ‘laminar’, ‘annular’, and ‘homogeneous’ classes, respectively. Also shown are the classiﬁcations of the input space given by the K-nearest-neighbour algorithm for various values of K.
21. ### A Simple Classiﬁer — kNN Classifier decision boundary If we

wish to minimize the probability of misclassiﬁcation, this is done by assigning the test point x to the class having the largest posterior probability, corresponding to the largest value of Kk/K. Thus to classify a new point, we identify the K nearest points from the training data set and then assign the new point to the class having the largest number of representatives amongst this set. Ties can be broken at random. The particular case of K = 1 is called the nearest-neighbour rule, because a test point is simply assigned to the same class as the nearest point from the training set. These concepts are illustrated in Figure 2.27. In Figure 2.28, we show the results of applying the K-nearest-neighbour algo- rithm to the oil ﬂow data, introduced in Chapter 1, for various values of K. As expected, we see that K controls the degree of smoothing, so that small K produces many small regions of each class, whereas large K leads to fewer larger regions. x6 x7 K = 1 0 1 2 0 1 2 x6 x7 K = 3 0 1 2 0 1 2 x6 x7 K = 31 0 1 2 0 1 2 Figure 2.28 Plot of 200 data points from the oil data set showing values of x6 plotted against x7 , where the red, green, and blue points correspond to the ‘laminar’, ‘annular’, and ‘homogeneous’ classes, respectively. Also shown are the classiﬁcations of the input space given by the K-nearest-neighbour algorithm for various values of K. Non-linear decision boundary
22. ### A Simple Classiﬁer — kNN Optimal classifier with error e

1-NN with k = 1 will have error < 2e (Cover and Hart, 1967)

24. ### A Linear Classiﬁer — SVM Assumption: Data is linearly separable

{ (X1, t1), (X2, t2), … } ti ∈ { +1, -1 }

w1 w2

separator

w1 w2
29. ### A Linear Classiﬁer — SVM w1 w2 Margin Perpendicular distance

between the separator and the nearest data point
30. ### A Linear Classiﬁer — SVM w1 w2 Goal Find the

maximum margin linear separator
31. ### A Linear Classiﬁer — SVM w2 Find the maximum margin

linear separator Goal
32. ### A Linear Classiﬁer — SVM How do we handle non-linearly

separable data?

] ϕ( )

in 2-D

40. ### Kernels Solution: The Kernel Trick Problem: The non-linear mapping ϕ

can be extremely large oneweirdkerneltrick.com

42. ### The Kernel Trick A valid kernel function K(x, y) implicitly

defines a feature mapping Much easier to define K(x, y)
43. ### Examples of Valid Kernels K( x , x 0) =

e( k x x 0k2 2 2 ) Radial basis function K( x , x 0) = ( x T y + c)d Polynomial

45. ### Generative Models Model the unknown data generating process Learn parameters

of the model from the data
46. ### Example: Finding Progression Stages Yang, J., McAuley, J., Leskovec, J.,

LePendu, P., and Shah, N. Finding progression stages in time-evolving event sequences. WWW 2014. Data — Beer ratings by users on RateBeer.com
47. ### Example: Finding Progression Stages Yang, J., McAuley, J., Leskovec, J.,

LePendu, P., and Shah, N. Finding progression stages in time-evolving event sequences. WWW 2014. Question — How do beer drinkers “progress” over time?
48. ### Example: Finding Progression Stages Yang, J., McAuley, J., Leskovec, J.,

LePendu, P., and Shah, N. Finding progression stages in time-evolving event sequences. WWW 2014. Maximize the data likelihood xij ⇠ Multinomial(⇥(ci, sj)) ⇥(ci, sj) ⇠ Dirichlet( ) A generative model
49. ### Example: Finding Progression Stages Yang, J., McAuley, J., Leskovec, J.,

LePendu, P., and Shah, N. Finding progression stages in time-evolving event sequences. WWW 2014.

52. ### Naive Bayes Documents x1 x2 xn xi = { w1,

w2, …, wm } Words use place line one via money bank account United deposit Uk Assets funds Mr family come private kin also next now years last two died sum life full give well far 25th left care sent put 10 Allied kept USD Sir lord arise fax business fund want late claim share death inform client regards dear offer five find partner gold sincere manager prior nature state turn OLD able oil 15 5 end file 30 M incur collecting Kindly action west acceptable Africa town \$ TW ENTY BELLO 2 FASO AUDITING ran OPENED FILES CHARTER JET 50 BENIN 55 TRADER TRADE 60 70 THING 95 hand DIE TIRED Mrs sence VALID DRIVERS George CHEAT HIT 1To TRACE visit SET ASIDE TAKE 252 2To BILLS SEX cocoa CELL CODE 3To Barr banking million make Dollars Arag Contact investigation since name charity organizations assist months official transaction know Simeon God mail interest provide forward world person father email deposited may COUNTRY good TRANSFER never London Please ownership situation contacting HSBC wish given made willing release investment live proposal thousand JOHN KOROVO FOREIGNER huge soon reward told way investments Schoelers properties within back time due numbers Total right believe less project information man estate children ACCIDENT got States assistance ask address hundred consignment OW NER beneficiary profit APPROVED assurance International position desire receive expenses attorney law deceased might COURSE used names associates deal proceeds contacted immediately GOING accounts bless procedure just details clients send Best per daughter enable kindness lines hospital Abidjan fathers FOREIGN chance Amah privileged cote wife divoire Firm six Trust informed certain permit portfolio new especially around opportunities cash COMPANY process instructions understand party relation Abdul instruct simply destroy let cent general DISCOVERED hesitate choice plane crash security knows confide capacity DOCUMENTS need simple Wumi 2000 sharing confidence conclusion honourably telephone secure phone seeking 2003 annually property Based following ways subject serve nothing sale indicate main towards concluded reverting Moreover current period revert LOCAL capital fifteen effort start input successful first feel message Four free According lived accept Peter Attah whole soul gives recommendations living CREDIT ABROAD invest letter CITY Port Harcourt like hear came division ago people advice took risk done \$15 later task sector days dead held depositors alone much help seek move pass work must Management arrangement guardian monitored majestys government compensation distributing poisoned overseas consider AFRICAN COMMISSION appreciate without relationship placed request result reputable communication explained wished officer numerous wealth various managers found AMOUNT Kindom orphans Securities Trading charges affiliate processes worth concerned special declared possible surviving means existence rarely nominate internal dictates matter Stella practice relatives customer prepared accrued released distribute BOUAKE expedite HUNDREN SULEMAN AUDITOR BURKINA FLOATING economical GREETINGS RECORDS BEIRUTBOUND businessI fundHe DECEMBER COTONOU REPUBLIC NOBODY MINING PROVED ALONG INVOLVED reimburse TWENTYFIVE PASSPORT humanity STRONG sympathetic INFLUENCE mutual FOREIGNERS PENDING PHYSICAL ARRIVAL PROVE supposed balance BUILD retrive ENTITLED GRATIFICATION wealthy fearing education residential CHAMPION Hello Phillip
53. ### Naive Bayes Documents x1 x2 xn Labels ∈ { ham,

spam } y1 y2 yn use place line one via money bank account United deposit Uk Assets funds Mr family come private kin also next now years last two died sum life full give well far 25th left care sent put 10 Allied kept USD Sir lord arise fax business fund want late claim share death inform client regards dear offer five find partner gold sincere manager prior nature state turn OLD able oil 15 5 end file 30 M incur collecting Kindly action west acceptable Africa town \$ TW ENTY BELLO 2 FASO AUDITING ran OPENED FILES CHARTER JET 50 BENIN 55 TRADER TRADE 60 70 THING 95 hand DIE TIRED Mrs sence VALID DRIVERS George CHEAT HIT 1To TRACE visit SET ASIDE TAKE 252 2To BILLS SEX cocoa CELL CODE 3To Barr banking million make Dollars Arag Contact investigation since name charity organizations assist months official transaction know Simeon God mail interest provide forward world person father email deposited may COUNTRY good TRANSFER never London Please ownership situation contacting HSBC wish given made willing release investment live proposal thousand JOHN KOROVO FOREIGNER huge soon reward told way investments Schoelers properties within back time due numbers Total right believe less project information man estate children ACCIDENT got States assistance ask address hundred consignment OW NER beneficiary profit APPROVED assurance International position desire receive expenses attorney law deceased might COURSE used names associates deal proceeds contacted immediately GOING accounts bless procedure just details clients send Best per daughter enable kindness lines hospital Abidjan fathers FOREIGN chance Amah privileged cote wife divoire Firm six Trust informed certain permit portfolio new especially around opportunities cash COMPANY process instructions understand party relation Abdul instruct simply destroy let cent general DISCOVERED hesitate choice plane crash security knows confide capacity DOCUMENTS need simple Wumi 2000 sharing confidence conclusion honourably telephone secure phone seeking 2003 annually property Based following ways subject serve nothing sale indicate main towards concluded reverting Moreover current period revert LOCAL capital fifteen effort start input successful first feel message Four free According lived accept Peter Attah whole soul gives recommendations living CREDIT ABROAD invest letter CITY Port Harcourt like hear came division ago people advice took risk done \$15 later task sector days dead held depositors alone much help seek move pass work must Management arrangement guardian monitored majestys government compensation distributing poisoned overseas consider AFRICAN COMMISSION appreciate without relationship placed request result reputable communication explained wished officer numerous wealth various managers found AMOUNT Kindom orphans Securities Trading charges affiliate processes worth concerned special declared possible surviving means existence rarely nominate internal dictates matter Stella practice relatives customer prepared accrued released distribute BOUAKE expedite HUNDREN SULEMAN AUDITOR BURKINA FLOATING economical GREETINGS RECORDS BEIRUTBOUND businessI fundHe DECEMBER COTONOU REPUBLIC NOBODY MINING PROVED ALONG INVOLVED reimburse TWENTYFIVE PASSPORT humanity STRONG sympathetic INFLUENCE mutual FOREIGNERS PENDING PHYSICAL ARRIVAL PROVE supposed balance BUILD retrive ENTITLED GRATIFICATION wealthy fearing education residential CHAMPION Hello Phillip xi = { w1, w2, …, wm } Words

vocabulary

57. ### Naive Bayes Generating a document xi: 1. Pick a label

yi from {ham, spam} with probability θ θ = P(spam)
58. ### Naive Bayes Generating a document xi: 1. Pick a label

yi from {ham, spam} with probability θ θ = P(spam) P(wj | ham) P(wj | spam)

60. ### Naive Bayes Generating a document xi: 1. Pick a label

yi from {ham, spam} with probability θ 2. For each possible word w1 — wM, include it in the document with probability P(wj | yi) θ = P(spam) P(wj | ham) P(wj | spam)
61. ### Naive Bayes How many coins? θ = P(spam) P(wj |

ham) P(wj | spam)

63. ### Naive Bayes — Important Assumption P(w1, w2, …, wM |

label) = P(w1 | label)…P(wM, | label)
64. ### Naive Bayes — Important Assumption P(w1, w2, …, wM |

label) = P(w1 | label)…P(wM, | label) The presence of each word within a document is conditionally independent of the other words, given the label

66. ### I observe the word “million” in the document Conditional Independence

SPAM “MILLION” “WINNER”
67. ### I observe the word “million” in the document Conditional Independence

SPAM “MILLION” “WINNER” How do the chances of observing the word “winner” change?
68. ### Conditional Independence SPAM “MILLION” “WINNER” Knowing that “million” is observed

changes our degree of uncertainty about observing “winner”
69. ### Conditional Independence SPAM “MILLION” “WINNER” Knowing that “million” is observed

changes our degree of uncertainty about observing “winner” The events are not independent

is not spam
71. ### Conditional Independence SPAM “MILLION” “WINNER” I know that the document

is not spam I observe the word “million” in the document
72. ### Conditional Independence SPAM “MILLION” “WINNER” I know that the document

is not spam I observe the word “million” in the document Does that affect the chances of observing “winner”?
73. ### Naive Bayes — Fitting Parameters P ( yi = spam)

= ✓ = number of spam documents total number of documents P ( wj | spam) = number of times word wj occurs in spam total number of words labeled spam
74. ### Naive Bayes — Prediction My father was a very wealthy

cocoa merchant in Abidjan, the economic capital of Ivory Coast before he was poisoned to death by his business associates on one of their outing to discus on a business deal. http://www.hoax-slayer.net/wumi-abdul-advance-fee-scam/ P(y = spam) ∝ P(“my” | spam)…P(“deal” | spam)P(spam) P(y = ham) ∝ P(“my” | ham)…P(“deal” | ham)P(spam)
75. ### Naive Bayes — Smoothing What if we observe a word

in a document that we never saw during training? P(“Ivory” | spam) = 0.0 P(y = spam) ∝ P(“my” | spam)…P(“Ivory” | spam)P(spam) = 0.0
76. ### Smooth word counts P ( wj | spam) = number

of times word wj occurs in spam + 1 total number of words labeled spam + |V | ccurs in spam + 1 eled spam + |V | unique words in the training data Naive Bayes — Smoothing

NB
80. ### Evaluation Goals Model selection — finding the best-performing model k=2

k=3 kNN NB “Hyperparameter" selection — finding the best hyperparameters for a given model
81. ### Goal: Minimize the error on future, unobserved data Evaluation Goals

Model selection — finding the best-performing model k=2 k=3 kNN NB “Hyperparameter" selection — finding the best hyperparameters for a given model
82. ### Method I — Train, Validation, Test • Split data randomly

into train, validation and test • Split can be stratified based on the label • See sklearn.model_selection.train_test_split Data Train Test Val
83. ### Method I — Train and Test Data Can we do

this without wasting valuable training data?
84. ### Method II — k-Fold Cross-Validation • Split data randomly into

train and test • Split train data randomly into k equal “folds” • Train on k-1 folds, validate on the remaining • Average the k metrics from each fold • See sklearn.model_selection.KFold
85. ### Method II — k-Fold Cross-Validation (Shufﬂed) Training Data Fold 1

Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Test Data
86. ### Hyperparameter Selection • Grid-search: Search a “well- spaced grid” of

hyperparameter values • Performance metrics averaged over the k validation folds • Select the hyperparameters with the best performance
87. ### Performance Metrics Factors driving the choice of a performance metric:

• Data — balanced vs. skewed • Task — ranking, classification, clustering • Real-world use-case
88. ### Performance Metrics Dangerous = +1, +1, …, +1 -1, …,

-1 Skewed Data
89. ### Performance Metrics Dangerous = +1, +1, …, +1 -1, …,

-1 Skewed Data 90% 10%
90. ### Performance Metrics Dangerous = +1, +1, …, +1 -1, …,

-1 Skewed Data Accuracy of “always guess +1” = 90%! 90% 10%
91. ### Performance Metrics — Confusion Matrix Actual P Actual F Predicted

P True Positive False Positive Predicted F False Negative True Negative
92. ### True Positive Rate Performance Metrics — TPR, FPR False Positive

Rate TP / (TP + FN) FP / (FP + TN) Example: Percentage of dangerous objects correctly identified as such. Example: Percentage of safe objects incorrectly identified as dangerous.
93. ### Performance Metrics — Thresholds -1 +1 -1 +1 Actual TPR

FPR 0.0 1.0 1.0
94. ### Performance Metrics — Thresholds -1 +1 -1 +1 Actual 0.1

0.3 0.4 0.8 Prob. TPR FPR 0.0 1.0 1.0
95. ### Performance Metrics — Thresholds -1 +1 -1 +1 Actual 0.1

0.3 0.4 0.8 Prob. TPR FPR 0.0 1.0 1.0 Thresh. Pred. FPR/TPR
96. ### Performance Metrics — Thresholds -1 +1 -1 +1 Actual 0.1

0.3 0.4 0.8 Prob. TPR FPR 0.0 1.0 1.0 Thresh. Pred. FPR/TPR -1 -1 -1 +1 0.8 0.00 / 0.00
97. ### Performance Metrics — Thresholds -1 +1 -1 +1 Actual 0.1

0.3 0.4 0.8 Prob. TPR FPR 0.0 1.0 1.0 Thresh. Pred. FPR/TPR -1 -1 -1 +1 0.8 0.00 / 0.00 -1 -1 +1 +1 0.4 0.50 / 0.50
98. ### Performance Metrics — Thresholds -1 +1 -1 +1 Actual 0.1

0.3 0.4 0.8 Prob. TPR FPR 0.0 1.0 1.0 Thresh. Pred. FPR/TPR -1 -1 -1 +1 0.8 0.00 / 0.00 -1 -1 +1 +1 0.4 0.50 / 0.50 -1 +1 +1 +1 0.3 0.50 / 1.00
99. ### Performance Metrics — Thresholds -1 +1 -1 +1 Actual 0.1

0.3 0.4 0.8 Prob. Thresh. -1 -1 -1 +1 Pred. 0.8 -1 -1 +1 +1 0.4 -1 +1 +1 +1 0.3 +1 +1 +1 +1 0.1 TPR FPR FPR/TPR 0.00 / 0.00 0.50 / 0.50 0.50 / 1.00 1.00 / 1.00 0.0 1.0 1.0
100. ### Performance Metrics — Thresholds -1 +1 -1 +1 Actual 0.1

0.3 0.4 0.8 Prob. Thresh. -1 -1 -1 +1 Pred. 0.8 -1 -1 +1 +1 0.4 -1 +1 +1 +1 0.3 +1 +1 +1 +1 0.1 TPR FPR FPR/TPR 0.00 / 0.00 0.50 / 0.50 0.50 / 1.00 1.00 / 1.00 0.0 1.0 1.0
101. ### Performance Metrics — Precision/Recall Precision Recall TP / (TP +

FP) FP / (FP + TN) Example: Percentage of objects identified as dangerous that were actually dangerous. Example: Percentage of dangerous objects correctly identified as such. F1-Score (PxR ) / (P + R)
102. ### Performance Metrics — Precision/Recall -1 +1 -1 +1 Actual 0.1

0.3 0.4 0.8 Prob. Thresh. -1 -1 -1 +1 Pred. 0.8 -1 -1 +1 +1 0.4 -1 +1 +1 +1 0.3 +1 +1 +1 +1 0.1 P R P / R 1.00 / 0.50 0.50 / 0.50 0.66 / 1.00 0.50 / 1.00 0.0 1.0 1.0