Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Classification

Emaad Manzoor
November 14, 2017

Introduction to Classification

Lecture for the 95865 Unstructured Data Analysis course in Fall 2017.

Demo notebook: https://gist.github.com/emaadmanzoor/0ba78a2920ea0858b54942eff8b08820

Emaad Manzoor

November 14, 2017
Tweet

More Decks by Emaad Manzoor

Other Decks in Science

Transcript

  1. Classification Essentials Labeled data { (x1, y1), (x2, y2), …

    } Types of labels: • Binary: yi ∈ { 0, 1 } • Multi-class: yi ∈ { cat, spiderman, … } • Multi-label: yi ∈ { {cat, feline}, {…} }
  2. Classification Essentials Labeled data { (x1, y1), (x2, y2), …

    } Classification Model • Linear function • Tree-based • Nearest-neighbor • …
  3. A Simple Classifier — kNN i k = 1 k-NN

    majority vote Test Point xi, yi = ?
  4. A Simple Classifier — kNN i k = 2 k-NN

    majority vote Break ties randomly Test Point xi, yi = ?
  5. A Simple Classifier — kNN i k = 3 k-NN

    majority vote Break ties randomly Test Point xi, yi = ?
  6. A Simple Classifier — kNN Classifier decision boundary If we

    wish to minimize the probability of misclassification, this is done by assigning the test point x to the class having the largest posterior probability, corresponding to the largest value of Kk/K. Thus to classify a new point, we identify the K nearest points from the training data set and then assign the new point to the class having the largest number of representatives amongst this set. Ties can be broken at random. The particular case of K = 1 is called the nearest-neighbour rule, because a test point is simply assigned to the same class as the nearest point from the training set. These concepts are illustrated in Figure 2.27. In Figure 2.28, we show the results of applying the K-nearest-neighbour algo- rithm to the oil flow data, introduced in Chapter 1, for various values of K. As expected, we see that K controls the degree of smoothing, so that small K produces many small regions of each class, whereas large K leads to fewer larger regions. x6 x7 K = 1 0 1 2 0 1 2 x6 x7 K = 3 0 1 2 0 1 2 x6 x7 K = 31 0 1 2 0 1 2 Figure 2.28 Plot of 200 data points from the oil data set showing values of x6 plotted against x7 , where the red, green, and blue points correspond to the ‘laminar’, ‘annular’, and ‘homogeneous’ classes, respectively. Also shown are the classifications of the input space given by the K-nearest-neighbour algorithm for various values of K.
  7. A Simple Classifier — kNN Classifier decision boundary If we

    wish to minimize the probability of misclassification, this is done by assigning the test point x to the class having the largest posterior probability, corresponding to the largest value of Kk/K. Thus to classify a new point, we identify the K nearest points from the training data set and then assign the new point to the class having the largest number of representatives amongst this set. Ties can be broken at random. The particular case of K = 1 is called the nearest-neighbour rule, because a test point is simply assigned to the same class as the nearest point from the training set. These concepts are illustrated in Figure 2.27. In Figure 2.28, we show the results of applying the K-nearest-neighbour algo- rithm to the oil flow data, introduced in Chapter 1, for various values of K. As expected, we see that K controls the degree of smoothing, so that small K produces many small regions of each class, whereas large K leads to fewer larger regions. x6 x7 K = 1 0 1 2 0 1 2 x6 x7 K = 3 0 1 2 0 1 2 x6 x7 K = 31 0 1 2 0 1 2 Figure 2.28 Plot of 200 data points from the oil data set showing values of x6 plotted against x7 , where the red, green, and blue points correspond to the ‘laminar’, ‘annular’, and ‘homogeneous’ classes, respectively. Also shown are the classifications of the input space given by the K-nearest-neighbour algorithm for various values of K. Non-linear decision boundary
  8. A Simple Classifier — kNN Optimal classifier with error e

    1-NN with k = 1 will have error < 2e (Cover and Hart, 1967)
  9. A Linear Classifier — SVM Assumption: Data is linearly separable

    { (X1, t1), (X2, t2), … } ti ∈ { +1, -1 }
  10. A Linear Classifier — SVM w1 w2 Margin Perpendicular distance

    between the separator and the nearest data point
  11. A Linear Classifier — SVM w1 w2 Goal Find the

    maximum margin linear separator
  12. Kernels Solution: The Kernel Trick Problem: The non-linear mapping ϕ

    can be extremely large oneweirdkerneltrick.com
  13. The Kernel Trick A valid kernel function K(x, y) implicitly

    defines a feature mapping Much easier to define K(x, y)
  14. Examples of Valid Kernels K( x , x 0) =

    e( k x x 0k2 2 2 ) Radial basis function K( x , x 0) = ( x T y + c)d Polynomial
  15. Example: Finding Progression Stages Yang, J., McAuley, J., Leskovec, J.,

    LePendu, P., and Shah, N. Finding progression stages in time-evolving event sequences. WWW 2014. Data — Beer ratings by users on RateBeer.com
  16. Example: Finding Progression Stages Yang, J., McAuley, J., Leskovec, J.,

    LePendu, P., and Shah, N. Finding progression stages in time-evolving event sequences. WWW 2014. Question — How do beer drinkers “progress” over time?
  17. Example: Finding Progression Stages Yang, J., McAuley, J., Leskovec, J.,

    LePendu, P., and Shah, N. Finding progression stages in time-evolving event sequences. WWW 2014. Maximize the data likelihood xij ⇠ Multinomial(⇥(ci, sj)) ⇥(ci, sj) ⇠ Dirichlet( ) A generative model
  18. Example: Finding Progression Stages Yang, J., McAuley, J., Leskovec, J.,

    LePendu, P., and Shah, N. Finding progression stages in time-evolving event sequences. WWW 2014.
  19. Naive Bayes Documents x1 x2 xn xi = { w1,

    w2, …, wm } Words use place line one via money bank account United deposit Uk Assets funds Mr family come private kin also next now years last two died sum life full give well far 25th left care sent put 10 Allied kept USD Sir lord arise fax business fund want late claim share death inform client regards dear offer five find partner gold sincere manager prior nature state turn OLD able oil 15 5 end file 30 M incur collecting Kindly action west acceptable Africa town $ TW ENTY BELLO 2 FASO AUDITING ran OPENED FILES CHARTER JET 50 BENIN 55 TRADER TRADE 60 70 THING 95 hand DIE TIRED Mrs sence VALID DRIVERS George CHEAT HIT 1To TRACE visit SET ASIDE TAKE 252 2To BILLS SEX cocoa CELL CODE 3To Barr banking million make Dollars Arag Contact investigation since name charity organizations assist months official transaction know Simeon God mail interest provide forward world person father email deposited may COUNTRY good TRANSFER never London Please ownership situation contacting HSBC wish given made willing release investment live proposal thousand JOHN KOROVO FOREIGNER huge soon reward told way investments Schoelers properties within back time due numbers Total right believe less project information man estate children ACCIDENT got States assistance ask address hundred consignment OW NER beneficiary profit APPROVED assurance International position desire receive expenses attorney law deceased might COURSE used names associates deal proceeds contacted immediately GOING accounts bless procedure just details clients send Best per daughter enable kindness lines hospital Abidjan fathers FOREIGN chance Amah privileged cote wife divoire Firm six Trust informed certain permit portfolio new especially around opportunities cash COMPANY process instructions understand party relation Abdul instruct simply destroy let cent general DISCOVERED hesitate choice plane crash security knows confide capacity DOCUMENTS need simple Wumi 2000 sharing confidence conclusion honourably telephone secure phone seeking 2003 annually property Based following ways subject serve nothing sale indicate main towards concluded reverting Moreover current period revert LOCAL capital fifteen effort start input successful first feel message Four free According lived accept Peter Attah whole soul gives recommendations living CREDIT ABROAD invest letter CITY Port Harcourt like hear came division ago people advice took risk done $15 later task sector days dead held depositors alone much help seek move pass work must Management arrangement guardian monitored majestys government compensation distributing poisoned overseas consider AFRICAN COMMISSION appreciate without relationship placed request result reputable communication explained wished officer numerous wealth various managers found AMOUNT Kindom orphans Securities Trading charges affiliate processes worth concerned special declared possible surviving means existence rarely nominate internal dictates matter Stella practice relatives customer prepared accrued released distribute BOUAKE expedite HUNDREN SULEMAN AUDITOR BURKINA FLOATING economical GREETINGS RECORDS BEIRUTBOUND businessI fundHe DECEMBER COTONOU REPUBLIC NOBODY MINING PROVED ALONG INVOLVED reimburse TWENTYFIVE PASSPORT humanity STRONG sympathetic INFLUENCE mutual FOREIGNERS PENDING PHYSICAL ARRIVAL PROVE supposed balance BUILD retrive ENTITLED GRATIFICATION wealthy fearing education residential CHAMPION Hello Phillip
  20. Naive Bayes Documents x1 x2 xn Labels ∈ { ham,

    spam } y1 y2 yn use place line one via money bank account United deposit Uk Assets funds Mr family come private kin also next now years last two died sum life full give well far 25th left care sent put 10 Allied kept USD Sir lord arise fax business fund want late claim share death inform client regards dear offer five find partner gold sincere manager prior nature state turn OLD able oil 15 5 end file 30 M incur collecting Kindly action west acceptable Africa town $ TW ENTY BELLO 2 FASO AUDITING ran OPENED FILES CHARTER JET 50 BENIN 55 TRADER TRADE 60 70 THING 95 hand DIE TIRED Mrs sence VALID DRIVERS George CHEAT HIT 1To TRACE visit SET ASIDE TAKE 252 2To BILLS SEX cocoa CELL CODE 3To Barr banking million make Dollars Arag Contact investigation since name charity organizations assist months official transaction know Simeon God mail interest provide forward world person father email deposited may COUNTRY good TRANSFER never London Please ownership situation contacting HSBC wish given made willing release investment live proposal thousand JOHN KOROVO FOREIGNER huge soon reward told way investments Schoelers properties within back time due numbers Total right believe less project information man estate children ACCIDENT got States assistance ask address hundred consignment OW NER beneficiary profit APPROVED assurance International position desire receive expenses attorney law deceased might COURSE used names associates deal proceeds contacted immediately GOING accounts bless procedure just details clients send Best per daughter enable kindness lines hospital Abidjan fathers FOREIGN chance Amah privileged cote wife divoire Firm six Trust informed certain permit portfolio new especially around opportunities cash COMPANY process instructions understand party relation Abdul instruct simply destroy let cent general DISCOVERED hesitate choice plane crash security knows confide capacity DOCUMENTS need simple Wumi 2000 sharing confidence conclusion honourably telephone secure phone seeking 2003 annually property Based following ways subject serve nothing sale indicate main towards concluded reverting Moreover current period revert LOCAL capital fifteen effort start input successful first feel message Four free According lived accept Peter Attah whole soul gives recommendations living CREDIT ABROAD invest letter CITY Port Harcourt like hear came division ago people advice took risk done $15 later task sector days dead held depositors alone much help seek move pass work must Management arrangement guardian monitored majestys government compensation distributing poisoned overseas consider AFRICAN COMMISSION appreciate without relationship placed request result reputable communication explained wished officer numerous wealth various managers found AMOUNT Kindom orphans Securities Trading charges affiliate processes worth concerned special declared possible surviving means existence rarely nominate internal dictates matter Stella practice relatives customer prepared accrued released distribute BOUAKE expedite HUNDREN SULEMAN AUDITOR BURKINA FLOATING economical GREETINGS RECORDS BEIRUTBOUND businessI fundHe DECEMBER COTONOU REPUBLIC NOBODY MINING PROVED ALONG INVOLVED reimburse TWENTYFIVE PASSPORT humanity STRONG sympathetic INFLUENCE mutual FOREIGNERS PENDING PHYSICAL ARRIVAL PROVE supposed balance BUILD retrive ENTITLED GRATIFICATION wealthy fearing education residential CHAMPION Hello Phillip xi = { w1, w2, …, wm } Words
  21. Naive Bayes Generating a document xi: 1. Pick a label

    yi from {ham, spam} with probability θ θ = P(spam)
  22. Naive Bayes Generating a document xi: 1. Pick a label

    yi from {ham, spam} with probability θ θ = P(spam) P(wj | ham) P(wj | spam)
  23. Naive Bayes Generating a document xi: 1. Pick a label

    yi from {ham, spam} with probability θ θ = P(spam) P(wj | ham) P(wj | spam) use place line one via money bank account United deposit Uk Assets funds Mr family come private kin also next now years last two died sum life full give well far 25th left care sent put 10 Allied kept USD Sir lord arise fax business fund want late claim share death inform client regards dear offer five find partner gold sincere manager prior nature state turn OLD able oil 15 5 end file 30 M incur collecting Kindly action west acceptable Africa town $ TW ENTY BELLO 2 FASO AUDITING ran OPENED FILES CHARTER JET 50 BENIN 55 TRADER TRADE 60 70 THING 95 hand DIE TIRED Mrs sence VALID DRIVERS George CHEAT HIT 1To TRACE visit SET ASIDE TAKE 252 2To BILLS SEX cocoa CELL CODE 3To Barr banking million make Dollars Arag Contact investigation since name charity organizations assist months official transaction know Simeon God mail interest provide forward world person father email deposited may COUNTRY good TRANSFER never London Please ownership situation contacting HSBC wish given made willing release investment live proposal thousand JOHN KOROVO FOREIGNER huge soon reward told way investments Schoelers properties within back time due numbers Total right believe less project information man estate children ACCIDENT got States assistance ask address hundred consignment OW NER beneficiary profit APPROVED assurance International position desire receive expenses attorney law deceased might COURSE used names associates deal proceeds contacted immediately GOING accounts bless procedure just details clients send Best per daughter enable kindness lines hospital Abidjan fathers FOREIGN chance Amah privileged cote wife divoire Firm six Trust informed certain permit portfolio new especially around opportunities cash COMPANY process instructions understand party relation Abdul instruct simply destroy let cent general DISCOVERED hesitate choice plane crash security knows confide capacity DOCUMENTS need simple Wumi 2000 sharing confidence conclusion honourably telephone secure phone seeking 2003 annually property Based following ways subject serve nothing sale indicate main towards concluded reverting Moreover current period revert LOCAL capital fifteen effort start input successful first feel message Four free According lived accept Peter Attah whole soul gives recommendations living CREDIT ABROAD invest letter CITY Port Harcourt like hear came division ago people advice took risk done $15 later task sector days dead held depositors alone much help seek move pass work must Management arrangement guardian monitored majestys government compensation distributing poisoned overseas consider AFRICAN COMMISSION appreciate without relationship placed request result reputable communication explained wished officer numerous wealth various managers found AMOUNT Kindom orphans Securities Trading charges affiliate processes worth concerned special declared possible surviving means existence rarely nominate internal dictates matter Stella practice relatives customer prepared accrued released distribute BOUAKE expedite HUNDREN SULEMAN AUDITOR BURKINA FLOATING economical GREETINGS RECORDS BEIRUTBOUND businessI fundHe DECEMBER COTONOU REPUBLIC NOBODY MINING PROVED ALONG INVOLVED reimburse TWENTYFIVE PASSPORT humanity STRONG sympathetic INFLUENCE mutual FOREIGNERS PENDING PHYSICAL ARRIVAL PROVE supposed balance BUILD retrive ENTITLED GRATIFICATION wealthy fearing education residential CHAMPION Hello Phillip word words sprite placed area algorithm layout candidate step collision bounding without retrieve operation perform hierarchical time 32 possible draw placement data pixel expensive pixels even masks implementation simple detection starting larger whole previously comparing move box large think version single tree separately always use overlap animations prevents browsers event loop blocking placing incredibly available GitHub important Attempt place point usually near middle somewhere central horizontal line intersects open source one along increasing spiral Repeat intersections found hard part making license efficiently According Jonathan Feinberg Wordle uses combination d3cloud Note boxes quadtrees achieve reasonable speeds Glyphs JavaScript isnt way code precise glyph shapes via DOM except perhaps SVG fonts Instead text hidden canvas element rendering final Retrieving output requires many additional batch development Sprites initial quite performed slow hundred using run doesnt copy appropriate position asynchronously representing configurable Cloud advantage involves positioning size relevant rather previous Somewhat surprisingly lowlevel hack made tremendous difference constructing compressed blocks 1bit 32bit integers thus reducing number checks memory times fact turned beat makes quadtree everything tried Generator areas font sizes animate primarily Works needs stuttering test per whereas compare every overlaps slightly Another possibility merge recommended fairly though compared analagous mask essentially ORing block converting
  24. Naive Bayes Generating a document xi: 1. Pick a label

    yi from {ham, spam} with probability θ 2. For each possible word w1 — wM, include it in the document with probability P(wj | yi) θ = P(spam) P(wj | ham) P(wj | spam)
  25. Naive Bayes — Important Assumption P(w1, w2, …, wM |

    label) = P(w1 | label)…P(wM, | label)
  26. Naive Bayes — Important Assumption P(w1, w2, …, wM |

    label) = P(w1 | label)…P(wM, | label) The presence of each word within a document is conditionally independent of the other words, given the label
  27. I observe the word “million” in the document Conditional Independence

    SPAM “MILLION” “WINNER” How do the chances of observing the word “winner” change?
  28. Conditional Independence SPAM “MILLION” “WINNER” Knowing that “million” is observed

    changes our degree of uncertainty about observing “winner”
  29. Conditional Independence SPAM “MILLION” “WINNER” Knowing that “million” is observed

    changes our degree of uncertainty about observing “winner” The events are not independent
  30. Conditional Independence SPAM “MILLION” “WINNER” I know that the document

    is not spam I observe the word “million” in the document
  31. Conditional Independence SPAM “MILLION” “WINNER” I know that the document

    is not spam I observe the word “million” in the document Does that affect the chances of observing “winner”?
  32. Naive Bayes — Fitting Parameters P ( yi = spam)

    = ✓ = number of spam documents total number of documents P ( wj | spam) = number of times word wj occurs in spam total number of words labeled spam
  33. Naive Bayes — Prediction My father was a very wealthy

    cocoa merchant in Abidjan, the economic capital of Ivory Coast before he was poisoned to death by his business associates on one of their outing to discus on a business deal. http://www.hoax-slayer.net/wumi-abdul-advance-fee-scam/ P(y = spam) ∝ P(“my” | spam)…P(“deal” | spam)P(spam) P(y = ham) ∝ P(“my” | ham)…P(“deal” | ham)P(spam)
  34. Naive Bayes — Smoothing What if we observe a word

    in a document that we never saw during training? P(“Ivory” | spam) = 0.0 P(y = spam) ∝ P(“my” | spam)…P(“Ivory” | spam)P(spam) = 0.0
  35. Smooth word counts P ( wj | spam) = number

    of times word wj occurs in spam + 1 total number of words labeled spam + |V | ccurs in spam + 1 eled spam + |V | unique words in the training data Naive Bayes — Smoothing
  36. Evaluation Goals Model selection — finding the best-performing model k=2

    k=3 kNN NB “Hyperparameter" selection — finding the best hyperparameters for a given model
  37. Goal: Minimize the error on future, unobserved data Evaluation Goals

    Model selection — finding the best-performing model k=2 k=3 kNN NB “Hyperparameter" selection — finding the best hyperparameters for a given model
  38. Method I — Train, Validation, Test • Split data randomly

    into train, validation and test • Split can be stratified based on the label • See sklearn.model_selection.train_test_split Data Train Test Val
  39. Method I — Train and Test Data Can we do

    this without wasting valuable training data?
  40. Method II — k-Fold Cross-Validation • Split data randomly into

    train and test • Split train data randomly into k equal “folds” • Train on k-1 folds, validate on the remaining • Average the k metrics from each fold • See sklearn.model_selection.KFold
  41. Method II — k-Fold Cross-Validation (Shuffled) Training Data Fold 1

    Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Test Data
  42. Hyperparameter Selection • Grid-search: Search a “well- spaced grid” of

    hyperparameter values • Performance metrics averaged over the k validation folds • Select the hyperparameters with the best performance
  43. Performance Metrics Factors driving the choice of a performance metric:

    • Data — balanced vs. skewed • Task — ranking, classification, clustering • Real-world use-case
  44. Performance Metrics Dangerous = +1, +1, …, +1 -1, …,

    -1 Skewed Data Accuracy of “always guess +1” = 90%! 90% 10%
  45. Performance Metrics — Confusion Matrix Actual P Actual F Predicted

    P True Positive False Positive Predicted F False Negative True Negative
  46. True Positive Rate Performance Metrics — TPR, FPR False Positive

    Rate TP / (TP + FN) FP / (FP + TN) Example: Percentage of dangerous objects correctly identified as such. Example: Percentage of safe objects incorrectly identified as dangerous.
  47. Performance Metrics — Thresholds -1 +1 -1 +1 Actual 0.1

    0.3 0.4 0.8 Prob. TPR FPR 0.0 1.0 1.0
  48. Performance Metrics — Thresholds -1 +1 -1 +1 Actual 0.1

    0.3 0.4 0.8 Prob. TPR FPR 0.0 1.0 1.0 Thresh. Pred. FPR/TPR
  49. Performance Metrics — Thresholds -1 +1 -1 +1 Actual 0.1

    0.3 0.4 0.8 Prob. TPR FPR 0.0 1.0 1.0 Thresh. Pred. FPR/TPR -1 -1 -1 +1 0.8 0.00 / 0.00
  50. Performance Metrics — Thresholds -1 +1 -1 +1 Actual 0.1

    0.3 0.4 0.8 Prob. TPR FPR 0.0 1.0 1.0 Thresh. Pred. FPR/TPR -1 -1 -1 +1 0.8 0.00 / 0.00 -1 -1 +1 +1 0.4 0.50 / 0.50
  51. Performance Metrics — Thresholds -1 +1 -1 +1 Actual 0.1

    0.3 0.4 0.8 Prob. TPR FPR 0.0 1.0 1.0 Thresh. Pred. FPR/TPR -1 -1 -1 +1 0.8 0.00 / 0.00 -1 -1 +1 +1 0.4 0.50 / 0.50 -1 +1 +1 +1 0.3 0.50 / 1.00
  52. Performance Metrics — Thresholds -1 +1 -1 +1 Actual 0.1

    0.3 0.4 0.8 Prob. Thresh. -1 -1 -1 +1 Pred. 0.8 -1 -1 +1 +1 0.4 -1 +1 +1 +1 0.3 +1 +1 +1 +1 0.1 TPR FPR FPR/TPR 0.00 / 0.00 0.50 / 0.50 0.50 / 1.00 1.00 / 1.00 0.0 1.0 1.0
  53. Performance Metrics — Thresholds -1 +1 -1 +1 Actual 0.1

    0.3 0.4 0.8 Prob. Thresh. -1 -1 -1 +1 Pred. 0.8 -1 -1 +1 +1 0.4 -1 +1 +1 +1 0.3 +1 +1 +1 +1 0.1 TPR FPR FPR/TPR 0.00 / 0.00 0.50 / 0.50 0.50 / 1.00 1.00 / 1.00 0.0 1.0 1.0
  54. Performance Metrics — Precision/Recall Precision Recall TP / (TP +

    FP) FP / (FP + TN) Example: Percentage of objects identified as dangerous that were actually dangerous. Example: Percentage of dangerous objects correctly identified as such. F1-Score (PxR ) / (P + R)
  55. Performance Metrics — Precision/Recall -1 +1 -1 +1 Actual 0.1

    0.3 0.4 0.8 Prob. Thresh. -1 -1 -1 +1 Pred. 0.8 -1 -1 +1 +1 0.4 -1 +1 +1 +1 0.3 +1 +1 +1 +1 0.1 P R P / R 1.00 / 0.50 0.50 / 0.50 0.66 / 1.00 0.50 / 1.00 0.0 1.0 1.0