Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Classification 1

pankajmore
September 21, 2012

Classification 1

pankajmore

September 21, 2012
Tweet

More Decks by pankajmore

Other Decks in Science

Transcript

  1. CS685: Data Mining Classification Arnab Bhattacharya arnabb@cse.iitk.ac.in Computer Science and

    Engineering, Indian Institute of Technology, Kanpur http://web.cse.iitk.ac.in/~cs685/ 1st semester, 2012-13 Tue, Wed, Fri 0900-1000 at CS101 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 1 / 32
  2. Outline 1 Preliminaries 2 Decision trees 3 Rule-based classifiers Arnab

    Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 2 / 32
  3. Outline 1 Preliminaries 2 Decision trees 3 Rule-based classifiers Arnab

    Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 3 / 32
  4. Classification A dataset of n objects Oi , i =

    1, . . . , n A total of k classes Cj , j = 1, . . . , k Each object belongs to a single class If object Oi belongs to class Cj , then C(Oi ) = j Given a new object Oq, classification is the problem of determining its class, i.e., C(Oq) out of possible k choices Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 4 / 32
  5. Classification A dataset of n objects Oi , i =

    1, . . . , n A total of k classes Cj , j = 1, . . . , k Each object belongs to a single class If object Oi belongs to class Cj , then C(Oi ) = j Given a new object Oq, classification is the problem of determining its class, i.e., C(Oq) out of possible k choices If, instead of k discrete classes, there is a continuum of values, the problem of determining the value V (Oq) of a new object Oq is called prediction Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 4 / 32
  6. General method of classification Total available data is divided randomly

    into two parts: training set and testing set (or validation set) Classification algorithm or model is built using only the training set Testing set should not be used at all Quality of method is measured using testing set Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 5 / 32
  7. General method of classification Total available data is divided randomly

    into two parts: training set and testing set (or validation set) Classification algorithm or model is built using only the training set Testing set should not be used at all Quality of method is measured using testing set Stratified If representation of each class in training set is proportional to the overall ratios Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 5 / 32
  8. General method of classification Total available data is divided randomly

    into two parts: training set and testing set (or validation set) Classification algorithm or model is built using only the training set Testing set should not be used at all Quality of method is measured using testing set Stratified If representation of each class in training set is proportional to the overall ratios k-fold cross-validation Data is divided into k random parts k − 1 groups are used as training set and the kth group as testing set Training is repeated k times with a new testing set Leave-one-out cross-validation (LOOCV): When k = n Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 5 / 32
  9. General method of classification Total available data is divided randomly

    into two parts: training set and testing set (or validation set) Classification algorithm or model is built using only the training set Testing set should not be used at all Quality of method is measured using testing set Stratified If representation of each class in training set is proportional to the overall ratios k-fold cross-validation Data is divided into k random parts k − 1 groups are used as training set and the kth group as testing set Training is repeated k times with a new testing set Leave-one-out cross-validation (LOOCV): When k = n Stratified cross-validation When representation in each of the k random groups is proportional to the overall ratios Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 5 / 32
  10. General method of classification Total available data is divided randomly

    into two parts: training set and testing set (or validation set) Classification algorithm or model is built using only the training set Testing set should not be used at all Quality of method is measured using testing set Stratified If representation of each class in training set is proportional to the overall ratios k-fold cross-validation Data is divided into k random parts k − 1 groups are used as training set and the kth group as testing set Training is repeated k times with a new testing set Leave-one-out cross-validation (LOOCV): When k = n Stratified cross-validation When representation in each of the k random groups is proportional to the overall ratios Supervised learning Algorithm or model is “supervised” by the class information Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 5 / 32
  11. Over-fitting and under-fitting Over-fitting Algorithm or model classifies the training

    set too well It is too complex or uses too many parameters Generally performs poorly with testing set Ends up modeling noise rather than data characteristics Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 6 / 32
  12. Over-fitting and under-fitting Over-fitting Algorithm or model classifies the training

    set too well It is too complex or uses too many parameters Generally performs poorly with testing set Ends up modeling noise rather than data characteristics Under-fitting The opposite problem Algorithm or model does not classify the training set well at all It is too simple or uses too less parameters Generally performs poorly with testing set Ends up modeling overall data characteristics instead of per class Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 6 / 32
  13. Errors Positives (P): objects that are “true” answers Negatives (N):

    objects that are not answers N = D − P Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 7 / 32
  14. Errors Positives (P): objects that are “true” answers Negatives (N):

    objects that are not answers N = D − P For any classification algorithm, True Positives (TP): Answers that have been found True Negatives (TN): Non-answers that have not been found False Positives (FP): Non-answers that have been found False Negatives (FN): Answers that have not been found P = TP ∪ FN N = TN ∪ FP Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 7 / 32
  15. Errors Positives (P): objects that are “true” answers Negatives (N):

    objects that are not answers N = D − P For any classification algorithm, True Positives (TP): Answers that have been found True Negatives (TN): Non-answers that have not been found False Positives (FP): Non-answers that have been found False Negatives (FN): Answers that have not been found P = TP ∪ FN N = TN ∪ FP Errors Type I error: FP Type II error: FN Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 7 / 32
  16. Error parameters or performance metrics Parameter Interpretation Formula Recall or

    Sensitivity or Proportion of True positive rate or answers found Hit rate Precision Proportion of answers in those found Specificity or Proportion of True negative rate non-answers not found False positive rate Proportion of non-answers found as answer Accuracy Proportion of correctly found answers and non-answers Error rate Proportion of wrongly found answers and non-answers Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 8 / 32
  17. Error parameters or performance metrics Parameter Interpretation Formula Recall or

    |TP| |TP∪FN| = |TP| |P| Sensitivity or Proportion of True positive rate or answers found Hit rate Precision Proportion of answers in those found Specificity or Proportion of True negative rate non-answers not found False positive rate Proportion of non-answers found as answer Accuracy Proportion of correctly found answers and non-answers Error rate Proportion of wrongly found answers and non-answers Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 8 / 32
  18. Error parameters or performance metrics Parameter Interpretation Formula Recall or

    |TP| |TP∪FN| = |TP| |P| Sensitivity or Proportion of True positive rate or answers found Hit rate Precision Proportion of |TP| |TP∪FP| answers in those found Specificity or Proportion of True negative rate non-answers not found False positive rate Proportion of non-answers found as answer Accuracy Proportion of correctly found answers and non-answers Error rate Proportion of wrongly found answers and non-answers Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 8 / 32
  19. Error parameters or performance metrics Parameter Interpretation Formula Recall or

    |TP| |TP∪FN| = |TP| |P| Sensitivity or Proportion of True positive rate or answers found Hit rate Precision Proportion of |TP| |TP∪FP| answers in those found Specificity or Proportion of |TN| |TN∪FP| = |TN| |N| True negative rate non-answers not found False positive rate Proportion of non-answers found as answer Accuracy Proportion of correctly found answers and non-answers Error rate Proportion of wrongly found answers and non-answers Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 8 / 32
  20. Error parameters or performance metrics Parameter Interpretation Formula Recall or

    |TP| |TP∪FN| = |TP| |P| Sensitivity or Proportion of True positive rate or answers found Hit rate Precision Proportion of |TP| |TP∪FP| answers in those found Specificity or Proportion of |TN| |TN∪FP| = |TN| |N| True negative rate non-answers not found False positive rate Proportion of |FP| |TN∪FP| = |FP| |N| non-answers found as answer Accuracy Proportion of correctly found answers and non-answers Error rate Proportion of wrongly found answers and non-answers Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 8 / 32
  21. Error parameters or performance metrics Parameter Interpretation Formula Recall or

    |TP| |TP∪FN| = |TP| |P| Sensitivity or Proportion of True positive rate or answers found Hit rate Precision Proportion of |TP| |TP∪FP| answers in those found Specificity or Proportion of |TN| |TN∪FP| = |TN| |N| True negative rate non-answers not found False positive rate Proportion of |FP| |TN∪FP| = |FP| |N| non-answers found as answer Accuracy Proportion of correctly |TP∪TN| |TP∪TN∪FP∪FN| = |TP∪TN| |D| found answers and non-answers Error rate Proportion of wrongly found answers and non-answers Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 8 / 32
  22. Error parameters or performance metrics Parameter Interpretation Formula Recall or

    |TP| |TP∪FN| = |TP| |P| Sensitivity or Proportion of True positive rate or answers found Hit rate Precision Proportion of |TP| |TP∪FP| answers in those found Specificity or Proportion of |TN| |TN∪FP| = |TN| |N| True negative rate non-answers not found False positive rate Proportion of |FP| |TN∪FP| = |FP| |N| non-answers found as answer Accuracy Proportion of correctly |TP∪TN| |TP∪TN∪FP∪FN| = |TP∪TN| |D| found answers and non-answers Error rate Proportion of wrongly |FP∪FN| |TP∪TN∪FP∪FN| = |FP∪FN| |D| found answers and non-answers Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 8 / 32
  23. Error parameters or performance metrics Parameter Interpretation Formula Recall or

    |TP| |TP∪FN| = |TP| |P| Sensitivity or Proportion of True positive rate or answers found Hit rate Precision Proportion of |TP| |TP∪FP| answers in those found Specificity or Proportion of |TN| |TN∪FP| = |TN| |N| True negative rate non-answers not found False positive rate Proportion of |FP| |TN∪FP| = |FP| |N| non-answers found as answer Accuracy Proportion of correctly |TP∪TN| |TP∪TN∪FP∪FN| = |TP∪TN| |D| found answers and non-answers Error rate Proportion of wrongly |FP∪FN| |TP∪TN∪FP∪FN| = |FP∪FN| |D| found answers and non-answers False positive rate = 1 - specificity Error rate = 1 - accuracy Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 8 / 32
  24. F-score F-score or F-measure Single measure capturing both precision and

    recall Harmonic mean of precision and recall F − score = 2 × Precision × Recall Precision + Recall Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 9 / 32
  25. F-score F-score or F-measure Single measure capturing both precision and

    recall Harmonic mean of precision and recall F − score = 2 × Precision × Recall Precision + Recall Precision and recall can be weighted as well When recall is β times more important than precision F − score = (1 + β) × Precision × Recall β × Precision + Recall Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 9 / 32
  26. F-score F-score or F-measure Single measure capturing both precision and

    recall Harmonic mean of precision and recall F − score = 2 × Precision × Recall Precision + Recall Precision and recall can be weighted as well When recall is β times more important than precision F − score = (1 + β) × Precision × Recall β × Precision + Recall In terms of errors F − score = (1 + β) × TP (1 + β) × TP + β × FN + FP Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 9 / 32
  27. Example D ={O1, O2, O3, O4, O5, O6, O7, O8}

    Correct answer set ={O1, O5, O7} Algorithm returns ={O1, O3, O5, O6} Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 10 / 32
  28. Example D ={O1, O2, O3, O4, O5, O6, O7, O8}

    Correct answer set ={O1, O5, O7} Algorithm returns ={O1, O3, O5, O6} ∴ P = N = TP = TN = FP = FN = Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 10 / 32
  29. Example D ={O1, O2, O3, O4, O5, O6, O7, O8}

    Correct answer set ={O1, O5, O7} Algorithm returns ={O1, O3, O5, O6} ∴ P = {O1, O5, O7} N = TP = TN = FP = FN = Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 10 / 32
  30. Example D ={O1, O2, O3, O4, O5, O6, O7, O8}

    Correct answer set ={O1, O5, O7} Algorithm returns ={O1, O3, O5, O6} ∴ P = {O1, O5, O7} N = {O2, O3, O4, O6, O8} TP = TN = FP = FN = Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 10 / 32
  31. Example D ={O1, O2, O3, O4, O5, O6, O7, O8}

    Correct answer set ={O1, O5, O7} Algorithm returns ={O1, O3, O5, O6} ∴ P = {O1, O5, O7} N = {O2, O3, O4, O6, O8} TP = {O1, O5} TN = FP = FN = Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 10 / 32
  32. Example D ={O1, O2, O3, O4, O5, O6, O7, O8}

    Correct answer set ={O1, O5, O7} Algorithm returns ={O1, O3, O5, O6} ∴ P = {O1, O5, O7} N = {O2, O3, O4, O6, O8} TP = {O1, O5} TN = {O2, O4, O8} FP = FN = Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 10 / 32
  33. Example D ={O1, O2, O3, O4, O5, O6, O7, O8}

    Correct answer set ={O1, O5, O7} Algorithm returns ={O1, O3, O5, O6} ∴ P = {O1, O5, O7} N = {O2, O3, O4, O6, O8} TP = {O1, O5} TN = {O2, O4, O8} FP = {O3, O6} FN = Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 10 / 32
  34. Example D ={O1, O2, O3, O4, O5, O6, O7, O8}

    Correct answer set ={O1, O5, O7} Algorithm returns ={O1, O3, O5, O6} ∴ P = {O1, O5, O7} N = {O2, O3, O4, O6, O8} TP = {O1, O5} TN = {O2, O4, O8} FP = {O3, O6} FN = {O7} Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 10 / 32
  35. Example D ={O1, O2, O3, O4, O5, O6, O7, O8}

    Correct answer set ={O1, O5, O7} Algorithm returns ={O1, O3, O5, O6} ∴ P = {O1, O5, O7} N = {O2, O3, O4, O6, O8} TP = {O1, O5} TN = {O2, O4, O8} FP = {O3, O6} FN = {O7} ∴ Recall = Sensitivity = Precision = Specificity = F-score = Accuracy = Error rate = Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 10 / 32
  36. Example D ={O1, O2, O3, O4, O5, O6, O7, O8}

    Correct answer set ={O1, O5, O7} Algorithm returns ={O1, O3, O5, O6} ∴ P = {O1, O5, O7} N = {O2, O3, O4, O6, O8} TP = {O1, O5} TN = {O2, O4, O8} FP = {O3, O6} FN = {O7} ∴ Recall = Sensitivity = 2/3 = 0.67 Precision = Specificity = F-score = Accuracy = Error rate = Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 10 / 32
  37. Example D ={O1, O2, O3, O4, O5, O6, O7, O8}

    Correct answer set ={O1, O5, O7} Algorithm returns ={O1, O3, O5, O6} ∴ P = {O1, O5, O7} N = {O2, O3, O4, O6, O8} TP = {O1, O5} TN = {O2, O4, O8} FP = {O3, O6} FN = {O7} ∴ Recall = Sensitivity = 2/3 = 0.67 Precision = 2/4 = 0.5 Specificity = F-score = Accuracy = Error rate = Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 10 / 32
  38. Example D ={O1, O2, O3, O4, O5, O6, O7, O8}

    Correct answer set ={O1, O5, O7} Algorithm returns ={O1, O3, O5, O6} ∴ P = {O1, O5, O7} N = {O2, O3, O4, O6, O8} TP = {O1, O5} TN = {O2, O4, O8} FP = {O3, O6} FN = {O7} ∴ Recall = Sensitivity = 2/3 = 0.67 Precision = 2/4 = 0.5 Specificity = 3/5 = 0.6 F-score = Accuracy = Error rate = Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 10 / 32
  39. Example D ={O1, O2, O3, O4, O5, O6, O7, O8}

    Correct answer set ={O1, O5, O7} Algorithm returns ={O1, O3, O5, O6} ∴ P = {O1, O5, O7} N = {O2, O3, O4, O6, O8} TP = {O1, O5} TN = {O2, O4, O8} FP = {O3, O6} FN = {O7} ∴ Recall = Sensitivity = 2/3 = 0.67 Precision = 2/4 = 0.5 Specificity = 3/5 = 0.6 F-score = 4/7 = 0.571 Accuracy = Error rate = Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 10 / 32
  40. Example D ={O1, O2, O3, O4, O5, O6, O7, O8}

    Correct answer set ={O1, O5, O7} Algorithm returns ={O1, O3, O5, O6} ∴ P = {O1, O5, O7} N = {O2, O3, O4, O6, O8} TP = {O1, O5} TN = {O2, O4, O8} FP = {O3, O6} FN = {O7} ∴ Recall = Sensitivity = 2/3 = 0.67 Precision = 2/4 = 0.5 Specificity = 3/5 = 0.6 F-score = 4/7 = 0.571 Accuracy = 5/8 = 0.625 Error rate = Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 10 / 32
  41. Example D ={O1, O2, O3, O4, O5, O6, O7, O8}

    Correct answer set ={O1, O5, O7} Algorithm returns ={O1, O3, O5, O6} ∴ P = {O1, O5, O7} N = {O2, O3, O4, O6, O8} TP = {O1, O5} TN = {O2, O4, O8} FP = {O3, O6} FN = {O7} ∴ Recall = Sensitivity = 2/3 = 0.67 Precision = 2/4 = 0.5 Specificity = 3/5 = 0.6 F-score = 4/7 = 0.571 Accuracy = 5/8 = 0.625 Error rate = 3/8 = 0.375 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 10 / 32
  42. ROC curve ROC (Receiver Operating Characteristics) curve 1 - specificity

    (x-axis) versus sensitivity (y-axis) False positive rate (x-axis) versus true positive rate (y-axis) A random guess algorithm is a 45◦ line Area under the curve measures accuracy (or discrimination) A perfect algorithm has an area of 1 A random (or useless) algorithm has an area of 0.5 Grading 0.9-1.0: excellent; 0.8-0.9: good; 0.7-0.8: fair; 0.6-0.7: poor; <0.6: fail Accuracy of Algorithm 2 True Positive Rate False Positive Rate 1 0 0 1 Algorithm 1 Algorithm 2 Perfect algorithm Random guess algorithm Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 11 / 32
  43. Confusion matrix Confusion matrix Answers found by algorithm (predictions) on

    rows versus “true” answers (actuals) on columns Shows which classes are harder to identify True answers Positives P Negatives N Found by Positives P TP FP algorithm Negatives N FN TN Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 12 / 32
  44. Confusion matrix Confusion matrix Answers found by algorithm (predictions) on

    rows versus “true” answers (actuals) on columns Shows which classes are harder to identify True answers Positives P Negatives N Found by Positives P TP FP algorithm Negatives N FN TN For our earlier example P N P 2 2 4 N 1 3 4 3 5 8 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 12 / 32
  45. Outline 1 Preliminaries 2 Decision trees 3 Rule-based classifiers Arnab

    Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 13 / 32
  46. Decision trees A decision tree is a tree structure used

    for classification Each internal node represents a test on an attribute Each branch represents an outcome of the test Each leaf represents a class outcome For a test object, its attributes are tested and a particular path is followed to a leaf, which is deemed its class Motivated? CPI >= 8? No BTP recommends? strongly Advisor Advisor willing? No BTP BTP No BTP BTP yes yes yes yes no no no no Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 14 / 32
  47. Constructing a decision tree If all objects are in same

    class, label the leaf node with that class The leaf is then pure Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 15 / 32
  48. Constructing a decision tree If all objects are in same

    class, label the leaf node with that class The leaf is then pure Else, choose the “best” attribute to split Determine splitting criterion based on splitting attribute Indicates split point(s) or splitting subset(s) Different measures of impurity to split a node Separate objects into different branches according to split Recursively, build tree for each split Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 15 / 32
  49. Constructing a decision tree If all objects are in same

    class, label the leaf node with that class The leaf is then pure Else, choose the “best” attribute to split Determine splitting criterion based on splitting attribute Indicates split point(s) or splitting subset(s) Different measures of impurity to split a node Separate objects into different branches according to split Recursively, build tree for each split Stop when either Leaf becomes pure No more attributes to split – assign class through majority voting Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 15 / 32
  50. Constructing a decision tree If all objects are in same

    class, label the leaf node with that class The leaf is then pure Else, choose the “best” attribute to split Determine splitting criterion based on splitting attribute Indicates split point(s) or splitting subset(s) Different measures of impurity to split a node Separate objects into different branches according to split Recursively, build tree for each split Stop when either Leaf becomes pure No more attributes to split – assign class through majority voting Decision tree building is top-down and no backtracking is allowed Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 15 / 32
  51. Information gain Entropy impurity or information impurity info(D) = −

    k i=1 (pi log2 pi ) For n partitions into D1, . . . , Dn, denoted by S infoS (D) = n j=1 (|Dj |/|D|)info(Dj ) Information gain is gainS (D) = info(D) − infoS (D) More the gain, better the split Choose attribute and split point that maximizes gain Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 16 / 32
  52. Gini index Variance impurity for two classes var(D) = p1.p2

    For k classes, generalized to Gini index or Gini impurity gini(D) = k i=1 k j=1,j=i pi .pj = 1 − k i=1 p2 i For n partitions into D1, . . . , Dn, denoted by S giniS (D) = n j=1 (|Dj |/|D|)gini(Dj ) Less the gini index, better the split Choose attribute and split point that minimizes gini index Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 17 / 32
  53. Classification error Classification error or misclassification index class(D) = 1

    − max i pi This is the probability of misclassification when no more split is done and majority voting is used Find reduction in impurity by splitting class(D) − classS (D) = class(D) − n j=1 (|Dj |/|D|)class(Dj ) More the reduction in impurity, better the split Choose attribute and split point that maximizes reduction Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 18 / 32
  54. Gain ratio Most impurity measures are biased towards multiway splits

    Higher chance that a node becomes purer Gain ratio counters it For n partitions into D1, . . . , Dn, denoted by S Split information is defined as splitinfoS (D) = − n j=1 (|Dj |/|D|) log2 (|Dj |/|D|) Similar to information measure, although just uses the number of objects in each partition and not any class information This is used to normalize information gain gainratioS (D) = gainS (D)/splitinfoS (D) Higher the gain ratio, better the split Choose attribute and split point that maximizes gain ratio Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 19 / 32
  55. Choosing a split point If attribute is nominal Arnab Bhattacharya

    (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 20 / 32
  56. Choosing a split point If attribute is nominal Each category

    denotes a new branch If binary split is required, Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 20 / 32
  57. Choosing a split point If attribute is nominal Each category

    denotes a new branch If binary split is required, use set membership testing If attribute is ordinal Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 20 / 32
  58. Choosing a split point If attribute is nominal Each category

    denotes a new branch If binary split is required, use set membership testing If attribute is ordinal Each category denotes a new branch If binary split is required, Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 20 / 32
  59. Choosing a split point If attribute is nominal Each category

    denotes a new branch If binary split is required, use set membership testing If attribute is ordinal Each category denotes a new branch If binary split is required, use order information If attribute is numeric Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 20 / 32
  60. Choosing a split point If attribute is nominal Each category

    denotes a new branch If binary split is required, use set membership testing If attribute is ordinal Each category denotes a new branch If binary split is required, use order information If attribute is numeric Sort all values and choose a (binary) split point If multiway split is required, Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 20 / 32
  61. Choosing a split point If attribute is nominal Each category

    denotes a new branch If binary split is required, use set membership testing If attribute is ordinal Each category denotes a new branch If binary split is required, use order information If attribute is numeric Sort all values and choose a (binary) split point If multiway split is required, choose multiple split points Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 20 / 32
  62. Discussion Over-fitting can happen CPI criterion (in example) over-fitted to

    CSE students Tree needs to be pruned Can use criteria such as chi-square test to stop splitting Can use criteria such as information gain to merge Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 21 / 32
  63. Discussion Over-fitting can happen CPI criterion (in example) over-fitted to

    CSE students Tree needs to be pruned Can use criteria such as chi-square test to stop splitting Can use criteria such as information gain to merge Under-fitting can also happen If nodes about advisor decisions (in example) are left out Some thresholds are always needed to control these Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 21 / 32
  64. Discussion Over-fitting can happen CPI criterion (in example) over-fitted to

    CSE students Tree needs to be pruned Can use criteria such as chi-square test to stop splitting Can use criteria such as information gain to merge Under-fitting can also happen If nodes about advisor decisions (in example) are left out Some thresholds are always needed to control these Node decisions are based on single attribute – monothetic trees Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 21 / 32
  65. Discussion Over-fitting can happen CPI criterion (in example) over-fitted to

    CSE students Tree needs to be pruned Can use criteria such as chi-square test to stop splitting Can use criteria such as information gain to merge Under-fitting can also happen If nodes about advisor decisions (in example) are left out Some thresholds are always needed to control these Node decisions are based on single attribute – monothetic trees Why not polythetic trees where decisions are based on multiple attributes? Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 21 / 32
  66. Discussion Over-fitting can happen CPI criterion (in example) over-fitted to

    CSE students Tree needs to be pruned Can use criteria such as chi-square test to stop splitting Can use criteria such as information gain to merge Under-fitting can also happen If nodes about advisor decisions (in example) are left out Some thresholds are always needed to control these Node decisions are based on single attribute – monothetic trees Why not polythetic trees where decisions are based on multiple attributes? Theoretically possible but practically too complex Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 21 / 32
  67. Variants of decision tree Three main variants ID3 (from Iterative

    Dichotomiser generation 3) Multiway split Uses information gain Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 22 / 32
  68. Variants of decision tree Three main variants ID3 (from Iterative

    Dichotomiser generation 3) Multiway split Uses information gain C4.5 Evolved from ID3 Multiway split Uses gain ratio Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 22 / 32
  69. Variants of decision tree Three main variants ID3 (from Iterative

    Dichotomiser generation 3) Multiway split Uses information gain C4.5 Evolved from ID3 Multiway split Uses gain ratio CART (from Classification and Regression Trees) Binary split Uses gini index Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 22 / 32
  70. Outline 1 Preliminaries 2 Decision trees 3 Rule-based classifiers Arnab

    Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 23 / 32
  71. Rules Rules are of the form if condition then class

    condition is a conjunct (i.e., logical AND) of tests on single attributes If the condition holds, then the object is said to be from class condition is called antecedent or precondition class is called consequent Example: if motivated = yes AND cpi ≥ 8 then btp Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 24 / 32
  72. Rules Rules are of the form if condition then class

    condition is a conjunct (i.e., logical AND) of tests on single attributes If the condition holds, then the object is said to be from class condition is called antecedent or precondition class is called consequent Example: if motivated = yes AND cpi ≥ 8 then btp Two important parameters of a rule Coverage: Number of objects the rule applies to coverage = |covers|/|D| Accuracy: Number of correctly classified objects when rule is applied accuracy = |correct|/|covers| Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 24 / 32
  73. Triggering and firing of rules For every tuple, a rule

    that satisfies it is “triggered” If for that tuple, it is the only rule, then it is “fired” Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 25 / 32
  74. Triggering and firing of rules For every tuple, a rule

    that satisfies it is “triggered” If for that tuple, it is the only rule, then it is “fired” Otherwise, a conflict resolution strategy is devised Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 25 / 32
  75. Triggering and firing of rules For every tuple, a rule

    that satisfies it is “triggered” If for that tuple, it is the only rule, then it is “fired” Otherwise, a conflict resolution strategy is devised Size-based ordering: Rule with larger antecedent is invoked More stringent, i.e., tougher Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 25 / 32
  76. Triggering and firing of rules For every tuple, a rule

    that satisfies it is “triggered” If for that tuple, it is the only rule, then it is “fired” Otherwise, a conflict resolution strategy is devised Size-based ordering: Rule with larger antecedent is invoked More stringent, i.e., tougher Class-based ordering: Two schemes Consequent class is more frequent, i.e., according to order of prevalence Consequent class has less misclassification Within same class, there is arbitrary ordering Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 25 / 32
  77. Triggering and firing of rules For every tuple, a rule

    that satisfies it is “triggered” If for that tuple, it is the only rule, then it is “fired” Otherwise, a conflict resolution strategy is devised Size-based ordering: Rule with larger antecedent is invoked More stringent, i.e., tougher Class-based ordering: Two schemes Consequent class is more frequent, i.e., according to order of prevalence Consequent class has less misclassification Within same class, there is arbitrary ordering Rule-based ordering: Priority list according to some function based on coverage, accuracy and size Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 25 / 32
  78. Triggering and firing of rules For every tuple, a rule

    that satisfies it is “triggered” If for that tuple, it is the only rule, then it is “fired” Otherwise, a conflict resolution strategy is devised Size-based ordering: Rule with larger antecedent is invoked More stringent, i.e., tougher Class-based ordering: Two schemes Consequent class is more frequent, i.e., according to order of prevalence Consequent class has less misclassification Within same class, there is arbitrary ordering Rule-based ordering: Priority list according to some function based on coverage, accuracy and size For a query tuple, the first rule that satisifies it is invoked If no such rule, then a default rule is invoked: if () then class i Class i is the most abundant class Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 25 / 32
  79. Learning rules from a decision tree Every path is a

    rule As verbose or complex as the decision tree itself Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 26 / 32
  80. Learning rules from a decision tree Every path is a

    rule As verbose or complex as the decision tree itself Rules are mutually exclusive and exhaustive Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 26 / 32
  81. Learning rules from a decision tree Every path is a

    rule As verbose or complex as the decision tree itself Rules are mutually exclusive and exhaustive No need to order the rules Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 26 / 32
  82. Sequential covering algorithm Sequential covering algorithm learns rules sequentially Rules

    are learnt per class one-by-one Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 27 / 32
  83. Sequential covering algorithm Sequential covering algorithm learns rules sequentially Rules

    are learnt per class one-by-one When a rule is learnt, all tuples covered by it are removed Given a set of tuples, how is a rule learnt? Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 27 / 32
  84. Sequential covering algorithm Sequential covering algorithm learns rules sequentially Rules

    are learnt per class one-by-one When a rule is learnt, all tuples covered by it are removed Given a set of tuples, how is a rule learnt? Greedy learn-one-rule method learns the “best” rule given the current set of tuples General-to-specific strategy Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 27 / 32
  85. Sequential covering algorithm Sequential covering algorithm learns rules sequentially Rules

    are learnt per class one-by-one When a rule is learnt, all tuples covered by it are removed Given a set of tuples, how is a rule learnt? Greedy learn-one-rule method learns the “best” rule given the current set of tuples General-to-specific strategy Starts with an empty antecedent At each stage, every attribute (and every possible split) is considered If the new rule has better quality than the old rule, it is retained Decisions are thus greedy and are never backtracked Otherwise, the old rule is accepted Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 27 / 32
  86. Sequential covering algorithm Sequential covering algorithm learns rules sequentially Rules

    are learnt per class one-by-one When a rule is learnt, all tuples covered by it are removed Given a set of tuples, how is a rule learnt? Greedy learn-one-rule method learns the “best” rule given the current set of tuples General-to-specific strategy Starts with an empty antecedent At each stage, every attribute (and every possible split) is considered If the new rule has better quality than the old rule, it is retained Decisions are thus greedy and are never backtracked Otherwise, the old rule is accepted The next rule is then learnt Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 27 / 32
  87. Sequential covering algorithm Sequential covering algorithm learns rules sequentially Rules

    are learnt per class one-by-one When a rule is learnt, all tuples covered by it are removed Given a set of tuples, how is a rule learnt? Greedy learn-one-rule method learns the “best” rule given the current set of tuples General-to-specific strategy Starts with an empty antecedent At each stage, every attribute (and every possible split) is considered If the new rule has better quality than the old rule, it is retained Decisions are thus greedy and are never backtracked Otherwise, the old rule is accepted The next rule is then learnt Rules are ordered according to their order of inception Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 27 / 32
  88. Sequential covering algorithm Sequential covering algorithm learns rules sequentially Rules

    are learnt per class one-by-one When a rule is learnt, all tuples covered by it are removed Given a set of tuples, how is a rule learnt? Greedy learn-one-rule method learns the “best” rule given the current set of tuples General-to-specific strategy Starts with an empty antecedent At each stage, every attribute (and every possible split) is considered If the new rule has better quality than the old rule, it is retained Decisions are thus greedy and are never backtracked Otherwise, the old rule is accepted The next rule is then learnt Rules are ordered according to their order of inception Variants are AQ, CN2 and RIPPER Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 27 / 32
  89. Rule quality Accuracy is the most vital concern Arnab Bhattacharya

    (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 28 / 32
  90. Rule quality Accuracy is the most vital concern A rule

    with 90% accuracy and 80% coverage is better than another rule with 95% accuracy and 10% coverage Coverage also needs to be considered Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 28 / 32
  91. Rule quality Accuracy is the most vital concern A rule

    with 90% accuracy and 80% coverage is better than another rule with 95% accuracy and 10% coverage Coverage also needs to be considered Old rule R1 has a1 as antecedent New rule R2 has a2 as antecedent Let the number of tuples covered by a rule be denoted by Di For the particular class in question, pi is the number of tuples correctly classified, i.e., the consequent is this class Correspondingly, ni is the number of negative tuples Di = pi + ni Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 28 / 32
  92. Rule quality Accuracy is the most vital concern A rule

    with 90% accuracy and 80% coverage is better than another rule with 95% accuracy and 10% coverage Coverage also needs to be considered Old rule R1 has a1 as antecedent New rule R2 has a2 as antecedent Let the number of tuples covered by a rule be denoted by Di For the particular class in question, pi is the number of tuples correctly classified, i.e., the consequent is this class Correspondingly, ni is the number of negative tuples Di = pi + ni Four ways of measuring quality Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 28 / 32
  93. Rule quality measures FOIL Gain measure proposed as part of

    the sequential covering algorithm First Order Inductive Learner (FOIL) used in RIPPER FOIL Gain(R1 → R2) = p2 × log2 p2 D2 − log2 p1 D1 Considers both coverage and accuracy Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 29 / 32
  94. Rule quality measures FOIL Gain measure proposed as part of

    the sequential covering algorithm First Order Inductive Learner (FOIL) used in RIPPER FOIL Gain(R1 → R2) = p2 × log2 p2 D2 − log2 p1 D1 Considers both coverage and accuracy Statistical test using the likelihood ratio statistic LR = 2 m i=1 fi log fi ei where m is the number of classes, fi and ei are the observed and expected frequencies of tuples in each class LR statistic has a chi-square distribution with m − 1 degrees of freedom The larger the statistic, the more deviated it is from the random rule, and thus, the better Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 29 / 32
  95. Rule quality measures (contd.) Entropy: rule with less entropy is

    better Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 30 / 32
  96. Rule quality measures (contd.) Entropy: rule with less entropy is

    better m-estimate measure considers the number of classes as well m-estimate = pi + m.ci Di + m where m is the number of classes and ci is the prior probability of class Ci If the prior probabilities are not known, replacing it by 1/m yields the Laplacian estimate Laplacian = pi + 1 Di + m The larger the estimate, the better is the rule Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 30 / 32
  97. Rule pruning General-to-specific strategy is susceptible to overfitting Specific-to-general strategy

    first learns the most specific rule and then prunes the antecedent This is rule pruning Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 31 / 32
  98. Rule pruning General-to-specific strategy is susceptible to overfitting Specific-to-general strategy

    first learns the most specific rule and then prunes the antecedent This is rule pruning Each training instance starts as a rule From a rule R1, an antecedent is removed to yield rule R2 Measure of rule quality is FOIL Prune FOIL Prune = pi − ni Di If this measure is higher for R2, then pruning is applied Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 31 / 32
  99. Discussion Rules can be very verbose Simple rules can only

    learn rectilinear boundaries Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 32 / 32
  100. Discussion Rules can be very verbose Simple rules can only

    learn rectilinear boundaries Rules have an interpretation and can lead to descriptive models Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 32 / 32
  101. Discussion Rules can be very verbose Simple rules can only

    learn rectilinear boundaries Rules have an interpretation and can lead to descriptive models Can handle imbalance in class distribution very well Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 1 2012-13 32 / 32