120

# Classification 1

#### pankajmore

September 21, 2012

## Transcript

1. ### CS685: Data Mining Classiﬁcation Arnab Bhattacharya arnabb@cse.iitk.ac.in Computer Science and

Engineering, Indian Institute of Technology, Kanpur http://web.cse.iitk.ac.in/~cs685/ 1st semester, 2012-13 Tue, Wed, Fri 0900-1000 at CS101 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 1 / 32
2. ### Outline 1 Preliminaries 2 Decision trees 3 Rule-based classiﬁers Arnab

Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 2 / 32
3. ### Outline 1 Preliminaries 2 Decision trees 3 Rule-based classiﬁers Arnab

Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 3 / 32
4. ### Classiﬁcation A dataset of n objects Oi , i =

1, . . . , n A total of k classes Cj , j = 1, . . . , k Each object belongs to a single class If object Oi belongs to class Cj , then C(Oi ) = j Given a new object Oq, classiﬁcation is the problem of determining its class, i.e., C(Oq) out of possible k choices Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 4 / 32
5. ### Classiﬁcation A dataset of n objects Oi , i =

1, . . . , n A total of k classes Cj , j = 1, . . . , k Each object belongs to a single class If object Oi belongs to class Cj , then C(Oi ) = j Given a new object Oq, classiﬁcation is the problem of determining its class, i.e., C(Oq) out of possible k choices If, instead of k discrete classes, there is a continuum of values, the problem of determining the value V (Oq) of a new object Oq is called prediction Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 4 / 32
6. ### General method of classiﬁcation Total available data is divided randomly

into two parts: training set and testing set (or validation set) Classiﬁcation algorithm or model is built using only the training set Testing set should not be used at all Quality of method is measured using testing set Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 5 / 32
7. ### General method of classiﬁcation Total available data is divided randomly

into two parts: training set and testing set (or validation set) Classiﬁcation algorithm or model is built using only the training set Testing set should not be used at all Quality of method is measured using testing set Stratiﬁed If representation of each class in training set is proportional to the overall ratios Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 5 / 32
8. ### General method of classiﬁcation Total available data is divided randomly

into two parts: training set and testing set (or validation set) Classiﬁcation algorithm or model is built using only the training set Testing set should not be used at all Quality of method is measured using testing set Stratiﬁed If representation of each class in training set is proportional to the overall ratios k-fold cross-validation Data is divided into k random parts k − 1 groups are used as training set and the kth group as testing set Training is repeated k times with a new testing set Leave-one-out cross-validation (LOOCV): When k = n Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 5 / 32
9. ### General method of classiﬁcation Total available data is divided randomly

into two parts: training set and testing set (or validation set) Classiﬁcation algorithm or model is built using only the training set Testing set should not be used at all Quality of method is measured using testing set Stratiﬁed If representation of each class in training set is proportional to the overall ratios k-fold cross-validation Data is divided into k random parts k − 1 groups are used as training set and the kth group as testing set Training is repeated k times with a new testing set Leave-one-out cross-validation (LOOCV): When k = n Stratiﬁed cross-validation When representation in each of the k random groups is proportional to the overall ratios Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 5 / 32
10. ### General method of classiﬁcation Total available data is divided randomly

into two parts: training set and testing set (or validation set) Classiﬁcation algorithm or model is built using only the training set Testing set should not be used at all Quality of method is measured using testing set Stratiﬁed If representation of each class in training set is proportional to the overall ratios k-fold cross-validation Data is divided into k random parts k − 1 groups are used as training set and the kth group as testing set Training is repeated k times with a new testing set Leave-one-out cross-validation (LOOCV): When k = n Stratiﬁed cross-validation When representation in each of the k random groups is proportional to the overall ratios Supervised learning Algorithm or model is “supervised” by the class information Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 5 / 32
11. ### Over-ﬁtting and under-ﬁtting Over-ﬁtting Algorithm or model classiﬁes the training

set too well It is too complex or uses too many parameters Generally performs poorly with testing set Ends up modeling noise rather than data characteristics Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 6 / 32
12. ### Over-ﬁtting and under-ﬁtting Over-ﬁtting Algorithm or model classiﬁes the training

set too well It is too complex or uses too many parameters Generally performs poorly with testing set Ends up modeling noise rather than data characteristics Under-ﬁtting The opposite problem Algorithm or model does not classify the training set well at all It is too simple or uses too less parameters Generally performs poorly with testing set Ends up modeling overall data characteristics instead of per class Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 6 / 32
13. ### Errors Positives (P): objects that are “true” answers Negatives (N):

objects that are not answers N = D − P Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 7 / 32
14. ### Errors Positives (P): objects that are “true” answers Negatives (N):

objects that are not answers N = D − P For any classiﬁcation algorithm, True Positives (TP): Answers that have been found True Negatives (TN): Non-answers that have not been found False Positives (FP): Non-answers that have been found False Negatives (FN): Answers that have not been found P = TP ∪ FN N = TN ∪ FP Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 7 / 32
15. ### Errors Positives (P): objects that are “true” answers Negatives (N):

objects that are not answers N = D − P For any classiﬁcation algorithm, True Positives (TP): Answers that have been found True Negatives (TN): Non-answers that have not been found False Positives (FP): Non-answers that have been found False Negatives (FN): Answers that have not been found P = TP ∪ FN N = TN ∪ FP Errors Type I error: FP Type II error: FN Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 7 / 32

24. ### F-score F-score or F-measure Single measure capturing both precision and

recall Harmonic mean of precision and recall F − score = 2 × Precision × Recall Precision + Recall Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 9 / 32
25. ### F-score F-score or F-measure Single measure capturing both precision and

recall Harmonic mean of precision and recall F − score = 2 × Precision × Recall Precision + Recall Precision and recall can be weighted as well When recall is β times more important than precision F − score = (1 + β) × Precision × Recall β × Precision + Recall Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 9 / 32
26. ### F-score F-score or F-measure Single measure capturing both precision and

recall Harmonic mean of precision and recall F − score = 2 × Precision × Recall Precision + Recall Precision and recall can be weighted as well When recall is β times more important than precision F − score = (1 + β) × Precision × Recall β × Precision + Recall In terms of errors F − score = (1 + β) × TP (1 + β) × TP + β × FN + FP Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 9 / 32
27. ### Example D ={O1, O2, O3, O4, O5, O6, O7, O8}

Correct answer set ={O1, O5, O7} Algorithm returns ={O1, O3, O5, O6} Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 10 / 32
28. ### Example D ={O1, O2, O3, O4, O5, O6, O7, O8}

Correct answer set ={O1, O5, O7} Algorithm returns ={O1, O3, O5, O6} ∴ P = N = TP = TN = FP = FN = Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 10 / 32
29. ### Example D ={O1, O2, O3, O4, O5, O6, O7, O8}

Correct answer set ={O1, O5, O7} Algorithm returns ={O1, O3, O5, O6} ∴ P = {O1, O5, O7} N = TP = TN = FP = FN = Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 10 / 32
30. ### Example D ={O1, O2, O3, O4, O5, O6, O7, O8}

Correct answer set ={O1, O5, O7} Algorithm returns ={O1, O3, O5, O6} ∴ P = {O1, O5, O7} N = {O2, O3, O4, O6, O8} TP = TN = FP = FN = Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 10 / 32
31. ### Example D ={O1, O2, O3, O4, O5, O6, O7, O8}

Correct answer set ={O1, O5, O7} Algorithm returns ={O1, O3, O5, O6} ∴ P = {O1, O5, O7} N = {O2, O3, O4, O6, O8} TP = {O1, O5} TN = FP = FN = Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 10 / 32
32. ### Example D ={O1, O2, O3, O4, O5, O6, O7, O8}

Correct answer set ={O1, O5, O7} Algorithm returns ={O1, O3, O5, O6} ∴ P = {O1, O5, O7} N = {O2, O3, O4, O6, O8} TP = {O1, O5} TN = {O2, O4, O8} FP = FN = Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 10 / 32
33. ### Example D ={O1, O2, O3, O4, O5, O6, O7, O8}

Correct answer set ={O1, O5, O7} Algorithm returns ={O1, O3, O5, O6} ∴ P = {O1, O5, O7} N = {O2, O3, O4, O6, O8} TP = {O1, O5} TN = {O2, O4, O8} FP = {O3, O6} FN = Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 10 / 32
34. ### Example D ={O1, O2, O3, O4, O5, O6, O7, O8}

Correct answer set ={O1, O5, O7} Algorithm returns ={O1, O3, O5, O6} ∴ P = {O1, O5, O7} N = {O2, O3, O4, O6, O8} TP = {O1, O5} TN = {O2, O4, O8} FP = {O3, O6} FN = {O7} Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 10 / 32
35. ### Example D ={O1, O2, O3, O4, O5, O6, O7, O8}

Correct answer set ={O1, O5, O7} Algorithm returns ={O1, O3, O5, O6} ∴ P = {O1, O5, O7} N = {O2, O3, O4, O6, O8} TP = {O1, O5} TN = {O2, O4, O8} FP = {O3, O6} FN = {O7} ∴ Recall = Sensitivity = Precision = Speciﬁcity = F-score = Accuracy = Error rate = Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 10 / 32
36. ### Example D ={O1, O2, O3, O4, O5, O6, O7, O8}

Correct answer set ={O1, O5, O7} Algorithm returns ={O1, O3, O5, O6} ∴ P = {O1, O5, O7} N = {O2, O3, O4, O6, O8} TP = {O1, O5} TN = {O2, O4, O8} FP = {O3, O6} FN = {O7} ∴ Recall = Sensitivity = 2/3 = 0.67 Precision = Speciﬁcity = F-score = Accuracy = Error rate = Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 10 / 32
37. ### Example D ={O1, O2, O3, O4, O5, O6, O7, O8}

Correct answer set ={O1, O5, O7} Algorithm returns ={O1, O3, O5, O6} ∴ P = {O1, O5, O7} N = {O2, O3, O4, O6, O8} TP = {O1, O5} TN = {O2, O4, O8} FP = {O3, O6} FN = {O7} ∴ Recall = Sensitivity = 2/3 = 0.67 Precision = 2/4 = 0.5 Speciﬁcity = F-score = Accuracy = Error rate = Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 10 / 32
38. ### Example D ={O1, O2, O3, O4, O5, O6, O7, O8}

Correct answer set ={O1, O5, O7} Algorithm returns ={O1, O3, O5, O6} ∴ P = {O1, O5, O7} N = {O2, O3, O4, O6, O8} TP = {O1, O5} TN = {O2, O4, O8} FP = {O3, O6} FN = {O7} ∴ Recall = Sensitivity = 2/3 = 0.67 Precision = 2/4 = 0.5 Speciﬁcity = 3/5 = 0.6 F-score = Accuracy = Error rate = Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 10 / 32
39. ### Example D ={O1, O2, O3, O4, O5, O6, O7, O8}

Correct answer set ={O1, O5, O7} Algorithm returns ={O1, O3, O5, O6} ∴ P = {O1, O5, O7} N = {O2, O3, O4, O6, O8} TP = {O1, O5} TN = {O2, O4, O8} FP = {O3, O6} FN = {O7} ∴ Recall = Sensitivity = 2/3 = 0.67 Precision = 2/4 = 0.5 Speciﬁcity = 3/5 = 0.6 F-score = 4/7 = 0.571 Accuracy = Error rate = Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 10 / 32
40. ### Example D ={O1, O2, O3, O4, O5, O6, O7, O8}

Correct answer set ={O1, O5, O7} Algorithm returns ={O1, O3, O5, O6} ∴ P = {O1, O5, O7} N = {O2, O3, O4, O6, O8} TP = {O1, O5} TN = {O2, O4, O8} FP = {O3, O6} FN = {O7} ∴ Recall = Sensitivity = 2/3 = 0.67 Precision = 2/4 = 0.5 Speciﬁcity = 3/5 = 0.6 F-score = 4/7 = 0.571 Accuracy = 5/8 = 0.625 Error rate = Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 10 / 32
41. ### Example D ={O1, O2, O3, O4, O5, O6, O7, O8}

Correct answer set ={O1, O5, O7} Algorithm returns ={O1, O3, O5, O6} ∴ P = {O1, O5, O7} N = {O2, O3, O4, O6, O8} TP = {O1, O5} TN = {O2, O4, O8} FP = {O3, O6} FN = {O7} ∴ Recall = Sensitivity = 2/3 = 0.67 Precision = 2/4 = 0.5 Speciﬁcity = 3/5 = 0.6 F-score = 4/7 = 0.571 Accuracy = 5/8 = 0.625 Error rate = 3/8 = 0.375 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 10 / 32
42. ### ROC curve ROC (Receiver Operating Characteristics) curve 1 - speciﬁcity

(x-axis) versus sensitivity (y-axis) False positive rate (x-axis) versus true positive rate (y-axis) A random guess algorithm is a 45◦ line Area under the curve measures accuracy (or discrimination) A perfect algorithm has an area of 1 A random (or useless) algorithm has an area of 0.5 Grading 0.9-1.0: excellent; 0.8-0.9: good; 0.7-0.8: fair; 0.6-0.7: poor; <0.6: fail Accuracy of Algorithm 2 True Positive Rate False Positive Rate 1 0 0 1 Algorithm 1 Algorithm 2 Perfect algorithm Random guess algorithm Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 11 / 32
43. ### Confusion matrix Confusion matrix Answers found by algorithm (predictions) on

rows versus “true” answers (actuals) on columns Shows which classes are harder to identify True answers Positives P Negatives N Found by Positives P TP FP algorithm Negatives N FN TN Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 12 / 32
44. ### Confusion matrix Confusion matrix Answers found by algorithm (predictions) on

rows versus “true” answers (actuals) on columns Shows which classes are harder to identify True answers Positives P Negatives N Found by Positives P TP FP algorithm Negatives N FN TN For our earlier example P N P 2 2 4 N 1 3 4 3 5 8 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 12 / 32
45. ### Outline 1 Preliminaries 2 Decision trees 3 Rule-based classiﬁers Arnab

Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 13 / 32
46. ### Decision trees A decision tree is a tree structure used

for classiﬁcation Each internal node represents a test on an attribute Each branch represents an outcome of the test Each leaf represents a class outcome For a test object, its attributes are tested and a particular path is followed to a leaf, which is deemed its class Motivated? CPI >= 8? No BTP recommends? strongly Advisor Advisor willing? No BTP BTP No BTP BTP yes yes yes yes no no no no Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 14 / 32
47. ### Constructing a decision tree If all objects are in same

class, label the leaf node with that class The leaf is then pure Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 15 / 32
48. ### Constructing a decision tree If all objects are in same

class, label the leaf node with that class The leaf is then pure Else, choose the “best” attribute to split Determine splitting criterion based on splitting attribute Indicates split point(s) or splitting subset(s) Diﬀerent measures of impurity to split a node Separate objects into diﬀerent branches according to split Recursively, build tree for each split Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 15 / 32
49. ### Constructing a decision tree If all objects are in same

class, label the leaf node with that class The leaf is then pure Else, choose the “best” attribute to split Determine splitting criterion based on splitting attribute Indicates split point(s) or splitting subset(s) Diﬀerent measures of impurity to split a node Separate objects into diﬀerent branches according to split Recursively, build tree for each split Stop when either Leaf becomes pure No more attributes to split – assign class through majority voting Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 15 / 32
50. ### Constructing a decision tree If all objects are in same

class, label the leaf node with that class The leaf is then pure Else, choose the “best” attribute to split Determine splitting criterion based on splitting attribute Indicates split point(s) or splitting subset(s) Diﬀerent measures of impurity to split a node Separate objects into diﬀerent branches according to split Recursively, build tree for each split Stop when either Leaf becomes pure No more attributes to split – assign class through majority voting Decision tree building is top-down and no backtracking is allowed Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 15 / 32
51. ### Information gain Entropy impurity or information impurity info(D) = −

k i=1 (pi log2 pi ) For n partitions into D1, . . . , Dn, denoted by S infoS (D) = n j=1 (|Dj |/|D|)info(Dj ) Information gain is gainS (D) = info(D) − infoS (D) More the gain, better the split Choose attribute and split point that maximizes gain Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 16 / 32
52. ### Gini index Variance impurity for two classes var(D) = p1.p2

For k classes, generalized to Gini index or Gini impurity gini(D) = k i=1 k j=1,j=i pi .pj = 1 − k i=1 p2 i For n partitions into D1, . . . , Dn, denoted by S giniS (D) = n j=1 (|Dj |/|D|)gini(Dj ) Less the gini index, better the split Choose attribute and split point that minimizes gini index Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 17 / 32
53. ### Classiﬁcation error Classiﬁcation error or misclassiﬁcation index class(D) = 1

− max i pi This is the probability of misclassiﬁcation when no more split is done and majority voting is used Find reduction in impurity by splitting class(D) − classS (D) = class(D) − n j=1 (|Dj |/|D|)class(Dj ) More the reduction in impurity, better the split Choose attribute and split point that maximizes reduction Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 18 / 32
54. ### Gain ratio Most impurity measures are biased towards multiway splits

Higher chance that a node becomes purer Gain ratio counters it For n partitions into D1, . . . , Dn, denoted by S Split information is deﬁned as splitinfoS (D) = − n j=1 (|Dj |/|D|) log2 (|Dj |/|D|) Similar to information measure, although just uses the number of objects in each partition and not any class information This is used to normalize information gain gainratioS (D) = gainS (D)/splitinfoS (D) Higher the gain ratio, better the split Choose attribute and split point that maximizes gain ratio Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 19 / 32
55. ### Choosing a split point If attribute is nominal Arnab Bhattacharya

(arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 20 / 32
56. ### Choosing a split point If attribute is nominal Each category

denotes a new branch If binary split is required, Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 20 / 32
57. ### Choosing a split point If attribute is nominal Each category

denotes a new branch If binary split is required, use set membership testing If attribute is ordinal Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 20 / 32
58. ### Choosing a split point If attribute is nominal Each category

denotes a new branch If binary split is required, use set membership testing If attribute is ordinal Each category denotes a new branch If binary split is required, Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 20 / 32
59. ### Choosing a split point If attribute is nominal Each category

denotes a new branch If binary split is required, use set membership testing If attribute is ordinal Each category denotes a new branch If binary split is required, use order information If attribute is numeric Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 20 / 32
60. ### Choosing a split point If attribute is nominal Each category

denotes a new branch If binary split is required, use set membership testing If attribute is ordinal Each category denotes a new branch If binary split is required, use order information If attribute is numeric Sort all values and choose a (binary) split point If multiway split is required, Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 20 / 32
61. ### Choosing a split point If attribute is nominal Each category

denotes a new branch If binary split is required, use set membership testing If attribute is ordinal Each category denotes a new branch If binary split is required, use order information If attribute is numeric Sort all values and choose a (binary) split point If multiway split is required, choose multiple split points Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 20 / 32
62. ### Discussion Over-ﬁtting can happen CPI criterion (in example) over-ﬁtted to

CSE students Tree needs to be pruned Can use criteria such as chi-square test to stop splitting Can use criteria such as information gain to merge Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 21 / 32
63. ### Discussion Over-ﬁtting can happen CPI criterion (in example) over-ﬁtted to

CSE students Tree needs to be pruned Can use criteria such as chi-square test to stop splitting Can use criteria such as information gain to merge Under-ﬁtting can also happen If nodes about advisor decisions (in example) are left out Some thresholds are always needed to control these Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 21 / 32
64. ### Discussion Over-ﬁtting can happen CPI criterion (in example) over-ﬁtted to

CSE students Tree needs to be pruned Can use criteria such as chi-square test to stop splitting Can use criteria such as information gain to merge Under-ﬁtting can also happen If nodes about advisor decisions (in example) are left out Some thresholds are always needed to control these Node decisions are based on single attribute – monothetic trees Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 21 / 32
65. ### Discussion Over-ﬁtting can happen CPI criterion (in example) over-ﬁtted to

CSE students Tree needs to be pruned Can use criteria such as chi-square test to stop splitting Can use criteria such as information gain to merge Under-ﬁtting can also happen If nodes about advisor decisions (in example) are left out Some thresholds are always needed to control these Node decisions are based on single attribute – monothetic trees Why not polythetic trees where decisions are based on multiple attributes? Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 21 / 32
66. ### Discussion Over-ﬁtting can happen CPI criterion (in example) over-ﬁtted to

CSE students Tree needs to be pruned Can use criteria such as chi-square test to stop splitting Can use criteria such as information gain to merge Under-ﬁtting can also happen If nodes about advisor decisions (in example) are left out Some thresholds are always needed to control these Node decisions are based on single attribute – monothetic trees Why not polythetic trees where decisions are based on multiple attributes? Theoretically possible but practically too complex Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 21 / 32
67. ### Variants of decision tree Three main variants ID3 (from Iterative

Dichotomiser generation 3) Multiway split Uses information gain Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 22 / 32
68. ### Variants of decision tree Three main variants ID3 (from Iterative

Dichotomiser generation 3) Multiway split Uses information gain C4.5 Evolved from ID3 Multiway split Uses gain ratio Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 22 / 32
69. ### Variants of decision tree Three main variants ID3 (from Iterative

Dichotomiser generation 3) Multiway split Uses information gain C4.5 Evolved from ID3 Multiway split Uses gain ratio CART (from Classiﬁcation and Regression Trees) Binary split Uses gini index Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 22 / 32
70. ### Outline 1 Preliminaries 2 Decision trees 3 Rule-based classiﬁers Arnab

Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 23 / 32
71. ### Rules Rules are of the form if condition then class

condition is a conjunct (i.e., logical AND) of tests on single attributes If the condition holds, then the object is said to be from class condition is called antecedent or precondition class is called consequent Example: if motivated = yes AND cpi ≥ 8 then btp Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 24 / 32
72. ### Rules Rules are of the form if condition then class

condition is a conjunct (i.e., logical AND) of tests on single attributes If the condition holds, then the object is said to be from class condition is called antecedent or precondition class is called consequent Example: if motivated = yes AND cpi ≥ 8 then btp Two important parameters of a rule Coverage: Number of objects the rule applies to coverage = |covers|/|D| Accuracy: Number of correctly classiﬁed objects when rule is applied accuracy = |correct|/|covers| Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 24 / 32
73. ### Triggering and ﬁring of rules For every tuple, a rule

that satisﬁes it is “triggered” If for that tuple, it is the only rule, then it is “ﬁred” Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 25 / 32
74. ### Triggering and ﬁring of rules For every tuple, a rule

that satisﬁes it is “triggered” If for that tuple, it is the only rule, then it is “ﬁred” Otherwise, a conﬂict resolution strategy is devised Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 25 / 32
75. ### Triggering and ﬁring of rules For every tuple, a rule

that satisﬁes it is “triggered” If for that tuple, it is the only rule, then it is “ﬁred” Otherwise, a conﬂict resolution strategy is devised Size-based ordering: Rule with larger antecedent is invoked More stringent, i.e., tougher Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 25 / 32
76. ### Triggering and ﬁring of rules For every tuple, a rule

that satisﬁes it is “triggered” If for that tuple, it is the only rule, then it is “ﬁred” Otherwise, a conﬂict resolution strategy is devised Size-based ordering: Rule with larger antecedent is invoked More stringent, i.e., tougher Class-based ordering: Two schemes Consequent class is more frequent, i.e., according to order of prevalence Consequent class has less misclassiﬁcation Within same class, there is arbitrary ordering Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 25 / 32
77. ### Triggering and ﬁring of rules For every tuple, a rule

that satisﬁes it is “triggered” If for that tuple, it is the only rule, then it is “ﬁred” Otherwise, a conﬂict resolution strategy is devised Size-based ordering: Rule with larger antecedent is invoked More stringent, i.e., tougher Class-based ordering: Two schemes Consequent class is more frequent, i.e., according to order of prevalence Consequent class has less misclassiﬁcation Within same class, there is arbitrary ordering Rule-based ordering: Priority list according to some function based on coverage, accuracy and size Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 25 / 32
78. ### Triggering and ﬁring of rules For every tuple, a rule

that satisﬁes it is “triggered” If for that tuple, it is the only rule, then it is “ﬁred” Otherwise, a conﬂict resolution strategy is devised Size-based ordering: Rule with larger antecedent is invoked More stringent, i.e., tougher Class-based ordering: Two schemes Consequent class is more frequent, i.e., according to order of prevalence Consequent class has less misclassiﬁcation Within same class, there is arbitrary ordering Rule-based ordering: Priority list according to some function based on coverage, accuracy and size For a query tuple, the ﬁrst rule that satisiﬁes it is invoked If no such rule, then a default rule is invoked: if () then class i Class i is the most abundant class Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 25 / 32
79. ### Learning rules from a decision tree Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685:

Classiﬁcation 1 2012-13 26 / 32
80. ### Learning rules from a decision tree Every path is a

rule As verbose or complex as the decision tree itself Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 26 / 32
81. ### Learning rules from a decision tree Every path is a

rule As verbose or complex as the decision tree itself Rules are mutually exclusive and exhaustive Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 26 / 32
82. ### Learning rules from a decision tree Every path is a

rule As verbose or complex as the decision tree itself Rules are mutually exclusive and exhaustive No need to order the rules Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 26 / 32
83. ### Sequential covering algorithm Sequential covering algorithm learns rules sequentially Rules

are learnt per class one-by-one Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 27 / 32
84. ### Sequential covering algorithm Sequential covering algorithm learns rules sequentially Rules

are learnt per class one-by-one When a rule is learnt, all tuples covered by it are removed Given a set of tuples, how is a rule learnt? Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 27 / 32
85. ### Sequential covering algorithm Sequential covering algorithm learns rules sequentially Rules

are learnt per class one-by-one When a rule is learnt, all tuples covered by it are removed Given a set of tuples, how is a rule learnt? Greedy learn-one-rule method learns the “best” rule given the current set of tuples General-to-speciﬁc strategy Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 27 / 32
86. ### Sequential covering algorithm Sequential covering algorithm learns rules sequentially Rules

are learnt per class one-by-one When a rule is learnt, all tuples covered by it are removed Given a set of tuples, how is a rule learnt? Greedy learn-one-rule method learns the “best” rule given the current set of tuples General-to-speciﬁc strategy Starts with an empty antecedent At each stage, every attribute (and every possible split) is considered If the new rule has better quality than the old rule, it is retained Decisions are thus greedy and are never backtracked Otherwise, the old rule is accepted Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 27 / 32
87. ### Sequential covering algorithm Sequential covering algorithm learns rules sequentially Rules

are learnt per class one-by-one When a rule is learnt, all tuples covered by it are removed Given a set of tuples, how is a rule learnt? Greedy learn-one-rule method learns the “best” rule given the current set of tuples General-to-speciﬁc strategy Starts with an empty antecedent At each stage, every attribute (and every possible split) is considered If the new rule has better quality than the old rule, it is retained Decisions are thus greedy and are never backtracked Otherwise, the old rule is accepted The next rule is then learnt Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 27 / 32
88. ### Sequential covering algorithm Sequential covering algorithm learns rules sequentially Rules

are learnt per class one-by-one When a rule is learnt, all tuples covered by it are removed Given a set of tuples, how is a rule learnt? Greedy learn-one-rule method learns the “best” rule given the current set of tuples General-to-speciﬁc strategy Starts with an empty antecedent At each stage, every attribute (and every possible split) is considered If the new rule has better quality than the old rule, it is retained Decisions are thus greedy and are never backtracked Otherwise, the old rule is accepted The next rule is then learnt Rules are ordered according to their order of inception Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 27 / 32
89. ### Sequential covering algorithm Sequential covering algorithm learns rules sequentially Rules

are learnt per class one-by-one When a rule is learnt, all tuples covered by it are removed Given a set of tuples, how is a rule learnt? Greedy learn-one-rule method learns the “best” rule given the current set of tuples General-to-speciﬁc strategy Starts with an empty antecedent At each stage, every attribute (and every possible split) is considered If the new rule has better quality than the old rule, it is retained Decisions are thus greedy and are never backtracked Otherwise, the old rule is accepted The next rule is then learnt Rules are ordered according to their order of inception Variants are AQ, CN2 and RIPPER Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 27 / 32
90. ### Rule quality Accuracy is the most vital concern Arnab Bhattacharya

(arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 28 / 32
91. ### Rule quality Accuracy is the most vital concern A rule

with 90% accuracy and 80% coverage is better than another rule with 95% accuracy and 10% coverage Coverage also needs to be considered Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 28 / 32
92. ### Rule quality Accuracy is the most vital concern A rule

with 90% accuracy and 80% coverage is better than another rule with 95% accuracy and 10% coverage Coverage also needs to be considered Old rule R1 has a1 as antecedent New rule R2 has a2 as antecedent Let the number of tuples covered by a rule be denoted by Di For the particular class in question, pi is the number of tuples correctly classiﬁed, i.e., the consequent is this class Correspondingly, ni is the number of negative tuples Di = pi + ni Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 28 / 32
93. ### Rule quality Accuracy is the most vital concern A rule

with 90% accuracy and 80% coverage is better than another rule with 95% accuracy and 10% coverage Coverage also needs to be considered Old rule R1 has a1 as antecedent New rule R2 has a2 as antecedent Let the number of tuples covered by a rule be denoted by Di For the particular class in question, pi is the number of tuples correctly classiﬁed, i.e., the consequent is this class Correspondingly, ni is the number of negative tuples Di = pi + ni Four ways of measuring quality Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 28 / 32
94. ### Rule quality measures FOIL Gain measure proposed as part of

the sequential covering algorithm First Order Inductive Learner (FOIL) used in RIPPER FOIL Gain(R1 → R2) = p2 × log2 p2 D2 − log2 p1 D1 Considers both coverage and accuracy Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 29 / 32
95. ### Rule quality measures FOIL Gain measure proposed as part of

the sequential covering algorithm First Order Inductive Learner (FOIL) used in RIPPER FOIL Gain(R1 → R2) = p2 × log2 p2 D2 − log2 p1 D1 Considers both coverage and accuracy Statistical test using the likelihood ratio statistic LR = 2 m i=1 fi log fi ei where m is the number of classes, fi and ei are the observed and expected frequencies of tuples in each class LR statistic has a chi-square distribution with m − 1 degrees of freedom The larger the statistic, the more deviated it is from the random rule, and thus, the better Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 29 / 32
96. ### Rule quality measures (contd.) Entropy: rule with less entropy is

better Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 30 / 32
97. ### Rule quality measures (contd.) Entropy: rule with less entropy is

better m-estimate measure considers the number of classes as well m-estimate = pi + m.ci Di + m where m is the number of classes and ci is the prior probability of class Ci If the prior probabilities are not known, replacing it by 1/m yields the Laplacian estimate Laplacian = pi + 1 Di + m The larger the estimate, the better is the rule Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 30 / 32
98. ### Rule pruning General-to-speciﬁc strategy is susceptible to overﬁtting Speciﬁc-to-general strategy

ﬁrst learns the most speciﬁc rule and then prunes the antecedent This is rule pruning Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 31 / 32
99. ### Rule pruning General-to-speciﬁc strategy is susceptible to overﬁtting Speciﬁc-to-general strategy

ﬁrst learns the most speciﬁc rule and then prunes the antecedent This is rule pruning Each training instance starts as a rule From a rule R1, an antecedent is removed to yield rule R2 Measure of rule quality is FOIL Prune FOIL Prune = pi − ni Di If this measure is higher for R2, then pruning is applied Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 31 / 32
100. ### Discussion Rules can be very verbose Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685:

Classiﬁcation 1 2012-13 32 / 32
101. ### Discussion Rules can be very verbose Simple rules can only

learn rectilinear boundaries Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 32 / 32
102. ### Discussion Rules can be very verbose Simple rules can only

learn rectilinear boundaries Rules have an interpretation and can lead to descriptive models Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 32 / 32
103. ### Discussion Rules can be very verbose Simple rules can only

learn rectilinear boundaries Rules have an interpretation and can lead to descriptive models Can handle imbalance in class distribution very well Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classiﬁcation 1 2012-13 32 / 32