and Experimental Results Enrique Amigó UNED Madrid, Spain enrique@lsi.uned.es Julio Gonzalo UNED Madrid, Spain julio@lsi.uned.es Stefano Mizzaro University of Udine Udine, Italy mizzaro@uniud.it Jorge Carrillo-de-Albornoz UNED Madrid, Spain jcalbornoz@lsi.uned.es Abstract In Ordinal Classiﬁcation tasks, items have to be assigned to classes that have a relative order- ing, such as positive, neutral, negative in sen- timent analysis. Remarkably, the most popu- lar evaluation metrics for ordinal classiﬁcation tasks either ignore relevant information (for in- those other problems. But classiﬁcation measures ignore the ordering between classes, ranking met- rics ignore category matching, and value prediction metrics are used by assuming (usually equal) nu- meric intervals between categories. In this paper we propose a metric designed to evaluate Ordinal Classiﬁcation systems which re-
7 105 193 90 7 0 50 100 150 200 reject weak reject undecided weak accept accept # papers 180 10 3 10 173 0 50 100 150 200 reject weak reject undecided weak accept accept # papers Figure 1: In the left distribution, weak accept vs. weak reject would be a strong disagreement between reviewers (i.e., the classes are distant), because in practice these are almost the extreme cases of the scale (reviewers rarely go for accept or reject). In the right distribution the situation is the opposite: reviewers tend to take a clear stance, which makes weak accept and weak reject closer assessments than in the left case. ਤ https://www.aclweb.org/anthology/2020.acl-main.363/ ΑΓҾ༻
7 105 193 90 7 0 50 100 150 200 reject weak reject undecided weak accept accept # papers 180 10 3 10 173 0 50 100 150 200 reject weak reject undecided weak accept accept # papers Figure 1: In the left distribution, weak accept vs. weak reject would be a strong disagreement between reviewers (i.e., the classes are distant), because in practice these are almost the extreme cases of the scale (reviewers rarely go for accept or reject). In the right distribution the situation is the opposite: reviewers tend to take a clear stance, which makes weak accept and weak reject closer assessments than in the left case. ਤ https://www.aclweb.org/anthology/2020.acl-main.363/ ΑΓҾ༻ ·ͩϚγ ʢͲͬͪͰେࠩͳ͍ʣ ѱ͍ ʢධՁ͕େ͖͘มΘΔʣ
: data • Ordinal Invariance where : strictly increasing func. • Ordinal Monotonicity if • Class Imbalance where s g d ∈ D Eﬀ(s, g) = Eﬀ(f(s), f(g)) f Eﬀ(s′ , g) > Eﬀ(s, g) ∃d . (s(d) ≠ s′ (d)) ∧ (∀d . ((s(d) > s′ (d) ≥ g(d)) ∨ (s(d) = s′ (d)))) Eﬀ(gd1 →c2 , g) > Eﬀ(gd3 →c2 , g) nc1 > nc3 10 ਤ https://www.aclweb.org/anthology/2020.acl-main.363/ ΑΓҾ༻ lly, at interval scale, CEMINT would be lent to a logarithmic version of MAE when- ems are uniformly distributed across classes. eave a more detailed formal and empirical s of CEM at other scales for future work, as t the primary scope of this paper. heoretical Evidence ing a methodology previously applied for ﬁcation (Sebastiani, 2015; Sokolova, 2006), ing (Dom, 2001; Meila, 2003; Amigó et al., and document ranking tasks (Moffat, 2013; et al., 2013b), here we deﬁne a formal work for OC via desirable properties to be d, which are illustrated in Figure 2 and in- ed below. Metric Properties st property states that an effectiveness met- f(s, g) should not assume predeﬁned inter- etween classes, i.e., it should be invariant permissible transformation functions at ordi- le. Figure 2: Illustration of desirable formal properties for Ordinal Classiﬁcation. Each bin is a system output, where columns represent ordered classes assigned by the system, and colors represent the items’ true classes, ordered from black to white. "=" means that both out- puts should have the same quality, and ">" that the left output should receive a higher metric value than the right output. strictly better, then the metric score of s0 must be higher. Finally, in order to manage the effect of im- balanced data sets, another desirable property is that an item classiﬁcation error in a frequent class should have less effect than a classiﬁcation error e, CEMINT would be version of MAE when- tributed across classes. d formal and empirical ales for future work, as f this paper. e previously applied for 2015; Sokolova, 2006), ila, 2003; Amigó et al., ng tasks (Moffat, 2013; e we deﬁne a formal irable properties to be ted in Figure 2 and in- Figure 2: Illustration of desirable formal properties for Ordinal Classiﬁcation. Each bin is a system output, where columns represent ordered classes assigned by the system, and colors represent the items’ true classes, ordered from black to white. "=" means that both out- puts should have the same quality, and ">" that the left output should receive a higher metric value than the right output. cale, CEMINT would be mic version of MAE when- distributed across classes. iled formal and empirical r scales for future work, as pe of this paper. ence gy previously applied for ni, 2015; Sokolova, 2006), Meila, 2003; Amigó et al., nking tasks (Moffat, 2013; here we deﬁne a formal desirable properties to be Figure 2: Illustration of desirable formal properties for Ordinal Classiﬁcation. Each bin is a system output, where columns represent ordered classes assigned by the system, and colors represent the items’ true classes, ordered from black to white. "=" means that both out- puts should have the same quality, and ">" that the left
؍ଌ͢Δ͕֬খ͍͞ɺ ͱ ͕ʮ͍ۙʯͱఆٛ͢Δ : ͱ Λൺͨͱ͖ʹ ͷํ͕ ʹ͍ۙ֬ a b x a b P(x ⪯b ORD a) x a x b CIQORD(a, b) := − log(P(x ⪯b ORD a)) CIQORD(s(d), g(d)) = − log(P(x ⪯g(d) ORD s(d))) 11
ͷͱ͖ • ͷ 1/2 factor empirical ͳͷʢஶऀஊʣ x ⪯b ORD a a ≤ x ≤ b b ≤ x ≤ a ni g(d) = ci N CIQORD(ci , cj ) = − log ni 2 + ∑j k=i+1 nk N ci = cj CIQORD(ci , ci ) = − log(ni /2N) ni 13
150 200 reject weak reject undecided weak accept accept # papers 180 10 3 10 173 0 50 100 150 200 reject weak reject undecided weak accept accept # papers Figure 1: In the left distribution, weak accept vs. weak reject would be a strong disagreement between reviewers (i.e., the classes are distant), because in practice these are almost the extreme cases of the scale (reviewers rarely go for accept or reject). In the right distribution the situation is the opposite: reviewers tend to take a clear stance, which makes weak accept and weak reject closer assessments than in the left case. the true classes in the gold standard. A key idea in our metric is to establish a notion of informational closeness that depends on how items are distributed in the rank of classes. The idea is item d 2 D by the gold standard and the system output, CIQORD(s(d), g(d)) measures the closeness between the assigned class and the gold standard class: ਤ https://www.aclweb.org/anthology/2020.acl-main.363/ ΑΓҾ༻ = CIQORD(weak accept, weak reject) −log ( 90/2 + 193 + 105 302 ) ≃ 0.23 14 = CIQORD(weak accept, weak reject) −log ( 10/2 + 3 + 10 376 ) ≃ 4.38
Metrics Ord. Ord. Imb. Inv. Mon. Acc 3 - - Classiﬁcation Acc with n 3 - - Metrics Macro Avg Acc, Cohen’s 3 - 3 F-measure avg. across classes 3 - 3 MAE, MSE - 3 - Value Macro Avg. MAE/MSE - 3 3 Prediction Weighted - 3 3 Rennie & Srebro loss function - 3 - Cosine similarity - 3 - Linear correlation - - - Correlation Ordinal: Kendall (tau-b), Spea. 3 - 3 Coefﬁcients Kendall-(Tau-a) 3 - - Reliability and Sensitivity 3 - 3 Clustering MI, Purity and Inv. Purity 3 - 3 Path based Ordinal Classiﬁcation Index 3 - - CEMNOM 3 - 3 CEM CEMINT - 3 3 CEMORD 3 3 3 The most popular Value Prediction metrics are aged. In Tau-a, only discordant pair (g(d1) > g(d2) and s(d1) < s(d2) is not satisﬁed. The most popular cient approach (Tau-b) and Spearm imbalance. Pearson coefﬁcient doe interval effect. Reliability and Sen which extend the clustering metric sentially an ordinal correlation met ant but failing in monotonicity, wit of satisfying imbalance due to the notions. By deﬁnition, clustering metric variant, because they are not affecte of category descriptors. In additio such as Mutual Information (MI) o verse Purity, satisfy imbalance. Ho not ordinal monotonic, given that sider any ordinal relationship betw Finally, we must include the ap doso and Sousa (2011), a path bas Ordinal Classiﬁcation Index whi speciﬁcally for OC problems. T that integrates aspects from the pre ric families, including two param ਤ https://www.aclweb.org/anthology/2020.acl-main.363/ ΑΓҾ༻ ɾOrdinal Invariance ɾOrdinal Monotonicity ɾClass Imbalance શͯΛຬ͍ͨͯ͠Δ 16
m ℳ Covℳ (m) = Spea (m(s) − m(s′ ), UIRℳ (s, s′ )) 18 εϐΞϚϯॱҐ૬ؔ ʢείΞॱংʹʣ ͯ͠Δ ࢦඪ ͷࠩ m Unanimous Improvement Ratio (UIR) acc, Kendall, MI શ্͕ͯ ͨ͠σʔλͷࠩͷׂ߹ iments, in addition to robustness, we select three complementary metrics, each focused on one of these partial aspects, and we evaluate to what ex- tent existing OC metrics are able to capture all these aspects simultaneously. The selected metrics are: (i) Accuracy, as a partial metric which captures class matching; (ii) Kendall’s correlation coefﬁcient Tau-a (without counting ties), in order to capture class ordering2; and (iii) Mutual Information (MI), a clustering met- ric which reﬂects how much knowing the system output reduces uncertainty about the gold standard values. This metric accentuates the effect of small classes (imbalance property). 5.1 Meta-evaluation Metric In order to quantify the ability of metrics to capture the aspects reﬂected by these three metrics, we use the Unanimous Improvement Ratio (UIR) (Amigó et al., 2011). While robustness focuses on consis- tence across data sets, UIR focuses on consistence across metrics. It essentially counts in how many test cases an improvement is observed for all met- rics simultaneously. Being M a set of metrics, and T a set of test cases, and st a system output for the test case t, the Unanimous Improvement Ratio UIRM(s, s0) between two systems is deﬁned as: t 2 T : st M s0 t t 2 T : s0 t M st T , where st M s0 t represents that system s improves system s0, on the test case t, unanimously for every Th res im asp 5.2 We me cur hav eva me eith to ric the cat Ind ari the inc sim 5.3 In o am per syn um In we ing iments, in addition to robustness, we select three complementary metrics, each focused on one of these partial aspects, and we evaluate to what ex- tent existing OC metrics are able to capture all these aspects simultaneously. The selected metrics are: (i) Accuracy, as a partial metric which captures class matching; (ii) Kendall’s correlation coefﬁcient Tau-a (without counting ties), in order to capture class ordering2; and (iii) Mutual Information (MI), a clustering met- ric which reﬂects how much knowing the system output reduces uncertainty about the gold standard values. This metric accentuates the effect of small classes (imbalance property). 5.1 Meta-evaluation Metric In order to quantify the ability of metrics to capture the aspects reﬂected by these three metrics, we use the Unanimous Improvement Ratio (UIR) (Amigó et al., 2011). While robustness focuses on consis- tence across data sets, UIR focuses on consistence across metrics. It essentially counts in how many test cases an improvement is observed for all met- rics simultaneously. Being M a set of metrics, and T a set of test cases, and st a system output for the test case t, the Unanimous Improvement Ratio UIRM(s, s0) between two systems is deﬁned as: t 2 T : st M s0 t t 2 T : s0 t M st T , where st M s0 t represents that system s improves system s0, on the test case t, unanimously for every metric: st M s0 t ⌘ 8m 2 M m(st) m(s0 t . The resp imp asp 5.2 We met cur hav eva met eith to e rics ther cate Ind arit the inc sim 5.3 In o am per syn um In o we ing typ gra The Fin