20200906_ACL2020_metric_for_ordinal_classification_YoheiKikuta

2020/09/06 @yohei_kikuta An Effectiveness Metric for Ordinal Classification: Formal Properties
and Experimental Results Enrique Amigó UNED Madrid, Spain [email protected] Julio Gonzalo UNED Madrid, Spain [email protected] Stefano Mizzaro University of Udine Udine, Italy [email protected] Jorge Carrillo-de-Albornoz UNED Madrid, Spain [email protected] Abstract In Ordinal Classification tasks, items have to be assigned to classes that have a relative ordering, such as positive, neutral, negative in sen- timent analysis. Remarkably, the most popular evaluation metrics for ordinal classification tasks either ignore relevant information (for in- those other problems. But classification measures ignore the ordering between classes, ranking metrics ignore category matching, and value prediction metrics are used by assuming (usually equal) nu- meric intervals between categories. In this paper we propose a metric designed to evaluate Ordinal Classification systems which re-

/21 Ubie גࣜձࣾͷ٠ాངฏͰ͢ • Accounts • https://github.com/yoheikikuta • https://twitter.com/yohei_kikuta •
https://yoheikikuta.github.io/ • WE ARE HIRING!!! • https://herp.careers/v1/ubie 2

/21 ঺հ࿦จ An Eﬀectiveness Metric for Ordinal Classiﬁcation: Formal Properties
and Experimental Results • https://www.aclweb.org/anthology/2020.acl-main.363/ • ϝϞ: https://github.com/yoheikikuta/paper-reading/issues/54 • ॱং෼ྨλεΫͷධՁࢦඪΛߟྀ͢΂͖ੑ࣭ʹج͍ͮͯఏҊ • ߟྀ͢΂͖ੑ࣭͸ Ordinal Invariance, Ordinal Monotonicity, Class Imbalance • ্ه 3 ͭΛຬͨ͢Ϋϥεॱং΍෼෍ʹج͍ͮͨࢦඪΛఆٛ • ࣮ݧʹΑΓैདྷ࢖ΘΕ͍ͯͨࢦඪΑΓ΋༗༻ͱ͍͏݁ՌΛಘͨ 3

/21 Ͳ͕ͬͪʮѱ͍ʯʁ ࿦จͷࠪಡͰ {reject, weak reject, undecided, weak accept, accept}
Λ͚ͭΔ͜ͱΛߟ͑Δ ࠪಡ࿦จͷਅͷධՁ஋͕ accept ͷͱ͖ʹҎԼͲ͕ͬͪѱ͍ʁ • weak reject ͱؒҧ͑Δ  • weak accept ͱؒҧ͑Δ 4

/21 Ͳ͕ͬͪʮѱ͍ʯʁ ࿦จͷࠪಡͰ {reject, weak reject, undecided, weak accept, accept}
Λ͚ͭΔ͜ͱΛߟ͑Δ ࠪಡ࿦จͷਅͷධՁ஋͕ accept ͷͱ͖ʹҎԼͲ͕ͬͪѱ͍ʁ • weak reject ͱؒҧ͑Δ  • weak accept ͱؒҧ͑Δ 5 ѱ͍  ʢ3 ஈ֊΋ԼͷΧςΰϦʣ ·ͩϚγ  ʢ1 ஈ֊͚ͩԼͷΧςΰϦʣ

/21 ʢͦͷ̎ʣͲ͕ͬͪʮѱ͍ʯʁ weak reject 㲗 weak accept Λؒҧ͑ͯධՁ͢Δͱ͖ʹ  ҎԼͲ͕ͬͪѱ͍ʁ 6
7 105 193 90 7 0 50 100 150 200 reject weak reject undecided weak accept accept # papers 180 10 3 10 173 0 50 100 150 200 reject weak reject undecided weak accept accept # papers Figure 1: In the left distribution, weak accept vs. weak reject would be a strong disagreement between reviewers (i.e., the classes are distant), because in practice these are almost the extreme cases of the scale (reviewers rarely go for accept or reject). In the right distribution the situation is the opposite: reviewers tend to take a clear stance, which makes weak accept and weak reject closer assessments than in the left case. ਤ͸ https://www.aclweb.org/anthology/2020.acl-main.363/ ΑΓҾ༻

/21 ʢͦͷ̎ʣͲ͕ͬͪʮѱ͍ʯʁ weak reject 㲗 weak accept Λؒҧ͑ͯධՁ͢Δͱ͖ʹ  ҎԼͲ͕ͬͪѱ͍ʁ 7
7 105 193 90 7 0 50 100 150 200 reject weak reject undecided weak accept accept # papers 180 10 3 10 173 0 50 100 150 200 reject weak reject undecided weak accept accept # papers Figure 1: In the left distribution, weak accept vs. weak reject would be a strong disagreement between reviewers (i.e., the classes are distant), because in practice these are almost the extreme cases of the scale (reviewers rarely go for accept or reject). In the right distribution the situation is the opposite: reviewers tend to take a clear stance, which makes weak accept and weak reject closer assessments than in the left case. ਤ͸ https://www.aclweb.org/anthology/2020.acl-main.363/ ΑΓҾ༻ ·ͩϚγ  ʢͲͬͪͰ΋େࠩͳ͍ʣ ѱ͍  ʢධՁ͕େ͖͘มΘΔʣ

/21 ॱং෼ྨ͸ ̋̋̋ Ͱ͸ͳ͍ • n-array classiﬁcation Ͱ͸ͳ͍  AC ͕ਅͷ࣌
weak AC ͱ weak REJ ͳΒޙऀͷํ͕ෆద੾ • ranking prediction Ͱ͸ͳ͍  (AC, weak AC, undecided) ≠ (undecided, weak REJ, REJ) • value prediction Ͱ͸ͳ͍  AC ͱ weak AC ͷࠩͱ weak AC ͱ weak REJ ͷࠩ͸ൺֱෆՄ • linear correlation Ͱ͸ͳ͍  ૬͕ؔߴͯ͘΋ग़ྗͷ஋͕Ұக͠ͳ͍৔߹͸͋Δ 8

/21 ࿦จͷ֓ཁ • ՝୊ҙࣝɿ  ॱং෼ྨ͸ NLP ͰΑ͘ग़ͯ͘Δ͕ਖ਼͘͠ධՁ͞Ε͍ͯͳ͍ • ໨తɿ  ॱংई౓Λߟྀͨ͠ධՁࢦඪΛ࡞੒͍ͨ͠
• ಺༰ɿ  ධՁࢦඪ͕༗͢Δ΂͖ੑ࣭Λఆٛ͠ɺ۩ମతͳߏ੒ΛఏҊ  ࣮ݧͰॱং෼ྨ༧ଌ͕ຬͨ͢΂͖ੑ࣭ΛΑ͘ଊ͍͑ͯΔ͜ͱ Λࣔͨ͠ 9 ஶऀΒ͸ http://evall.uned.es/ ͷਓʑ  ࢦඪܭࢉίʔυ͸ Java Ͱఏڙ: https://github.com/EvALLTEAM/EvALLToolkit

/21 ॱং෼ྨͷࢦඪ ͕ຬͨ͢΂͖ੑ࣭ Eff : system output, : ground truth,
: data • Ordinal Invariance  where : strictly increasing func.  • Ordinal Monotonicity  if   • Class Imbalance  where s g d ∈ D Eff(s, g) = Eff(f(s), f(g)) f Eff(s′ , g) > Eff(s, g) ∃d . (s(d) ≠ s′ (d)) ∧ (∀d . ((s(d) > s′ (d) ≥ g(d)) ∨ (s(d) = s′ (d)))) Eff(gd1 →c2 , g) > Eff(gd3 →c2 , g) nc1 > nc3 10 ਤ͸ https://www.aclweb.org/anthology/2020.acl-main.363/ ΑΓҾ༻ lly, at interval scale, CEMINT would be lent to a logarithmic version of MAE when- ems are uniformly distributed across classes. eave a more detailed formal and empirical s of CEM at other scales for future work, as t the primary scope of this paper. heoretical Evidence ing a methodology previously applied for fication (Sebastiani, 2015; Sokolova, 2006), ing (Dom, 2001; Meila, 2003; Amigó et al., and document ranking tasks (Moffat, 2013; et al., 2013b), here we define a formal work for OC via desirable properties to be d, which are illustrated in Figure 2 and in- ed below. Metric Properties st property states that an effectiveness met- f(s, g) should not assume predefined inter- etween classes, i.e., it should be invariant permissible transformation functions at ordi- le. Figure 2: Illustration of desirable formal properties for Ordinal Classification. Each bin is a system output, where columns represent ordered classes assigned by the system, and colors represent the items’ true classes, ordered from black to white. "=" means that both out- puts should have the same quality, and ">" that the left output should receive a higher metric value than the right output. strictly better, then the metric score of s0 must be higher. Finally, in order to manage the effect of im- balanced data sets, another desirable property is that an item classification error in a frequent class should have less effect than a classification error e, CEMINT would be version of MAE when- tributed across classes. d formal and empirical ales for future work, as f this paper. e previously applied for 2015; Sokolova, 2006), ila, 2003; Amigó et al., ng tasks (Moffat, 2013; e we define a formal irable properties to be ted in Figure 2 and in- Figure 2: Illustration of desirable formal properties for Ordinal Classification. Each bin is a system output, where columns represent ordered classes assigned by the system, and colors represent the items’ true classes, ordered from black to white. "=" means that both out- puts should have the same quality, and ">" that the left output should receive a higher metric value than the right output. cale, CEMINT would be mic version of MAE when- distributed across classes. iled formal and empirical r scales for future work, as pe of this paper. ence gy previously applied for ni, 2015; Sokolova, 2006), Meila, 2003; Amigó et al., nking tasks (Moffat, 2013; here we define a formal desirable properties to be Figure 2: Illustration of desirable formal properties for Ordinal Classification. Each bin is a system output, where columns represent ordered classes assigned by the system, and colors represent the items’ true classes, ordered from black to white. "=" means that both out- puts should have the same quality, and ">" that the left

/21 Closeness Information Quantity (CIQ) σʔλ෼෍ʹج͍ͮͯʮۙ͞ʯΛఆΊ͍ͨ ʢ৘ใ࿦తͳߟ͑ʹج͖ͮʣσʔλ ͱ ͷؒʹσʔλ Λ
؍ଌ͢Δ֬཰͕খ͍࣌͞ɺ ͱ ͕ʮ͍ۙʯͱఆٛ͢Δ : ͱ Λൺ΂ͨͱ͖ʹ ͷํ͕ ʹ͍ۙ֬཰ a b x a b P(x ⪯b ORD a) x a x b CIQORD(a, b) := − log(P(x ⪯b ORD a)) CIQORD(s(d), g(d)) = − log(P(x ⪯g(d) ORD s(d))) 11

/21 Closeness Evaluation Measure (CEM) CIQ Λ࠷େ஋ Ͱن֨Խͯ͠શσʔλͰ଍্͛͠Δ • ࠷େ஋͸
ͷͱ͖Ͱ 1 ʹͳΔ • ࠷খ஋͸෼෍ʹґΔ • ʮ͍ۙʯؒҧ͍ΑΓʮԕ͍ʯؒҧ͍ͷํ͕ ͸௿͘ͳΔ s(d) = g(d) CEMORD(s, g) = ∑ d∈D CIQORD(s(d), g(d)) ∑ d∈D CIQORD(g(d), g(d)) ∀d s(d) = g(d) CEM 12

/21 ॱং෼ྨͰͷ۩ମతදࣜ ͸ ΋͘͠͸ Λҙຯ͢Δ : ͱͳΔσʔλ਺, : શσʔλ਺ •
ͷͱ͖͸ • ͷ 1/2 factor ͸ empirical ͳ΋ͷʢஶऀஊʣ x ⪯b ORD a a ≤ x ≤ b b ≤ x ≤ a ni g(d) = ci N CIQORD(ci , cj ) = − log ni 2 + ∑j k=i+1 nk N ci = cj CIQORD(ci , ci ) = − log(ni /2N) ni 13

/21 ܭࢉྫ 7 105 193 90 7 0 50 100
150 200 reject weak reject undecided weak accept accept # papers 180 10 3 10 173 0 50 100 150 200 reject weak reject undecided weak accept accept # papers Figure 1: In the left distribution, weak accept vs. weak reject would be a strong disagreement between reviewers (i.e., the classes are distant), because in practice these are almost the extreme cases of the scale (reviewers rarely go for accept or reject). In the right distribution the situation is the opposite: reviewers tend to take a clear stance, which makes weak accept and weak reject closer assessments than in the left case. the true classes in the gold standard. A key idea in our metric is to establish a notion of informational closeness that depends on how items are distributed in the rank of classes. The idea is item d 2 D by the gold standard and the system output, CIQORD(s(d), g(d)) measures the closeness between the assigned class and the gold standard class: ਤ͸ https://www.aclweb.org/anthology/2020.acl-main.363/ ΑΓҾ༻   = CIQORD(weak accept, weak reject) −log ( 90/2 + 193 + 105 302 ) ≃ 0.23 14   = CIQORD(weak accept, weak reject) −log ( 10/2 + 3 + 10 376 ) ≃ 4.38

/21 ଞͷई౓΁ͷҰൠԽ ֤ई౓Ͱۙ͞Λఆٛ: • ໊ٛई౓  : શ୯ࣸؔ਺ → • ॱংई౓ 
: ڱٛ୯ௐ૿Ճؔ਺ → • ִؒई౓  : ઢܗؔ਺ → ∃f ∈ ℱT (| f(x) − f(b)| ≤ | f(a) − f(b)|) ℱNOM (b = x) ∨ (b ≠ a) ℱORD (a ≤ x ≤ b) ∨ (b ≤ x ≤ a) ℱINT |b − x| ≤ |b − a| 15

/21 ఏҊࢦඪ͸ຬͨ͢΂͖ੑ࣭Λຬ͍ͨͯ͠Δ Table 2: Constraint-based Metric Analysis Constraints Metric family
Metrics Ord. Ord. Imb. Inv. Mon. Acc 3 - - Classification Acc with n 3 - - Metrics Macro Avg Acc, Cohen’s  3 - 3 F-measure avg. across classes 3 - 3 MAE, MSE - 3 - Value Macro Avg. MAE/MSE - 3 3 Prediction Weighted  - 3 3 Rennie & Srebro loss function - 3 - Cosine similarity - 3 - Linear correlation - - - Correlation Ordinal: Kendall (tau-b), Spea. 3 - 3 Coefficients Kendall-(Tau-a) 3 - - Reliability and Sensitivity 3 - 3 Clustering MI, Purity and Inv. Purity 3 - 3 Path based Ordinal Classification Index 3 - - CEMNOM 3 - 3 CEM CEMINT - 3 3 CEMORD 3 3 3 The most popular Value Prediction metrics are aged. In Tau-a, only discordant pair (g(d1) > g(d2) and s(d1) < s(d2) is not satisfied. The most popular cient approach (Tau-b) and Spearm imbalance. Pearson coefficient doe interval effect. Reliability and Sen which extend the clustering metric sentially an ordinal correlation met ant but failing in monotonicity, wit of satisfying imbalance due to the notions. By definition, clustering metric variant, because they are not affecte of category descriptors. In additio such as Mutual Information (MI) o verse Purity, satisfy imbalance. Ho not ordinal monotonic, given that sider any ordinal relationship betw Finally, we must include the ap doso and Sousa (2011), a path bas Ordinal Classification Index whi specifically for OC problems. T that integrates aspects from the pre ric families, including two param ਤ͸ https://www.aclweb.org/anthology/2020.acl-main.363/ ΑΓҾ༻ ɾOrdinal Invariance  ɾOrdinal Monotonicity  ɾClass Imbalance  શͯΛຬ͍ͨͯ͠Δ 16

/21 ࣮ݧઃܭ ॱং෼ྨͷࢦඪ͸ҎԼͷ৘ใΛଊ͍͑ͯΔ΋ͷͱԾఆ • Accuracy (acc): ΫϥεҰகͷ৘ใ • Kendall Tau-a
(Kendall): ॱং৘ใ • Mutual Information (MI): class imbalance ৘ใ γεςϜ s, s’ ͷग़ྗΛൺֱ͢Δͱ͖ʹɺ  3 ͭͷࢦඪશ͕ͯ޲্͢Δ৔߹ʹఏҊࢦඪ΋޲্͢Δ͔ΛଌΔ 17

/21 ϝλ෼ੳͷධՁࢦඪ coverage ΛҎԼͰఆٛʢ ͸ࢦඪ, ͸ acc, Kendall, MIʣ  ͋Δσʔληοτʹର͢Δෳ਺ͷγεςϜग़ྗΛ࢖ͬͯܭࢉ
m ℳ Covℳ (m) = Spea (m(s) − m(s′ ), UIRℳ (s, s′ )) 18 εϐΞϚϯॱҐ૬ؔ܎਺  ʢείΞॱংʹ஫໨ʣ ஫໨ͯ͠Δ  ࢦඪ ͷࠩ෼ m Unanimous Improvement  Ratio (UIR) acc, Kendall, MI શ͕ͯ޲্ ͨ͠σʔλͷࠩ෼ͷׂ߹ iments, in addition to robustness, we select three complementary metrics, each focused on one of these partial aspects, and we evaluate to what ex- tent existing OC metrics are able to capture all these aspects simultaneously. The selected metrics are: (i) Accuracy, as a partial metric which captures class matching; (ii) Kendall’s correlation coefficient Tau-a (without counting ties), in order to capture class ordering2; and (iii) Mutual Information (MI), a clustering metric which reflects how much knowing the system output reduces uncertainty about the gold standard values. This metric accentuates the effect of small classes (imbalance property). 5.1 Meta-evaluation Metric In order to quantify the ability of metrics to capture the aspects reflected by these three metrics, we use the Unanimous Improvement Ratio (UIR) (Amigó et al., 2011). While robustness focuses on consistence across data sets, UIR focuses on consistence across metrics. It essentially counts in how many test cases an improvement is observed for all metrics simultaneously. Being M a set of metrics, and T a set of test cases, and st a system output for the test case t, the Unanimous Improvement Ratio UIRM(s, s0) between two systems is defined as: t 2 T : st M s0 t t 2 T : s0 t M st T , where st M s0 t represents that system s improves system s0, on the test case t, unanimously for every Th res im asp 5.2 We me cur hav eva me eith to ric the cat Ind ari the inc sim 5.3 In o am per syn um In we ing iments, in addition to robustness, we select three complementary metrics, each focused on one of these partial aspects, and we evaluate to what ex- tent existing OC metrics are able to capture all these aspects simultaneously. The selected metrics are: (i) Accuracy, as a partial metric which captures class matching; (ii) Kendall’s correlation coefficient Tau-a (without counting ties), in order to capture class ordering2; and (iii) Mutual Information (MI), a clustering metric which reflects how much knowing the system output reduces uncertainty about the gold standard values. This metric accentuates the effect of small classes (imbalance property). 5.1 Meta-evaluation Metric In order to quantify the ability of metrics to capture the aspects reflected by these three metrics, we use the Unanimous Improvement Ratio (UIR) (Amigó et al., 2011). While robustness focuses on consistence across data sets, UIR focuses on consistence across metrics. It essentially counts in how many test cases an improvement is observed for all metrics simultaneously. Being M a set of metrics, and T a set of test cases, and st a system output for the test case t, the Unanimous Improvement Ratio UIRM(s, s0) between two systems is defined as: t 2 T : st M s0 t t 2 T : s0 t M st T , where st M s0 t represents that system s improves system s0, on the test case t, unanimously for every metric: st M s0 t ⌘ 8m 2 M m(st) m(s0 t . The resp imp asp 5.2 We met cur hav eva met eith to e rics ther cate Ind arit the inc sim 5.3 In o am per syn um In o we ing typ gra The Fin

/21 ࣮ݧσʔλ • ਓ޻σʔλ • ฏۉ 4, ෼ࢄ 1~3 Ͱ
11 ݸͷΫϥεΛ෇༩ͨ͠ 20,000 ݅ • γεςϜग़ྗ͸ {0.1, 0.2, …, 1.0} ͷ֬཰Ͱؒҧ͑Δ΋ͷΛؒ ҧ͑ํͱͯ͠ҎԼ 5 ύλʔϯͣͭੜ੒͠ܭ 500 ݅ • ฏۉ, ϥϯμϜ, ΧςΰϦΛͣΒ͢, ॱংϥϯμϜ, ॱংมߋ • ࣮σʔλ • ground truth ͱෳ਺ͷγεςϜग़ྗ͕࢖͑Δίϯϖσʔλ • ۃੑ෼ੳͷ RepLab2013, ײ৘෼ੳͷ SemEval2014, 2015 19 RepLab2013 Dataset: http://nlp.uned.es/replab2013/  SemEval2014, 2015: http://alt.qcri.org/semeval2014/ , http://alt.qcri.org/semeval2015/

/21 ఏҊख๏ͷ coverage ͕༏Ε͍ͯΔ ਤ͸ https://www.aclweb.org/anthology/2020.acl-main.363/ ΑΓҾ༻ Table 3: Metric
Coverage: Spearman Correlation between single metrics and the UIR combination of Mutual Information, Accuracy, and Kendall across system pairs in both the synthetic and real data sets. Synthetic data Real data all minus minus minus minus minus Replab SEM-2014 SEM-2015 systems sRand sprox smaj stDisp soDisp 2013 T9-A T9-B T10-A T10-B T10-C Reference Accuracy 0.81 0.77 0.78 0.78 0.94 0.77 0.75 0.90 0.98 0.85 0.94 0.80 metrics in Kendall 0.84 0.81 0.82 0.82 0.93 0.82 0.88 0.94 0.98 0.84 0.97 0.88 UIR MI 0.84 0.82 0.84 0.82 0.93 0.82 0.91 0.97 0.99 0.93 0.98 0.93 F-measure 0.83 0.80 0.82 0.81 0.93 0.81 0.66 0.90 0.98 0.91 0.98 0.92 Classiﬁcation MAAC 0.83 0.81 0.82 0.79 0.91 0.81 0.84 0.86 0.97 0.84 0.95 0.82 metrics Kappa 0.81 0.78 0.79 0.77 0.94 0.77 0.44 0.95 0.99 0.93 0.98 0.97 Acc with 1 0.79 0.75 0.77 0.80 0.85 0.79 0.23 0.82 0.60 0.31 0.35 -0.19 MAE 0.84 0.82 0.83 0.87 0.86 0.84 0.81 0.96 0.95 0.95 0.87 0.56 Error MAEm 0.74 0.73 0.74 0.80 0.76 0.73 0.73 0.95 0.88 0.91 0.74 0.30 minimization MSE 0.89 0.87 0.87 0.88 0.93 0.88 0.28 0.87 0.98 0.63 0.97 0.93 MSEm 0.83 0.80 0.80 0.82 0.90 0.83 0.10 0.85 0.94 0.48 0.91 0.52 Correlation Pearson 0.77 0.79 0.74 0.73 0.83 0.79 0.91 0.97 0.98 0.96 0.97 0.79 coefﬁcients Spearman 0.72 0.67 0.69 0.77 0.76 0.70 0.07 0.96 0.98 0.97 0.98 0.80 Measurement CEMORD 0.91 0.89 0.90 0.90 0.95 0.89 0.94 0.96 0.99 0.98 0.99 0.96 theory CEMORD flat 0.87 0.84 0.86 0.88 0.89 0.87 0.82 0.96 0.96 0.94 0.92 0.65 3. Tag displacement: Assign the next category: stDisp(d) = g(d) + 1. 4. Ordinal displacement: Being ord(d) the ordinal position of d in a sorting of docu- the second system assigns documents to the major- ity class, but not in terms of MI, which accounts for the imbalance effect. Table 3 (left part) shows the results. The met- log scaling ͳ͠ͷ৔߹͸݁Ռ͕ྑ͘ͳ͍  ʢlog scaling ͸ॏཁʣ 20 ఏҊࢦඪ͸ acc, Kendall, MI શ͕ͯ޲্͢ Δ৔߹ͱߴ͍૬ؔʢಛ௃Λଊ͍͑ͯΔʣ

/21 ·ͱΊ • ॱং෼ྨʹ͸ద੾ͳධՁࢦඪ͕ͳ͔ͬͨ • ॱংई౓Λߟྀͯ͠ࢦඪ͕ຬͨ͢΂͖ੑ࣭Λྻڍ • ੑ࣭Λຬͨ͢ Closeness Evaluation
Measure (CEM) ΛఏҊ • ࣮ݧʹΑΓఏҊख๏͕༏Ε͍ͯΔͱ͍͏݁ՌΛಘͨ  acc, Kendall, MI શ͕ͯ޲্͢Δ৔߹ͱߴ͍૬ؔ • Ϟσϧֶशʹ࢖͑ΔΑ͏֦ு͢Δͷ͸ڵຯਂ͍ํ޲  ඍ෼ՄೳͳఆࣜԽɺσʔλ෼෍ʹର͢ΔԿ͔͠ΒͷԾఆ 21

20200906_ACL2020_metric_for_ordinal_classificat...

20200906_ACL2020_metric_for_ordinal_classification_YoheiKikuta

yoppe

More Decks by yoppe

Other Decks in Research

Featured

Transcript

2020/09/06 @yohei_kikuta An Effectiveness Metric for Ordinal Classiﬁcation: Formal Properties

/21 Ubie גࣜձࣾͷ٠ాངฏͰ͢ • Accounts • https://github.com/yoheikikuta • https://twitter.com/yohei_kikuta •

/21 ঺հ࿦จ An Eﬀectiveness Metric for Ordinal Classiﬁcation: Formal Properties

/21 Ͳ͕ͬͪʮѱ͍ʯʁ ࿦จͷࠪಡͰ {reject, weak reject, undecided, weak accept, accept}

/21 Ͳ͕ͬͪʮѱ͍ʯʁ ࿦จͷࠪಡͰ {reject, weak reject, undecided, weak accept, accept}

/21 ʢͦͷ̎ʣͲ͕ͬͪʮѱ͍ʯʁ weak reject 㲗 weak accept Λؒҧ͑ͯධՁ͢Δͱ͖ʹ  ҎԼͲ͕ͬͪѱ͍ʁ 6

/21 ʢͦͷ̎ʣͲ͕ͬͪʮѱ͍ʯʁ weak reject 㲗 weak accept Λؒҧ͑ͯධՁ͢Δͱ͖ʹ  ҎԼͲ͕ͬͪѱ͍ʁ 7

/21 ॱং෼ྨ͸ ̋̋̋ Ͱ͸ͳ͍ • n-array classiﬁcation Ͱ͸ͳ͍  AC ͕ਅͷ࣌

/21 ࿦จͷ֓ཁ • ՝୊ҙࣝɿ  ॱং෼ྨ͸ NLP ͰΑ͘ग़ͯ͘Δ͕ਖ਼͘͠ධՁ͞Ε͍ͯͳ͍ • ໨తɿ  ॱংई౓Λߟྀͨ͠ධՁࢦඪΛ࡞੒͍ͨ͠

/21 ॱং෼ྨͷࢦඪ ͕ຬͨ͢΂͖ੑ࣭ Eff : system output, : ground truth,

/21 Closeness Information Quantity (CIQ) σʔλ෼෍ʹج͍ͮͯʮۙ͞ʯΛఆΊ͍ͨ ʢ৘ใ࿦తͳߟ͑ʹج͖ͮʣσʔλ ͱ ͷؒʹσʔλ Λ

/21 Closeness Evaluation Measure (CEM) CIQ Λ࠷େ஋ Ͱن֨Խͯ͠શσʔλͰ଍্͛͠Δ • ࠷େ஋͸

/21 ॱং෼ྨͰͷ۩ମతදࣜ ͸ ΋͘͠͸ Λҙຯ͢Δ : ͱͳΔσʔλ਺, : શσʔλ਺ •

/21 ܭࢉྫ 7 105 193 90 7 0 50 100

/21 ଞͷई౓΁ͷҰൠԽ ֤ई౓Ͱۙ͞Λఆٛ: • ໊ٛई౓  : શ୯ࣸؔ਺ → • ॱংई౓

/21 ఏҊࢦඪ͸ຬͨ͢΂͖ੑ࣭Λຬ͍ͨͯ͠Δ Table 2: Constraint-based Metric Analysis Constraints Metric family

/21 ࣮ݧઃܭ ॱং෼ྨͷࢦඪ͸ҎԼͷ৘ใΛଊ͍͑ͯΔ΋ͷͱԾఆ • Accuracy (acc): ΫϥεҰகͷ৘ใ • Kendall Tau-a

/21 ϝλ෼ੳͷධՁࢦඪ coverage ΛҎԼͰఆٛʢ ͸ࢦඪ, ͸ acc, Kendall, MIʣ  ͋Δσʔληοτʹର͢Δෳ਺ͷγεςϜग़ྗΛ࢖ͬͯܭࢉ

/21 ࣮ݧσʔλ • ਓ޻σʔλ • ฏۉ 4, ෼ࢄ 1~3 Ͱ

/21 ఏҊख๏ͷ coverage ͕༏Ε͍ͯΔ ਤ͸ https://www.aclweb.org/anthology/2020.acl-main.363/ ΑΓҾ༻ Table 3: Metric

/21 ·ͱΊ • ॱং෼ྨʹ͸ద੾ͳධՁࢦඪ͕ͳ͔ͬͨ • ॱংई౓Λߟྀͯ͠ࢦඪ͕ຬͨ͢΂͖ੑ࣭Λྻڍ • ੑ࣭Λຬͨ͢ Closeness Evaluation