Upgrade to Pro — share decks privately, control downloads, hide ads and more …

20200906_ACL2020_metric_for_ordinal_classification_YoheiKikuta

yoppe
September 06, 2020

 20200906_ACL2020_metric_for_ordinal_classification_YoheiKikuta

yoppe

September 06, 2020
Tweet

More Decks by yoppe

Other Decks in Research

Transcript

  1. 2020/09/06

    @yohei_kikuta
    An Effectiveness Metric for Ordinal Classification:
    Formal Properties and Experimental Results
    Enrique Amigó
    UNED
    Madrid, Spain
    [email protected]
    Julio Gonzalo
    UNED
    Madrid, Spain
    [email protected]
    Stefano Mizzaro
    University of Udine
    Udine, Italy
    [email protected]
    Jorge Carrillo-de-Albornoz
    UNED
    Madrid, Spain
    [email protected]
    Abstract
    In Ordinal Classification tasks, items have to
    be assigned to classes that have a relative order-
    ing, such as positive, neutral, negative in sen-
    timent analysis. Remarkably, the most popu-
    lar evaluation metrics for ordinal classification
    tasks either ignore relevant information (for in-
    those other problems. But classification measures
    ignore the ordering between classes, ranking met-
    rics ignore category matching, and value prediction
    metrics are used by assuming (usually equal) nu-
    meric intervals between categories.
    In this paper we propose a metric designed to
    evaluate Ordinal Classification systems which re-

    View full-size slide

  2. /21
    Ubie גࣜձࣾͷ٠ాངฏͰ͢
    • Accounts

    • https://github.com/yoheikikuta

    • https://twitter.com/yohei_kikuta

    • https://yoheikikuta.github.io/

    • WE ARE HIRING!!!

    • https://herp.careers/v1/ubie
    2

    View full-size slide

  3. /21
    ঺հ࿦จ
    An Effectiveness Metric for Ordinal Classification: Formal
    Properties and Experimental Results

    • https://www.aclweb.org/anthology/2020.acl-main.363/

    • ϝϞ: https://github.com/yoheikikuta/paper-reading/issues/54

    • ॱং෼ྨλεΫͷධՁࢦඪΛߟྀ͢΂͖ੑ࣭ʹج͍ͮͯఏҊ

    • ߟྀ͢΂͖ੑ࣭͸ Ordinal Invariance, Ordinal Monotonicity,
    Class Imbalance

    • ্ه 3 ͭΛຬͨ͢Ϋϥεॱং΍෼෍ʹج͍ͮͨࢦඪΛఆٛ

    • ࣮ݧʹΑΓैདྷ࢖ΘΕ͍ͯͨࢦඪΑΓ΋༗༻ͱ͍͏݁ՌΛಘͨ
    3

    View full-size slide

  4. /21
    Ͳ͕ͬͪʮѱ͍ʯʁ
    ࿦จͷࠪಡͰ {reject, weak reject, undecided, weak accept,
    accept} Λ͚ͭΔ͜ͱΛߟ͑Δ

    ࠪಡ࿦จͷਅͷධՁ஋͕ accept ͷͱ͖ʹҎԼͲ͕ͬͪѱ͍ʁ

    • weak reject ͱؒҧ͑Δ

    • weak accept ͱؒҧ͑Δ
    4

    View full-size slide

  5. /21
    Ͳ͕ͬͪʮѱ͍ʯʁ
    ࿦จͷࠪಡͰ {reject, weak reject, undecided, weak accept,
    accept} Λ͚ͭΔ͜ͱΛߟ͑Δ

    ࠪಡ࿦จͷਅͷධՁ஋͕ accept ͷͱ͖ʹҎԼͲ͕ͬͪѱ͍ʁ

    • weak reject ͱؒҧ͑Δ

    • weak accept ͱؒҧ͑Δ
    5
    ѱ͍

    ʢ3 ஈ֊΋ԼͷΧςΰϦʣ
    ·ͩϚγ

    ʢ1 ஈ֊͚ͩԼͷΧςΰϦʣ

    View full-size slide

  6. /21
    ʢͦͷ̎ʣͲ͕ͬͪʮѱ͍ʯʁ
    weak reject 㲗 weak accept Λؒҧ͑ͯධՁ͢Δͱ͖ʹ

    ҎԼͲ͕ͬͪѱ͍ʁ
    6
    7
    105
    193
    90
    7
    0
    50
    100
    150
    200
    reject weak
    reject
    undecided weak
    accept
    accept
    # papers
    180
    10 3 10
    173
    0
    50
    100
    150
    200
    reject weak
    reject
    undecided weak
    accept
    accept
    # papers
    Figure 1: In the left distribution, weak accept vs. weak reject would be a strong disagreement between reviewers
    (i.e., the classes are distant), because in practice these are almost the extreme cases of the scale (reviewers rarely
    go for accept or reject). In the right distribution the situation is the opposite: reviewers tend to take a clear stance,
    which makes weak accept and weak reject closer assessments than in the left case.
    ਤ͸ https://www.aclweb.org/anthology/2020.acl-main.363/ ΑΓҾ༻

    View full-size slide

  7. /21
    ʢͦͷ̎ʣͲ͕ͬͪʮѱ͍ʯʁ
    weak reject 㲗 weak accept Λؒҧ͑ͯධՁ͢Δͱ͖ʹ

    ҎԼͲ͕ͬͪѱ͍ʁ
    7
    7
    105
    193
    90
    7
    0
    50
    100
    150
    200
    reject weak
    reject
    undecided weak
    accept
    accept
    # papers
    180
    10 3 10
    173
    0
    50
    100
    150
    200
    reject weak
    reject
    undecided weak
    accept
    accept
    # papers
    Figure 1: In the left distribution, weak accept vs. weak reject would be a strong disagreement between reviewers
    (i.e., the classes are distant), because in practice these are almost the extreme cases of the scale (reviewers rarely
    go for accept or reject). In the right distribution the situation is the opposite: reviewers tend to take a clear stance,
    which makes weak accept and weak reject closer assessments than in the left case.
    ਤ͸ https://www.aclweb.org/anthology/2020.acl-main.363/ ΑΓҾ༻
    ·ͩϚγ

    ʢͲͬͪͰ΋େࠩͳ͍ʣ
    ѱ͍

    ʢධՁ͕େ͖͘มΘΔʣ

    View full-size slide

  8. /21
    ॱং෼ྨ͸ ̋̋̋ Ͱ͸ͳ͍
    • n-array classification Ͱ͸ͳ͍

    AC ͕ਅͷ࣌ weak AC ͱ weak REJ ͳΒޙऀͷํ͕ෆద੾

    • ranking prediction Ͱ͸ͳ͍

    (AC, weak AC, undecided) ≠ (undecided, weak REJ, REJ)

    • value prediction Ͱ͸ͳ͍

    AC ͱ weak AC ͷࠩͱ weak AC ͱ weak REJ ͷࠩ͸ൺֱෆՄ

    • linear correlation Ͱ͸ͳ͍

    ૬͕ؔߴͯ͘΋ग़ྗͷ஋͕Ұக͠ͳ͍৔߹͸͋Δ
    8

    View full-size slide

  9. /21
    ࿦จͷ֓ཁ
    • ՝୊ҙࣝɿ

    ॱং෼ྨ͸ NLP ͰΑ͘ग़ͯ͘Δ͕ਖ਼͘͠ධՁ͞Ε͍ͯͳ͍

    • ໨తɿ

    ॱংई౓Λߟྀͨ͠ධՁࢦඪΛ࡞੒͍ͨ͠

    • ಺༰ɿ

    ධՁࢦඪ͕༗͢Δ΂͖ੑ࣭Λఆٛ͠ɺ۩ମతͳߏ੒ΛఏҊ

    ࣮ݧͰॱং෼ྨ༧ଌ͕ຬͨ͢΂͖ੑ࣭ΛΑ͘ଊ͍͑ͯΔ͜ͱ
    Λࣔͨ͠
    9
    ஶऀΒ͸ http://evall.uned.es/ ͷਓʑ

    ࢦඪܭࢉίʔυ͸ Java Ͱఏڙ: https://github.com/EvALLTEAM/EvALLToolkit

    View full-size slide

  10. /21
    ॱং෼ྨͷࢦඪ ͕ຬͨ͢΂͖ੑ࣭
    Eff
    : system output, : ground truth, : data

    • Ordinal Invariance

    where : strictly increasing func.

    • Ordinal Monotonicity

    if 

    • Class Imbalance

    where
    s g d ∈ D
    Eff(s, g) = Eff(f(s), f(g)) f
    Eff(s′ , g) > Eff(s, g) ∃d . (s(d) ≠ s′ (d)) ∧ (∀d . ((s(d) > s′ (d) ≥ g(d)) ∨ (s(d) = s′ (d))))
    Eff(gd1
    →c2
    , g) > Eff(gd3
    →c2
    , g) nc1
    > nc3
    10
    ਤ͸ https://www.aclweb.org/anthology/2020.acl-main.363/ ΑΓҾ༻
    lly, at interval scale, CEMINT would be
    lent to a logarithmic version of MAE when-
    ems are uniformly distributed across classes.
    eave a more detailed formal and empirical
    s of CEM at other scales for future work, as
    t the primary scope of this paper.
    heoretical Evidence
    ing a methodology previously applied for
    fication (Sebastiani, 2015; Sokolova, 2006),
    ing (Dom, 2001; Meila, 2003; Amigó et al.,
    and document ranking tasks (Moffat, 2013;
    et al., 2013b), here we define a formal
    work for OC via desirable properties to be
    d, which are illustrated in Figure 2 and in-
    ed below.
    Metric Properties
    st property states that an effectiveness met-
    f(s, g) should not assume predefined inter-
    etween classes, i.e., it should be invariant
    permissible transformation functions at ordi-
    le.
    Figure 2: Illustration of desirable formal properties for
    Ordinal Classification. Each bin is a system output,
    where columns represent ordered classes assigned by
    the system, and colors represent the items’ true classes,
    ordered from black to white. "=" means that both out-
    puts should have the same quality, and ">" that the left
    output should receive a higher metric value than the
    right output.
    strictly better, then the metric score of s0 must be
    higher.
    Finally, in order to manage the effect of im-
    balanced data sets, another desirable property is
    that an item classification error in a frequent class
    should have less effect than a classification error
    e, CEMINT would be
    version of MAE when-
    tributed across classes.
    d formal and empirical
    ales for future work, as
    f this paper.
    e
    previously applied for
    2015; Sokolova, 2006),
    ila, 2003; Amigó et al.,
    ng tasks (Moffat, 2013;
    e we define a formal
    irable properties to be
    ted in Figure 2 and in-
    Figure 2: Illustration of desirable formal properties for
    Ordinal Classification. Each bin is a system output,
    where columns represent ordered classes assigned by
    the system, and colors represent the items’ true classes,
    ordered from black to white. "=" means that both out-
    puts should have the same quality, and ">" that the left
    output should receive a higher metric value than the
    right output.
    cale, CEMINT would be
    mic version of MAE when-
    distributed across classes.
    iled formal and empirical
    r scales for future work, as
    pe of this paper.
    ence
    gy previously applied for
    ni, 2015; Sokolova, 2006),
    Meila, 2003; Amigó et al.,
    nking tasks (Moffat, 2013;
    here we define a formal
    desirable properties to be
    Figure 2: Illustration of desirable formal properties for
    Ordinal Classification. Each bin is a system output,
    where columns represent ordered classes assigned by
    the system, and colors represent the items’ true classes,
    ordered from black to white. "=" means that both out-
    puts should have the same quality, and ">" that the left

    View full-size slide

  11. /21
    Closeness Information Quantity (CIQ)
    σʔλ෼෍ʹج͍ͮͯʮۙ͞ʯΛఆΊ͍ͨ

    ʢ৘ใ࿦తͳߟ͑ʹج͖ͮʣσʔλ ͱ ͷؒʹσʔλ Λ
    ؍ଌ͢Δ֬཰͕খ͍࣌͞ɺ ͱ ͕ʮ͍ۙʯͱఆٛ͢Δ

    : ͱ Λൺ΂ͨͱ͖ʹ ͷํ͕ ʹ͍ۙ֬཰



    a b x
    a b
    P(x ⪯b
    ORD
    a) x a x b
    CIQORD(a, b) := − log(P(x ⪯b
    ORD
    a))
    CIQORD(s(d), g(d)) = − log(P(x ⪯g(d)
    ORD
    s(d)))
    11

    View full-size slide

  12. /21
    Closeness Evaluation Measure (CEM)
    CIQ Λ࠷େ஋ Ͱن֨Խͯ͠શσʔλͰ଍্͛͠Δ



    • ࠷େ஋͸ ͷͱ͖Ͱ 1 ʹͳΔ

    • ࠷খ஋͸෼෍ʹґΔ

    • ʮ͍ۙʯؒҧ͍ΑΓʮԕ͍ʯؒҧ͍ͷํ͕ ͸௿͘ͳΔ
    s(d) = g(d)
    CEMORD(s, g) =

    d∈D
    CIQORD(s(d), g(d))

    d∈D
    CIQORD(g(d), g(d))
    ∀d s(d) = g(d)
    CEM
    12

    View full-size slide

  13. /21
    ॱং෼ྨͰͷ۩ମతදࣜ
    ͸ ΋͘͠͸ Λҙຯ͢Δ

    : ͱͳΔσʔλ਺, : શσʔλ਺



    • ͷͱ͖͸

    • ͷ 1/2 factor ͸ empirical ͳ΋ͷʢஶऀஊʣ
    x ⪯b
    ORD
    a a ≤ x ≤ b b ≤ x ≤ a
    ni
    g(d) = ci
    N
    CIQORD(ci
    , cj
    ) = − log
    ni
    2
    + ∑j
    k=i+1
    nk
    N
    ci
    = cj
    CIQORD(ci
    , ci
    ) = − log(ni
    /2N)
    ni
    13

    View full-size slide

  14. /21
    ܭࢉྫ
    7
    105
    193
    90
    7
    0
    50
    100
    150
    200
    reject weak
    reject
    undecided weak
    accept
    accept
    # papers
    180
    10 3 10
    173
    0
    50
    100
    150
    200
    reject weak
    reject
    undecided weak
    accept
    accept
    # papers
    Figure 1: In the left distribution, weak accept vs. weak reject would be a strong disagreement between reviewers
    (i.e., the classes are distant), because in practice these are almost the extreme cases of the scale (reviewers rarely
    go for accept or reject). In the right distribution the situation is the opposite: reviewers tend to take a clear stance,
    which makes weak accept and weak reject closer assessments than in the left case.
    the true classes in the gold standard.
    A key idea in our metric is to establish a notion of
    informational closeness that depends on how items
    are distributed in the rank of classes. The idea is
    item d 2 D by the gold standard and the system
    output, CIQORD(s(d), g(d)) measures the closeness
    between the assigned class and the gold standard
    class:
    ਤ͸ https://www.aclweb.org/anthology/2020.acl-main.363/ ΑΓҾ༻

    =

    CIQORD(weak accept, weak reject)
    −log (
    90/2 + 193 + 105
    302 )
    ≃ 0.23
    14

    =

    CIQORD(weak accept, weak reject)
    −log (
    10/2 + 3 + 10
    376 )
    ≃ 4.38

    View full-size slide

  15. /21
    ଞͷई౓΁ͷҰൠԽ
    ֤ई౓Ͱۙ͞Λఆٛ:

    • ໊ٛई౓

    : શ୯ࣸؔ਺ →

    • ॱংई౓

    : ڱٛ୯ௐ૿Ճؔ਺ →

    • ִؒई౓

    : ઢܗؔ਺ →

    ∃f ∈ ℱT
    (| f(x) − f(b)| ≤ | f(a) − f(b)|)
    ℱNOM
    (b = x) ∨ (b ≠ a)
    ℱORD
    (a ≤ x ≤ b) ∨ (b ≤ x ≤ a)
    ℱINT
    |b − x| ≤ |b − a|
    15

    View full-size slide

  16. /21
    ఏҊࢦඪ͸ຬͨ͢΂͖ੑ࣭Λຬ͍ͨͯ͠Δ
    Table 2: Constraint-based Metric Analysis
    Constraints
    Metric family Metrics Ord. Ord. Imb.
    Inv. Mon.
    Acc 3 - -
    Classification Acc with n 3 - -
    Metrics Macro Avg Acc, Cohen’s  3 - 3
    F-measure avg. across classes 3 - 3
    MAE, MSE - 3 -
    Value Macro Avg. MAE/MSE - 3 3
    Prediction Weighted  - 3 3
    Rennie & Srebro loss function - 3 -
    Cosine similarity - 3 -
    Linear correlation - - -
    Correlation Ordinal: Kendall (tau-b), Spea. 3 - 3
    Coefficients Kendall-(Tau-a) 3 - -
    Reliability and Sensitivity 3 - 3
    Clustering MI, Purity and Inv. Purity 3 - 3
    Path based Ordinal Classification Index 3 - -
    CEMNOM 3 - 3
    CEM CEMINT - 3 3
    CEMORD 3 3 3
    The most popular Value Prediction metrics are
    aged. In Tau-a, only discordant pair
    (g(d1) > g(d2) and s(d1) < s(d2)
    is not satisfied. The most popular
    cient approach (Tau-b) and Spearm
    imbalance. Pearson coefficient doe
    interval effect. Reliability and Sen
    which extend the clustering metric
    sentially an ordinal correlation met
    ant but failing in monotonicity, wit
    of satisfying imbalance due to the
    notions.
    By definition, clustering metric
    variant, because they are not affecte
    of category descriptors. In additio
    such as Mutual Information (MI) o
    verse Purity, satisfy imbalance. Ho
    not ordinal monotonic, given that
    sider any ordinal relationship betw
    Finally, we must include the ap
    doso and Sousa (2011), a path bas
    Ordinal Classification Index whi
    specifically for OC problems. T
    that integrates aspects from the pre
    ric families, including two param
    ਤ͸ https://www.aclweb.org/anthology/2020.acl-main.363/ ΑΓҾ༻
    ɾOrdinal Invariance

    ɾOrdinal Monotonicity

    ɾClass Imbalance

    શͯΛຬ͍ͨͯ͠Δ
    16

    View full-size slide

  17. /21
    ࣮ݧઃܭ
    ॱং෼ྨͷࢦඪ͸ҎԼͷ৘ใΛଊ͍͑ͯΔ΋ͷͱԾఆ

    • Accuracy (acc): ΫϥεҰகͷ৘ใ

    • Kendall Tau-a (Kendall): ॱং৘ใ

    • Mutual Information (MI): class imbalance ৘ใ

    γεςϜ s, s’ ͷग़ྗΛൺֱ͢Δͱ͖ʹɺ

    3 ͭͷࢦඪશ͕ͯ޲্͢Δ৔߹ʹఏҊࢦඪ΋޲্͢Δ͔ΛଌΔ
    17

    View full-size slide

  18. /21
    ϝλ෼ੳͷධՁࢦඪ
    coverage ΛҎԼͰఆٛʢ ͸ࢦඪ, ͸ acc, Kendall, MIʣ

    ͋Δσʔληοτʹର͢Δෳ਺ͷγεςϜग़ྗΛ࢖ͬͯܭࢉ

    m ℳ
    Covℳ
    (m) = Spea (m(s) − m(s′ ), UIRℳ
    (s, s′ ))
    18
    εϐΞϚϯॱҐ૬ؔ܎਺

    ʢείΞॱংʹ஫໨ʣ
    ஫໨ͯ͠Δ

    ࢦඪ ͷࠩ෼
    m
    Unanimous Improvement

    Ratio (UIR)

    acc, Kendall, MI શ͕ͯ޲্
    ͨ͠σʔλͷࠩ෼ͷׂ߹
    iments, in addition to robustness, we select three
    complementary metrics, each focused on one of
    these partial aspects, and we evaluate to what ex-
    tent existing OC metrics are able to capture all
    these aspects simultaneously.
    The selected metrics are: (i) Accuracy, as
    a partial metric which captures class matching;
    (ii) Kendall’s correlation coefficient Tau-a (without
    counting ties), in order to capture class ordering2;
    and (iii) Mutual Information (MI), a clustering met-
    ric which reflects how much knowing the system
    output reduces uncertainty about the gold standard
    values. This metric accentuates the effect of small
    classes (imbalance property).
    5.1 Meta-evaluation Metric
    In order to quantify the ability of metrics to capture
    the aspects reflected by these three metrics, we use
    the Unanimous Improvement Ratio (UIR) (Amigó
    et al., 2011). While robustness focuses on consis-
    tence across data sets, UIR focuses on consistence
    across metrics. It essentially counts in how many
    test cases an improvement is observed for all met-
    rics simultaneously. Being M a set of metrics, and
    T a set of test cases, and st a system output for
    the test case t, the Unanimous Improvement Ratio
    UIRM(s, s0) between two systems is defined as:
    t 2 T : st M s0
    t
    t 2 T : s0
    t M st
    T
    ,
    where st M s0
    t
    represents that system s improves
    system s0, on the test case t, unanimously for every
    Th
    res
    im
    asp
    5.2
    We
    me
    cur
    hav
    eva
    me
    eith
    to
    ric
    the
    cat
    Ind
    ari
    the
    inc
    sim
    5.3
    In o
    am
    per
    syn
    um
    In
    we
    ing
    iments, in addition to robustness, we select three
    complementary metrics, each focused on one of
    these partial aspects, and we evaluate to what ex-
    tent existing OC metrics are able to capture all
    these aspects simultaneously.
    The selected metrics are: (i) Accuracy, as
    a partial metric which captures class matching;
    (ii) Kendall’s correlation coefficient Tau-a (without
    counting ties), in order to capture class ordering2;
    and (iii) Mutual Information (MI), a clustering met-
    ric which reflects how much knowing the system
    output reduces uncertainty about the gold standard
    values. This metric accentuates the effect of small
    classes (imbalance property).
    5.1 Meta-evaluation Metric
    In order to quantify the ability of metrics to capture
    the aspects reflected by these three metrics, we use
    the Unanimous Improvement Ratio (UIR) (Amigó
    et al., 2011). While robustness focuses on consis-
    tence across data sets, UIR focuses on consistence
    across metrics. It essentially counts in how many
    test cases an improvement is observed for all met-
    rics simultaneously. Being M a set of metrics, and
    T a set of test cases, and st a system output for
    the test case t, the Unanimous Improvement Ratio
    UIRM(s, s0) between two systems is defined as:
    t 2 T : st M s0
    t
    t 2 T : s0
    t M st
    T
    ,
    where st M s0
    t
    represents that system s improves
    system s0, on the test case t, unanimously for every
    metric:
    st M s0
    t
    ⌘ 8m 2 M m(st) m(s0
    t
    .
    The
    resp
    imp
    asp
    5.2
    We
    met
    cur
    hav
    eva
    met
    eith
    to e
    rics
    ther
    cate
    Ind
    arit
    the
    inc
    sim
    5.3
    In o
    am
    per
    syn
    um
    In o
    we
    ing
    typ
    gra
    The
    Fin

    View full-size slide

  19. /21
    ࣮ݧσʔλ
    • ਓ޻σʔλ

    • ฏۉ 4, ෼ࢄ 1~3 Ͱ 11 ݸͷΫϥεΛ෇༩ͨ͠ 20,000 ݅

    • γεςϜग़ྗ͸ {0.1, 0.2, …, 1.0} ͷ֬཰Ͱؒҧ͑Δ΋ͷΛؒ
    ҧ͑ํͱͯ͠ҎԼ 5 ύλʔϯͣͭੜ੒͠ܭ 500 ݅

    • ฏۉ, ϥϯμϜ, ΧςΰϦΛͣΒ͢, ॱংϥϯμϜ, ॱংมߋ

    • ࣮σʔλ

    • ground truth ͱෳ਺ͷγεςϜग़ྗ͕࢖͑Δίϯϖσʔλ

    • ۃੑ෼ੳͷ RepLab2013, ײ৘෼ੳͷ SemEval2014, 2015
    19
    RepLab2013 Dataset: http://nlp.uned.es/replab2013/

    SemEval2014, 2015: http://alt.qcri.org/semeval2014/ , http://alt.qcri.org/semeval2015/

    View full-size slide

  20. /21
    ఏҊख๏ͷ coverage ͕༏Ε͍ͯΔ
    ਤ͸ https://www.aclweb.org/anthology/2020.acl-main.363/ ΑΓҾ༻
    Table 3: Metric Coverage: Spearman Correlation between single metrics and the UIR combination of Mutual
    Information, Accuracy, and Kendall across system pairs in both the synthetic and real data sets.
    Synthetic data Real data
    all minus minus minus minus minus Replab SEM-2014 SEM-2015
    systems sRand
    sprox smaj stDisp soDisp
    2013 T9-A T9-B T10-A T10-B T10-C
    Reference Accuracy 0.81 0.77 0.78 0.78 0.94 0.77 0.75 0.90 0.98 0.85 0.94 0.80
    metrics in Kendall 0.84 0.81 0.82 0.82 0.93 0.82 0.88 0.94 0.98 0.84 0.97 0.88
    UIR MI 0.84 0.82 0.84 0.82 0.93 0.82 0.91 0.97 0.99 0.93 0.98 0.93
    F-measure 0.83 0.80 0.82 0.81 0.93 0.81 0.66 0.90 0.98 0.91 0.98 0.92
    Classification MAAC 0.83 0.81 0.82 0.79 0.91 0.81 0.84 0.86 0.97 0.84 0.95 0.82
    metrics Kappa 0.81 0.78 0.79 0.77 0.94 0.77 0.44 0.95 0.99 0.93 0.98 0.97
    Acc with 1 0.79 0.75 0.77 0.80 0.85 0.79 0.23 0.82 0.60 0.31 0.35 -0.19
    MAE 0.84 0.82 0.83 0.87 0.86 0.84 0.81 0.96 0.95 0.95 0.87 0.56
    Error MAEm
    0.74 0.73 0.74 0.80 0.76 0.73 0.73 0.95 0.88 0.91 0.74 0.30
    minimization MSE 0.89 0.87 0.87 0.88 0.93 0.88 0.28 0.87 0.98 0.63 0.97 0.93
    MSEm
    0.83 0.80 0.80 0.82 0.90 0.83 0.10 0.85 0.94 0.48 0.91 0.52
    Correlation Pearson 0.77 0.79 0.74 0.73 0.83 0.79 0.91 0.97 0.98 0.96 0.97 0.79
    coefficients Spearman 0.72 0.67 0.69 0.77 0.76 0.70 0.07 0.96 0.98 0.97 0.98 0.80
    Measurement CEMORD 0.91 0.89 0.90 0.90 0.95 0.89 0.94 0.96 0.99 0.98 0.99 0.96
    theory CEMORD
    flat
    0.87 0.84 0.86 0.88 0.89 0.87 0.82 0.96 0.96 0.94 0.92 0.65
    3. Tag displacement: Assign the next category:
    stDisp(d) = g(d) + 1.
    4. Ordinal displacement: Being ord(d) the
    ordinal position of d in a sorting of docu-
    the second system assigns documents to the major-
    ity class, but not in terms of MI, which accounts
    for the imbalance effect.
    Table 3 (left part) shows the results. The met-
    log scaling ͳ͠ͷ৔߹͸݁Ռ͕ྑ͘ͳ͍

    ʢlog scaling ͸ॏཁʣ
    20
    ఏҊࢦඪ͸ acc, Kendall, MI શ͕ͯ޲্͢
    Δ৔߹ͱߴ͍૬ؔʢಛ௃Λଊ͍͑ͯΔʣ

    View full-size slide

  21. /21
    ·ͱΊ
    • ॱং෼ྨʹ͸ద੾ͳධՁࢦඪ͕ͳ͔ͬͨ

    • ॱংई౓Λߟྀͯ͠ࢦඪ͕ຬͨ͢΂͖ੑ࣭Λྻڍ

    • ੑ࣭Λຬͨ͢ Closeness Evaluation Measure (CEM) ΛఏҊ

    • ࣮ݧʹΑΓఏҊख๏͕༏Ε͍ͯΔͱ͍͏݁ՌΛಘͨ

    acc, Kendall, MI શ͕ͯ޲্͢Δ৔߹ͱߴ͍૬ؔ

    • Ϟσϧֶशʹ࢖͑ΔΑ͏֦ு͢Δͷ͸ڵຯਂ͍ํ޲

    ඍ෼ՄೳͳఆࣜԽɺσʔλ෼෍ʹର͢ΔԿ͔͠ΒͷԾఆ
    21

    View full-size slide