Upgrade to Pro — share decks privately, control downloads, hide ads and more …

OpenTalks.AI - Елена Тутубалина, RuREBus-2020 Shared Task: Russian Relation Extraction for Business

opentalks3
February 05, 2021

OpenTalks.AI - Елена Тутубалина, RuREBus-2020 Shared Task: Russian Relation Extraction for Business

opentalks3

February 05, 2021
Tweet

More Decks by opentalks3

Other Decks in Business

Transcript

  1. Russian Relation Extraction for Business
    Vitaly Ivanin, Ekaterina Artemova, Tatiana Batura,
    Vladimir Ivanov, Veronika Sarkisyan, Elena Tutubalina,
    and Ivan Smurov
    ABBYY, Russia
    Moscow Institute of Physics and Technology, Russia
    National Research University Higher School of Economics, Russia
    Novosibirsk State University, Russia
    Innopolis University, Russia
    Kazan Federal University, Russia
    Lomonosov Moscow State University
    RuREBus-2020 Shared Task:

    View Slide

  2. ● Named entity recognition and relation extraction are established and well researched
    NLP tasks
    ● Scores obtained on standard academic corpora (such as CoNLL-03 and SemEval-2010
    task 8) by SOTA systems are high and in many cases are close to human performance
    ● Given these considerations some representatives of academia claim that NER and RE
    are essentially “solved tasks”
    ● NER and RE are widely used by business, but the performance is typically much lower
    than reported on academic corpora
    ● One can assume that the reason for this is that academic corpora have some major
    differences from industrial and thus it is reasonable to create a business-oriented
    corpus and test modern methods on it.
    Our motivation
    2

    View Slide

  3. In our opinion there are two key differences between academic and industrial
    corpora:
    ● Academic baselines typically consist of well-written news or biography
    texts
    ● Business case texts are usually domain-specific(e. g. legal) texts that can
    contain less than perfect language or other irregularities.
    ● Entities in academia are usually compact and well-defined
    ● Entities in industry are often much more loose, spanning for many words
    and with less than clear borders.
    Academic corpora v. s. industrial
    3

    View Slide

  4. ● Create a corpus of strategic planning
    documents with entity and relation annotation
    ● Organize a shared task on this corpus thus
    establishing a reasonable baseline on it
    ● We intend our corpus to serve as a “lower
    bound estimate” for industrial applications
    4
    Our goals
    Common NLP pipeline

    View Slide

  5. We intended to make our shared task as close to real-world scenario as
    possible. To do this we allowed several features, often frown upon in
    academia (to our best knowledge ignored by our participants):
    ● We did not restrict participations for open-source systems
    exclusively
    ● We allowed participants to create additional markup, provided they
    report it and send us markup created by them
    ● We provided a large corpus of unmarked texts of same domain
    Business-like features
    5

    View Slide

  6. Each Russian federal and municipal subject publishes several strategic planning
    documents per year. The overall collection contains more than 30 thousand documents
    with the following features:
    ● uniformity of texts: documents have the same domain, purpose, very similar style
    and size;
    ● shared scope: documents mention various types of economic and social entities and
    relations at different levels of management;
    ● fixed modalities: a fixed list of modalities in documents that cover current state of the
    economy or society (problems), as well as plans for future (actions, tasks, etc.)
    Strategic planning documents
    6

    View Slide

  7. Description of entities
    7
    Entity Entity description Examples (eng) Examples (rus)
    ACT
    (activity)
    event or specific activity
    restoration work
    drug prevention
    реставрационные работы
    профилактика наркомании
    BIN
    (binary)
    one-time action /
    binary characteristic
    modernization
    invest
    модернизация
    инвестировать
    CMP
    (compare)
    comparative characteristic
    decrease of level
    more ecological
    снижение уровня
    более экологичный
    ECO
    (economics)
    economic entity /
    infrastructure object
    PJSC Sberbank
    hospital complex
    ПАО Сбербанк
    больничный комплекс
    INST
    (institutions)
    institutions, structures and
    organizations
    Youth Employment Center
    city administration
    центр занятости молодёжи
    администрация города
    MET
    (metrics)
    numerical indicator /
    object on which a comparison
    operation is defined
    unemployment rate
    total length of roads
    уровень безработицы
    общая протяжённость
    дорог
    SOC
    (social)
    entity related to social rights
    or social amenities
    leisure activities
    historical heritage
    досуг населения
    историческое наследие
    QUA
    (qualitative )
    quality characteristic
    high quality
    stable
    высококачественный
    стабильный

    View Slide

  8. Description of relations
    8
    Relation Examples (eng) Examples (rus)
    PNG landscaping work
    is not completed
    работы по благоустройству
    не завершены
    FPS increase of
    life expectancy
    повышение
    средней продолжительности жизни
    NNT rate of incidence
    stabilized
    темп заболеваемости
    стабилизировался
    GOL decrease of
    unemployment level
    снижение
    уровня безработицы
    TSK capital
    kindergarten repair
    капитальный
    ремонт детского сада
    Positive
    (PS)
    Neutral
    (NT)
    Negative
    (NG)
    Past
    (P)
    PPS PNT PNG
    Present
    (N)
    NPS NNT NNG
    Future
    (F)
    FPS FNT FNG
    + GOL - abstract goals
    + TSK - tasks
    Relation types Examples of
    annotation

    View Slide

  9. Why this entities and relations are useful?
    ● Our main motivation was to create a showcase scenario for non-
    traditional entities and relation
    ● However we believe that our markup can be useful for analysis of e-
    government documents
    ● See our article “So What's the Plan? Mining Strategic Planning
    Document” at DTSG for details

    View Slide

  10. Annotation pipeline: brat interface
    10
    Annotation
    interface

    View Slide

  11. Active learning

    View Slide

  12. Named entities
    total mean len (std)
    BIN 30201 1.05 (0.28)
    MET 14161 4.23 (3.50)
    QUA 7719 1.14 (0.52)
    CMP 9288 1.16 (0.78)
    SOC 10834 2.77 (2.31)
    INST 7903 3.69 (2.81)
    ECO 24853 2.78 (2.19)
    ACT 12274 4.74 (4.57)
    NE statistics
    Dataset:
    - 188 train documents
    - 30 test documents

    View Slide

  13. Relations
    Label Count
    NNT 534
    NNG 844
    NPS 755
    PPS 528
    PNG 84
    Label Count
    GOL 3563
    TSK 4613
    PNT 190
    FPS 1167
    FNG 229
    FNT 141
    Relations by class

    View Slide

  14. Shared task
    1. Named Entity Recognition
    Given: raw text files
    Expected: char spans with labels
    Evaluation: micro span-based F1
    1. Relation Extraction with given Named Entities
    Given: NEs spans
    Expected: relations in format (class, head span idx, tail span idx)
    Evaluation: micro F1 measure
    1. End-to-end Relation Extraction
    Given: raw text files
    Expected: NEs & relations in format (class, head idx, tail idx)
    Evaluation: micro F1 measure

    View Slide

  15. Task results
    Team Task 1 (NER) Task 2 (RE with NEs)
    davletov-aa 0.561 0.394
    Sdernal 0.464 0.441
    ksmith 0.463 0.152
    viby 0.417 0.218
    dimsolo 0.400 -
    bond005 0.338 0.045
    Student2020 0.253 -
    Span based F1 measure

    View Slide

  16. Task 1

    View Slide

  17. Task 2

    View Slide

  18. After task 1 and 3 deadline
    Team Task 1 Task 3
    max before deadline 0.561 0.062 (ksmith)
    davletov-aa 0.561 0.132
    bondarenko 0.498 (top 2) -

    View Slide

  19. Best architectures
    NER
    - top-1 : Multilingual-BERT + MLP
    - top-2: RuBERT + MLP
    RE
    - top-1 : R-BERT
    - top-2 : Multi task learning NER + RE
    R-BERT
    architecture

    View Slide

  20. Long spans problem
    Label ACT BIN CMP ECO INST MET QUA SOC
    F1 diff 0.28 0.03 0.00 0.23 0.21 0.27 0.00 0.19
    Mean char len 34 12 10 24 27 31 12 21
    Difference between span-based-F1 and char-based-F1

    View Slide

  21. SemEval-2020 Task 11
    ● Giovanni Da San Martino, Alberto Barron-Cedeno, Henning Wachsmuth, Rostislav Petrov,
    and Preslav Nakov. SemEval-2020 task 11: Detection of propaganda techniques in news
    articles. In Proceedings of the 14th International Workshop on Semantic Evaluation, SemEval
    2020, Barcelona, Spain, September 2020.

    View Slide

  22. Conclusions
    ● We presented new Russian dataset for NER and RE
    ● We presented large raw text corpus for this domain
    ● Proposed dataset represents worst-case industrial application scenario
    ● Shared task results demonstrate that dataset could be treated as testing ground for industrial
    applications of NER and RE

    View Slide

  23. Thanks for your attention!
    Github: https://github.com/dialogue-evaluation/RuREBus

    View Slide