Upgrade to Pro — share decks privately, control downloads, hide ads and more …

OpenTalks.AI - Наталья Лукашевич, Автоматическое извлечение таксономий из текстов: Современные подходы и тестирование​

OpenTalks.AI
February 21, 2020

OpenTalks.AI - Наталья Лукашевич, Автоматическое извлечение таксономий из текстов: Современные подходы и тестирование​

OpenTalks.AI

February 21, 2020
Tweet

More Decks by OpenTalks.AI

Other Decks in Science

Transcript

  1. Automatic Taxonomy Extraction from Texts: Current Approaches and Evaluations Natalia

    Loukachevitch Lomonosov Moscow State University Moscow, Russia Developer of large Russian resources such as RuWordNet, RuThes, organizer of Russian NLP evaluations
  2. Lexical and Domain Knowledge in NLP • Lexical relations: –

    WordNet and wordnets for different languages – ImageNet was constructed over WordNet • Domain knowledge • Medical ontologies and thesauries (UMLS, MESH, Gene Onotology) are very influentional in medical NLP and bioNLP • Industrial knowledge graphs • Necessity of large resources
  3. Lexical and Domain Knowledge in Applications • Types of lexical

    relations – pets are allowed»=> dogs are allowed (hypernym) – Restaurant in Japan => restaurant in Asia (holonym) – Restaurant in Japan ≠ restaurant in China (co-hyponym) – Good restaurant ≠ bad restaurant (antonym) • Question-answering systems – When did Donald Trump visit Alabama? • Trump visited Huntsville on September 23 • Trump visited Missisipi on June 21 • Dialog state tracking(DST) –first component of a dialog system • Asked for a „cheap pub in the east” the system should not recommend an „expensive restaurant in the pub”
  4. Word embeddings • Currently vector representation of words (word embeddings)

    can be calculated on large text collections using neural networks (word2vec, Glove, FastText, Elmo, Bert, etc)
  5. Word embeddings vs. Resources • Why not just use word

    embeddings? – Modern word embeddings can capture word relatedness – But they mix all types of semantic relations together • Development and maintaining of large lexical and domain specific resources is needed – Faster creation of specialized resources (for example, in information security domain) – Adding novel words and terms in existing resources • What are current achievements in automated methods of lexical relation extraction and taxonomy construction
  6. Approaches to Semantic Relation Extraction • Path-based (pattern-based) • Hearts’s

    patterns with modifications • Distributional approaches based on word embeddings – Supervised learning on embeddings – Projection Learning • Combined approaches
  7. Path-Based Approaches HypeNet (Schwartz et al., 2016) best paper award

    ACL-216 •LSTM encodes a single path, •paths between words are classified for hypernymy (path-based network)
  8. Supervised Distributional Models • Represent (x, y) as a feature

    vector, based on the word embeddings: – Concatenation [Baroni et al., 2012] – Difference [Roller et al., 2014] • Train a classifier to predict the semantic relation between x and y: Achieved very good results: more 70% Accuracy • But [Levy et al., 2015]: “lexical memorization”: overfitting to the most common relation of a specific word – Training: (cat, animal), (dog, animal), (cow, animal), ... all labeled as hypernymy – Model: (x, animal) is a hypernym pair, regardless of x y x    x y   
  9. Datasets for Evaluation of Semantic Relation Extraction • Datasets with

    comparable numbers of positive and negative examples for all types of relations • Measures: F-measure and Accuracy • But in reality the number of positive examples for any relation is much smaller
  10. Semeval-2018. Hypernym Discovery Task (Camacho-Collados et al., 2018) • Task:

    To find hypernyms for a given hyponym X. To return ordered list of hypernym candidates for X from a given text collection. • Data: Target word X, large text collection, vocabulary of possible hypenyms (word occurred more than 5 times in the given collection), blacklist of too general words
  11. Semeval-2018. Results • MRR – Mean Reciprocal Rank account for

    first correct result in a ranked list of hypernym candidates Best approach (Crim): patterns, linear transformations from hyponyms to hypernyms, negative sampling, linear regression
  12. Held, Habash: Simple Approach to Hypernymy extraction (ACL 2019) •

    Hearst patterns (47 patterns) • Simple distributional model – Hypernyms of the most similar word from the training set (Nearest Neigbor) are taken – The hypernyms are ordered in descending order of their frequencies – If the similarity to NN is lower than threshold, the most frequent hypernym from the training data is taken
  13. New Evaluation for Russian (till 29 February) • Published RuWordNet

    (Loukachevitch, 2019) – 110 thousand Russian words and expressions • New version of RuWordNet – 130 thousand Russian words and expressions is prepared but not published • Evaluation task – For new words (noun and verbs) to predict the nearest synsets from the published version – Correct answers should indicate • Direct hypernyms if a new word created a new synset • One-step hypernyms higher in hierarchy
  14. Russe 2020 Evaluation (till 29.02.2020 ) Results are higher than

    at SevEval-2018 because the task is updating of an existing resource
  15. Conclusion • Large lexical resources continue to play important role

    in NLP applications • Capabilities of current automatic approaches (including neural networks) to extract specific lexical relations are relatively low – Even for most frequent relations (synonyms, hypernyms) • Combined approaches can be useful in various applications – Including the reuse of existing resources – Tuning them in domain-specific text collections – Semi-automatic approaches