OpenTalks.AI

Automatic Taxonomy Extraction from Texts: Current Approaches and Evaluations Natalia
Loukachevitch Lomonosov Moscow State University Moscow, Russia Developer of large Russian resources such as RuWordNet, RuThes, organizer of Russian NLP evaluations

Lexical and Domain Knowledge in NLP • Lexical relations: –
WordNet and wordnets for different languages – ImageNet was constructed over WordNet • Domain knowledge • Medical ontologies and thesauries (UMLS, MESH, Gene Onotology) are very influentional in medical NLP and bioNLP • Industrial knowledge graphs • Necessity of large resources

Lexical and Domain Knowledge in Applications • Types of lexical
relations – pets are allowed»=> dogs are allowed (hypernym) – Restaurant in Japan => restaurant in Asia (holonym) – Restaurant in Japan ≠ restaurant in China (co-hyponym) – Good restaurant ≠ bad restaurant (antonym) • Question-answering systems – When did Donald Trump visit Alabama? • Trump visited Huntsville on September 23 • Trump visited Missisipi on June 21 • Dialog state tracking(DST) –first component of a dialog system • Asked for a „cheap pub in the east” the system should not recommend an „expensive restaurant in the pub”

Word embeddings • Currently vector representation of words (word embeddings)
can be calculated on large text collections using neural networks (word2vec, Glove, FastText, Elmo, Bert, etc)

Word embeddings vs. Resources • Why not just use word
embeddings? – Modern word embeddings can capture word relatedness – But they mix all types of semantic relations together • Development and maintaining of large lexical and domain specific resources is needed – Faster creation of specialized resources (for example, in information security domain) – Adding novel words and terms in existing resources • What are current achievements in automated methods of lexical relation extraction and taxonomy construction

Approaches to Semantic Relation Extraction • Path-based (pattern-based) • Hearts’s
patterns with modifications • Distributional approaches based on word embeddings – Supervised learning on embeddings – Projection Learning • Combined approaches

Path-Based Approaches HypeNet (Schwartz et al., 2016) best paper award
ACL-216 •LSTM encodes a single path, •paths between words are classified for hypernymy (path-based network)

Supervised Distributional Models • Represent (x, y) as a feature
vector, based on the word embeddings: – Concatenation [Baroni et al., 2012] – Difference [Roller et al., 2014] • Train a classifier to predict the semantic relation between x and y: Achieved very good results: more 70% Accuracy • But [Levy et al., 2015]: “lexical memorization”: overfitting to the most common relation of a specific word – Training: (cat, animal), (dog, animal), (cow, animal), ... all labeled as hypernymy – Model: (x, animal) is a hypernym pair, regardless of x y x    x y   

Datasets for Evaluation of Semantic Relation Extraction • Datasets with
comparable numbers of positive and negative examples for all types of relations • Measures: F-measure and Accuracy • But in reality the number of positive examples for any relation is much smaller

Semeval-2018. Hypernym Discovery Task (Camacho-Collados et al., 2018) • Task:
To find hypernyms for a given hyponym X. To return ordered list of hypernym candidates for X from a given text collection. • Data: Target word X, large text collection, vocabulary of possible hypenyms (word occurred more than 5 times in the given collection), blacklist of too general words

Semeval-2018. Results • MRR – Mean Reciprocal Rank account for
first correct result in a ranked list of hypernym candidates Best approach (Crim): patterns, linear transformations from hyponyms to hypernyms, negative sampling, linear regression

Held, Habash: Simple Approach to Hypernymy extraction (ACL 2019) •
Hearst patterns (47 patterns) • Simple distributional model – Hypernyms of the most similar word from the training set (Nearest Neigbor) are taken – The hypernyms are ordered in descending order of their frequencies – If the similarity to NN is lower than threshold, the most frequent hypernym from the training data is taken

New Evaluation for Russian (till 29 February) • Published RuWordNet
(Loukachevitch, 2019) – 110 thousand Russian words and expressions • New version of RuWordNet – 130 thousand Russian words and expressions is prepared but not published • Evaluation task – For new words (noun and verbs) to predict the nearest synsets from the published version – Correct answers should indicate • Direct hypernyms if a new word created a new synset • One-step hypernyms higher in hierarchy

Russe 2020 Evaluation (till 29.02.2020 ) Results are higher than
at SevEval-2018 because the task is updating of an existing resource

Conclusion • Large lexical resources continue to play important role
in NLP applications • Capabilities of current automatic approaches (including neural networks) to extract specific lexical relations are relatively low – Even for most frequent relations (synonyms, hypernyms) • Combined approaches can be useful in various applications – Including the reuse of existing resources – Tuning them in domain-specific text collections – Semi-automatic approaches

OpenTalks.AI - Наталья Лукашевич, Автоматическо...

OpenTalks.AI - Наталья Лукашевич, Автоматическое извлечение таксономий из текстов: Современные подходы и тестирование

More Decks by OpenTalks.AI

Other Decks in Science

Featured

Transcript

Automatic Taxonomy Extraction from Texts: Current Approaches and Evaluations Natalia

Lexical and Domain Knowledge in NLP • Lexical relations: –

Lexical and Domain Knowledge in Applications • Types of lexical

Word embeddings • Currently vector representation of words (word embeddings)

Word embeddings vs. Resources • Why not just use word

Approaches to Semantic Relation Extraction • Path-based (pattern-based) • Hearts’s

Path-Based Approaches HypeNet (Schwartz et al., 2016) best paper award

Supervised Distributional Models • Represent (x, y) as a feature

Datasets for Evaluation of Semantic Relation Extraction • Datasets with

Semeval-2018. Hypernym Discovery Task (Camacho-Collados et al., 2018) • Task:

Semeval-2018. Results • MRR – Mean Reciprocal Rank account for

Held, Habash: Simple Approach to Hypernymy extraction (ACL 2019) •

New Evaluation for Russian (till 29 February) • Published RuWordNet

Russe 2020 Evaluation (till 29.02.2020 ) Results are higher than

Conclusion • Large lexical resources continue to play important role