Constructing Large-Scale Person Ontology from Wikipedia
Yumi Shibaki, Masaaki Nagata and Kazuhide Yamamoto. Constructing Large-Scale Person Ontology from Wikipedia. Proceedings of 2nd Workshop on The People’s Web Meets NLP: Collaboratively Constructed Semantic Resources ( CCSR 2010), pp.1-9 (2010.8)
Language resources • Previous works – Three related methods • Preliminary study • Proposed method • Results – Experimental results – Discussions • Conclusion
Language resources • Previous works – Three related methods • Preliminary study • Proposed method • Results – Experimental results – Discussions • Conclusion
Goal and Motivation • Motivation – Wikipedia contains a large number of articles and categories that represent a person – a large-scale person ontology would be useful for applications such as person search and named entity recognition • What’s “Person” – Personal name, Occupation and Ethnic group etc…
machine learning-based classifiers, using • Structure of Wikipedia category network • Semantic category of words provided by manually made Japanese thesaurus “Nihongo-Goi-Taikei” • Results – Categories : 99.3% precision, 98.4% recall – Articles : 98.2% precision, 98.6% recall – Extracting person instance from the article by heuristic rules (instances)
person instance : 130,000 People by occupation is-a person Example of person ontology • Maasai people • Somali people Critics Ethnic groups Ethnic groups in Africa Astronauts • Neil Armstrong Critics of Wikipedia People who have walked on the Moon Category People associated with Sports • Ball boy • Sports master is-a Article
concrete abstract Instance Other category Person category is-a Language resources (1/2) • Japanese thesaurus “Nihongo Goi-Taikei” – 100,000 common nouns (instances) – 2,700 hierarchical semantic categories – each instance could have multiple categories plaything , sporting goods officials competitor performer • model • dancer •
American golfers Language resources (2/2) – Article has a title, a body and categories • In most articles, the first sentence of the body is usually the definition of the title – Category link dose not necessarily represent is-a relation – 500,000 articles and 40,000 categories category definition sentence •Michelle Wie •PGA Tour not-is-a is-a • Japanese Wikipedia
Language resources • Previous works – Three related methods • Preliminary study • Proposed method • Results – Experimental results – Discussions • Conclusion
Head word is the last word in Japanese • Head matching – A category link is labeled as is-a relation If the two categories share the same head words Ponzetto’s method [Ponzetto and Strube, 2007] • Problems – not a single interconnected taxonomy – relatively high precision, but low recall
of YAGO – The Wikipedia category is associated with Goi-Taikei category as its subcategory – The Wikipedia articles are extracted as the instances referees players Goi-Taikei Category American golfers associated • Michelle Wie • Paula Creamer Sportspeople • Sportsperson associated Wikipedia Category Articles [Suchanek et al., 2007]
) is a golf player. In 2006, she was named in a Time magazine article: "one of 100 people who shape our world." Kobayashi’s method (2/3) [Kobayashi et al., 2008] • Defining D-hypernym – The hypernym of the Wikipedia article extracted from its definition sentence by pattern matching. D-hypernym Definition sentence
•PGA Tour (Tour) instance D-hypernym players Head matching Kobayashi’s method (3/3) [Kobayashi et al., 2008] Doesn’t match associate Goi-Taikei Category Wikipedia Category American golfers × × × × – Person categories that do not match any Goi-Taikei’s instance cannot be extracted such as Animators, Percussionist • If the head word of Wikipedia category matches on instance of a Goi-Taikei category • If the D-hypernym of a Wikipedia article matches an instance of the same Goi-Taikei category Articles 1 2 1 2
personal name from Wikipedia article’s title – If the article is listed on the category “_th births” etc. Yamashita’s method [Yamashita, 2007] Personal name • Problems – No hierarchical taxonomy is made – A simple rule and high precision, but low recall
Language resources • Previous works – Three related methods • Preliminary study • Proposed method • Results – Experimental results – Discussions • Conclusion
categories are is-a relations (68% of all category links are is-a relations) • Sportspeople - Sportsmen – 97.3% person instances are listed on one or more person categories with is-a relation • Article “Isaac Newton” is listed on some person categories “British physicists” and “Alchemists” etc… Preliminary study (2/2) Japanese Wikipedia (July 24, 2008) If linked categories are both person categories, the category link can be regarded as is-a relation. We make only a person category classifier !!
Language resources • Previous works – Three related methods • Preliminary study • Proposed method • Results – Experimental results – Discussions • Conclusion
people Engineers Announcers Conductors Japanese conductors Proposed method (1/3) 1. Extracting person categories by using a machine learning-based classifier “SVM” Music Technology
link is labeled as is-a relation – If linked categories are both person categories Person category hierarchy Musical instrument makers Engineers Conductors Japanese conductors person Root category Root category Announcers Root category Music people
person Announcers • Sports announcer • News reader • Nippon Television Network school •Amati •Nicolas Chédeville • Kazuyoshi Akiyama • Takashi Asahina 3. Extracting person instances from articles listed on person categories by heuristic rules Person category hierarchy × × × × Music people
article with the same name as the target category. A crime writer is an author of crime fiction. Crime writer Crime writers •novelist •British novelist Target category Writers Parent category Child category Organized crime novelists Poets Features for SVM (1/2) • 6 kinds of words surrounding target category – They often consist of common nouns 2 3 5 6 1 • We also used the information of “similar category” for details, please refer to the proceedings. Sibling category 4
No human senses Ex. Tour, Golf, Sports etc… c. Both human and not-human senses Ex. Model, Master, Center etc… d. Unknown Ex. Gamer, Sniper, Ubiquitous etc… Features for SVM (2/2) • The manner in which each kind of word matches the Goi-Taikei instance
of articles listed on person categories – Heuristic rules • The head word of title or D-hypernym matches Goi-Taikei’s instance in person category • Most categories are person categories • Category matches predefined patterns (ex. “_th births”)
Language resources • Previous works – Three related methods • Preliminary study • Proposed method • Results – Experimental results – Discussions • Conclusion
Kobayashi's method 96.0% (264/275) 94.0% (264/281) 95.0% Proposed method 98.2% (277/282) 98.6% (277/281) 98.4% • 96.2 % of person instances are personal names Experimental result (2/2) • The person instances extraction accuracy – Evaluation data : 1,000 (person:281) (randomly sampled)
cannot be extracted by head matching (better than Ponzetto’s method) • Ex. Journalists - Sports writers Discussions (1/2) – A variety of categories and instances that represent person • Ex. Albert Einstein, Financial planner, Asian American – Person categories that do not match any Goi-Taikei instance (better than Kobayashi’s method) • Ex. Violinists, Animators, Percussionist
20000 30000 本手法1 本手法2 本手法3 Proposed method Without using Goi-Taikei Without using similar category 0 20k 30k 10k 100.0 98.0 96.0 94.0 92.0 90.0 F-measure [%] The number of training data • The effect of using “similar category” and “Goi-Taikei” – Using similar categories results in higher F-measure, regardless of the training data size. – With using Goi-Taikei, the classifier achieves higher accuracy with few training data Training data: 1,000 to 30,000 Evaluation data:9,767
Language resources • Previous works – Three related methods • Preliminary study • Proposed method • Results – Experimental results – Discussions • Conclusion
constructed • The proposed method achieved accurate person categories and articles extraction – Categories : 99.3% precision, 98.4% recall – Instances : 98.2% precision, 98.6% recall • In the future, we’ll attempt to apply our method to other objects – such as organizations, product names for applications such as named entity recognition (Articles)
Wikipedia person categories) Precision : 98.3% Extracted pairs : 118,552 The extraction accuracy of the pairs of categories Our method can extract is-a relations without reference to surface character strings. 5,558 (38.6%) did not match their head words. Ponzetto’s method and Sakurai’s method cannot extract such pairs. Ex. Journalists - Sports writers
Proposed method 98.0% (294/300) 99.3% (294/296) 98.7% The extraction accuracy of the pairs of person category and person instance American golfers - Michelle Wie Artists-Meritorious - Artist 1,000 pairs of category and article in Wikipedia, and manually created evaluation data (positive:296, negative:704). Evaluation data : Positive ex. Chiba clan - Ohsuga clan Negative ex.
novelists Poets :Target category :Similar category :Head word British crime fiction writers Child writers similar category : Parent, child, and sibling categories whose head word matches the target category. Words for feature of similar category are added to words for feature of feature. there is a possibility that there is not enough text information from which features can be extracted, which could degrade the accuracy.