Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Constructing Large-Scale Person Ontology from Wikipedia

Constructing Large-Scale Person Ontology from Wikipedia

Yumi Shibaki, Masaaki Nagata and Kazuhide Yamamoto. Constructing Large-Scale Person Ontology from Wikipedia. Proceedings of 2nd Workshop on The People’s Web Meets NLP: Collaboratively Constructed Semantic Resources ( CCSR 2010), pp.1-9 (2010.8)

自然言語処理研究室

August 31, 2010
Tweet

More Decks by 自然言語処理研究室

Other Decks in Research

Transcript

  1. Constructing Large-Scale Person Ontology from Wikipedia Yumi Shibaki 1 Masaaki

    Nagata 2 Kazuhide Yamamoto 1 1 Nagaoka University of Technology 2 NTT Communication Science Laboratories
  2. Outline • Introduction – Goal, Motivation, Approach and results –

    Language resources • Previous works – Three related methods • Preliminary study • Proposed method • Results – Experimental results – Discussions • Conclusion
  3. Outline • Introduction – Goal, Motivation, Approach and results –

    Language resources • Previous works – Three related methods • Preliminary study • Proposed method • Results – Experimental results – Discussions • Conclusion
  4. • Goal – Constructing a large-scale person ontology from Wikipedia

    Goal and Motivation • Motivation – Wikipedia contains a large number of articles and categories that represent a person – a large-scale person ontology would be useful for applications such as person search and named entity recognition • What’s “Person” – Personal name, Occupation and Ethnic group etc…
  5. Approach and Results • Approach – Detecting person category by

    machine learning-based classifiers, using • Structure of Wikipedia category network • Semantic category of words provided by manually made Japanese thesaurus “Nihongo-Goi-Taikei” • Results – Categories : 99.3% precision, 98.4% recall – Articles : 98.2% precision, 98.6% recall – Extracting person instance from the article by heuristic rules (instances)
  6. • Size – Wikipedia person category : 8,500 – Wikipedia

    person instance : 130,000 People by occupation is-a person Example of person ontology • Maasai people • Somali people Critics Ethnic groups Ethnic groups in Africa Astronauts • Neil Armstrong Critics of Wikipedia People who have walked on the Moon Category People associated with Sports • Ball boy • Sports master is-a Article
  7. • model • toy • common noun agents places objects

    concrete abstract Instance Other category Person category is-a Language resources (1/2) • Japanese thesaurus “Nihongo Goi-Taikei” – 100,000 common nouns (instances) – 2,700 hierarchical semantic categories – each instance could have multiple categories plaything , sporting goods officials competitor performer • model • dancer •
  8. Sportspeople Sports People by occupation Articles is-a not-is-a Japanese golfers

    American golfers Language resources (2/2) – Article has a title, a body and categories • In most articles, the first sentence of the body is usually the definition of the title – Category link dose not necessarily represent is-a relation – 500,000 articles and 40,000 categories category definition sentence •Michelle Wie •PGA Tour not-is-a is-a • Japanese Wikipedia
  9. Outline • Introduction – Goal, Motivation, Approach and results –

    Language resources • Previous works – Three related methods • Preliminary study • Proposed method • Results – Experimental results – Discussions • Conclusion
  10. English CAPITALS - CAPITALS IN ASIA Japanese 首都 - アジアの首都

    Head word is the last word in Japanese • Head matching – A category link is labeled as is-a relation If the two categories share the same head words Ponzetto’s method [Ponzetto and Strube, 2007] • Problems – not a single interconnected taxonomy – relatively high precision, but low recall
  11. Kobayashi’s method (1/3) [Kobayashi et al., 2008] • Japanese equivalent

    of YAGO – The Wikipedia category is associated with Goi-Taikei category as its subcategory – The Wikipedia articles are extracted as the instances referees players Goi-Taikei Category American golfers associated • Michelle Wie • Paula Creamer Sportspeople • Sportsperson associated Wikipedia Category Articles [Suchanek et al., 2007]
  12. Michelle Wie Michelle Wie (Michelle Wie, born October 11, 1989

    ) is a golf player. In 2006, she was named in a Time magazine article: "one of 100 people who shape our world." Kobayashi’s method (2/3) [Kobayashi et al., 2008] • Defining D-hypernym – The hypernym of the Wikipedia article extracted from its definition sentence by pattern matching. D-hypernym Definition sentence
  13. • Problem • player • golfer •Michelle Wie (golf player)

    •PGA Tour (Tour) instance D-hypernym players Head matching Kobayashi’s method (3/3) [Kobayashi et al., 2008] Doesn’t match associate Goi-Taikei Category Wikipedia Category American golfers × × × × – Person categories that do not match any Goi-Taikei’s instance cannot be extracted such as Animators, Percussionist • If the head word of Wikipedia category matches on instance of a Goi-Taikei category • If the D-hypernym of a Wikipedia article matches an instance of the same Goi-Taikei category Articles 1 2 1 2
  14. Category for sorting personal names, horses, and dogs. • Extracting

    personal name from Wikipedia article’s title – If the article is listed on the category “_th births” etc. Yamashita’s method [Yamashita, 2007] Personal name • Problems – No hierarchical taxonomy is made – A simple rule and high precision, but low recall
  15. Outline • Introduction – Goal, Motivation, Approach and results –

    Language resources • Previous works – Three related methods • Preliminary study • Proposed method • Results – Experimental results – Discussions • Conclusion
  16. Preliminary study (1/2) Japanese Wikipedia (July 24, 2008) estimated •

    Size – All categories : 39,767 – All articles : 477,094 – Person categories: 8,485 (21.3%) – Person instances : 130,000 (27.7%) (articles)
  17. • is-a relation – 98.7% of the links between person

    categories are is-a relations (68% of all category links are is-a relations) • Sportspeople - Sportsmen – 97.3% person instances are listed on one or more person categories with is-a relation • Article “Isaac Newton” is listed on some person categories “British physicists” and “Alchemists” etc… Preliminary study (2/2) Japanese Wikipedia (July 24, 2008) If linked categories are both person categories, the category link can be regarded as is-a relation. We make only a person category classifier !!
  18. Outline • Introduction – Goal, Motivation, Approach and results –

    Language resources • Previous works – Three related methods • Preliminary study • Proposed method • Results – Experimental results – Discussions • Conclusion
  19. Musical instrument makers Broadcasting Wikipedia category hierarchy Announcer productions Music

    people Engineers Announcers Conductors Japanese conductors Proposed method (1/3) 1. Extracting person categories by using a machine learning-based classifier “SVM” Music Technology
  20. Proposed method (2/3) 2. The category link is labeled as

    is-a relation – If linked categories are both person categories is-a Musical instrument makers Broadcasting Announcer productions Engineers Announcers Conductors Japanese conductors Music Technology is-a is-a is-a × × × × × × × × × × × × × × × × × × × × Root category Root category Root category Music people
  21. is-a is-a is-a is-a Proposed method (2/3) 2. The category

    link is labeled as is-a relation – If linked categories are both person categories Person category hierarchy Musical instrument makers Engineers Conductors Japanese conductors person Root category Root category Announcers Root category Music people
  22. Proposed method (3/3) Musical instrument makers Engineers Conductors Japanese conductors

    person Announcers • Sports announcer • News reader • Nippon Television Network school •Amati •Nicolas Chédeville • Kazuyoshi Akiyama • Takashi Asahina 3. Extracting person instances from articles listed on person categories by heuristic rules Person category hierarchy × × × × Music people
  23. D-hypernyms of articles in the target category D-hypernym of the

    article with the same name as the target category. A crime writer is an author of crime fiction. Crime writer Crime writers •novelist •British novelist Target category Writers Parent category Child category Organized crime novelists Poets Features for SVM (1/2) • 6 kinds of words surrounding target category – They often consist of common nouns 2 3 5 6 1 • We also used the information of “similar category” for details, please refer to the proceedings. Sibling category 4
  24. a. Only human senses Ex. Dancer, Artist, Novelist etc… b.

    No human senses Ex. Tour, Golf, Sports etc… c. Both human and not-human senses Ex. Model, Master, Center etc… d. Unknown Ex. Gamer, Sniper, Ubiquitous etc… Features for SVM (2/2) • The manner in which each kind of word matches the Goi-Taikei instance
  25. D-hypernym Extracting instances • Extracting person instances from the titles

    of articles listed on person categories – Heuristic rules • The head word of title or D-hypernym matches Goi-Taikei’s instance in person category • Most categories are person categories • Category matches predefined patterns (ex. “_th births”)
  26. Outline • Introduction – Goal, Motivation, Approach and results –

    Language resources • Previous works – Three related methods • Preliminary study • Proposed method • Results – Experimental results – Discussions • Conclusion
  27. Precision Recall F-measure Kobayashi’s method 92.8% (6727/7247) 83.6% (6727/8050) 88.0%

    Proposed method 99.3% (7922/7979) 98.4% (7922/8050) 98.8% Experimental result (1/2) • Person categories extracted : 8,357 • Root categories : 224 • The person categories extraction accuracy – Training data : 2,000 (person:435) (randomly sampled) – Evaluation data : 37,767 (person:8,050) (remaining) (extracted + training)
  28. Precision Recall F-measure Yamashita's method 100.0% (218/218) 77.6% (218/281) 87.4%

    Kobayashi's method 96.0% (264/275) 94.0% (264/281) 95.0% Proposed method 98.2% (277/282) 98.6% (277/281) 98.4% • 96.2 % of person instances are personal names Experimental result (2/2) • The person instances extraction accuracy – Evaluation data : 1,000 (person:281) (randomly sampled)
  29. • Proposed method can extract … – is-a relations that

    cannot be extracted by head matching (better than Ponzetto’s method) • Ex. Journalists - Sports writers Discussions (1/2) – A variety of categories and instances that represent person • Ex. Albert Einstein, Financial planner, Asian American – Person categories that do not match any Goi-Taikei instance (better than Kobayashi’s method) • Ex. Violinists, Animators, Percussionist
  30. Discussions (2/2) 90.0 92.0 94.0 96.0 98.0 100.0 0 10000

    20000 30000 本手法1 本手法2 本手法3 Proposed method Without using Goi-Taikei Without using similar category 0 20k 30k 10k 100.0 98.0 96.0 94.0 92.0 90.0 F-measure [%] The number of training data • The effect of using “similar category” and “Goi-Taikei” – Using similar categories results in higher F-measure, regardless of the training data size. – With using Goi-Taikei, the classifier achieves higher accuracy with few training data Training data: 1,000 to 30,000 Evaluation data:9,767
  31. Outline • Introduction – Goal, Motivation, Approach and results –

    Language resources • Previous works – Three related methods • Preliminary study • Proposed method • Results – Experimental results – Discussions • Conclusion
  32. Conclusion • A large-scale and up-to-date Wikipedia person ontology was

    constructed • The proposed method achieved accurate person categories and articles extraction – Categories : 99.3% precision, 98.4% recall – Instances : 98.2% precision, 98.6% recall • In the future, we’ll attempt to apply our method to other objects – such as organizations, product names for applications such as named entity recognition (Articles)
  33. Evaluate data : 1,000 pairs of linked categories (from extracted

    Wikipedia person categories) Precision : 98.3% Extracted pairs : 118,552 The extraction accuracy of the pairs of categories Our method can extract is-a relations without reference to surface character strings. 5,558 (38.6%) did not match their head words. Ponzetto’s method and Sakurai’s method cannot extract such pairs. Ex. Journalists - Sports writers
  34. Precision Recall F-measure Kobayashi’s method 95.9% (259/270) 87.5% (259/296) 91.5%

    Proposed method 98.0% (294/300) 99.3% (294/296) 98.7% The extraction accuracy of the pairs of person category and person instance American golfers - Michelle Wie Artists-Meritorious - Artist 1,000 pairs of category and article in Wikipedia, and manually created evaluation data (positive:296, negative:704). Evaluation data : Positive ex. Chiba clan - Ohsuga clan Negative ex.
  35. Similar category Crime writers People by occupation Writers Organized crime

    novelists Poets :Target category :Similar category :Head word British crime fiction writers Child writers similar category : Parent, child, and sibling categories whose head word matches the target category. Words for feature of similar category are added to words for feature of feature. there is a possibility that there is not enough text information from which features can be extracted, which could degrade the accuracy.