Constructing Large-Scale Person Ontology from Wikipedia

Constructing Large-Scale Person Ontology from Wikipedia Yumi Shibaki 1 Masaaki
Nagata 2 Kazuhide Yamamoto 1 1 Nagaoka University of Technology 2 NTT Communication Science Laboratories

Outline • Introduction – Goal, Motivation, Approach and results –
Language resources • Previous works – Three related methods • Preliminary study • Proposed method • Results – Experimental results – Discussions • Conclusion

• Goal – Constructing a large-scale person ontology from Wikipedia
Goal and Motivation • Motivation – Wikipedia contains a large number of articles and categories that represent a person – a large-scale person ontology would be useful for applications such as person search and named entity recognition • What’s “Person” – Personal name, Occupation and Ethnic group etc…

Approach and Results • Approach – Detecting person category by
machine learning-based classifiers, using • Structure of Wikipedia category network • Semantic category of words provided by manually made Japanese thesaurus “Nihongo-Goi-Taikei” • Results – Categories : 99.3% precision, 98.4% recall – Articles : 98.2% precision, 98.6% recall – Extracting person instance from the article by heuristic rules (instances)

• Size – Wikipedia person category : 8,500 – Wikipedia
person instance : 130,000 People by occupation is-a person Example of person ontology • Maasai people • Somali people Critics Ethnic groups Ethnic groups in Africa Astronauts • Neil Armstrong Critics of Wikipedia People who have walked on the Moon Category People associated with Sports • Ball boy • Sports master is-a Article

• model • toy • common noun agents places objects
concrete abstract Instance Other category Person category is-a Language resources (1/2) • Japanese thesaurus “Nihongo Goi-Taikei” – 100,000 common nouns (instances) – 2,700 hierarchical semantic categories – each instance could have multiple categories plaything , sporting goods officials competitor performer • model • dancer •

Sportspeople Sports People by occupation Articles is-a not-is-a Japanese golfers
American golfers Language resources (2/2) – Article has a title, a body and categories • In most articles, the first sentence of the body is usually the definition of the title – Category link dose not necessarily represent is-a relation – 500,000 articles and 40,000 categories category definition sentence •Michelle Wie •PGA Tour not-is-a is-a • Japanese Wikipedia

English CAPITALS － CAPITALS IN ASIA Japanese 首都－アジアの首都
Head word is the last word in Japanese • Head matching – A category link is labeled as is-a relation If the two categories share the same head words Ponzetto’s method [Ponzetto and Strube, 2007] • Problems – not a single interconnected taxonomy – relatively high precision, but low recall

Kobayashi’s method (1/3) [Kobayashi et al., 2008] • Japanese equivalent
of YAGO – The Wikipedia category is associated with Goi-Taikei category as its subcategory – The Wikipedia articles are extracted as the instances referees players Goi-Taikei Category American golfers associated • Michelle Wie • Paula Creamer Sportspeople • Sportsperson associated Wikipedia Category Articles [Suchanek et al., 2007]

Michelle Wie Michelle Wie (Michelle Wie, born October 11, 1989
) is a golf player. In 2006, she was named in a Time magazine article: "one of 100 people who shape our world." Kobayashi’s method (2/3) [Kobayashi et al., 2008] • Defining D-hypernym – The hypernym of the Wikipedia article extracted from its definition sentence by pattern matching. D-hypernym Definition sentence

• Problem • player • golfer •Michelle Wie (golf player)
•PGA Tour (Tour) instance D-hypernym players Head matching Kobayashi’s method (3/3) [Kobayashi et al., 2008] Doesn’t match associate Goi-Taikei Category Wikipedia Category American golfers × × × × – Person categories that do not match any Goi-Taikei’s instance cannot be extracted such as Animators, Percussionist • If the head word of Wikipedia category matches on instance of a Goi-Taikei category • If the D-hypernym of a Wikipedia article matches an instance of the same Goi-Taikei category Articles 1 2 1 2

Category for sorting personal names, horses, and dogs. • Extracting
personal name from Wikipedia article’s title – If the article is listed on the category “_th births” etc. Yamashita’s method [Yamashita, 2007] Personal name • Problems – No hierarchical taxonomy is made – A simple rule and high precision, but low recall

Preliminary study (1/2) Japanese Wikipedia (July 24, 2008) estimated •
Size – All categories : 39,767 – All articles : 477,094 – Person categories: 8,485 (21.3%) – Person instances : 130,000 (27.7%) (articles)

• is-a relation – 98.7% of the links between person
categories are is-a relations (68% of all category links are is-a relations) • Sportspeople - Sportsmen – 97.3% person instances are listed on one or more person categories with is-a relation • Article “Isaac Newton” is listed on some person categories “British physicists” and “Alchemists” etc… Preliminary study (2/2) Japanese Wikipedia (July 24, 2008) If linked categories are both person categories, the category link can be regarded as is-a relation. We make only a person category classifier !!

Musical instrument makers Broadcasting Wikipedia category hierarchy Announcer productions Music
people Engineers Announcers Conductors Japanese conductors Proposed method (1/3) 1. Extracting person categories by using a machine learning-based classifier “SVM” Music Technology

Proposed method (2/3) 2. The category link is labeled as
is-a relation – If linked categories are both person categories is-a Musical instrument makers Broadcasting Announcer productions Engineers Announcers Conductors Japanese conductors Music Technology is-a is-a is-a × × × × × × × × × × × × × × × × × × × × Root category Root category Root category Music people

is-a is-a is-a is-a Proposed method (2/3) 2. The category
link is labeled as is-a relation – If linked categories are both person categories Person category hierarchy Musical instrument makers Engineers Conductors Japanese conductors person Root category Root category Announcers Root category Music people

Proposed method (3/3) Musical instrument makers Engineers Conductors Japanese conductors
person Announcers • Sports announcer • News reader • Nippon Television Network school •Amati •Nicolas Chédeville • Kazuyoshi Akiyama • Takashi Asahina 3. Extracting person instances from articles listed on person categories by heuristic rules Person category hierarchy × × × × Music people

D-hypernyms of articles in the target category D-hypernym of the
article with the same name as the target category. A crime writer is an author of crime fiction. Crime writer Crime writers •novelist •British novelist Target category Writers Parent category Child category Organized crime novelists Poets Features for SVM (1/2) • 6 kinds of words surrounding target category – They often consist of common nouns 2 3 5 6 1 • We also used the information of “similar category” for details, please refer to the proceedings. Sibling category 4

a. Only human senses Ex. Dancer, Artist, Novelist etc… b.
No human senses Ex. Tour, Golf, Sports etc… c. Both human and not-human senses Ex. Model, Master, Center etc… d. Unknown Ex. Gamer, Sniper, Ubiquitous etc… Features for SVM (2/2) • The manner in which each kind of word matches the Goi-Taikei instance

D-hypernym Extracting instances • Extracting person instances from the titles
of articles listed on person categories – Heuristic rules • The head word of title or D-hypernym matches Goi-Taikei’s instance in person category • Most categories are person categories • Category matches predefined patterns (ex. “_th births”)

Precision Recall F-measure Kobayashi’s method 92.8% (6727/7247) 83.6% (6727/8050) 88.0%
Proposed method 99.3% (7922/7979) 98.4% (7922/8050) 98.8% Experimental result (1/2) • Person categories extracted : 8,357 • Root categories : 224 • The person categories extraction accuracy – Training data : 2,000 (person:435) (randomly sampled) – Evaluation data : 37,767 (person:8,050) (remaining) (extracted + training)

Precision Recall F-measure Yamashita's method 100.0% (218/218) 77.6% (218/281) 87.4%
Kobayashi's method 96.0% (264/275) 94.0% (264/281) 95.0% Proposed method 98.2% (277/282) 98.6% (277/281) 98.4% • 96.2 % of person instances are personal names Experimental result (2/2) • The person instances extraction accuracy – Evaluation data : 1,000 (person:281) (randomly sampled)

• Proposed method can extract … – is-a relations that
cannot be extracted by head matching (better than Ponzetto’s method) • Ex. Journalists － Sports writers Discussions (1/2) – A variety of categories and instances that represent person • Ex. Albert Einstein, Financial planner, Asian American – Person categories that do not match any Goi-Taikei instance (better than Kobayashi’s method) • Ex. Violinists, Animators, Percussionist

Discussions (2/2) 90.0 92.0 94.0 96.0 98.0 100.0 0 10000
20000 30000 本手法１本手法２本手法３ Proposed method Without using Goi-Taikei Without using similar category 0 20k 30k 10k 100.0 98.0 96.0 94.0 92.0 90.0 F-measure [%] The number of training data • The effect of using “similar category” and “Goi-Taikei” – Using similar categories results in higher F-measure, regardless of the training data size. – With using Goi-Taikei, the classifier achieves higher accuracy with few training data Training data: 1,000 to 30,000 Evaluation data:9,767

Conclusion • A large-scale and up-to-date Wikipedia person ontology was
constructed • The proposed method achieved accurate person categories and articles extraction – Categories : 99.3% precision, 98.4% recall – Instances : 98.2% precision, 98.6% recall • In the future, we’ll attempt to apply our method to other objects – such as organizations, product names for applications such as named entity recognition (Articles)

support documentation

Evaluate data : 1,000 pairs of linked categories (from extracted
Wikipedia person categories) Precision : 98.3% Extracted pairs : 118,552 The extraction accuracy of the pairs of categories Our method can extract is-a relations without reference to surface character strings. 5,558 (38.6%) did not match their head words. Ponzetto’s method and Sakurai’s method cannot extract such pairs. Ex. Journalists － Sports writers

Precision Recall F-measure Kobayashi’s method 95.9% (259/270) 87.5% (259/296) 91.5%
Proposed method 98.0% (294/300) 99.3% (294/296) 98.7% The extraction accuracy of the pairs of person category and person instance American golfers － Michelle Wie Artists-Meritorious － Artist 1,000 pairs of category and article in Wikipedia, and manually created evaluation data (positive:296, negative:704). Evaluation data : Positive ex. Chiba clan － Ohsuga clan Negative ex.

Similar category Crime writers People by occupation Writers Organized crime
novelists Poets ：Target category ：Similar category ：Head word British crime fiction writers Child writers similar category : Parent, child, and sibling categories whose head word matches the target category. Words for feature of similar category are added to words for feature of feature. there is a possibility that there is not enough text information from which features can be extracted, which could degrade the accuracy.

Constructing Large-Scale Person Ontology from W...

Constructing Large-Scale Person Ontology from Wikipedia

More Decks by 自然言語処理研究室

Other Decks in Research

Featured

Transcript