Using Goi-Taikei as an Upper Ontology to Build a Large-Scale Japanese Ontology from Wikipedia
Masaaki Nagata, Yumi Shibaki and Kazuhide Yamamoto. Using Goi-Taikei as an Upper Ontology to Build a Large-Scale Japanese Ontology from Wikipedia. Proceedings of 6th Workshop on Ontologies and Lexical Resources (OntoLex 2010), pp.11-18 (2010.8)
from Wikipedia Upper level: an existing taxonomy Nihongo Goi-Taikei (日本語語彙大系) Manually-made, general purpose taxonomy Lower level: derived from Wikipedia catetories Fine-grained, up-to-date information About 50k categories and 480k instances(articels) End result: “Extended” Nihongo Goi-Taikei Categories: 2,715 += 23,289 (precision 92.8%) Instances: 274,379 += 263,631 (precision 98.6%)
Japanese Wikipedia Proposed method Category alignment Is-a link detection Instance extraction Experiment result Comparison to Previous Works Conclusion
is-a links in Wikipedia category network Head matching (Ponzetto and Strube, 2007): A link between two categories is labeled as is-a if their head words are the same CAPITALS – CAPITALS IN ASIA Its Japanese equivalent is suffix matching (Sakurai et al., 2008), as Japanese is a head final language 首都 – アジアの首都 The result is a set of taxonomic trees, not a single interconnected taxonomy Head matching Wikipedia category network
an existing taxonomy as a core to make a single taxonomy from Wikipedia YAGO (Suchanek et al., 2007) For each Wikipedia article, a Wikipedia category whose head noun in plural form is extracted as its hypernym (conceptual categories) A conceptual category are associated with a WordNet synset by heuristics such as head matching The core taxonomy extended only level WordNet synsets Conceptual categories Wikipedia articles (Kobayashi et al., 2008) is a Japanese equivalent of Yago, Using Nihongo Goi-Taikei as a core “person/human” “American people in Japan” Konishiki Yasokichi
with its category hierarchy Approach Using an existing ontology as upper ontology Using the knowledge defined in the upper ontology to increase the coverage of head matching Wikipedia categories and instances Upper ontology: Nihongo Goi-Taikei
Japanese Wikipedia Proposed method Category alignment Is-a link detection Instance extraction Experiment result Comparison to Previous Works Conclusion
Japanese thesauri (Ikehara et al., 1997) Published as a book in 5 volumes and as a CD-ROM Meaning of 274,379 words are defined using 2715 hierarchical semantic categories Each category has a unique ID number and category name, such as 4:person, 388:place Each word has up to 5 semantic categories There are different category hierarchies for common nouns, proper nouns and verbs We used only common noun hierarchy Proper nouns are mapped to common noun hierarchy using the mapping table
Japanese version has about 480k articles and 50k categories Each article has a title, body, and a list of categories The first sentence of the body is usually the definition of title Each category has a title, body, and a list of super categories Original Japanese description English translation <title>カクテル</title> カクテル(英語:Cocktail)とは、主にベース となる酒に、他の酒またはジュースなどを 混ぜて作るアルコール飲料。 .. . <Category>カクテル</Category> <title>cocktail</title> A cocktail (English:Cocktail) is an alcoholic beverage made by mixing a base liquor with other liquor or juice … <category>cocktail</Category> <title>Category:カクテル</title> [[カクテル]]に関するカテゴリ … <Category>酒</Category> <title>Category:Cocktails</title> Category in [[cocktails]] … <Category>alcoholic beverages</Category>
article could belong to multiple categories A category could have multiple super categories A hyper link is not necessarily an is-a relation Upper categories are thematic classification History, Geography, culture, nature, technology, etc. The lower the category link is, the more it is likely to be is-a relation cocktail alcoholic beverage (is-a) shaker cocktail (not-is-a)
Japanese Wikipedia Proposed method Category alignment Is-a link detection Instance extraction Experiment result Comparison to Previous Works Conclusion
First, Goi-Taikei leaf categories are aligned with Wikipedia categories Junction category: a Wikipedia category aligned with a Goi-Taikei category Wikipedia categories Goi-Taikei categories Junction categories
detection For each Goi-Taikei leaf category, is-a links in Wikipedia below the junction category are detected Instance extraction For each wikipedia category, articles with is-a relations to the category are extracted Goi-Taikei categories Wikipedia categories and instances Junction categories
category is extracted as the junction category candidates if The Goi-Taikei category matches the Wikipedia category (GC) alcoholic beverages <-> (WC) alcoholic beverages One of the instance of the Goi-Taikei category matches the Wikipedia category (GC) public organization (GI) school <-> (WC) school More than two instances of the Goi-Taikei category match either instances or subcategories of the Wikipedia category (GC) athlete <-> (WC) sports people (GI) golfer, boxer, jockey, (WI) golfer, boxer, jockey
a Goi-Taikei category, they are aligned If a Wikipedia category is semantically equivalent to an instance of a Goi-Taikei category, it becomes a subcategory of the Goi-Taikei category This is a word sense disambiguation ロケット(roketto) has two Goi-Taikei categories, 990:aircraft (rocket) and 834:accesary (locket) Wikipedia category ロケット(rocket) is the former We manually aligned Goi-Taikei and Wikipedia because the accuracy of automatic alignment was not satisfactory
and category, hypernym is extracted for is-a link detection Lexico-syntactic patters are applied to the definition sentence [hyponym](と)は… [hypernym] (である)<EOS> [hyponym] is a [hypernym] … カクテルとは… アルコール飲料。 カクテル (アルコール飲料) If there is an article whose title is the same as its category, the hypernym of the article is used as that of the category
link detection is applied recursively A link is regarded as is-a, if the suffix of either the child category or its hypernym (from definition) matches one of the hypernym candidates Hypernym candidates for a Wikipedia category are the union of the followings: All super categories in Wikipedia from the current category to the junction category Three super categories in Goi-Taikei from the junction category All instances of the above three Goi-Taikei categories
from all Wikipedia articles listed on the category A link is regarded as is-a, if the suffix of either the article or its hypernym (from definition) matches one of the hypernym candidates of the article
Japanese Wikipedia Proposed method Category alignment Is-a link detection Instance extraction Experiment result Comparison to Previous Works Conclusion
categories and 479,231 articles Nihongo Goi-Taikei 2,715 categories and 274,379 instances (words) 1,921 leaf categories with 108,247 instances 6,301 junction category candidates are automatically extracted 2,477 junction category are manually selected 719 Goi-Taikei leaf categories have one or more junction categories (719/1921=38.4%) Preliminary experiment on automatic selection using SVM Standard ontology mapping features Whether (class|instance) name of (self|parent|children|siblings) match one the other About 90% precision and 70% recall
Wikipedia categories We evaluated the accuracy using two criteria: Parent-child precision whether the link between current category and its immediate parent is an is-a relation Ancestor-descendant precision whether all the links from the current category to the root are is-a relation 100 categories at each depth are randomly sampled, and manually evaluated
Wikipedia articles The category with the largest number of instances is JAPANESE ACTORS with 5,632 instances The average number of instances: 17.8 For evaluation, First, category-article pairs are randomly sampled from the categories with100% ancestor-descendant precision Then, we evaluated manually whether they are is-a relations 98.6% (205/208) precision and 83.0% (205/247) recall
al., 2008) Japanese equivalent of (Ponzetto and Strube, 2007) Parent-child precision 91.2% 6,672 Wikipedia categories Using an existing ontology as a core (Kobayashi et al., 2008) Japanese equivalent of YAGO (Suchanek et al., 2007) Parent-child precision 93% 19,426 Wikipedia categories Our method Parent-child precision 92.8% 23,239 Wikipedia categories A single connected taxonomy with richer hierarchy
well-defined taxonomy inherited from an existing ontology (Goi-Taikei) Lower level: Fine-grained and up-to-date taxonomy extracted from Wikipedia Future works Automatic category alignment between Goi-Taikei and Wikipedia Extract more categories and articles, because our method uses only a half of Wikipedia We will present a different approach to use almost all category information by restricting target categories Shibaki et al., “Constructing Large-Scale Person Ontology from Wikipedia”, CSSR-2010 (COLING workshop WS4 in Saturday)