Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Using Goi-Taikei as an Upper Ontology to Build a Large-Scale Japanese Ontology from Wikipedia

Using Goi-Taikei as an Upper Ontology to Build a Large-Scale Japanese Ontology from Wikipedia

Masaaki Nagata, Yumi Shibaki and Kazuhide Yamamoto. Using Goi-Taikei as an Upper Ontology to Build a Large-Scale Japanese Ontology from Wikipedia. Proceedings of 6th Workshop on Ontologies and Lexical Resources (OntoLex 2010), pp.11-18 (2010.8)

自然言語処理研究室

August 31, 2010
Tweet

More Decks by 自然言語処理研究室

Other Decks in Research

Transcript

  1. Using Goi-Taikei as an Upper Ontology to Build a Large-Scale

    Japanese Ontology from Wikpedia Masaaki Nagata1, Yumi Shibaki2 and Kazuhide Yamamoto2 1NTT Corporation 2Nagaoka University of Technology
  2. Introduction A novel approach for building a large-scale Japanese ontology

    from Wikipedia Upper level: an existing taxonomy Nihongo Goi-Taikei (日本語語彙大系) Manually-made, general purpose taxonomy Lower level: derived from Wikipedia catetories Fine-grained, up-to-date information About 50k categories and 480k instances(articels) End result: “Extended” Nihongo Goi-Taikei Categories: 2,715 += 23,289 (precision 92.8%) Instances: 274,379 += 263,631 (precision 98.6%)
  3. Outline Introduction Previous works Our proposal Language resources Nihongo Goi-Taikei

    Japanese Wikipedia Proposed method Category alignment Is-a link detection Instance extraction Experiment result Comparison to Previous Works Conclusion
  4. Previous works on building an ontology from Wikipedia (1/2) Detecting

    is-a links in Wikipedia category network Head matching (Ponzetto and Strube, 2007): A link between two categories is labeled as is-a if their head words are the same CAPITALS – CAPITALS IN ASIA Its Japanese equivalent is suffix matching (Sakurai et al., 2008), as Japanese is a head final language 首都 – アジアの首都 The result is a set of taxonomic trees, not a single interconnected taxonomy Head matching Wikipedia category network
  5. Previous works on building an ontology from Wikipedia (2/2) Using

    an existing taxonomy as a core to make a single taxonomy from Wikipedia YAGO (Suchanek et al., 2007) For each Wikipedia article, a Wikipedia category whose head noun in plural form is extracted as its hypernym (conceptual categories) A conceptual category are associated with a WordNet synset by heuristics such as head matching The core taxonomy extended only level WordNet synsets Conceptual categories Wikipedia articles (Kobayashi et al., 2008) is a Japanese equivalent of Yago, Using Nihongo Goi-Taikei as a core “person/human” “American people in Japan” Konishiki Yasokichi
  6. Our proposal Goal Building a single connected ontology from Wikipedia

    with its category hierarchy Approach Using an existing ontology as upper ontology Using the knowledge defined in the upper ontology to increase the coverage of head matching Wikipedia categories and instances Upper ontology: Nihongo Goi-Taikei
  7. Outline Introduction Previous works Our proposal Language resources Nihongo Goi-Taikei

    Japanese Wikipedia Proposed method Category alignment Is-a link detection Instance extraction Experiment result Comparison to Previous Works Conclusion
  8. Nihongo Goi-Taikei (1/2) One of the largest and best known

    Japanese thesauri (Ikehara et al., 1997) Published as a book in 5 volumes and as a CD-ROM Meaning of 274,379 words are defined using 2715 hierarchical semantic categories Each category has a unique ID number and category name, such as 4:person, 388:place Each word has up to 5 semantic categories There are different category hierarchies for common nouns, proper nouns and verbs We used only common noun hierarchy Proper nouns are mapped to common noun hierarchy using the mapping table
  9. Nihongo Goi-Taikei (2/2) Japanese word “raita” (ライター) 353:author 4:person (writer)

    915:house hold appliances 533:concrete object (lighter) 3000 categories 3000 categories 3000 categories 3000 categories 300,000 words 300,000 words 300,000 words 300,000 words 1:noun 2:concrete 1000:abstract 3:agent 388:place 533:concrete object 362:organization 389:facility 458:region 468:nature 534:animate being 706:inanimate being 915:household appliance 1001:abstract object 1235:thing 2422:abstract relation 353:author writer lighter 4:person
  10. Japanese Wikipedia (1/2) Wikipedia is a free, multi-lingual, on-line encyclopedia

    Japanese version has about 480k articles and 50k categories Each article has a title, body, and a list of categories The first sentence of the body is usually the definition of title Each category has a title, body, and a list of super categories Original Japanese description English translation <title>カクテル</title> カクテル(英語:Cocktail)とは、主にベース となる酒に、他の酒またはジュースなどを 混ぜて作るアルコール飲料。 .. . <Category>カクテル</Category> <title>cocktail</title> A cocktail (English:Cocktail) is an alcoholic beverage made by mixing a base liquor with other liquor or juice … <category>cocktail</Category> <title>Category:カクテル</title> [[カクテル]]に関するカテゴリ … <Category>酒</Category> <title>Category:Cocktails</title> Category in [[cocktails]] … <Category>alcoholic beverages</Category>
  11. Japanese Wikipedia (2/2) Wikipedia categories is a hierarchical network An

    article could belong to multiple categories A category could have multiple super categories A hyper link is not necessarily an is-a relation Upper categories are thematic classification History, Geography, culture, nature, technology, etc. The lower the category link is, the more it is likely to be is-a relation cocktail alcoholic beverage (is-a) shaker cocktail (not-is-a)
  12. Outline Introduction Previous works Our proposal Language resources Nihongo Goi-Taikei

    Japanese Wikipedia Proposed method Category alignment Is-a link detection Instance extraction Experiment result Comparison to Previous Works Conclusion
  13. Outline of the proposed ontology building method (1/2) Category alignment

    First, Goi-Taikei leaf categories are aligned with Wikipedia categories Junction category: a Wikipedia category aligned with a Goi-Taikei category Wikipedia categories Goi-Taikei categories Junction categories
  14. Outline of the proposed ontology building method (2/2) Is-a link

    detection For each Goi-Taikei leaf category, is-a links in Wikipedia below the junction category are detected Instance extraction For each wikipedia category, articles with is-a relations to the category are extracted Goi-Taikei categories Wikipedia categories and instances Junction categories
  15. Category Alignment (1/2) For each Goi-Taikei leaf category, a Wikipedia

    category is extracted as the junction category candidates if The Goi-Taikei category matches the Wikipedia category (GC) alcoholic beverages <-> (WC) alcoholic beverages One of the instance of the Goi-Taikei category matches the Wikipedia category (GC) public organization (GI) school <-> (WC) school More than two instances of the Goi-Taikei category match either instances or subcategories of the Wikipedia category (GC) athlete <-> (WC) sports people (GI) golfer, boxer, jockey, (WI) golfer, boxer, jockey
  16. Category Alignment(2/2) If a Wikipedia category is semantically equivalent to

    a Goi-Taikei category, they are aligned If a Wikipedia category is semantically equivalent to an instance of a Goi-Taikei category, it becomes a subcategory of the Goi-Taikei category This is a word sense disambiguation ロケット(roketto) has two Goi-Taikei categories, 990:aircraft (rocket) and 834:accesary (locket) Wikipedia category ロケット(rocket) is the former We manually aligned Goi-Taikei and Wikipedia because the accuracy of automatic alignment was not satisfactory
  17. Hypernym Extraction (Preparation for is-a Link Detection) For each article

    and category, hypernym is extracted for is-a link detection Lexico-syntactic patters are applied to the definition sentence [hyponym](と)は… [hypernym] (である)<EOS> [hyponym] is a [hypernym] … カクテルとは… アルコール飲料。 カクテル (アルコール飲料) If there is an article whose title is the same as its category, the hypernym of the article is used as that of the category
  18. Is-a link detection (1/2) Starting from a junction category, is-a

    link detection is applied recursively A link is regarded as is-a, if the suffix of either the child category or its hypernym (from definition) matches one of the hypernym candidates Hypernym candidates for a Wikipedia category are the union of the followings: All super categories in Wikipedia from the current category to the junction category Three super categories in Goi-Taikei from the junction category All instances of the above three Goi-Taikei categories
  19. Is-a link detection (2/2) 861:酒_liquor カクテル_cocktails 蒸留酒_distilled_beverages ウイスキー_whiskies 醸造酒_brewed_beverages ワイン_wine

    フランスワイン_French_wines ビール_beer 日本酒_sake ボルドーワイン_Bordeaux_wines 857:飲み物_beverages ジンベースのカクテル_cocktails_with_gin ウォッカベースのカクテル_cocktails_with_vodka ・酒_alcoholic_beverage ・ウイスキー_whisky ・飲料_beverage instance ・飲料_beverage アルコール飲料_alcoholic_beverage hypernym 酒神_gods_of_alcoholic_beverages Not is-a relation 酒_alcoholic_beverages instance × × × × Goi-Taikei categories Wikipedia categories junction category
  20. Instance Extraction (1/2) For each Wikipedia category, instances are extracted

    from all Wikipedia articles listed on the category A link is regarded as is-a, if the suffix of either the article or its hypernym (from definition) matches one of the hypernym candidates of the article
  21. Instance Extraction (2/2) 861:酒_liquor カクテル_cocktails 857:飲み物_beverages ジンベースのカクテル_cocktails_with_gin ウォッカベースのカクテル_cocktails_with_vodka ・酒_alcoholic_beverage ・ウイスキー_whisky

    ・飲料_beverage instance ・飲料_beverage 酒_alcoholic_beverages instance Goi-Taikei categories Wikipedia categories ・アースクエイク_earthquake (カクテル_cocktail) ・食前酒_aperitif (酒_alcoholic_beverage) ・シェイカー_shaker (器具_appliance) × × × × ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ instance junction category
  22. Outline Introduction Previous works Our proposal Language resources Nihongo Goi-Taikei

    Japanese Wikipedia Proposed method Category alignment Is-a link detection Instance extraction Experiment result Comparison to Previous Works Conclusion
  23. Category Alignment Japanese Wikipedia As of July 24, 2008 49,543

    categories and 479,231 articles Nihongo Goi-Taikei 2,715 categories and 274,379 instances (words) 1,921 leaf categories with 108,247 instances 6,301 junction category candidates are automatically extracted 2,477 junction category are manually selected 719 Goi-Taikei leaf categories have one or more junction categories (719/1921=38.4%) Preliminary experiment on automatic selection using SVM Standard ontology mapping features Whether (class|instance) name of (self|parent|children|siblings) match one the other About 90% precision and 70% recall
  24. Is-a Link Detection (1/2) 23,289 categories are extracted from 49,543

    Wikipedia categories We evaluated the accuracy using two criteria: Parent-child precision whether the link between current category and its immediate parent is an is-a relation Ancestor-descendant precision whether all the links from the current category to the root are is-a relation 100 categories at each depth are randomly sampled, and manually evaluated
  25. Is-a Link Detection (2/2) 0.0 10.0 20.0 30.0 40.0 50.0

    60.0 70.0 80.0 90.0 100.0 1 2 3 4 5 6 7 8 9 10 11 12 13 depth of is-a categories precision 0 5000 10000 15000 20000 25000 number of is-a catetories parent-child precision ancestor-descendant precision number of is-a categories
  26. Instance Extraction 263,631 articles are extracted as instances from 479,231

    Wikipedia articles The category with the largest number of instances is JAPANESE ACTORS with 5,632 instances The average number of instances: 17.8 For evaluation, First, category-article pairs are randomly sampled from the categories with100% ancestor-descendant precision Then, we evaluated manually whether they are is-a relations 98.6% (205/208) precision and 83.0% (205/247) recall
  27. Comparison to Previous Methods Head matching (suffix matching) (sakurai et

    al., 2008) Japanese equivalent of (Ponzetto and Strube, 2007) Parent-child precision 91.2% 6,672 Wikipedia categories Using an existing ontology as a core (Kobayashi et al., 2008) Japanese equivalent of YAGO (Suchanek et al., 2007) Parent-child precision 93% 19,426 Wikipedia categories Our method Parent-child precision 92.8% 23,239 Wikipedia categories A single connected taxonomy with richer hierarchy
  28. Conclusion and future work A large-scale Japanese ontology Upper level:

    well-defined taxonomy inherited from an existing ontology (Goi-Taikei) Lower level: Fine-grained and up-to-date taxonomy extracted from Wikipedia Future works Automatic category alignment between Goi-Taikei and Wikipedia Extract more categories and articles, because our method uses only a half of Wikipedia We will present a different approach to use almost all category information by restricting target categories Shibaki et al., “Constructing Large-Scale Person Ontology from Wikipedia”, CSSR-2010 (COLING workshop WS4 in Saturday)