Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Importance of Morphological Analysis in Japanese Search Engines

Minoru Osuka
October 17, 2022

The Importance of Morphological Analysis in Japanese Search Engines

Minoru Osuka

October 17, 2022
Tweet

More Decks by Minoru Osuka

Other Decks in Technology

Transcript

  1. Japanese characters Kanji (漢字) : Kanji - Wikipedia ( Kyūjitai

    - Wikipedia Shinjitai - Wikipedia ) Kanji characters were first introduced to the Japanese archipelago from China around 2,400 to 1,700 years ago. People in the Japanese archipelago wrote things down by applying Kanji had the same meaning as their own words. Hiragana (平仮名) : Hiragana - Wikipedia Hiragana (Unicode block) - Wikipedia It is a simplified form Man'yogana (万葉仮名) which used Kanji characters to represent the sounds of Japanese. Katakana (片仮名): Katakana - Wikipedia Katakana (Unicode block) - Wikipedia Katakana was also originated from Man'yogana (万葉仮名) at about the same time as Hiragana. While Hiragana is a simplified form of Man'yogana, Katakana is extracted by part of Man'yogana. Japanese has many loanwords from foreign languages. Katakana characters are often used to imitate its sound. Halfwidth and Fullwidth: Halfwidth and fullwidth forms - Wikipedia Halfwidth and Fullwidth Forms (Unicode block) - Wikipedia The old computer systems used the DBCS (Double-Byte Character Set) that used twice as much as the SBCS (Single-Byte Character Set) to make Japanese more pretty and readable. Full-width alphabets, numbers, and symbols are also included because some computer systems did not allow full-width and half-width characters to be placed on the same line.
  2. A Japanese sentence is that separated by “。”. - "。"

    (Kuten) means "." (Period). - "、" (Touten) means "," (Comma). 今日は雨が降ると思うよ。 (I think that it will rain today.) Japanese sentence
  3. Japanese clause A clause is a segment of a sentence

    that divides the meaning of the sentence as short as possible. For example, the sentence "今日は雨が降ると思うよ" can be broken down into clauses as follows: 今日は / 雨が / 降ると / 思うよ。 I think that it will rain today.
  4. Japanese word The smallest unit of a word that has

    meaning and cannot be broken down further is called a word. For example, the previous sentence, "今日は雨が降ると思うよ " can be broken down into words as follows. 今日 / は / 雨 / が / 降る / と / 思う / よ / 。 I think that it will rain today. Think about the word "今日" here. The word "今日" means today only when combined with "今" and "日". In short, if you split "今日" into "今" and "日" the meaning changes. "今" means now, and "日" means day, sun and etc. "今" and "日" have many more meanings than this.
  5. What is morphological analysis? Japanese uses a morphological analyzer to

    split into words because words are not clearly separated by white space like in English and other languages. Morphological analysis is the process of dividing natural language text without grammatical information into an array of morphemes (roughly, the smallest unit of meaning in a language) based on the information such as the parts of speech of the words and the grammar of the target languages called a dictionary, and then determining the parts of speech of each morpheme. 今日 / は / 雨 / が / 降る / と / 思う / よ / 。 Noun Particle Noun Particle Verb Particle Verb Particle Symbol
  6. Dictionary We often use dictionaries maintained by NAIST(Nara Institute of

    Science and Technology) and NINJAL(National Institute for Japanese and Linguistics). Recently, Works Applications Co., Ltd. has started providing Japanese language resources as OSS. How to maintain the dictionaries and language models? - Maintained by manpower - Supervised machine learning from the corpus (training data) The maintenance of such language resources is very hard.
  7. Dictionary format A dictionary is a data structure that provides

    information about available terms as well as how those terms should appear next to each other according to Japanese grammar or probability. Let's look into IPADIC dictionary format. The important thing is the first four columns (Surface, Left Context ID, Right Context ID, Cost). After that, the metadata such as the part of speech, reading and pronunciation of the term are described. 原村,1293,1293,8684,名詞,固有名詞,地域,一般,*,*,原村,ハラムラ,ハラムラ 大倉谷地,1293,1293,8676,名詞,固有名詞,地域,一般,*,*,大倉谷地,オオクラヤチ,オークラヤチ 駒ケ崎,1293,1293,8676,名詞,固有名詞,地域,一般,*,*,駒ケ崎,コマガサキ,コマガサキ 里本江,1293,1293,8676,名詞,固有名詞,地域,一般,*,*,里本江,サトホンゴ,サトホンゴ
  8. Dictionary details Surface: The string or term that should appear

    in the text. Left Context ID / Right Context ID: The context ID of the term referred from the left or right. The ID registered in the term dictionary is used as a key when referring to the term connection matrix. Cost: The cost indicates how likely it is for the term to appear. The smaller the cost, the frequently appeared or used is the term. Read these terms and build a term dictionary. 原村,1293,1293,8684,名詞,固有名詞,地域,一般,*,*,原村,ハラムラ,ハラムラ 大倉谷地,1293,1293,8676,名詞,固有名詞,地域,一般,*,*,大倉谷地,オオクラヤチ,オークラヤチ 駒ケ崎,1293,1293,8676,名詞,固有名詞,地域,一般,*,*,駒ケ崎,コマガサキ,コマガサキ 里本江,1293,1293,8676,名詞,固有名詞,地域,一般,*,*,里本江,サトホンゴ,サトホンゴ
  9. Connection matrix As with the term cost, the smaller the

    connection cost, the more likely a given term will be connected. The matrix.def in the IPADIC dictionary represents that mapping. The first line is the number of Right context IDs and Left context IDs (each has 1316 IDs), then followed by the Left context ID, Right context ID and Cost. 1316 1316 0 0 -434 0 1 1 0 2 -1630 … 1315 1313 -4369 1315 1314 -1712 1315 1315 -129
  10. Building a lattice and choosing the best path Most Japanese

    tokenizers use Lattice-based tokenization. As the name suggested, a lattice-based tokenizer builds a lattice (or a graph-like data structure) consisting of all possible tokens (terms or substrings) that surface on the input text. It uses the Viterbi algorithm to find the best-connected path through the lattice. 東 京 都 に 住む 京都 東京 BOS EOS
  11. Building a lattice pub fn set_text( &mut self, dict: &PrefixDict,

    user_dict: &Option<PrefixDict>, char_definitions: &CharacterDefinitions, unknown_dictionary: &UnknownDictionary, text: &str, search_mode: &Mode, ) { for start in 0..text.len() { ... let suffix = &text[start..]; let mut found: bool = false; for (prefix_len, word_entry) in dict.prefix(suffix) { let edge = Edge { edge_type: EdgeType::KNOWN, word_entry, left_edge: None, start_index: start as u32, stop_index: (start + prefix_len) as u32, path_cost: i32::max_value(), kanji_only: is_kanji_only(&suffix[..prefix_len]), }; self.add_edge_in_lattice(edge); found = true; } ... } } The sample code uses dict.prefix() to look up terms in the term dictionary that have a surface with a given prefix in the text using a common prefix search. If the text is "東京都に住む", a lattice-like the one on the below will be created. start=0 東 start=1 京 start=2 都 start=3 に start=4 住 start=5 む 東 6245 京 10791 都 9428 に 4304 住む 7048 京都 2135 東京 3003 BOS EOS -310 -368 -3838 -283 -368 -9617 -3573 -9617 -3547 -409 github.com/lindera-morphology/lindera/lindera-core/src/viterbi.rs NOTE: This lattice is simplified.
  12. Calculate path costs pub fn calculate_path_costs( &mut self, cost_matrix: &ConnectionCostMatrix,

    mode: &Mode) { let text_len = self.starts_at.len(); for i in 0..text_len { let left_edge_ids = &self.ends_at[i]; let right_edge_ids = &self.starts_at[i]; for &right_edge_id in right_edge_ids { let right_word_entry = self.edge(right_edge_id) .word_entry; let best_path = left_edge_ids .iter() .cloned() .map(|left_edge_id| { let left_edge = self.edge(left_edge_id); let mut path_cost = left_edge.path_cost + cost_matrix .cost( left_edge.word_entry.right_id(), right_word_entry.left_id()); (path_cost, left_edge_id) }) .min_by_key(|&(cost, _)| cost); if let Some((best_cost, best_left)) = best_path { let edge = &mut self .edges[right_edge_id.0 as usize]; edge.left_edge = Some(best_left); edge.path_cost = right_word_entry.word_cost as i32 + best_cost; } } } } The lower the cost for both term cost and connection cost, the more likely it is that the term or connection will appear. The most Japanese-like sequence of tokens (the "best" path) can be found by calculating the minimum cumulative cost of the edges from BOS to EOS, and then tracing the edge with the minimum cumulative cost from EOS to BOS. github.com/lindera-morphology/lindera/lindera-core/src/viterbi.rs 東 6245 京 10791 都 9428 に 4304 住む 7048 京都 2135 東京 3003 BOS EOS -310 -368 -3838 -283 -368 -9617 -3573 -9617 -3547 -409 NOTE: This lattice is simplified. 5962 2693 7729 16385 2504 3252 6736 6327 6768 8195
  13. History of Japanese morphological analyzers Starting with JUMAN, which was

    developed at Kyoto University. Currently, projects are being launched to implement morphological analyzers in several programming languages. Many morphological analyzers are influenced by JUMAN, ChaSen, and MeCab. Recently, Kuromoji, which has also contributed to the Lucene project, has had a great influence, and many self-contained (dictionary embedded) morphological analyzers derived from it have appeared.
  14. Morphological analyzer is used as tokenizers Character Filter Tokenizer Token

    Filter Character normalization Tokenizing with a Morphological Analyzer Token processing, removal, etc
  15. Character Filter Character Filter as preprocessing for tokenization Examples: -

    Old character forms (Kyujitai) to New character forms (Shinjitai) ( E.g. 圓 → 円 ) In searching, it is desirable to be able to search for new and old kanji in the same way. Normalize them so that they can be searched without distinguishing between them. - Halfwidth to Fullwidth ( R.g. ア → ア ) The half-width Katakana character was widely used in old computer systems to reduce the amount of data. It is still sometimes used in administrative documents. It is common to convert half-width Katakana to full-width Katakana. - Fullwidth to Halfwidth ( E.g. 9 → 9 ) Full-width alphabets and numbers are also frequently seen in old Japanese documents. Also, even today, some documents are created without switching IME, and alphabets and numbers that should be input in half-width characters are unintentionally input in full-width characters, just like other Japanese characters. These are generally converted to half-width characters.
  16. Token Filter Token Filter as post-processing of tokenization Examples: -

    Remove unnecessary parts of speech and stop words Particles are tokens that are not needed in search, so they are generally removed. Few people want to search for the word "は" in the above sentence. E.g. 今日 / は / 雨 / が / 降る → 明日 / 雨 / 降る - Stemming and lemmatization The rule of omitting long tone character (U+30FC, "ー") at the end of words is based on the JIS (Japanese Industrial Standards) Z8301, in which long tone character at the end of words with four or more characters are omitted as a rule. Most academic papers follow this standard. E.g. ユーザー → ユーザ It is also common to lemmatize tokens in order to unify verbs into their base forms. E.g. 雨 / が / 降っ / た → 明日 / が / 雨 / 降る / た - Kanji numerals to Arabic numerals In some texts and addresses, numbers are written in Kanji numerals. It is also common to normalize these to Arabic numerals. E.g. 二千二十一 → 2021 ✔ However, it is best to avoid converting kanji numerals of people's names to Arabic numerals. E.g. 鈴木一郎 → 鈴木1郎 ✘
  17. Maintain a dictionary "外国人参政権" 外国 / 人 / 参政 /

    権 ← Expected We expect it will be tokenized as "外国 / 人 / 参政 / 権" that means voting right for foreign residents. 外国 / 人参 / 政権 ← Actual Unfortunately in actuality, it is a completely unintended result that is "外国" (foreign country), "人参" (carrot) and "政権" (regime). The intended tokenization is not performed due to the lack of a well-maintained dictionary.
  18. Multiple interpretations "こちらが営業部長谷川です" こちら / が / 営業 / 部

    / 長谷川 / です This is Hasegawa from the sales department. こちら / が / 営業 / 部長 / 谷川 / です This is Tanigawa, the sales manager. There are cases where multiple interpretations are possible. Only the person who wrote the document knows the correct answer.
  19. Why can't we just use n-gram? For example, let's imagine

    a search site for local information. Address 1: 京都府京都市下京区東塩小路高倉町 8-3 → 京都 / 都府 / 府京 / 京都 / 都市 / 市下 / … Address 2: 東京都港区六本木6-10-1六本木ヒルズ森タワー → 東京 / 京都 / 都港 / 港区 / 区六 / 六本 / ... Tokenize the above two addresses in 2-gram. The problem is that the keyword "京都" (Kyoto) with the intention of searching for a facility in "京都" will result in a hit for a facility in "東京" (Tokyo). Simply applying the n-gram will improve recall, but may also reduce precision. 513.6km 東京 Tokyo 京都 Kyoto
  20. Conclusion - Morphological analyzers are an important component of Japanese

    search engines. - Since there are many problems that cannot be solved by a tokenizer alone, various Character Filter and Token Filter are also needed. - No matter how good a search engine is, if it cannot handle Japanese well, it will unfortunately not be used in Japan. - For Japanese, it is common to use a combination of n-gram and morphological analysis. - With the advent of new natural language processing models such as BERT, new mechanisms may emerge for morphological analysis as well.
  21. References - JUMAN - KUROHASHI-CHU-MURAWAKI LAB (kyoto-u.ac.jp) - chasen legacy

    -- an old morphological analyzer (osdn.jp) - MeCab: Yet Another Part-of-Speech and Morphological Analyzer (taku910.github.io) - Wayback Machine (archive.org) - 形態素解析の過去・現在・未来 (slideshare.net) - 日本語形態素解析の裏側を覗く! MeCab はどのように形態素解析しているか - クックパッド開発者ブログ (cookpad.com) - Pyconjp2015 - Python で作って学ぶ形態素解析 (slideshare.net) - How Japanese Tokenizers Work. A deep dive into Japanese tokenization… | by Wanasit Tanakitrungruang | Towards Data Science - 第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題と N-bestの提案 (slideshare.net) - The Challenges of Chinese and Japanese Searching – hybrismart | SAP Commerce Cloud under the hood | SAP hybris