The Importance of Morphological Analysis in Japanese Search Engines

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Features of Japanese Language

Slide 3

Slide 3 text

Japanese characters Kanji (漢字) : Kanji - Wikipedia ( Kyūjitai - Wikipedia Shinjitai - Wikipedia ) Kanji characters were first introduced to the Japanese archipelago from China around 2,400 to 1,700 years ago. People in the Japanese archipelago wrote things down by applying Kanji had the same meaning as their own words. Hiragana (平仮名) : Hiragana - Wikipedia Hiragana (Unicode block) - Wikipedia It is a simplified form Man'yogana (万葉仮名) which used Kanji characters to represent the sounds of Japanese. Katakana (片仮名): Katakana - Wikipedia Katakana (Unicode block) - Wikipedia Katakana was also originated from Man'yogana (万葉仮名) at about the same time as Hiragana. While Hiragana is a simplified form of Man'yogana, Katakana is extracted by part of Man'yogana. Japanese has many loanwords from foreign languages. Katakana characters are often used to imitate its sound. Halfwidth and Fullwidth: Halfwidth and fullwidth forms - Wikipedia Halfwidth and Fullwidth Forms (Unicode block) - Wikipedia The old computer systems used the DBCS (Double-Byte Character Set) that used twice as much as the SBCS (Single-Byte Character Set) to make Japanese more pretty and readable. Full-width alphabets, numbers, and symbols are also included because some computer systems did not allow full-width and half-width characters to be placed on the same line.

Slide 4

Slide 4 text

A Japanese sentence is that separated by “。”. - "。" (Kuten) means "." (Period). - "、" (Touten) means "," (Comma). 今日は雨が降ると思うよ。 (I think that it will rain today.) Japanese sentence

Slide 5

Slide 5 text

Japanese clause A clause is a segment of a sentence that divides the meaning of the sentence as short as possible. For example, the sentence "今日は雨が降ると思うよ" can be broken down into clauses as follows: 今日は / 雨が / 降ると / 思うよ。 I think that it will rain today.

Slide 6

Slide 6 text

Japanese word The smallest unit of a word that has meaning and cannot be broken down further is called a word. For example, the previous sentence, "今日は雨が降ると思うよ " can be broken down into words as follows. 今日 / は / 雨 / が / 降る / と / 思う / よ / 。 I think that it will rain today. Think about the word "今日" here. The word "今日" means today only when combined with "今" and "日". In short, if you split "今日" into "今" and "日" the meaning changes. "今" means now, and "日" means day, sun and etc. "今" and "日" have many more meanings than this.

Slide 7

Slide 7 text

Japanese Morphological Analysis

Slide 8

Slide 8 text

What is morphological analysis? Japanese uses a morphological analyzer to split into words because words are not clearly separated by white space like in English and other languages. Morphological analysis is the process of dividing natural language text without grammatical information into an array of morphemes (roughly, the smallest unit of meaning in a language) based on the information such as the parts of speech of the words and the grammar of the target languages called a dictionary, and then determining the parts of speech of each morpheme. 今日 / は / 雨 / が / 降る / と / 思う / よ / 。 Noun Particle Noun Particle Verb Particle Verb Particle Symbol

Slide 9

Slide 9 text

Dictionary We often use dictionaries maintained by NAIST(Nara Institute of Science and Technology) and NINJAL(National Institute for Japanese and Linguistics). Recently, Works Applications Co., Ltd. has started providing Japanese language resources as OSS. How to maintain the dictionaries and language models? - Maintained by manpower - Supervised machine learning from the corpus (training data) The maintenance of such language resources is very hard.

Slide 10

Slide 10 text

Dictionary format A dictionary is a data structure that provides information about available terms as well as how those terms should appear next to each other according to Japanese grammar or probability. Let's look into IPADIC dictionary format. The important thing is the first four columns (Surface, Left Context ID, Right Context ID, Cost). After that, the metadata such as the part of speech, reading and pronunciation of the term are described. 原村,1293,1293,8684,名詞,固有名詞,地域,一般,*,*,原村,ハラムラ,ハラムラ大倉谷地,1293,1293,8676,名詞,固有名詞,地域,一般,*,*,大倉谷地,オオクラヤチ,オークラヤチ駒ケ崎,1293,1293,8676,名詞,固有名詞,地域,一般,*,*,駒ケ崎,コマガサキ,コマガサキ里本江,1293,1293,8676,名詞,固有名詞,地域,一般,*,*,里本江,サトホンゴ,サトホンゴ

Slide 11

Slide 11 text

Dictionary details Surface: The string or term that should appear in the text. Left Context ID / Right Context ID: The context ID of the term referred from the left or right. The ID registered in the term dictionary is used as a key when referring to the term connection matrix. Cost: The cost indicates how likely it is for the term to appear. The smaller the cost, the frequently appeared or used is the term. Read these terms and build a term dictionary. 原村,1293,1293,8684,名詞,固有名詞,地域,一般,*,*,原村,ハラムラ,ハラムラ大倉谷地,1293,1293,8676,名詞,固有名詞,地域,一般,*,*,大倉谷地,オオクラヤチ,オークラヤチ駒ケ崎,1293,1293,8676,名詞,固有名詞,地域,一般,*,*,駒ケ崎,コマガサキ,コマガサキ里本江,1293,1293,8676,名詞,固有名詞,地域,一般,*,*,里本江,サトホンゴ,サトホンゴ

Slide 12

Slide 12 text

Connection matrix As with the term cost, the smaller the connection cost, the more likely a given term will be connected. The matrix.def in the IPADIC dictionary represents that mapping. The first line is the number of Right context IDs and Left context IDs (each has 1316 IDs), then followed by the Left context ID, Right context ID and Cost. 1316 1316 0 0 -434 0 1 1 0 2 -1630 … 1315 1313 -4369 1315 1314 -1712 1315 1315 -129

Slide 13

Slide 13 text

Building a lattice and choosing the best path Most Japanese tokenizers use Lattice-based tokenization. As the name suggested, a lattice-based tokenizer builds a lattice (or a graph-like data structure) consisting of all possible tokens (terms or substrings) that surface on the input text. It uses the Viterbi algorithm to find the best-connected path through the lattice. 東京都に住む京都東京 BOS EOS

Slide 14

Slide 14 text

Building a lattice pub fn set_text( &mut self, dict: &PrefixDict, user_dict: &Option, char_definitions: &CharacterDefinitions, unknown_dictionary: &UnknownDictionary, text: &str, search_mode: &Mode, ) { for start in 0..text.len() { ... let suffix = &text[start..]; let mut found: bool = false; for (prefix_len, word_entry) in dict.prefix(suffix) { let edge = Edge { edge_type: EdgeType::KNOWN, word_entry, left_edge: None, start_index: start as u32, stop_index: (start + prefix_len) as u32, path_cost: i32::max_value(), kanji_only: is_kanji_only(&suffix[..prefix_len]), }; self.add_edge_in_lattice(edge); found = true; } ... } } The sample code uses dict.prefix() to look up terms in the term dictionary that have a surface with a given prefix in the text using a common prefix search. If the text is "東京都に住む", a lattice-like the one on the below will be created. start=0 東 start=1 京 start=2 都 start=3 に start=4 住 start=5 む東 6245 京 10791 都 9428 に 4304 住む 7048 京都 2135 東京 3003 BOS EOS -310 -368 -3838 -283 -368 -9617 -3573 -9617 -3547 -409 github.com/lindera-morphology/lindera/lindera-core/src/viterbi.rs NOTE: This lattice is simplified.

Slide 15

Slide 15 text

Calculate path costs pub fn calculate_path_costs( &mut self, cost_matrix: &ConnectionCostMatrix, mode: &Mode) { let text_len = self.starts_at.len(); for i in 0..text_len { let left_edge_ids = &self.ends_at[i]; let right_edge_ids = &self.starts_at[i]; for &right_edge_id in right_edge_ids { let right_word_entry = self.edge(right_edge_id) .word_entry; let best_path = left_edge_ids .iter() .cloned() .map(|left_edge_id| { let left_edge = self.edge(left_edge_id); let mut path_cost = left_edge.path_cost + cost_matrix .cost( left_edge.word_entry.right_id(), right_word_entry.left_id()); (path_cost, left_edge_id) }) .min_by_key(|&(cost, _)| cost); if let Some((best_cost, best_left)) = best_path { let edge = &mut self .edges[right_edge_id.0 as usize]; edge.left_edge = Some(best_left); edge.path_cost = right_word_entry.word_cost as i32 + best_cost; } } } } The lower the cost for both term cost and connection cost, the more likely it is that the term or connection will appear. The most Japanese-like sequence of tokens (the "best" path) can be found by calculating the minimum cumulative cost of the edges from BOS to EOS, and then tracing the edge with the minimum cumulative cost from EOS to BOS. github.com/lindera-morphology/lindera/lindera-core/src/viterbi.rs 東 6245 京 10791 都 9428 に 4304 住む 7048 京都 2135 東京 3003 BOS EOS -310 -368 -3838 -283 -368 -9617 -3573 -9617 -3547 -409 NOTE: This lattice is simplified. 5962 2693 7729 16385 2504 3252 6736 6327 6768 8195

Slide 16

Slide 16 text

History of Japanese morphological analyzers Starting with JUMAN, which was developed at Kyoto University. Currently, projects are being launched to implement morphological analyzers in several programming languages. Many morphological analyzers are influenced by JUMAN, ChaSen, and MeCab. Recently, Kuromoji, which has also contributed to the Lucene project, has had a great influence, and many self-contained (dictionary embedded) morphological analyzers derived from it have appeared.

Slide 17

Slide 17 text

How morphological analyzer is used in Japanese search engine

Slide 18

Slide 18 text

Morphological analyzer is used as tokenizers Character Filter Tokenizer Token Filter Character normalization Tokenizing with a Morphological Analyzer Token processing, removal, etc

Slide 19

Slide 19 text

Character Filter Character Filter as preprocessing for tokenization Examples： - Old character forms (Kyujitai) to New character forms (Shinjitai) ( E.g. 圓 → 円 ) In searching, it is desirable to be able to search for new and old kanji in the same way. Normalize them so that they can be searched without distinguishing between them. - Halfwidth to Fullwidth ( R.g. ｱ → ア ) The half-width Katakana character was widely used in old computer systems to reduce the amount of data. It is still sometimes used in administrative documents. It is common to convert half-width Katakana to full-width Katakana. - Fullwidth to Halfwidth ( E.g. ９ → 9 ) Full-width alphabets and numbers are also frequently seen in old Japanese documents. Also, even today, some documents are created without switching IME, and alphabets and numbers that should be input in half-width characters are unintentionally input in full-width characters, just like other Japanese characters. These are generally converted to half-width characters.

Slide 20

Slide 20 text

Token Filter Token Filter as post-processing of tokenization Examples： - Remove unnecessary parts of speech and stop words Particles are tokens that are not needed in search, so they are generally removed. Few people want to search for the word "は" in the above sentence. E.g. 今日 / は / 雨 / が / 降る → 明日 / 雨 / 降る - Stemming and lemmatization The rule of omitting long tone character (U+30FC, "ー") at the end of words is based on the JIS (Japanese Industrial Standards) Z8301, in which long tone character at the end of words with four or more characters are omitted as a rule. Most academic papers follow this standard. E.g. ユーザー → ユーザ It is also common to lemmatize tokens in order to unify verbs into their base forms. E.g. 雨 / が / 降っ / た → 明日 / が / 雨 / 降る / た - Kanji numerals to Arabic numerals In some texts and addresses, numbers are written in Kanji numerals. It is also common to normalize these to Arabic numerals. E.g. 二千二十一 → 2021 ✔ However, it is best to avoid converting kanji numerals of people's names to Arabic numerals. E.g. 鈴木一郎 → 鈴木1郎 ✘

Slide 21

Slide 21 text

Difficulties in Morphological Analysis

Slide 22

Slide 22 text

Maintain a dictionary "外国人参政権" 外国 / 人 / 参政 / 権 ← Expected We expect it will be tokenized as "外国 / 人 / 参政 / 権" that means voting right for foreign residents. 外国 / 人参 / 政権 ← Actual Unfortunately in actuality, it is a completely unintended result that is "外国" (foreign country), "人参" (carrot) and "政権" (regime). The intended tokenization is not performed due to the lack of a well-maintained dictionary.

Slide 23

Slide 23 text

Multiple interpretations "こちらが営業部長谷川です" こちら / が / 営業 / 部 / 長谷川 / です This is Hasegawa from the sales department. こちら / が / 営業 / 部長 / 谷川 / です This is Tanigawa, the sales manager. There are cases where multiple interpretations are possible. Only the person who wrote the document knows the correct answer.

Slide 24

Slide 24 text

Why can't we just use n-gram? For example, let's imagine a search site for local information. Address 1: 京都府京都市下京区東塩小路高倉町 8-3 → 京都 / 都府 / 府京 / 京都 / 都市 / 市下 / … Address 2: 東京都港区六本木6-10-1六本木ヒルズ森タワー → 東京 / 京都 / 都港 / 港区 / 区六 / 六本 / ... Tokenize the above two addresses in 2-gram. The problem is that the keyword "京都" (Kyoto) with the intention of searching for a facility in "京都" will result in a hit for a facility in "東京" (Tokyo). Simply applying the n-gram will improve recall, but may also reduce precision. 513.6km 東京 Tokyo 京都 Kyoto

Slide 25

Slide 25 text

Conclusion

Slide 26

Slide 26 text

Conclusion - Morphological analyzers are an important component of Japanese search engines. - Since there are many problems that cannot be solved by a tokenizer alone, various Character Filter and Token Filter are also needed. - No matter how good a search engine is, if it cannot handle Japanese well, it will unfortunately not be used in Japan. - For Japanese, it is common to use a combination of n-gram and morphological analysis. - With the advent of new natural language processing models such as BERT, new mechanisms may emerge for morphological analysis as well.

Slide 27

Slide 27 text

Appendix

Slide 28

Slide 28 text

References - JUMAN - KUROHASHI-CHU-MURAWAKI LAB (kyoto-u.ac.jp) - chasen legacy -- an old morphological analyzer (osdn.jp) - MeCab: Yet Another Part-of-Speech and Morphological Analyzer (taku910.github.io) - Wayback Machine (archive.org) - 形態素解析の過去・現在・未来 (slideshare.net) - 日本語形態素解析の裏側を覗く！ MeCab はどのように形態素解析しているか - クックパッド開発者ブログ (cookpad.com) - Pyconjp2015 - Python で作って学ぶ形態素解析 (slideshare.net) - How Japanese Tokenizers Work. A deep dive into Japanese tokenization… | by Wanasit Tanakitrungruang | Towards Data Science - 第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題と N-bestの提案 (slideshare.net) - The Challenges of Chinese and Japanese Searching – hybrismart | SAP Commerce Cloud under the hood | SAP hybris