?/文節, sentence/文 • part-of-speech/品詞、noun/名詞, verb/動詞, particle/助詞 • (a kind of Japanese) modifier/連体詞 • (one of verb conjugation)/動詞未然形 • content word/自立語, functional word/付属語 • consonant/子音, vowel/母音 • "school grammar"/学校文法 – a grammar learned in Japanese school. This is well-known and de- facto standard.
the smallest component of a word, or other linguistic unit, that has semantic meaning. • One word (or one token) is not always one morpheme. – using = use + ing, better = good + er – unbreakable = un + break + able – 聞かせたくなかったようだ=聞く+せる+たい+な い+た+ようだ
(a department store) • 江頭2:50 (a comedian) • モーニング娘。/ 関ジャニ∞ (both singer groups) • Gone With the Wind, The Sound of Music • 平成13年9月11日のアメリカ合衆国において発生したテ ロリストによる攻撃等に対応して行われる国際連合憲 章の目的達成のための諸外国の活動に対して我が国が 実施する措置及び関連する国際連合決議等に基づく人 道的措置に関する特別措置法(テロ対策特別措置法) (a law)
(mainly a sentence), morphological analysis do the following tasks in general. • word segmentation / 分かち書き – necessary in processing non-segmented languages such as Japanese. English doesn't need to be segmented, but concatenation is required. (e.g. New York) • part-of-speech (POS) tagging / 品詞付与 – determines POS for each word. e.g., "like" has five parts-of- speech: noun, verb, adjective, particle, conjunction. • pronunciation tagging / 読みがな付与 – optional. In Japanese the three processes are conducted simultaneously.
is looked up. – 長岡技術科学大学 vs 長岡+技術+科学+大学 – Morphemes are defined as a entry of dictionary. • A lattice structure (explained later) is made. • Best path is searched among many paths in the lattice.
of two problems: • a search path problem given a lattice of input – how to find a best path among huge candidates • lattice construction, i.e., term dictionary construction. – The latter problem includes the unit of term – unknown word collection.
to connect some words; – In English, consecutive articles (a / the) never appear. – In Japanese, • a verb of -nai form + ます • a particle + an auxiliary verb • a noun + prefix • We may exclude some candidates by considering morpheme connectivity.
(paths) that we cannot see one by one. • The practical way is to use "knowledge." • Heuristics – an AI-derived term – is a knowledge that is considered to be true in many cases, but not always.
/ 最長一致法 – "The longer, the better." – はなみ vs は/なみ • minimum unit method / 文節数最小法 – a method to select path that has least units. – は/な/み (3 units) vs はな/み (2 units) • minimum cost method – generalization of the method above. The big problem here is to decide the cost.
n はなみ(花見), n な(菜), n なみ(波), n み(身), n みの(蓑), n の, p はる(春), n はる(貼る), v cost of words: c(v) = 3 c(n) = 2 c(p) = 1 cost for connection: c(*, *) = 1 Given a word dictionary (left), their costs (above), and an input expression (top), the problem is to find a path (i.e. combination of words) with lowest cost.
best path so far 葉, n : ignored hereafter 2 3 Costs from beginning to は: c(葉) = c(n) + c(*,*) = 2 + 1 = 3 c(は) = c(p) + c(*,*) = 1 + 1 = 2 The latter one is selected and stored, as it has the lowest score.