Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Natural Language Processing (2) Morphological analysis (1)

Natural Language Processing (2) Morphological analysis (1)

Kazuhide Yamamoto
Nagaoka University of Technology

自然言語処理研究室

September 20, 2013
Tweet

More Decks by 自然言語処理研究室

Other Decks in Education

Transcript

  1. 1 / 25 Natural Language Processing (2) Morphological Analysis (1)

    Kazuhide Yamamoto Dept. of Electrical Engineering Nagaoka University of Technology
  2. 2 / 25 FYI: some linguistic terms • letter/文字, word/単語,

    ?/文節, sentence/文 • part-of-speech/品詞、noun/名詞, verb/動詞, particle/助詞 • (a kind of Japanese) modifier/連体詞 • (one of verb conjugation)/動詞未然形 • content word/自立語, functional word/付属語 • consonant/子音, vowel/母音 • "school grammar"/学校文法 – a grammar learned in Japanese school. This is well-known and de- facto standard.
  3. 3 / 25 Morpheme / 形態素 • morpheme – is

    the smallest component of a word, or other linguistic unit, that has semantic meaning. • One word (or one token) is not always one morpheme. – using = use + ing, better = good + er – unbreakable = un + break + able – 聞かせたくなかったようだ=聞く+せる+たい+な い+た+ようだ
  4. 4 / 25 Below consist of one morpheme. • 109

    (a department store) • 江頭2:50 (a comedian) • モーニング娘。/ 関ジャニ∞ (both singer groups) • Gone With the Wind, The Sound of Music • 平成13年9月11日のアメリカ合衆国において発生したテ ロリストによる攻撃等に対応して行われる国際連合憲 章の目的達成のための諸外国の活動に対して我が国が 実施する措置及び関連する国際連合決議等に基づく人 道的措置に関する特別措置法(テロ対策特別措置法) (a law)
  5. 7 / 25 Morphological analysis / 形態素解析 Given an expression

    (mainly a sentence), morphological analysis do the following tasks in general. • word segmentation / 分かち書き – necessary in processing non-segmented languages such as Japanese. English doesn't need to be segmented, but concatenation is required. (e.g. New York) • part-of-speech (POS) tagging / 品詞付与 – determines POS for each word. e.g., "like" has five parts-of- speech: noun, verb, adjective, particle, conjunction. • pronunciation tagging / 読みがな付与 – optional. In Japanese the three processes are conducted simultaneously.
  6. 8 / 25 Morphological analysis: procedure • A word dictionary

    is looked up. – 長岡技術科学大学 vs 長岡+技術+科学+大学 – Morphemes are defined as a entry of dictionary. • A lattice structure (explained later) is made. • Best path is searched among many paths in the lattice.
  7. 9 / 25 Procedure (cont'd) Morphological analysis in general consists

    of two problems: • a search path problem given a lattice of input – how to find a best path among huge candidates • lattice construction, i.e., term dictionary construction. – The latter problem includes the unit of term – unknown word collection.
  8. 10 / 25 Lattice structure • directed graph / 有向グラフ

    • partial order relation / 半順序関係 • A path corresponds to an answer candidate.
  9. 11 / 25 Morpheme connectivity • There is no possibility

    to connect some words; – In English, consecutive articles (a / the) never appear. – In Japanese, • a verb of -nai form + ます • a particle + an auxiliary verb • a noun + prefix • We may exclude some candidates by considering morpheme connectivity.
  10. 12 / 25 Search algorithm • Task of morphological analysis

    is task to find best path given a lattice. • Two search strategies: – depth-first search / 深さ優先探索 – breadth-first search / 幅優先探索 • Backtracking
  11. 13 / 25 Heuristics • There are so many candidates

    (paths) that we cannot see one by one. • The practical way is to use "knowledge." • Heuristics – an AI-derived term – is a knowledge that is considered to be true in many cases, but not always.
  12. 14 / 25 Heuristics: examples • Sleeping students during a

    class get low scores to the course. • Absence students get lower. • Absence students who never see these slides get lowest.
  13. 15 / 25 Heuristics for Japanese analysis • Longest-match method

    / 最長一致法 – "The longer, the better." – はなみ vs は/なみ • minimum unit method / 文節数最小法 – a method to select path that has least units. – は/な/み (3 units) vs はな/み (2 units) • minimum cost method – generalization of the method above. The big problem here is to decide the cost.
  14. 16 / 25 Viterbi algorithm Viterbi algorithm is a kind

    of dynamic programming (DP) approach that finds the minimum cost path efficiently by storing the partial minimum cost path for each step.
  15. 17 / 25 Example: 「はなみのはる」 は(葉), n は, p はな(花),

    n はなみ(花見), n な(菜), n なみ(波), n み(身), n みの(蓑), n の, p はる(春), n はる(貼る), v cost of words: c(v) = 3 c(n) = 2 c(p) = 1 cost for connection: c(*, *) = 1 Given a word dictionary (left), their costs (above), and an input expression (top), the problem is to find a path (i.e. combination of words) with lowest cost.
  16. 18 / 25 は な み の は る 葉

    は 花 花見 菜 波 身 蓑 の 葉 は 春 貼る There are many paths from top left to reach bottom right.
  17. 19 / 25 は は, p : stored as the

    best path so far 葉, n : ignored hereafter  2 3 Costs from beginning to は: c(葉) = c(n) + c(*,*) = 2 + 1 = 3 c(は) = c(p) + c(*,*) = 1 + 1 = 2 The latter one is selected and stored, as it has the lowest score.
  18. 20 / 25 は な は 花, n 菜, n

    2 Cost to な: c(は+菜) = 2 + c(n) + c(*,*) = 2 + 2 + 1 = 5 c(花) = c(n) + c(*,*) = 2 + 1 = 3 花 is stored as best path to はな. Path of は+菜 is deleted. 3 3
  19. 21 / 25 は な み 花 花見 波 身

    は 2 3 c(は+波) = 2 + c(n) + c(*,*) = 5 c(花+身) = 3 + c(n) + c(*,*) = 6 c(花見) = c(n) + c(*,*) = 3 花見 is recorded as best path of はなみ 3 3 3
  20. 22 / 25 は な み の 花 花見 蓑

    の c(花+蓑) = 3 + c(n) + c(*,*) = 6 c(花見+の) = 3 + c(p) + c(*,*) = 5 花見の is recorded as best path until は なみの. 3 3 3 2
  21. 23 / 25 は な み の は 花見 の

    葉 は c(花見+の+葉) = 3 + 2 + c(n) + c(*,*) = 8 c(花見+の+は) = 3 + 2 + c(p) + c(*,*) = 7 花見のは is recorded as best path until はなみの は. 3 2 2 3
  22. 24 / 25 は な み の は 花見 の

    は c(花見+の+春) = 3 + 2 + c(n) + c(*,*) = 8 c(花見+の+貼る) = 3 + 2 + c(v) + c(*,*) = 9 Finally, we got best path 花見の春 for input はな みのはる. 3 2 2 春 貼る る 3 4
  23. 25 / 25 Today's key words • morpheme • morphological

    analysis • heuristics • Viterbi algorithm