Natural Language Processing (2) Morphological analysis (1)

Natural Language Processing (2) Morphological analysis (1)

Kazuhide Yamamoto
Nagaoka University of Technology

C04e17d9b3810e5c0ad22cb8a12589de?s=128

自然言語処理研究室

September 20, 2013
Tweet

Transcript

  1. 1.

    1 / 25 Natural Language Processing (2) Morphological Analysis (1)

    Kazuhide Yamamoto Dept. of Electrical Engineering Nagaoka University of Technology
  2. 2.

    2 / 25 FYI: some linguistic terms • letter/文字, word/単語,

    ?/文節, sentence/文 • part-of-speech/品詞、noun/名詞, verb/動詞, particle/助詞 • (a kind of Japanese) modifier/連体詞 • (one of verb conjugation)/動詞未然形 • content word/自立語, functional word/付属語 • consonant/子音, vowel/母音 • "school grammar"/学校文法 – a grammar learned in Japanese school. This is well-known and de- facto standard.
  3. 3.

    3 / 25 Morpheme / 形態素 • morpheme – is

    the smallest component of a word, or other linguistic unit, that has semantic meaning. • One word (or one token) is not always one morpheme. – using = use + ing, better = good + er – unbreakable = un + break + able – 聞かせたくなかったようだ=聞く+せる+たい+な い+た+ようだ
  4. 4.

    4 / 25 Below consist of one morpheme. • 109

    (a department store) • 江頭2:50 (a comedian) • モーニング娘。/ 関ジャニ∞ (both singer groups) • Gone With the Wind, The Sound of Music • 平成13年9月11日のアメリカ合衆国において発生したテ ロリストによる攻撃等に対応して行われる国際連合憲 章の目的達成のための諸外国の活動に対して我が国が 実施する措置及び関連する国際連合決議等に基づく人 道的措置に関する特別措置法(テロ対策特別措置法) (a law)
  5. 7.

    7 / 25 Morphological analysis / 形態素解析 Given an expression

    (mainly a sentence), morphological analysis do the following tasks in general. • word segmentation / 分かち書き – necessary in processing non-segmented languages such as Japanese. English doesn't need to be segmented, but concatenation is required. (e.g. New York) • part-of-speech (POS) tagging / 品詞付与 – determines POS for each word. e.g., "like" has five parts-of- speech: noun, verb, adjective, particle, conjunction. • pronunciation tagging / 読みがな付与 – optional. In Japanese the three processes are conducted simultaneously.
  6. 8.

    8 / 25 Morphological analysis: procedure • A word dictionary

    is looked up. – 長岡技術科学大学 vs 長岡+技術+科学+大学 – Morphemes are defined as a entry of dictionary. • A lattice structure (explained later) is made. • Best path is searched among many paths in the lattice.
  7. 9.

    9 / 25 Procedure (cont'd) Morphological analysis in general consists

    of two problems: • a search path problem given a lattice of input – how to find a best path among huge candidates • lattice construction, i.e., term dictionary construction. – The latter problem includes the unit of term – unknown word collection.
  8. 10.

    10 / 25 Lattice structure • directed graph / 有向グラフ

    • partial order relation / 半順序関係 • A path corresponds to an answer candidate.
  9. 11.

    11 / 25 Morpheme connectivity • There is no possibility

    to connect some words; – In English, consecutive articles (a / the) never appear. – In Japanese, • a verb of -nai form + ます • a particle + an auxiliary verb • a noun + prefix • We may exclude some candidates by considering morpheme connectivity.
  10. 12.

    12 / 25 Search algorithm • Task of morphological analysis

    is task to find best path given a lattice. • Two search strategies: – depth-first search / 深さ優先探索 – breadth-first search / 幅優先探索 • Backtracking
  11. 13.

    13 / 25 Heuristics • There are so many candidates

    (paths) that we cannot see one by one. • The practical way is to use "knowledge." • Heuristics – an AI-derived term – is a knowledge that is considered to be true in many cases, but not always.
  12. 14.

    14 / 25 Heuristics: examples • Sleeping students during a

    class get low scores to the course. • Absence students get lower. • Absence students who never see these slides get lowest.
  13. 15.

    15 / 25 Heuristics for Japanese analysis • Longest-match method

    / 最長一致法 – "The longer, the better." – はなみ vs は/なみ • minimum unit method / 文節数最小法 – a method to select path that has least units. – は/な/み (3 units) vs はな/み (2 units) • minimum cost method – generalization of the method above. The big problem here is to decide the cost.
  14. 16.

    16 / 25 Viterbi algorithm Viterbi algorithm is a kind

    of dynamic programming (DP) approach that finds the minimum cost path efficiently by storing the partial minimum cost path for each step.
  15. 17.

    17 / 25 Example: 「はなみのはる」 は(葉), n は, p はな(花),

    n はなみ(花見), n な(菜), n なみ(波), n み(身), n みの(蓑), n の, p はる(春), n はる(貼る), v cost of words: c(v) = 3 c(n) = 2 c(p) = 1 cost for connection: c(*, *) = 1 Given a word dictionary (left), their costs (above), and an input expression (top), the problem is to find a path (i.e. combination of words) with lowest cost.
  16. 18.

    18 / 25 は な み の は る 葉

    は 花 花見 菜 波 身 蓑 の 葉 は 春 貼る There are many paths from top left to reach bottom right.
  17. 19.

    19 / 25 は は, p : stored as the

    best path so far 葉, n : ignored hereafter  2 3 Costs from beginning to は: c(葉) = c(n) + c(*,*) = 2 + 1 = 3 c(は) = c(p) + c(*,*) = 1 + 1 = 2 The latter one is selected and stored, as it has the lowest score.
  18. 20.

    20 / 25 は な は 花, n 菜, n

    2 Cost to な: c(は+菜) = 2 + c(n) + c(*,*) = 2 + 2 + 1 = 5 c(花) = c(n) + c(*,*) = 2 + 1 = 3 花 is stored as best path to はな. Path of は+菜 is deleted. 3 3
  19. 21.

    21 / 25 は な み 花 花見 波 身

    は 2 3 c(は+波) = 2 + c(n) + c(*,*) = 5 c(花+身) = 3 + c(n) + c(*,*) = 6 c(花見) = c(n) + c(*,*) = 3 花見 is recorded as best path of はなみ 3 3 3
  20. 22.

    22 / 25 は な み の 花 花見 蓑

    の c(花+蓑) = 3 + c(n) + c(*,*) = 6 c(花見+の) = 3 + c(p) + c(*,*) = 5 花見の is recorded as best path until は なみの. 3 3 3 2
  21. 23.

    23 / 25 は な み の は 花見 の

    葉 は c(花見+の+葉) = 3 + 2 + c(n) + c(*,*) = 8 c(花見+の+は) = 3 + 2 + c(p) + c(*,*) = 7 花見のは is recorded as best path until はなみの は. 3 2 2 3
  22. 24.

    24 / 25 は な み の は 花見 の

    は c(花見+の+春) = 3 + 2 + c(n) + c(*,*) = 8 c(花見+の+貼る) = 3 + 2 + c(v) + c(*,*) = 9 Finally, we got best path 花見の春 for input はな みのはる. 3 2 2 春 貼る る 3 4
  23. 25.

    25 / 25 Today's key words • morpheme • morphological

    analysis • heuristics • Viterbi algorithm