Japanese linguistics in Lucene and Solr

Japanese linguistics in Lucene and Solr

An overview of the new Japanese language support in Apache Lucene and Apache Solr.

This is our talk from Berlin Buzzwords in 2012.

Transcript

  1. Japanese linguistics in Apache Lucene™ and Apache Solr™ Christian Moen

    cm@atilika.com June 5th, 2012
  2. About me • MSc. in computer science, University of Oslo,

    Norway • Worked with search at FAST (now Microsoft) for 10 years • 5 years in R&D building FAST Enterprise Search Platform in Oslo, Norway • 5 years in Services doing solution delivery, sales, etc. in Tokyo, Japan • Founded ΞςΟϦΧגࣜձࣾ in 2009 • We help companies innovate using new technologies and good ideas • We do information retrieval, natural language processing and big data • We are based in Tokyo, but we have clients everywhere • Newbie Lucene & Solr Committer • Mostly been working on Japanese language support (Kuromoji) so far • Please write me on cm@atilika.com or cm@apache.org
  3. Today’s talk

  4. Today’s talk • Japanese 101 - ordering beer and toasting

    • Japanese language processing • Japanese features in Lucene/Solr
  5. Today’s talk • Japanese 101 - ordering beer and toasting

    • Japanese language processing • Japanese features in Lucene/Solr
  6. Today’s talk • Japanese 101 - ordering beer and toasting

    • Japanese language processing • Japanese features in Lucene/Solr
  7. Japanese 101

  8. Ϗʔϧ͍ͩ͘͞ bi-ru kudasai

  9. Ϗʔϧ͍ͩ͘͞ bi-ru kudasai A beer, please

  10. ͋Γ͕ͱ͏͍͟͝·͢ʂ arigatō gozaimasu!

  11. ͋Γ͕ͱ͏͍͟͝·͢ʂ Thank you very much! arigatō gozaimasu!

  12. סഋʂ kanpai!

  13. Cheers! סഋʂ kanpai!

  14. ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ JR Shinjuku eki no chikaku ni bi-ru ō nomi

    ni ikō ka?
  15. ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ Shall we go for a beer near JR Shinjuku

    station? JR Shinjuku eki no chikaku ni bi-ru ō nomi ni ikō ka?
  16. ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ

  17. Romaji - ϩʔϚࣈ ɾLatin characters (26+) ɾUsed for proper nouns,

    etc. ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ
  18. Katakana - ΧλΧφ ɾPhonetic script (~50) ɾTypically used for loan

    words ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ
  19. Kanji - ׽ࣈ ɾChinese characters (50,000+) ɾUsed for stems &

    proper nouns ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ
  20. Hiragana - ͻΒ͕ͳ ɾPhonetic script (~50) ɾUsed for inflections &

    particles ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ
  21. Katakana - ΧλΧφ ɾPhonetic script ɾTypically used for loan words

    ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ Kanji - ׽ࣈ ɾChinese characters (50,000+) ɾUsed for stems & proper nouns Hiragana - ͻΒ͕ͳ ɾPhonetic script (~50) ɾUsed for inflections & particles Romaji - ϩʔϚࣈ ɾLatin characters (26+) ɾUsed for proper nouns, etc. Katakana - ΧλΧφ ɾPhonetic script (~50) ɾTypically used for loan words
  22. ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ

  23. ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ What are the words in this sentence? ?

  24. ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ What are the words in this sentence? Words are

    implicit in Japanese - there is no white space that separates them ? !
  25. ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ How do we index this for search, then? ?

  26. ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ How do we index this for search, then? We

    need to segment text into tokens first ? !
  27. 1. n-gramming 2. morphological analysis (statistical approach) Two major approaches

    for segmentation !
  28. n-gramming (n=2) ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ Shall we go for a beer near

    JR Shinjuku station?
  29. n-gramming (n=2) ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ n=2 ̧̟ ̧̟ Shall we go for

    a beer near JR Shinjuku station?
  30. n-gramming (n=2) ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ n=2 ̧̟ ̧৽ ̧̧̟৽ Shall we go

    for a beer near JR Shinjuku station?
  31. n-gramming (n=2) ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ n=2 ̧̟ ̧৽ ৽॓ ̧̧̟৽৽॓ Shall we

    go for a beer near JR Shinjuku station?
  32. n-gramming (n=2) ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ n=2 ̧̟ ̧৽ ৽॓ ॓Ӻ ̧̧̟৽৽॓॓Ӻ Shall

    we go for a beer near JR Shinjuku station?
  33. n-gramming (n=2) ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ n=2 ̧̟ ̧৽ ৽॓ ॓Ӻ ̧̧̟৽৽॓॓ӺӺͷ Ӻͷ

    Shall we go for a beer near JR Shinjuku station?
  34. n-gramming (n=2) ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ n=2 ̧̟ ̧৽ ৽॓ ॓Ӻ ̧̧̟৽৽॓॓ӺӺͷͷۙ Ӻͷ

    ͷۙ Shall we go for a beer near JR Shinjuku station?
  35. n-gramming (n=2) ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ n=2 ̧̟ ̧৽ ৽॓ ॓Ӻ ̧̧̟৽৽॓॓ӺӺͷͷۙۙ͘ Ӻͷ

    ͷۙ ۙ͘ Shall we go for a beer near JR Shinjuku station?
  36. Problems with n-gramming

  37. Problems with n-gramming ̧̧̟৽৽॓॓ӺӺͷͷۙۙ͘ ...

  38. Problems with n-gramming ̧̧̟৽৽॓॓ӺӺͷͷۙۙ͘ ... •

  39. Problems with n-gramming ̧̧̟৽৽॓॓ӺӺͷͷۙۙ͘ ... × •

  40. Problems with n-gramming ̧̧̟৽৽॓॓ӺӺͷͷۙۙ͘ ... × • •

  41. Problems with n-gramming ̧̧̟৽৽॓॓ӺӺͷͷۙۙ͘ ... × × • • change

    of semantics! means ‘post town’, ‘relay station’ or ‘stage’
  42. Problems with n-gramming ̧̧̟৽৽॓॓ӺӺͷͷۙۙ͘ ... × × × • •

    change of semantics! means ‘post town’, ‘relay station’ or ‘stage’
  43. Problems with n-gramming ̧̧̟৽৽॓॓ӺӺͷͷۙۙ͘ ... × × × × •

    • change of semantics! means ‘post town’, ‘relay station’ or ‘stage’
  44. Problems with n-gramming ̧̧̟৽৽॓॓ӺӺͷͷۙۙ͘ ... × × × × •

    • • change of semantics! means ‘post town’, ‘relay station’ or ‘stage’
  45. Problems with n-gramming ̧̧̟৽৽॓॓ӺӺͷͷۙۙ͘ ... × × × × •

    • • change of semantics! means ‘post town’, ‘relay station’ or ‘stage’ • Does not preserve meaning well and often changes semantics • Impacts on ranking - search precision (many false positives) Generates many terms per document or query Impacts on index size and search performance Sometimes appropriate for certain search applications Compliance, e-commerce with non product names, ...
  46. Problems with n-gramming ̧̧̟৽৽॓॓ӺӺͷͷۙۙ͘ ... × × × × •

    • • change of semantics! means ‘post town’, ‘relay station’ or ‘stage’ • Does not preserve meaning well and often changes semantics • Impacts on ranking - search precision (many false positives) • Also generates many terms per document or query • Impacts on index size and performance Sometimes appropriate for certain search applications Compliance, e-commerce with non product names, ...
  47. Problems with n-gramming ̧̧̟৽৽॓॓ӺӺͷͷۙۙ͘ ... × × × × •

    • • change of semantics! means ‘post town’, ‘relay station’ or ‘stage’ • Does not preserve meaning well and often changes semantics • Impacts on ranking - search precision (many false positives) • Also generates many terms per document or query • Impacts on index size and performance • Still sometimes appropriate for certain search applications • Compliance, e-commerce with special product names, ...
  48. Morphological analysis ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ Shall we go for a beer near

    JR Shinjuku station?
  49. Morphological analysis ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ Shall we go for a beer near

    JR Shinjuku station? ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏ ͔ʁ
  50. Morphological analysis ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ Shall we go for a beer near

    JR Shinjuku station? • • • • • • • • • • • • • • ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏ ͔ʁ
  51. Morphological analysis ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏ ͔ʁ Shall we go for a

    beer near JR Shinjuku station? • Tokens reflect what a Japanese speaker consider as words • Machine-learned statistical approach • CRFs decoded using Viterbi • Also does part-of-speech tagging, readings for kanji, etc. • Several statistical models available with high accuracy (F > 0.97) • Models/dictionaries are available as IPADIC, UniDic, ... • • • • • • • • • • • • • •
  52. Morphological analysis ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏ ͔ʁ Shall we go for a

    beer near JR Shinjuku station? • Tokens reflect what a Japanese speaker consider as words • Machine-learned statistical approach • Conditional Random Fields (CRFs) decoded using Viterbi • Also does part-of-speech tagging, readings for kanji, etc. • Several statistical models available with high accuracy (F > 0.97) • Models/dictionaries are available as IPADIC, UniDic, ... • • • • • • • • • • • • • •
  53. Morphological analysis ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏ ͔ʁ Shall we go for a

    beer near JR Shinjuku station? • Tokens reflect what a Japanese speaker consider as words • Machine-learned statistical approach • Conditional Random Fields (CRFs) decoded using Viterbi • Also does part-of-speech tagging, readings for kanji, etc. • Several statistical models available with high accuracy (F > 0.97) • Models/dictionaries are available as IPADIC, UniDic, ... • • • • • • • • • • • • • •
  54. How does this actually work?

  55. Demo

  56. Japanese support in Lucene and Solr

  57. Japanese in Lucene/Solr

  58. New feature in Lucene/Solr 3.6 ! Japanese in Lucene/Solr

  59. New feature in Lucene/Solr 3.6 ! Available out-of-the-box ! Japanese

    in Lucene/Solr
  60. New feature in Lucene/Solr 3.6 ! Available out-of-the-box ! Easy

    to use with reasonable defaults ! Japanese in Lucene/Solr
  61. New feature in Lucene/Solr 3.6 ! Available out-of-the-box ! Easy

    to use with reasonable defaults ! Provides sophisticated Japanese linguistics ! Japanese in Lucene/Solr
  62. New feature in Lucene/Solr 3.6 ! Available out-of-the-box ! Easy

    to customise ! Easy to use with reasonable defaults ! Provides sophisticated Japanese linguistics ! Japanese in Lucene/Solr
  63. How do we use it?

  64. Use JapaneseAnalyzer ! How do we use it?

  65. Use JapaneseAnalyzer ! Use field type “text_ja” in example schema.xml

    ! How do we use it?
  66. Demo

  67. Feature summary / text_ja analyzer chain JapaneseTokenizer Segments Japanese text

    into tokens with very high accuracy • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries
  68. JapaneseTokenizer Segments Japanese text into tokens with very high accuracy

    • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) Feature summary / text_ja analyzer chain
  69. JapaneseTokenizer Segments Japanese text into tokens with very high accuracy

    • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) JapanesePartOfSpeechStopFilter Stop-words removal based on part-of-speech tags See example/solr/conf/lang/stoptags_ja.txt Feature summary / text_ja analyzer chain
  70. JapaneseTokenizer Segments Japanese text into tokens with very high accuracy

    • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) JapanesePartOfSpeechStopFilter Stop-words removal based on part-of-speech tags See example/solr/conf/lang/stoptags_ja.txt CJKWidthFilter Character width normalisation (fast Unicode NFKC subset) Feature summary / text_ja analyzer chain
  71. JapaneseTokenizer Segments Japanese text into tokens with very high accuracy

    • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) JapanesePartOfSpeechStopFilter Stop-words removal based on part-of-speech tags See example/solr/conf/lang/stoptags_ja.txt CJKWidthFilter Character width normalisation (fast Unicode NFKC subset) StopFilter Stop-words removal See example/solr/conf/lang/stopwords_ja.txt Feature summary / text_ja analyzer chain
  72. JapaneseTokenizer Segments Japanese text into tokens with very high accuracy

    • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) JapanesePartOfSpeechStopFilter Stop-words removal based on part-of-speech tags See example/solr/conf/lang/stoptags_ja.txt CJKWidthFilter Character width normalisation (fast Unicode NFKC subset) StopFilter Stop-words removal See example/solr/conf/lang/stopwords_ja.txt JapaneseKatakanaStemFilter Normalises common katakana spelling variations Feature summary / text_ja analyzer chain
  73. JapaneseTokenizer Segments Japanese text into tokens with very high accuracy

    • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) JapanesePartOfSpeechStopFilter Stop-words removal based on part-of-speech tags See example/solr/conf/lang/stoptags_ja.txt CJKWidthFilter Character width normalisation (fast Unicode NFKC subset) StopFilter Stop-words removal See example/solr/conf/lang/stopwords_ja.txt JapaneseKatakanaStemFilter Normalises common katakana spelling variations LowerCaseFilter Lowercases Feature summary / text_ja analyzer chain
  74. Feature details

  75. Compound nouns How do we deal with compound nouns? ?

  76. Compound nouns How do we deal with compound nouns? ?

    Japanese English ؔ੢ࠃࡍۭߓ Kansai International Airport γχΞιϑτ΢ΣΞΤϯδχΞ Senior Software Engineer
  77. Compound nouns How do we deal with compound nouns? ?

    Japanese English ؔ੢ࠃࡍۭߓ Kansai International Airport γχΞιϑτ΢ΣΞΤϯδχΞ Senior Software Engineer These are one word in Japanese, so searching for ۭߓ (airport) doesn’t match !
  78. Compound nouns How do we deal with compound nouns? We

    need to segment the compounds, too ? ! Japanese English ؔ੢ࠃࡍۭߓ Kansai International Airport γχΞιϑτ΢ΣΞΤϯδχΞ Senior Software Engineer These are one word in Japanese, so searching for ۭߓ (airport) doesn’t match !
  79. Compound segmentation ؔ੢ࠃࡍۭߓ Kansai International Airport γχΞιϑτ΢ΣΞΤϯδχφ Senior Software Engineer

    We are using a heuristic to implement this !
  80. Compound segmentation ؔ੢ࠃࡍۭߓ Kansai International Airport ؔ੢ Kansai γχΞιϑτ΢ΣΞΤϯδχφ Senior

    Software Engineer γχΞ Senior We are using a heuristic to implement this !
  81. Compound segmentation ؔ੢ࠃࡍۭߓ Kansai International Airport ؔ੢ Kansai ࠃࡍ International

    γχΞιϑτ΢ΣΞΤϯδχφ Senior Software Engineer γχΞ Senior ιϑτ΢ΣΞ Software We are using a heuristic to implement this !
  82. Compound segmentation ؔ੢ࠃࡍۭߓ Kansai International Airport ؔ੢ Kansai ࠃࡍ International

    ۭߓ Airport γχΞιϑτ΢ΣΞΤϯδχφ Senior Software Engineer γχΞ Senior ιϑτ΢ΣΞ Software Τϯδχφ Engineer We are using a heuristic to implement this !
  83. Compound synonym tokens Position 1 Position 2 Position 3 ؔ੢

    ࠃࡍ ۭߓ ؔ੢ࠃࡍۭߓ • Segment the compounds into its part • Good for recall - we can also search and match ۭߓ (airport) • We keep the compound itself as a synonym • Good for precision with an exact hit because of IDF • Approach benefits both precision and recall for overall good ranking • JapaneseTokenizer actually returns a graph of tokens
  84. Compound synonym tokens Position 1 Position 2 Position 3 ؔ੢

    ࠃࡍ ۭߓ ؔ੢ࠃࡍۭߓ • Segment the compounds into its parts • Good for recall - we can also search and match ۭߓ (airport) • We keep the compound itself as a synonym • Good for precision with an exact hit because of IDF • Approach benefits both precision and recall for overall good ranking • JapaneseTokenizer actually returns a graph of tokens
  85. Compound synonym tokens Position 1 Position 2 Position 3 ؔ੢

    ࠃࡍ ۭߓ ؔ੢ࠃࡍۭߓ • Segment the compounds into its parts • Good for recall - we can also search and match ۭߓ (airport) • We keep the compound itself as a synonym • Good for precision with an exact hit because of IDF • Approach benefits both precision and recall for overall good ranking • JapaneseTokenizer actually returns a graph of tokens
  86. Compound synonym tokens Position 1 Position 2 Position 3 ؔ੢

    ࠃࡍ ۭߓ ؔ੢ࠃࡍۭߓ • Segment the compounds into its parts • Good for recall - we can also search and match ۭߓ (airport) • We keep the compound itself as a synonym • Good for precision with an exact hit because of IDF • Approach benefits both precision and recall for overall good ranking • JapaneseTokenizer actually returns a graph of tokens
  87. Lemmatisation Japanese adjectives and verbs are highly inflected, how do

    we deal with that? ?
  88. Lemmatisation Japanese adjectives and verbs are highly inflected, how do

    we deal with that? ? kau to buy ങ͏ Dictionary form
  89. Lemmatisation Japanese adjectives and verbs are highly inflected, how do

    we deal with that? ? kau to buy ങ͏ Dictionary form ങ͍ͳ͍͞ ങ͍ͳ͞Δͳ ങ͍·ͨ͠Β ങ͍·ͨ͠Γ ങ͍·ͯ͠ ങ͍·͠ΐ͏ ങ͍·͢ ങ͍·͢·͍ ങ͍·ͤ͹ ങ͍·ͤΜ ങ͍·ͤΜͰ ങ͍·ͤΜͰͨ͠ ങ͑Δ ങ͓͏ ങͬͨ ങͬͨΒ ങͬͨΓ ങͬͯ ങΘͤͳ͍ ങΘͤ·͢ ങΘͤ·ͤΜ ങΘͤΒΕͳ͍ ങΘͤΒΕ·͢ ങΘͤΒΕ·ͤΜ Inflected forms (not exhaustive) ങ͍·ͤΜͰͨ͠Β ങ͍·ͤΜͰͨ͠Γ ങ͍·ͤΜͳΒ ങ͏ͩΖ͏ ങ͏Ͱ͠ΐ͏ ങ͏ͳ ങ͏·͍ ങ͑ ങ͑ͳ͍ ങ͑͹ ങ͑·͢ ങ͑·ͤΜ ങΘͤΒΕΔ ങΘͤΔ ങΘͳ͍ ങΘͳ͍ͩΖ͏ ങΘͳ͍Ͱ ങΘͳ͍Ͱ͠ΐ͏ ങΘͳ͔ͬͨ ങΘͳ͔ͬͨΒ ങΘͳ͔ͬͨΓ ങΘͳ͚Ε͹ ങΘΕͳ͍ ങΘΕ·͢
  90. Lemmatisation Japanese adjectives and verbs are highly inflected, how do

    we deal with that? ? kau to buy ങ͏ Dictionary form ങ͍ͳ͍͞ ങ͍ͳ͞Δͳ ങ͍·ͨ͠Β ങ͍·ͨ͠Γ ങ͍·ͯ͠ ങ͍·͠ΐ͏ ങ͍·͢ ങ͍·͢·͍ ങ͍·ͤ͹ ങ͍·ͤΜ ങ͍·ͤΜͰ ങ͍·ͤΜͰͨ͠ ങ͑Δ ങ͓͏ ങͬͨ ങͬͨΒ ങͬͨΓ ങͬͯ ങΘͤͳ͍ ങΘͤ·͢ ങΘͤ·ͤΜ ങΘͤΒΕͳ͍ ങΘͤΒΕ·͢ ങΘͤΒΕ·ͤΜ Inflected forms (not exhaustive) ങ͍·ͤΜͰͨ͠Β ങ͍·ͤΜͰͨ͠Γ ങ͍·ͤΜͳΒ ങ͏ͩΖ͏ ങ͏Ͱ͠ΐ͏ ങ͏ͳ ങ͏·͍ ങ͑ ങ͑ͳ͍ ങ͑͹ ങ͑·͢ ങ͑·ͤΜ ങΘͤΒΕΔ ങΘͤΔ ങΘͳ͍ ങΘͳ͍ͩΖ͏ ങΘͳ͍Ͱ ങΘͳ͍Ͱ͠ΐ͏ ങΘͳ͔ͬͨ ങΘͳ͔ͬͨΒ ങΘͳ͔ͬͨΓ ങΘͳ͚Ε͹ ങΘΕͳ͍ ങΘΕ·͢ Use JapaneseBaseformFilter to normalise inflected adjectives and verbs to dictionary form (lemmatisation by reduction) !
  91. Character width normalisation How do we deal with character widths?

    ? Half-widthɾ൒֯ Full-widthɾશ֯ Lucene ̴̴̡̲̈́̽ ŜŦŜū ΧλΧφ 123 ̍ ̎ ̏
  92. Character width normalisation Half-widthɾ൒֯ Full-widthɾશ֯ Lucene ̴̴̡̲̈́̽ ŜŦŜū ΧλΧφ 123

    ̍ ̎ ̏ Input text ̴̴̡̲̈́̽ ŜŦŜū ̍ ̎ ̏ CJKWidthFilter Lucene ΧλΧφ 1 2 3 half-width full-width half-width Use CJKWidthFilter to normalise them (Unicode NFKC subset) ! How do we deal with character widths? ?
  93. Katakana end-vowel stemming English Japanese spelling variations Japanese spelling variations

    Japanese spelling variations manager Ϛωʔδϟʔ Ϛωʔδϟ Ϛωδϟʔ A common spelling variation in katakana is a end long-vowel sound ?
  94. Katakana end-vowel stemming English Japanese spelling variations Japanese spelling variations

    Japanese spelling variations manager Ϛωʔδϟʔ Ϛωʔδϟ Ϛωδϟʔ Input text ίϐʔ Ϛωʔδϟʔ Ϛωʔδϟ Ϛωδϟʔ JapaneseKatakanaStemFilter ίϐʔ Ϛωʔδϟ Ϛωʔδϟ Ϛωδϟ copy manager manager “manager” We JapaneseKatakanaStemFilter to normalise/stem end-vowel for long terms ! A common spelling variation in katakana is a end long-vowel sound ?
  95. User dictionaries • Own dictionaries can be used for ad

    hoc segmentation, i.e. to override default model • File format is simple and there’s no need to assign weights, etc. before using them • Example custom dictionary: # Custom segmentation and POS entry for long entries ؔ੢ࠃࡍۭߓ,ؔ੢ ࠃࡍ ۭߓ,ΧϯαΠ ίΫαΠ Ϋ΢ί΢,ΧελϜ໊ࢺ # Custom reading and POS former sumo wrestler Asashoryu ே੨ཾ,ே੨ཾ,Ξαγϣ΢Ϧϡ΢,ΧελϜਓ໊
  96. Japanese focus in 4.0 • Improvements in JapaneseTokenizer • Improved

    search mode for katakana compounds • Improved unknown word segmentation • Some performance improvements • CharFilters for various character normalisations • Dates and numbers • Repetition marks (odoriji) • Japanese spell-checker • Robert and Koji almost got this into 3.6, but it got postponed because of API changes being necessary
  97. Acknowledgements Robert Muir Thanks for the heavy lifting integrating Kuromoji

    into Lucene and always reviewing my patches quickly and for friendly help Michael McCandless Thanks for streaming Viterbi and synonym compounds Uwe Schindler Thanks for performance improvements + being the policeman Simon Willnauer Thanks for doing the Kuromoji code donation process so well Gaute Lambertsen & Gerry Hocks Thanks for presentation feedback and being great colleagues
  98. Q & A

  99. ͋Γ͕ͱ͏͍͟͝·ͨ͠ʂ Thank you very much! arigatō gozaimashita!