Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Japanese linguistics in Lucene and Solr

Japanese linguistics in Lucene and Solr

An overview of the new Japanese language support in Apache Lucene and Apache Solr.

This is our talk from Berlin Buzzwords in 2012.

More Decks by アティリカ株式会社

Other Decks in Technology

Transcript

  1. About me • MSc. in computer science, University of Oslo,

    Norway • Worked with search at FAST (now Microsoft) for 10 years • 5 years in R&D building FAST Enterprise Search Platform in Oslo, Norway • 5 years in Services doing solution delivery, sales, etc. in Tokyo, Japan • Founded ΞςΟϦΧגࣜձࣾ in 2009 • We help companies innovate using new technologies and good ideas • We do information retrieval, natural language processing and big data • We are based in Tokyo, but we have clients everywhere • Newbie Lucene & Solr Committer • Mostly been working on Japanese language support (Kuromoji) so far • Please write me on [email protected] or [email protected]
  2. Today’s talk • Japanese 101 - ordering beer and toasting

    • Japanese language processing • Japanese features in Lucene/Solr
  3. Today’s talk • Japanese 101 - ordering beer and toasting

    • Japanese language processing • Japanese features in Lucene/Solr
  4. Today’s talk • Japanese 101 - ordering beer and toasting

    • Japanese language processing • Japanese features in Lucene/Solr
  5. ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ Shall we go for a beer near JR Shinjuku

    station? JR Shinjuku eki no chikaku ni bi-ru ō nomi ni ikō ka?
  6. Romaji - ϩʔϚࣈ ɾLatin characters (26+) ɾUsed for proper nouns,

    etc. ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ
  7. Katakana - ΧλΧφ ɾPhonetic script (~50) ɾTypically used for loan

    words ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ
  8. Kanji - ׽ࣈ ɾChinese characters (50,000+) ɾUsed for stems &

    proper nouns ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ
  9. Hiragana - ͻΒ͕ͳ ɾPhonetic script (~50) ɾUsed for inflections &

    particles ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ
  10. Katakana - ΧλΧφ ɾPhonetic script ɾTypically used for loan words

    ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ Kanji - ׽ࣈ ɾChinese characters (50,000+) ɾUsed for stems & proper nouns Hiragana - ͻΒ͕ͳ ɾPhonetic script (~50) ɾUsed for inflections & particles Romaji - ϩʔϚࣈ ɾLatin characters (26+) ɾUsed for proper nouns, etc. Katakana - ΧλΧφ ɾPhonetic script (~50) ɾTypically used for loan words
  11. ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ What are the words in this sentence? Words are

    implicit in Japanese - there is no white space that separates them ? !
  12. Problems with n-gramming ̧̧̟৽৽॓॓ӺӺͷͷۙۙ͘ ... × × • • change

    of semantics! means ‘post town’, ‘relay station’ or ‘stage’
  13. Problems with n-gramming ̧̧̟৽৽॓॓ӺӺͷͷۙۙ͘ ... × × × • •

    change of semantics! means ‘post town’, ‘relay station’ or ‘stage’
  14. Problems with n-gramming ̧̧̟৽৽॓॓ӺӺͷͷۙۙ͘ ... × × × × •

    • change of semantics! means ‘post town’, ‘relay station’ or ‘stage’
  15. Problems with n-gramming ̧̧̟৽৽॓॓ӺӺͷͷۙۙ͘ ... × × × × •

    • • change of semantics! means ‘post town’, ‘relay station’ or ‘stage’
  16. Problems with n-gramming ̧̧̟৽৽॓॓ӺӺͷͷۙۙ͘ ... × × × × •

    • • change of semantics! means ‘post town’, ‘relay station’ or ‘stage’ • Does not preserve meaning well and often changes semantics • Impacts on ranking - search precision (many false positives) Generates many terms per document or query Impacts on index size and search performance Sometimes appropriate for certain search applications Compliance, e-commerce with non product names, ...
  17. Problems with n-gramming ̧̧̟৽৽॓॓ӺӺͷͷۙۙ͘ ... × × × × •

    • • change of semantics! means ‘post town’, ‘relay station’ or ‘stage’ • Does not preserve meaning well and often changes semantics • Impacts on ranking - search precision (many false positives) • Also generates many terms per document or query • Impacts on index size and performance Sometimes appropriate for certain search applications Compliance, e-commerce with non product names, ...
  18. Problems with n-gramming ̧̧̟৽৽॓॓ӺӺͷͷۙۙ͘ ... × × × × •

    • • change of semantics! means ‘post town’, ‘relay station’ or ‘stage’ • Does not preserve meaning well and often changes semantics • Impacts on ranking - search precision (many false positives) • Also generates many terms per document or query • Impacts on index size and performance • Still sometimes appropriate for certain search applications • Compliance, e-commerce with special product names, ...
  19. Morphological analysis ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ Shall we go for a beer near

    JR Shinjuku station? ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏ ͔ʁ
  20. Morphological analysis ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ Shall we go for a beer near

    JR Shinjuku station? • • • • • • • • • • • • • • ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏ ͔ʁ
  21. Morphological analysis ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏ ͔ʁ Shall we go for a

    beer near JR Shinjuku station? • Tokens reflect what a Japanese speaker consider as words • Machine-learned statistical approach • CRFs decoded using Viterbi • Also does part-of-speech tagging, readings for kanji, etc. • Several statistical models available with high accuracy (F > 0.97) • Models/dictionaries are available as IPADIC, UniDic, ... • • • • • • • • • • • • • •
  22. Morphological analysis ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏ ͔ʁ Shall we go for a

    beer near JR Shinjuku station? • Tokens reflect what a Japanese speaker consider as words • Machine-learned statistical approach • Conditional Random Fields (CRFs) decoded using Viterbi • Also does part-of-speech tagging, readings for kanji, etc. • Several statistical models available with high accuracy (F > 0.97) • Models/dictionaries are available as IPADIC, UniDic, ... • • • • • • • • • • • • • •
  23. Morphological analysis ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏ ͔ʁ Shall we go for a

    beer near JR Shinjuku station? • Tokens reflect what a Japanese speaker consider as words • Machine-learned statistical approach • Conditional Random Fields (CRFs) decoded using Viterbi • Also does part-of-speech tagging, readings for kanji, etc. • Several statistical models available with high accuracy (F > 0.97) • Models/dictionaries are available as IPADIC, UniDic, ... • • • • • • • • • • • • • •
  24. New feature in Lucene/Solr 3.6 ! Available out-of-the-box ! Easy

    to use with reasonable defaults ! Japanese in Lucene/Solr
  25. New feature in Lucene/Solr 3.6 ! Available out-of-the-box ! Easy

    to use with reasonable defaults ! Provides sophisticated Japanese linguistics ! Japanese in Lucene/Solr
  26. New feature in Lucene/Solr 3.6 ! Available out-of-the-box ! Easy

    to customise ! Easy to use with reasonable defaults ! Provides sophisticated Japanese linguistics ! Japanese in Lucene/Solr
  27. Feature summary / text_ja analyzer chain JapaneseTokenizer Segments Japanese text

    into tokens with very high accuracy • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries
  28. JapaneseTokenizer Segments Japanese text into tokens with very high accuracy

    • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) Feature summary / text_ja analyzer chain
  29. JapaneseTokenizer Segments Japanese text into tokens with very high accuracy

    • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) JapanesePartOfSpeechStopFilter Stop-words removal based on part-of-speech tags See example/solr/conf/lang/stoptags_ja.txt Feature summary / text_ja analyzer chain
  30. JapaneseTokenizer Segments Japanese text into tokens with very high accuracy

    • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) JapanesePartOfSpeechStopFilter Stop-words removal based on part-of-speech tags See example/solr/conf/lang/stoptags_ja.txt CJKWidthFilter Character width normalisation (fast Unicode NFKC subset) Feature summary / text_ja analyzer chain
  31. JapaneseTokenizer Segments Japanese text into tokens with very high accuracy

    • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) JapanesePartOfSpeechStopFilter Stop-words removal based on part-of-speech tags See example/solr/conf/lang/stoptags_ja.txt CJKWidthFilter Character width normalisation (fast Unicode NFKC subset) StopFilter Stop-words removal See example/solr/conf/lang/stopwords_ja.txt Feature summary / text_ja analyzer chain
  32. JapaneseTokenizer Segments Japanese text into tokens with very high accuracy

    • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) JapanesePartOfSpeechStopFilter Stop-words removal based on part-of-speech tags See example/solr/conf/lang/stoptags_ja.txt CJKWidthFilter Character width normalisation (fast Unicode NFKC subset) StopFilter Stop-words removal See example/solr/conf/lang/stopwords_ja.txt JapaneseKatakanaStemFilter Normalises common katakana spelling variations Feature summary / text_ja analyzer chain
  33. JapaneseTokenizer Segments Japanese text into tokens with very high accuracy

    • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) JapanesePartOfSpeechStopFilter Stop-words removal based on part-of-speech tags See example/solr/conf/lang/stoptags_ja.txt CJKWidthFilter Character width normalisation (fast Unicode NFKC subset) StopFilter Stop-words removal See example/solr/conf/lang/stopwords_ja.txt JapaneseKatakanaStemFilter Normalises common katakana spelling variations LowerCaseFilter Lowercases Feature summary / text_ja analyzer chain
  34. Compound nouns How do we deal with compound nouns? ?

    Japanese English ؔ੢ࠃࡍۭߓ Kansai International Airport γχΞιϑτ΢ΣΞΤϯδχΞ Senior Software Engineer
  35. Compound nouns How do we deal with compound nouns? ?

    Japanese English ؔ੢ࠃࡍۭߓ Kansai International Airport γχΞιϑτ΢ΣΞΤϯδχΞ Senior Software Engineer These are one word in Japanese, so searching for ۭߓ (airport) doesn’t match !
  36. Compound nouns How do we deal with compound nouns? We

    need to segment the compounds, too ? ! Japanese English ؔ੢ࠃࡍۭߓ Kansai International Airport γχΞιϑτ΢ΣΞΤϯδχΞ Senior Software Engineer These are one word in Japanese, so searching for ۭߓ (airport) doesn’t match !
  37. Compound segmentation ؔ੢ࠃࡍۭߓ Kansai International Airport ؔ੢ Kansai γχΞιϑτ΢ΣΞΤϯδχφ Senior

    Software Engineer γχΞ Senior We are using a heuristic to implement this !
  38. Compound segmentation ؔ੢ࠃࡍۭߓ Kansai International Airport ؔ੢ Kansai ࠃࡍ International

    γχΞιϑτ΢ΣΞΤϯδχφ Senior Software Engineer γχΞ Senior ιϑτ΢ΣΞ Software We are using a heuristic to implement this !
  39. Compound segmentation ؔ੢ࠃࡍۭߓ Kansai International Airport ؔ੢ Kansai ࠃࡍ International

    ۭߓ Airport γχΞιϑτ΢ΣΞΤϯδχφ Senior Software Engineer γχΞ Senior ιϑτ΢ΣΞ Software Τϯδχφ Engineer We are using a heuristic to implement this !
  40. Compound synonym tokens Position 1 Position 2 Position 3 ؔ੢

    ࠃࡍ ۭߓ ؔ੢ࠃࡍۭߓ • Segment the compounds into its part • Good for recall - we can also search and match ۭߓ (airport) • We keep the compound itself as a synonym • Good for precision with an exact hit because of IDF • Approach benefits both precision and recall for overall good ranking • JapaneseTokenizer actually returns a graph of tokens
  41. Compound synonym tokens Position 1 Position 2 Position 3 ؔ੢

    ࠃࡍ ۭߓ ؔ੢ࠃࡍۭߓ • Segment the compounds into its parts • Good for recall - we can also search and match ۭߓ (airport) • We keep the compound itself as a synonym • Good for precision with an exact hit because of IDF • Approach benefits both precision and recall for overall good ranking • JapaneseTokenizer actually returns a graph of tokens
  42. Compound synonym tokens Position 1 Position 2 Position 3 ؔ੢

    ࠃࡍ ۭߓ ؔ੢ࠃࡍۭߓ • Segment the compounds into its parts • Good for recall - we can also search and match ۭߓ (airport) • We keep the compound itself as a synonym • Good for precision with an exact hit because of IDF • Approach benefits both precision and recall for overall good ranking • JapaneseTokenizer actually returns a graph of tokens
  43. Compound synonym tokens Position 1 Position 2 Position 3 ؔ੢

    ࠃࡍ ۭߓ ؔ੢ࠃࡍۭߓ • Segment the compounds into its parts • Good for recall - we can also search and match ۭߓ (airport) • We keep the compound itself as a synonym • Good for precision with an exact hit because of IDF • Approach benefits both precision and recall for overall good ranking • JapaneseTokenizer actually returns a graph of tokens
  44. Lemmatisation Japanese adjectives and verbs are highly inflected, how do

    we deal with that? ? kau to buy ങ͏ Dictionary form
  45. Lemmatisation Japanese adjectives and verbs are highly inflected, how do

    we deal with that? ? kau to buy ങ͏ Dictionary form ങ͍ͳ͍͞ ങ͍ͳ͞Δͳ ങ͍·ͨ͠Β ങ͍·ͨ͠Γ ങ͍·ͯ͠ ങ͍·͠ΐ͏ ങ͍·͢ ങ͍·͢·͍ ങ͍·ͤ͹ ങ͍·ͤΜ ങ͍·ͤΜͰ ങ͍·ͤΜͰͨ͠ ങ͑Δ ങ͓͏ ങͬͨ ങͬͨΒ ങͬͨΓ ങͬͯ ങΘͤͳ͍ ങΘͤ·͢ ങΘͤ·ͤΜ ങΘͤΒΕͳ͍ ങΘͤΒΕ·͢ ങΘͤΒΕ·ͤΜ Inflected forms (not exhaustive) ങ͍·ͤΜͰͨ͠Β ങ͍·ͤΜͰͨ͠Γ ങ͍·ͤΜͳΒ ങ͏ͩΖ͏ ങ͏Ͱ͠ΐ͏ ങ͏ͳ ങ͏·͍ ങ͑ ങ͑ͳ͍ ങ͑͹ ങ͑·͢ ങ͑·ͤΜ ങΘͤΒΕΔ ങΘͤΔ ങΘͳ͍ ങΘͳ͍ͩΖ͏ ങΘͳ͍Ͱ ങΘͳ͍Ͱ͠ΐ͏ ങΘͳ͔ͬͨ ങΘͳ͔ͬͨΒ ങΘͳ͔ͬͨΓ ങΘͳ͚Ε͹ ങΘΕͳ͍ ങΘΕ·͢
  46. Lemmatisation Japanese adjectives and verbs are highly inflected, how do

    we deal with that? ? kau to buy ങ͏ Dictionary form ങ͍ͳ͍͞ ങ͍ͳ͞Δͳ ങ͍·ͨ͠Β ങ͍·ͨ͠Γ ങ͍·ͯ͠ ങ͍·͠ΐ͏ ങ͍·͢ ങ͍·͢·͍ ങ͍·ͤ͹ ങ͍·ͤΜ ങ͍·ͤΜͰ ങ͍·ͤΜͰͨ͠ ങ͑Δ ങ͓͏ ങͬͨ ങͬͨΒ ങͬͨΓ ങͬͯ ങΘͤͳ͍ ങΘͤ·͢ ങΘͤ·ͤΜ ങΘͤΒΕͳ͍ ങΘͤΒΕ·͢ ങΘͤΒΕ·ͤΜ Inflected forms (not exhaustive) ങ͍·ͤΜͰͨ͠Β ങ͍·ͤΜͰͨ͠Γ ങ͍·ͤΜͳΒ ങ͏ͩΖ͏ ങ͏Ͱ͠ΐ͏ ങ͏ͳ ങ͏·͍ ങ͑ ങ͑ͳ͍ ങ͑͹ ങ͑·͢ ങ͑·ͤΜ ങΘͤΒΕΔ ങΘͤΔ ങΘͳ͍ ങΘͳ͍ͩΖ͏ ങΘͳ͍Ͱ ങΘͳ͍Ͱ͠ΐ͏ ങΘͳ͔ͬͨ ങΘͳ͔ͬͨΒ ങΘͳ͔ͬͨΓ ങΘͳ͚Ε͹ ങΘΕͳ͍ ങΘΕ·͢ Use JapaneseBaseformFilter to normalise inflected adjectives and verbs to dictionary form (lemmatisation by reduction) !
  47. Character width normalisation How do we deal with character widths?

    ? Half-widthɾ൒֯ Full-widthɾશ֯ Lucene ̴̴̡̲̈́̽ ŜŦŜū ΧλΧφ 123 ̍ ̎ ̏
  48. Character width normalisation Half-widthɾ൒֯ Full-widthɾશ֯ Lucene ̴̴̡̲̈́̽ ŜŦŜū ΧλΧφ 123

    ̍ ̎ ̏ Input text ̴̴̡̲̈́̽ ŜŦŜū ̍ ̎ ̏ CJKWidthFilter Lucene ΧλΧφ 1 2 3 half-width full-width half-width Use CJKWidthFilter to normalise them (Unicode NFKC subset) ! How do we deal with character widths? ?
  49. Katakana end-vowel stemming English Japanese spelling variations Japanese spelling variations

    Japanese spelling variations manager Ϛωʔδϟʔ Ϛωʔδϟ Ϛωδϟʔ A common spelling variation in katakana is a end long-vowel sound ?
  50. Katakana end-vowel stemming English Japanese spelling variations Japanese spelling variations

    Japanese spelling variations manager Ϛωʔδϟʔ Ϛωʔδϟ Ϛωδϟʔ Input text ίϐʔ Ϛωʔδϟʔ Ϛωʔδϟ Ϛωδϟʔ JapaneseKatakanaStemFilter ίϐʔ Ϛωʔδϟ Ϛωʔδϟ Ϛωδϟ copy manager manager “manager” We JapaneseKatakanaStemFilter to normalise/stem end-vowel for long terms ! A common spelling variation in katakana is a end long-vowel sound ?
  51. User dictionaries • Own dictionaries can be used for ad

    hoc segmentation, i.e. to override default model • File format is simple and there’s no need to assign weights, etc. before using them • Example custom dictionary: # Custom segmentation and POS entry for long entries ؔ੢ࠃࡍۭߓ,ؔ੢ ࠃࡍ ۭߓ,ΧϯαΠ ίΫαΠ Ϋ΢ί΢,ΧελϜ໊ࢺ # Custom reading and POS former sumo wrestler Asashoryu ே੨ཾ,ே੨ཾ,Ξαγϣ΢Ϧϡ΢,ΧελϜਓ໊
  52. Japanese focus in 4.0 • Improvements in JapaneseTokenizer • Improved

    search mode for katakana compounds • Improved unknown word segmentation • Some performance improvements • CharFilters for various character normalisations • Dates and numbers • Repetition marks (odoriji) • Japanese spell-checker • Robert and Koji almost got this into 3.6, but it got postponed because of API changes being necessary
  53. Acknowledgements Robert Muir Thanks for the heavy lifting integrating Kuromoji

    into Lucene and always reviewing my patches quickly and for friendly help Michael McCandless Thanks for streaming Viterbi and synonym compounds Uwe Schindler Thanks for performance improvements + being the policeman Simon Willnauer Thanks for doing the Kuromoji code donation process so well Gaute Lambertsen & Gerry Hocks Thanks for presentation feedback and being great colleagues