Japanese linguistics in Lucene and Solr

Japanese linguistics in Apache Lucene™ and Apache Solr™ Christian Moen
[email protected] June 5th, 2012

About me • MSc. in computer science, University of Oslo,
Norway • Worked with search at FAST (now Microsoft) for 10 years • 5 years in R&D building FAST Enterprise Search Platform in Oslo, Norway • 5 years in Services doing solution delivery, sales, etc. in Tokyo, Japan • Founded ΞςΟϦΧגࣜձࣾ in 2009 • We help companies innovate using new technologies and good ideas • We do information retrieval, natural language processing and big data • We are based in Tokyo, but we have clients everywhere • Newbie Lucene & Solr Committer • Mostly been working on Japanese language support (Kuromoji) so far • Please write me on [email protected] or [email protected]

Today’s talk

Today’s talk • Japanese 101 - ordering beer and toasting
• Japanese language processing • Japanese features in Lucene/Solr

Japanese 101

Ϗʔϧ͍ͩ͘͞ bi-ru kudasai

Ϗʔϧ͍ͩ͘͞ bi-ru kudasai A beer, please

͋Γ͕ͱ͏͍͟͝·͢ʂ arigatō gozaimasu!

͋Γ͕ͱ͏͍͟͝·͢ʂ Thank you very much! arigatō gozaimasu!

סഋʂ kanpai!

Cheers! סഋʂ kanpai!

̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ JR Shinjuku eki no chikaku ni bi-ru ō nomi
ni ikō ka?

̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ Shall we go for a beer near JR Shinjuku
station? JR Shinjuku eki no chikaku ni bi-ru ō nomi ni ikō ka?

̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ

Romaji - ϩʔϚࣈ ɾLatin characters (26+) ɾUsed for proper nouns,
etc. ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ

Katakana - ΧλΧφ ɾPhonetic script (~50) ɾTypically used for loan
words ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ

Kanji - ׽ࣈ ɾChinese characters (50,000+) ɾUsed for stems &
proper nouns ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ

Hiragana - ͻΒ͕ͳ ɾPhonetic script (~50) ɾUsed for inﬂections &
particles ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ

Katakana - ΧλΧφ ɾPhonetic script ɾTypically used for loan words
̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ Kanji - ׽ࣈ ɾChinese characters (50,000+) ɾUsed for stems & proper nouns Hiragana - ͻΒ͕ͳ ɾPhonetic script (~50) ɾUsed for inﬂections & particles Romaji - ϩʔϚࣈ ɾLatin characters (26+) ɾUsed for proper nouns, etc. Katakana - ΧλΧφ ɾPhonetic script (~50) ɾTypically used for loan words

̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ

̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ What are the words in this sentence? ?

̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ What are the words in this sentence? Words are
implicit in Japanese - there is no white space that separates them ? !

̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ How do we index this for search, then? ?

̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ How do we index this for search, then? We
need to segment text into tokens ﬁrst ? !

1. n-gramming 2. morphological analysis (statistical approach) Two major approaches
for segmentation !

n-gramming (n=2) ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ Shall we go for a beer near
JR Shinjuku station?

n-gramming (n=2) ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ n=2 ̧̟ ̧̟ Shall we go for
a beer near JR Shinjuku station?

n-gramming (n=2) ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ n=2 ̧̟ ̧৽ ̧̧̟৽ Shall we go
for a beer near JR Shinjuku station?

n-gramming (n=2) ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ n=2 ̧̟ ̧৽ ৽॓ ̧̧̟৽৽॓ Shall we
go for a beer near JR Shinjuku station?

n-gramming (n=2) ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ n=2 ̧̟ ̧৽ ৽॓ ॓Ӻ ̧̧̟৽৽॓॓Ӻ Shall
we go for a beer near JR Shinjuku station?

n-gramming (n=2) ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ n=2 ̧̟ ̧৽ ৽॓ ॓Ӻ ̧̧̟৽৽॓॓ӺӺͷ Ӻͷ
Shall we go for a beer near JR Shinjuku station?

n-gramming (n=2) ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ n=2 ̧̟ ̧৽ ৽॓ ॓Ӻ ̧̧̟৽৽॓॓ӺӺͷͷۙ Ӻͷ
ͷۙ Shall we go for a beer near JR Shinjuku station?

n-gramming (n=2) ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ n=2 ̧̟ ̧৽ ৽॓ ॓Ӻ ̧̧̟৽৽॓॓ӺӺͷͷۙۙ͘ Ӻͷ
ͷۙ ۙ͘ Shall we go for a beer near JR Shinjuku station?

Problems with n-gramming

Problems with n-gramming ̧̧̟৽৽॓॓ӺӺͷͷۙۙ͘ ...

Problems with n-gramming ̧̧̟৽৽॓॓ӺӺͷͷۙۙ͘ ... •

Problems with n-gramming ̧̧̟৽৽॓॓ӺӺͷͷۙۙ͘ ... × •

Problems with n-gramming ̧̧̟৽৽॓॓ӺӺͷͷۙۙ͘ ... × • •

Problems with n-gramming ̧̧̟৽৽॓॓ӺӺͷͷۙۙ͘ ... × × • • change
of semantics! means ‘post town’, ‘relay station’ or ‘stage’

Problems with n-gramming ̧̧̟৽৽॓॓ӺӺͷͷۙۙ͘ ... × × × • •
change of semantics! means ‘post town’, ‘relay station’ or ‘stage’

Problems with n-gramming ̧̧̟৽৽॓॓ӺӺͷͷۙۙ͘ ... × × × × •
• change of semantics! means ‘post town’, ‘relay station’ or ‘stage’

• • change of semantics! means ‘post town’, ‘relay station’ or ‘stage’

• • change of semantics! means ‘post town’, ‘relay station’ or ‘stage’ • Does not preserve meaning well and often changes semantics • Impacts on ranking - search precision (many false positives) Generates many terms per document or query Impacts on index size and search performance Sometimes appropriate for certain search applications Compliance, e-commerce with non product names, ...

• • change of semantics! means ‘post town’, ‘relay station’ or ‘stage’ • Does not preserve meaning well and often changes semantics • Impacts on ranking - search precision (many false positives) • Also generates many terms per document or query • Impacts on index size and performance Sometimes appropriate for certain search applications Compliance, e-commerce with non product names, ...

• • change of semantics! means ‘post town’, ‘relay station’ or ‘stage’ • Does not preserve meaning well and often changes semantics • Impacts on ranking - search precision (many false positives) • Also generates many terms per document or query • Impacts on index size and performance • Still sometimes appropriate for certain search applications • Compliance, e-commerce with special product names, ...

Morphological analysis ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ Shall we go for a beer near
JR Shinjuku station?

JR Shinjuku station? ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏ ͔ʁ

JR Shinjuku station? • • • • • • • • • • • • • • ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏ ͔ʁ

Morphological analysis ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏ ͔ʁ Shall we go for a
beer near JR Shinjuku station? • Tokens reﬂect what a Japanese speaker consider as words • Machine-learned statistical approach • CRFs decoded using Viterbi • Also does part-of-speech tagging, readings for kanji, etc. • Several statistical models available with high accuracy (F > 0.97) • Models/dictionaries are available as IPADIC, UniDic, ... • • • • • • • • • • • • • •

Morphological analysis ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏͔ʁ ̧̟৽॓Ӻͷۙ͘ʹϏʔϧΛҿΈʹߦ͜͏ ͔ʁ Shall we go for a
beer near JR Shinjuku station? • Tokens reﬂect what a Japanese speaker consider as words • Machine-learned statistical approach • Conditional Random Fields (CRFs) decoded using Viterbi • Also does part-of-speech tagging, readings for kanji, etc. • Several statistical models available with high accuracy (F > 0.97) • Models/dictionaries are available as IPADIC, UniDic, ... • • • • • • • • • • • • • •

How does this actually work?

Japanese support in Lucene and Solr

Japanese in Lucene/Solr

New feature in Lucene/Solr 3.6 ! Japanese in Lucene/Solr

New feature in Lucene/Solr 3.6 ! Available out-of-the-box ! Japanese
in Lucene/Solr

New feature in Lucene/Solr 3.6 ! Available out-of-the-box ! Easy
to use with reasonable defaults ! Japanese in Lucene/Solr

to use with reasonable defaults ! Provides sophisticated Japanese linguistics ! Japanese in Lucene/Solr

to customise ! Easy to use with reasonable defaults ! Provides sophisticated Japanese linguistics ! Japanese in Lucene/Solr

How do we use it?

Use JapaneseAnalyzer ! How do we use it?

Use JapaneseAnalyzer ! Use ﬁeld type “text_ja” in example schema.xml
! How do we use it?

Feature summary / text_ja analyzer chain JapaneseTokenizer Segments Japanese text
into tokens with very high accuracy • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries

JapaneseTokenizer Segments Japanese text into tokens with very high accuracy
• Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) Feature summary / text_ja analyzer chain

• Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) JapanesePartOfSpeechStopFilter Stop-words removal based on part-of-speech tags See example/solr/conf/lang/stoptags_ja.txt Feature summary / text_ja analyzer chain

• Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) JapanesePartOfSpeechStopFilter Stop-words removal based on part-of-speech tags See example/solr/conf/lang/stoptags_ja.txt CJKWidthFilter Character width normalisation (fast Unicode NFKC subset) Feature summary / text_ja analyzer chain

• Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) JapanesePartOfSpeechStopFilter Stop-words removal based on part-of-speech tags See example/solr/conf/lang/stoptags_ja.txt CJKWidthFilter Character width normalisation (fast Unicode NFKC subset) StopFilter Stop-words removal See example/solr/conf/lang/stopwords_ja.txt Feature summary / text_ja analyzer chain

• Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) JapanesePartOfSpeechStopFilter Stop-words removal based on part-of-speech tags See example/solr/conf/lang/stoptags_ja.txt CJKWidthFilter Character width normalisation (fast Unicode NFKC subset) StopFilter Stop-words removal See example/solr/conf/lang/stopwords_ja.txt JapaneseKatakanaStemFilter Normalises common katakana spelling variations Feature summary / text_ja analyzer chain

• Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) JapanesePartOfSpeechStopFilter Stop-words removal based on part-of-speech tags See example/solr/conf/lang/stoptags_ja.txt CJKWidthFilter Character width normalisation (fast Unicode NFKC subset) StopFilter Stop-words removal See example/solr/conf/lang/stopwords_ja.txt JapaneseKatakanaStemFilter Normalises common katakana spelling variations LowerCaseFilter Lowercases Feature summary / text_ja analyzer chain

Feature details

Compound nouns How do we deal with compound nouns? ?

Japanese English ؔ੢ࠃࡍۭߓ Kansai International Airport γχΞιϑτ΢ΣΞΤϯδχΞ Senior Software Engineer

Japanese English ؔ੢ࠃࡍۭߓ Kansai International Airport γχΞιϑτ΢ΣΞΤϯδχΞ Senior Software Engineer These are one word in Japanese, so searching for ۭߓ (airport) doesn’t match !

Compound nouns How do we deal with compound nouns? We
need to segment the compounds, too ? ! Japanese English ؔ੢ࠃࡍۭߓ Kansai International Airport γχΞιϑτ΢ΣΞΤϯδχΞ Senior Software Engineer These are one word in Japanese, so searching for ۭߓ (airport) doesn’t match !

Compound segmentation ؔ੢ࠃࡍۭߓ Kansai International Airport γχΞιϑτ΢ΣΞΤϯδχφ Senior Software Engineer
We are using a heuristic to implement this !

Compound segmentation ؔ੢ࠃࡍۭߓ Kansai International Airport ؔ੢ Kansai γχΞιϑτ΢ΣΞΤϯδχφ Senior
Software Engineer γχΞ Senior We are using a heuristic to implement this !

Compound segmentation ؔ੢ࠃࡍۭߓ Kansai International Airport ؔ੢ Kansai ࠃࡍ International
γχΞιϑτ΢ΣΞΤϯδχφ Senior Software Engineer γχΞ Senior ιϑτ΢ΣΞ Software We are using a heuristic to implement this !

Compound segmentation ؔ੢ࠃࡍۭߓ Kansai International Airport ؔ੢ Kansai ࠃࡍ International
ۭߓ Airport γχΞιϑτ΢ΣΞΤϯδχφ Senior Software Engineer γχΞ Senior ιϑτ΢ΣΞ Software Τϯδχφ Engineer We are using a heuristic to implement this !

Compound synonym tokens Position 1 Position 2 Position 3 ؔ੢
ࠃࡍ ۭߓ ؔ੢ࠃࡍۭߓ • Segment the compounds into its part • Good for recall - we can also search and match ۭߓ (airport) • We keep the compound itself as a synonym • Good for precision with an exact hit because of IDF • Approach beneﬁts both precision and recall for overall good ranking • JapaneseTokenizer actually returns a graph of tokens

Compound synonym tokens Position 1 Position 2 Position 3 ؔ੢
ࠃࡍ ۭߓ ؔ੢ࠃࡍۭߓ • Segment the compounds into its parts • Good for recall - we can also search and match ۭߓ (airport) • We keep the compound itself as a synonym • Good for precision with an exact hit because of IDF • Approach beneﬁts both precision and recall for overall good ranking • JapaneseTokenizer actually returns a graph of tokens

Lemmatisation Japanese adjectives and verbs are highly inﬂected, how do
we deal with that? ?

we deal with that? ? kau to buy ങ͏ Dictionary form

we deal with that? ? kau to buy ങ͏ Dictionary form ങ͍ͳ͍͞ ങ͍ͳ͞Δͳ ങ͍·ͨ͠Β ങ͍·ͨ͠Γ ങ͍·ͯ͠ ങ͍·͠ΐ͏ ങ͍·͢ ങ͍·͢·͍ ങ͍·ͤ͹ ങ͍·ͤΜ ങ͍·ͤΜͰ ങ͍·ͤΜͰͨ͠ ങ͑Δ ങ͓͏ ങͬͨ ങͬͨΒ ങͬͨΓ ങͬͯ ങΘͤͳ͍ ങΘͤ·͢ ങΘͤ·ͤΜ ങΘͤΒΕͳ͍ ങΘͤΒΕ·͢ ങΘͤΒΕ·ͤΜ Inﬂected forms (not exhaustive) ങ͍·ͤΜͰͨ͠Β ങ͍·ͤΜͰͨ͠Γ ങ͍·ͤΜͳΒ ങ͏ͩΖ͏ ങ͏Ͱ͠ΐ͏ ങ͏ͳ ങ͏·͍ ങ͑ ങ͑ͳ͍ ങ͑͹ ങ͑·͢ ങ͑·ͤΜ ങΘͤΒΕΔ ങΘͤΔ ങΘͳ͍ ങΘͳ͍ͩΖ͏ ങΘͳ͍Ͱ ങΘͳ͍Ͱ͠ΐ͏ ങΘͳ͔ͬͨ ങΘͳ͔ͬͨΒ ങΘͳ͔ͬͨΓ ങΘͳ͚Ε͹ ങΘΕͳ͍ ങΘΕ·͢

we deal with that? ? kau to buy ങ͏ Dictionary form ങ͍ͳ͍͞ ങ͍ͳ͞Δͳ ങ͍·ͨ͠Β ങ͍·ͨ͠Γ ങ͍·ͯ͠ ങ͍·͠ΐ͏ ങ͍·͢ ങ͍·͢·͍ ങ͍·ͤ͹ ങ͍·ͤΜ ങ͍·ͤΜͰ ങ͍·ͤΜͰͨ͠ ങ͑Δ ങ͓͏ ങͬͨ ങͬͨΒ ങͬͨΓ ങͬͯ ങΘͤͳ͍ ങΘͤ·͢ ങΘͤ·ͤΜ ങΘͤΒΕͳ͍ ങΘͤΒΕ·͢ ങΘͤΒΕ·ͤΜ Inﬂected forms (not exhaustive) ങ͍·ͤΜͰͨ͠Β ങ͍·ͤΜͰͨ͠Γ ങ͍·ͤΜͳΒ ങ͏ͩΖ͏ ങ͏Ͱ͠ΐ͏ ങ͏ͳ ങ͏·͍ ങ͑ ങ͑ͳ͍ ങ͑͹ ങ͑·͢ ങ͑·ͤΜ ങΘͤΒΕΔ ങΘͤΔ ങΘͳ͍ ങΘͳ͍ͩΖ͏ ങΘͳ͍Ͱ ങΘͳ͍Ͱ͠ΐ͏ ങΘͳ͔ͬͨ ങΘͳ͔ͬͨΒ ങΘͳ͔ͬͨΓ ങΘͳ͚Ε͹ ങΘΕͳ͍ ങΘΕ·͢ Use JapaneseBaseformFilter to normalise inﬂected adjectives and verbs to dictionary form (lemmatisation by reduction) !

Character width normalisation How do we deal with character widths?
? Half-widthɾ൒֯ Full-widthɾશ֯ Lucene ̴̴̡̲̈́̽ ŜŦŜū ΧλΧφ 123 ̍ ̎ ̏

Character width normalisation Half-widthɾ൒֯ Full-widthɾશ֯ Lucene ̴̴̡̲̈́̽ ŜŦŜū ΧλΧφ 123
̍ ̎ ̏ Input text ̴̴̡̲̈́̽ ŜŦŜū ̍ ̎ ̏ CJKWidthFilter Lucene ΧλΧφ 1 2 3 half-width full-width half-width Use CJKWidthFilter to normalise them (Unicode NFKC subset) ! How do we deal with character widths? ?

Katakana end-vowel stemming English Japanese spelling variations Japanese spelling variations
Japanese spelling variations manager Ϛωʔδϟʔ Ϛωʔδϟ Ϛωδϟʔ A common spelling variation in katakana is a end long-vowel sound ?

Katakana end-vowel stemming English Japanese spelling variations Japanese spelling variations
Japanese spelling variations manager Ϛωʔδϟʔ Ϛωʔδϟ Ϛωδϟʔ Input text ίϐʔ Ϛωʔδϟʔ Ϛωʔδϟ Ϛωδϟʔ JapaneseKatakanaStemFilter ίϐʔ Ϛωʔδϟ Ϛωʔδϟ Ϛωδϟ copy manager manager “manager” We JapaneseKatakanaStemFilter to normalise/stem end-vowel for long terms ! A common spelling variation in katakana is a end long-vowel sound ?

User dictionaries • Own dictionaries can be used for ad
hoc segmentation, i.e. to override default model • File format is simple and there’s no need to assign weights, etc. before using them • Example custom dictionary: # Custom segmentation and POS entry for long entries ؔ੢ࠃࡍۭߓ,ؔ੢ ࠃࡍ ۭߓ,ΧϯαΠ ίΫαΠ Ϋ΢ί΢,ΧελϜ໊ࢺ # Custom reading and POS former sumo wrestler Asashoryu ே੨ཾ,ே੨ཾ,Ξαγϣ΢Ϧϡ΢,ΧελϜਓ໊

Japanese focus in 4.0 • Improvements in JapaneseTokenizer • Improved
search mode for katakana compounds • Improved unknown word segmentation • Some performance improvements • CharFilters for various character normalisations • Dates and numbers • Repetition marks (odoriji) • Japanese spell-checker • Robert and Koji almost got this into 3.6, but it got postponed because of API changes being necessary

Acknowledgements Robert Muir Thanks for the heavy lifting integrating Kuromoji
into Lucene and always reviewing my patches quickly and for friendly help Michael McCandless Thanks for streaming Viterbi and synonym compounds Uwe Schindler Thanks for performance improvements + being the policeman Simon Willnauer Thanks for doing the Kuromoji code donation process so well Gaute Lambertsen & Gerry Hocks Thanks for presentation feedback and being great colleagues

͋Γ͕ͱ͏͍͟͝·ͨ͠ʂ Thank you very much! arigatō gozaimashita!

Japanese linguistics in Lucene and Solr

Japanese linguistics in Lucene and Solr

More Decks by アティリカ株式会社

Other Decks in Technology

Featured

Transcript