Language support and linguistics in Lucene, Solr and ElasticSearch, and the eco-system

Slide 1

Slide 1 text

Language support and linguistics in Lucene, Solr and ElasticSearch and the eco-system Christian Moen [email protected] June 3rd, 2013

Slide 2

Slide 2 text

About me • MSc. in computer science, University of Oslo, Norway • Worked with search at FAST (now Microsoft) for 10 years • 5 years in R&D building FAST Enterprise Search Platform in Oslo, Norway • 5 years in Services doing solution delivery, technical sales, etc. in Tokyo, Japan • Founded ΞςΟϦΧגࣜձࣾ in October, 2009 • We help companies innovate using new technologies and good ideas • We do information retrieval, natural language processing and big data • We are based in Tokyo, but we have clients everywhere • We are a small company, but our customers are typically very big companies • Newbie Lucene & Solr Committer • Mostly been working on Japanese language support (Kuromoji) so far • Working on Korean support from a code donation (LUCENE-4956) • Please write me on [email protected] or [email protected]

Slide 3

Slide 3 text

About this talk • Basic searching and matching • Challenges with natural language • Basic measurements for search quality • Linguistics in Apache Lucene • Linguistics in ElasticSearch (quick intro) • Linguistics in Apache Solr • Linguistics in the NLP eco-system • Summary and practical advice

Slide 4

Slide 4 text

Hands-on 1: Working with Apache Lucene analyzers Hands-on 4: Other text processing using OpenNLP Hands-on 3: Multi-lingual search with Apace Solr Hands-on 2: Multi-lingual search using ElasticSearch Hands-on demos

Slide 5

Slide 5 text

What is a search engine?

Slide 6

Slide 6 text

Documents 1 Sushi is very tasty in Japan 2 Visiting the Tsukiji ﬁsh market is very fun Two documents (1 & 2) with English text 1

Slide 7

Slide 7 text

Text segmentation 1 Sushi is very tasty in Japan 2 Visiting the Tsukiji ﬁsh market is very fun 1 Sushi is very tasty in Japan 2 Visiting the Tsukiji ﬁsh market is very fun Documents are turned into searchable terms (tokenization) Two documents (1 & 2) with English text 1 2

Slide 8

Slide 8 text

Text segmentation 1 Sushi is very tasty in Japan 2 Visiting the Tsukiji fish market is very fun 1 Sushi is very tasty in Japan 2 Visiting the Tsukiji fish market is very fun 1 sushi is very tasty in japan 2 visiting the tsukiji fish market is very fun Documents are turned into searchable terms (tokenization) Two documents (1 & 2) with English text Terms/tokens are converted to lowercase form (normalization) 1 2 3

Slide 9

Slide 9 text

Document indexing 1 sushi is very tasty in japan 2 visiting the tsukiji ﬁsh market is very fun Tokenized documents with normalized tokens

Slide 10

Slide 10 text

Document indexing sushi 1 is 1 2 very 1 2 tasty 1 in 1 japan 1 visiting 2 the 2 tsukiji 2 ﬁsh 2 market 2 fun 2 1 sushi is very tasty in japan 2 visiting the tsukiji ﬁsh market is very fun Tokenized documents with normalized tokens Inverted index - tokens are mapped to the document ids that contain them

Slide 11

Slide 11 text

Searching sushi 1 is 1 2 very 1 2 tasty 1 in 1 japan 1 visiting 2 the 2 tsukiji 2 ﬁsh 2 market 2 fun 2

Slide 12

Slide 12 text

Searching sushi 1 is 1 2 very 1 2 tasty 1 in 1 japan 1 visiting 2 the 2 tsukiji 2 ﬁsh 2 market 2 fun 2 query very tasty sushi

Slide 13

Slide 13 text

Searching sushi 1 is 1 2 very 1 2 tasty 1 in 1 japan 1 visiting 2 the 2 tsukiji 2 ﬁsh 2 market 2 fun 2 AND very tasty sushi parsed query

Slide 14

Slide 14 text

Searching sushi 1 is 1 2 very 1 2 tasty 1 in 1 japan 1 visiting 2 the 2 tsukiji 2 ﬁsh 2 market 2 fun 2 AND very tasty sushi parsed query

Slide 15

Slide 15 text

Searching sushi 1 is 1 2 very 1 2 tasty 1 in 1 japan 1 visiting 2 the 2 tsukiji 2 ﬁsh 2 market 2 fun 2 AND very tasty sushi parsed query

Slide 16

Slide 16 text

Searching sushi 1 is 1 2 very 1 2 tasty 1 in 1 japan 1 visiting 2 the 2 tsukiji 2 ﬁsh 2 market 2 fun 2 AND very tasty sushi parsed query

Slide 17

Slide 17 text

Searching sushi 1 is 1 2 very 1 2 tasty 1 in 1 japan 1 visiting 2 the 2 tsukiji 2 ﬁsh 2 market 2 fun 2 1 hits AND very tasty sushi parsed query

Slide 18

Slide 18 text

Searching sushi 1 is 1 2 very 1 2 tasty 1 in 1 japan 1 visiting 2 the 2 tsukiji 2 ﬁsh 2 market 2 fun 2

Slide 19

Slide 19 text

Searching sushi 1 is 1 2 very 1 2 tasty 1 in 1 japan 1 visiting 2 the 2 tsukiji 2 ﬁsh 2 market 2 fun 2 query visit fun market

Slide 20

Slide 20 text

Searching sushi 1 is 1 2 very 1 2 tasty 1 in 1 japan 1 visiting 2 the 2 tsukiji 2 ﬁsh 2 market 2 fun 2 AND visit fun market parsed query

Slide 21

Slide 21 text

Searching sushi 1 is 1 2 very 1 2 tasty 1 in 1 japan 1 visiting 2 the 2 tsukiji 2 ﬁsh 2 market 2 fun 2 AND visit fun market parsed query

Slide 22

Slide 22 text

Searching sushi 1 is 1 2 very 1 2 tasty 1 in 1 japan 1 visiting 2 the 2 tsukiji 2 ﬁsh 2 market 2 fun 2 AND visit fun market parsed query visit ≠ visiting

Slide 23

Slide 23 text

Searching sushi 1 is 1 2 very 1 2 tasty 1 in 1 japan 1 visiting 2 the 2 tsukiji 2 ﬁsh 2 market 2 fun 2 AND visit fun market parsed query no hits (all terms need to match)

Slide 24

Slide 24 text

What’s the problem? Search engines are not magical answering machines They match terms in queries against terms in documents, and order matches by rank ! !

Slide 25

Slide 25 text

Key takeaways Text processing aﬀects search quality in big way because it aﬀects matching The “magic” of a search engine is often provided by high quality text processing Garbage in 㱺 Garbage out ! !

Slide 26

Slide 26 text

Natural language and search

Slide 27

Slide 27 text

೔ຊޠ English Deutsch Français ى"#$%&'ا

Slide 28

Slide 28 text

English Pale ale is a beer made through warm fermentation using pale malt and is one of the world's major beer styles.

Slide 29

Slide 29 text

English How do we want to index world's? ? Pale ale is a beer made through warm fermentation using pale malt and is one of the world's major beer styles.

Slide 30

Slide 30 text

English How do we want to index world's? ? Should a search for style match styles? And should ferment match fermentation? ? Pale ale is a beer made through warm fermentation using pale malt and is one of the world's major beer styles.

Slide 31

Slide 31 text

German Das Oktoberfest ist das größte Volksfest der Welt und es ﬁndet in der bayerischen Landeshauptstadt München.

Slide 32

Slide 32 text

Slide 33

Slide 33 text

German Das Oktoberfest ist das größte Volksfest der Welt und es ﬁndet in der bayerischen Landeshauptstadt München. The Oktoberfest is the world’s largest festival and it takes place in the Bavarian capital Munich. How do we want to search ü, ö and ß? ?

Slide 34

Slide 34 text

Slide 35

Slide 35 text

French Le champagne est un vin pétillant français protégé par une appellation d'origine contrôlée.

Slide 36

Slide 36 text

French Le champagne est un vin pétillant français protégé par une appellation d'origine contrôlée. Champagne is a French sparkling wine with a protected designation of origin.

Slide 37

Slide 37 text

French Le champagne est un vin pétillant français protégé par une appellation d'origine contrôlée. How do we want to search é, ç and ô? ? Champagne is a French sparkling wine with a protected designation of origin.

Slide 38

Slide 38 text

French Le champagne est un vin pétillant français protégé par une appellation d'origine contrôlée. How do we want to search é, ç and ô? ? Do we want a search for aoc to match appellation d'origine contrôlée? ? Champagne is a French sparkling wine with a protected designation of origin.

Slide 39

Slide 39 text

Arabic )ْ+,ِ. مَ%َ1'ا زْ3ُ5ُر 7ِ5 اً9ْ5َر "َ:ْ#ـــــــ ِ <=ا >?#ِ$َ%َ&'ا ةَ3ْAَB'ا %َC,َD,ْ&ُE .Fِ$َ%َ&'ا َG'Hَ&'ا IِJ ْبَ%َ&'ا

Slide 40

Slide 40 text

Arabic )ْ+,ِ. مَ%َ1'ا زْ3ُ5ُر 7ِ5 اً9ْ5َر "َ:ْ#ـــــــ ِ <=ا >?#ِ$َ%َ&'ا ةَ3ْAَB'ا %َC,َD,ْ&ُE .Fِ$َ%َ&'ا َG'Hَ&'ا IِJ ْبَ%َ&'ا Reads from right to left

Slide 41

Slide 41 text

Slide 42

Slide 42 text

Arabic )ْ+,ِ. مَ%َ1'ا زْ3ُ5ُر 7ِ5 اً9ْ5َر "َ:ْ#ـــــــ ِ <=ا >?#ِ$َ%َ&'ا ةَ3ْAَB'ا %َC,َD,ْ&ُE .Fِ$َ%َ&'ا َG'Hَ&'ا IِJ ْبَ%َ&'ا Original Arabian coﬀee is considered a symbol of generosity among the Arabs in the Arab world. How do we want to search "َ:ْ#ـــــــ ِ <=ا? ?

Slide 43

Slide 43 text

Slide 44

Slide 44 text

Arabic )+. م%1'ا ز35ر 75 ا95ر ":#ـــــــ<=ا "#$%&'ا ة3AB'ا %CD&E .F$%&'ا G'H&'ا IJ ب%&'ا Original Arabian coﬀee is considered a symbol of generosity among the Arabs in the Arab world. How do we want to search "َ:ْ#ـــــــ ِ <=ا? ? Do we want to normalize diacritics? ? Diacritics normalized (removed)

Slide 45

Slide 45 text

Slide 46

Slide 46 text

Japanese ̧̟ ৽॓ Ӻ ͷ ۙ͘ʹ ϏʔϧΛҿΈʹߦ͜͏ ͔ʁ

Slide 47

Slide 47 text

Japanese Shall we go for a beer near JR Shinjuku station? ̧̟ ৽॓ Ӻ ͷ ۙ͘ʹ ϏʔϧΛҿΈʹߦ͜͏ ͔ʁ

Slide 48

Slide 48 text

Japanese Shall we go for a beer near JR Shinjuku station? What are the words in this sentence? ? What are the words in this sentence? Which tokens do we index? ̧̟ ৽॓ Ӻ ͷ ۙ͘ʹ ϏʔϧΛҿΈʹߦ͜͏ ͔ʁ

Slide 49

Slide 49 text

Japanese Shall we go for a beer near JR Shinjuku station? What are the words in this sentence? ? Words are implicit in Japanese - there is no white space that separates them ! What are the words in this sentence? Which tokens do we index? ̧̟ ৽॓ Ӻ ͷ ۙ͘ʹ ϏʔϧΛҿΈʹߦ͜͏ ͔ʁ

Slide 50

Slide 50 text

Japanese ̧̟ ৽॓ Ӻ ͷ ۙ͘ʹ ϏʔϧΛҿΈʹߦ͜͏ ͔ʁ What are the words in this sentence? ? Words are implicit in Japanese - there is no white space that separates them ! What are the words in this sentence? Which tokens do we index? Shall we go for a beer near JR Shinjuku station? But how do we ﬁnd the tokens? ? ̧̟ ৽॓ Ӻ ͷ ۙ͘ʹ ϏʔϧΛҿΈʹߦ͜͏ ͔ʁ

Slide 51

Slide 51 text

Slide 52

Slide 52 text

Japanese Do we want ҿΉ (to drink) to match ҿΈ? ? Shall we go for a beer near JR Shinjuku station? ̧̟ ৽॓ Ӻ ͷ ۙ͘ʹ ϏʔϧΛҿΈʹߦ͜͏ ͔ʁ

Slide 53

Slide 53 text

Japanese Do we want ҿΉ (to drink) to match ҿΈ? ? Do we want űƄŖſ to match Ϗʔϧ? ? Shall we go for a beer near JR Shinjuku station? Does half-width match full-width? ̧̟ ৽॓ Ӻ ͷ ۙ͘ʹ ϏʔϧΛҿΈʹߦ͜͏ ͔ʁ

Slide 54

Slide 54 text

Japanese Do we want ҿΉ (to drink) to match ҿΈ? ? Do we want űƄŖſ to match Ϗʔϧ? ? Do we want (emoji) to match? ? Shall we go for a beer near JR Shinjuku station? Does half-width match full-width? ̧̟ ৽॓ Ӻ ͷ ۙ͘ʹ ϏʔϧΛҿΈʹߦ͜͏ ͔ʁ

Slide 55

Slide 55 text

Common traits •Segmenting source text into tokens • Dealing with non-space separated languages • Handling punctuation in space separated languages • Segmenting compounds into their parts • Apply relevant linguistic normalizations • Character normalization • Morphological (or grammatical) normalizations • Spelling variations • Synonyms and stopwords

Slide 56

Slide 56 text

Key take-aways • Natural language is very complex • Each language is different with its own set of complexities • We have had a high level look at languages • But there is also... • Search needs per-language processing • Many considerations to be made (often application-speciﬁc) Greek Hebrew Chinese Korean Russian Thai Spanish and many more ... Japanese English German French Arabic

Slide 57

Slide 57 text

Basic search quality measurements

Slide 58

Slide 58 text

Precision Fraction of retrieved documents that are relevant precision = | { relevant docs } ∩ { retrieved docs } | | { retrieved docs } |

Slide 59

Slide 59 text

Recall | { relevant docs } ∩ { retrieved docs } | | { relevant docs } | recall = Fraction of relevant documents that are retrieved

Slide 60

Slide 60 text

Precision vs. Recall Should I optimize for precision or recall? ?

Slide 61

Slide 61 text

Precision vs. Recall Should I optimize for precision or recall? ? That depends on your application !

Slide 62

Slide 62 text

Precision vs. Recall Should I optimize for precision or recall? ? That depends on your application ! A lot of tuning work is in practice often about improving recall without hurting precision !

Slide 63

Slide 63 text

Linguistics in Lucene

Slide 64

Slide 64 text

Simpliﬁed architecture Index document or query

Slide 65

Slide 65 text

Index document or query Lucene analysis chain / Analyzer 1. Analyzes queries or documents in a pipelined fashion before indexing or searching 2. Analysis itself is done by an analyzer on a per ﬁeld basis 3. Key plug-in point for linguistics in Lucene Simpliﬁed architecture

Slide 66

Slide 66 text

What does an Analyzer do? ? Analyzers

Slide 67

Slide 67 text

What does an Analyzer do? ? ! Analyzers take text as its input and turns it into a stream of tokens Analyzers

Slide 68

Slide 68 text

What does an Analyzer do? ? ! Analyzers take text as its input and turns it into a stream of tokens Tokens are produced by a Tokenizer ! Analyzers

Slide 69

Slide 69 text

What does an Analyzer do? ? ! Analyzers take text as its input and turns it into a stream of tokens Tokens are produced by a Tokenizer Tokens can be processed further by a chain of TokenFilters downstream ! ! Analyzers

Slide 70

Slide 70 text

Analyzer high-level concepts Tokenizer Reader TokenFilter TokenFilter TokenFilter Reader • Stream to be analyzed is provided by a Reader (from java.io) • Can have chain of associated CharFilters (not discussed) Tokenizer • Segments text provider by reader into tokens • Most interesting things happen in incrementToken() method TokenFilter • Updates, mutates or enriches tokens • Most interesting things happen in incrementToken() method TokenFilter ... TokenFilter ...

Slide 71

Slide 71 text

Lucene processing example Le champagne est protégé par une appellation d'origine contrôlée.

Slide 72

Slide 72 text

Le champagne est protégé par une appellation d'origine contrôlée. FrenchAnalyzer

Slide 73

Slide 73 text

StandardTokenizer Le champagne est protégé par une appellation d'origine contrôlée. FrenchAnalyzer

Slide 74

Slide 74 text

StandardTokenizer Le champagne est protégé par une appellation d'origine contrôlée. Le champagne est protégé par une appellation d'origine contrôlée FrenchAnalyzer

Slide 75

Slide 75 text

StandardTokenizer Le champagne est protégé par une appellation d'origine contrôlée. Le champagne est protégé par une appellation d'origine contrôlée ElisionFilter FrenchAnalyzer

Slide 76

Slide 76 text

StandardTokenizer Le champagne est protégé par une appellation d'origine contrôlée. Le champagne est protégé par une appellation d'origine contrôlée ElisionFilter Le champagne est protégé par une appellation origine contrôlée FrenchAnalyzer

Slide 77

Slide 77 text

Slide 78

Slide 78 text

LowerCaseFilter le champagne est protégé par une appellation origine contrôlée

Slide 79

Slide 79 text

LowerCaseFilter le champagne est protégé par une appellation origine contrôlée StopFilter

Slide 80

Slide 80 text

LowerCaseFilter le champagne est protégé par une appellation origine contrôlée StopFilter champagne protégé appellation origine contrôlée

Slide 81

Slide 81 text

LowerCaseFilter le champagne est protégé par une appellation origine contrôlée StopFilter champagne protégé appellation origine contrôlée FrenchLightStemFilter

Slide 82

Slide 82 text

LowerCaseFilter le champagne est protégé par une appellation origine contrôlée StopFilter champagne protégé appellation origine contrôlée champagn proteg apel origin control FrenchLightStemFilter

Slide 83

Slide 83 text

FrenchAnalyzer champagn proteg apel origin control Le champagne est protégé par une appellation d'origine contrôlée. FrenchLightStemFilter StandardTokenizer ElisionFilter LowerCaseFilter StopFilter

Slide 84

Slide 84 text

Analyzer processing model •Analyzers provide a TokenStream • Retrieve it by calling tokenStream(ﬁeld, reader) • tokenStream() bundles together tokenizers and any additional ﬁlters necessary for analysis •Input is advanced by incrementToken() • Information about the token itself is provided by so-called TokenAttributes attached to the stream • Attribute for term text, offset, token type, etc. • TokenAttributes are updated on incrementToken()

Slide 85

Slide 85 text

Hands-on: Working with analyzers in code See demo code on http://github.com/atilika/berlin-buzzwords-2013

Slide 86

Slide 86 text

Synonyms

Slide 87

Slide 87 text

Synonyms • Synonyms are ﬂexible and easy-to-use • Very powerful tools for improving recall • Two types of synonyms • One way/mapping “sparkling wine => champagne” • Two way/equivalence “aoc, appellation d'origine contrôlée” • Can be applied index-time or query-time • Apply synonyms on one side - not both • Best practice is to apply synonyms query-side • Allows for updating synonyms without reindexing • Allows for turning synonyms on and off easily

Slide 88

Slide 88 text

Hands-on: French analysis with synonyms See demo code on http://github.com/atilika/berlin-buzzwords-2013

Slide 89

Slide 89 text

Linguistics in ElasticSearch (quick intro)

Slide 90

Slide 90 text

ElasticSearch linguistics highlights • Uses Lucene analyzers, tokenizers & filters • Analyzers are made available through a provider interface • Some analyzers available through plugins, i.e. kuromoji, smartcn, icu, etc. • Analyzers can be set up in your mapping • Analyzers can also be chosen based on a field in your document, i.e. a lang field

Slide 91

Slide 91 text

Hands-on: Simple multi-language example See example on http://github.com/atilika/berlin-buzzwords-2013

Slide 92

Slide 92 text

Linguistics in Solr

Slide 93

Slide 93 text

Linguistics in Solr • Uses Lucene analyzers, tokenizers & filters • Linguistic processing is defined by field types in schema.xml • Different processing can be applied on indexing and querying side if desired • A rich set of pre-defined and ready-to-use per- language field types are available • Defaults can be used as starting points for further configuration or as they are

Slide 94

Slide 94 text

French in schema.xml

Slide 95

Slide 95 text

Arabic in schema.xml

Slide 96

Slide 96 text

Field types in schema.xml • text_ar Arabic • text_bg Bulgarian • text_ca Catalan • text_cjk CJK • text_cz Czech • text_da Danish • text_de German • text_el Greek • text_es Spanish • text_eu Basque • text_fa Farsi • text_ﬁ Finnish • text_fr French • text_ga Irish • text_gl Galician • text_hi Hindi • text_hu Hungarian • text_hy Armenian • text_id Indonedian • text_it Italian • text_lv Latvian • text_nl Dutch • text_no Norwegian • text_pt Portuguese • text_ro Romanian • text_ru Russian • text_sv Swedish • text_th Thai • text_fr Turkish

Slide 97

Slide 97 text

Field types in schema.xml Coming soon! LUCENE-4956 • text_ar Arabic • text_bg Bulgarian • text_ca Catalan • text_cjk CJK • text_cz Czech • text_da Danish • text_de German • text_el Greek • text_es Spanish • text_eu Basque • text_fa Farsi • text_ﬁ Finnish • text_fr French • text_ga Irish • text_gl Galician • text_hi Hindi • text_hu Hungarian • text_hy Armenian • text_id Indonedian • text_it Italian • text_lv Latvian • text_nl Dutch • text_no Norwegian • text_pt Portuguese • text_ro Romanian • text_ru Russian • text_sv Swedish • text_th Thai • text_fr Turkish • text_ko Korean

Slide 98

Slide 98 text

Solr processing

Slide 99

Slide 99 text

Adding document details Index ∙∙∙

Slide 100

Slide 100 text

Index ∙∙∙ Adding document details

Slide 101

Slide 101 text

Index id ... title ... body ... ∙∙∙ UpdateRequestHandler handles request 1. Receives a document via HTTP in XML (or JSON, CSV, ...) 2. Converts document to a SolrInputDocument 3. Activates the update chain Adding document details

Slide 102

Slide 102 text

Index id ... title ... body ... UpdateRequestHandler handles request 1. Receives a document via HTTP in XML (or JSON, CSV, ...) 2. Converts document to a SolrInputDocument 3. Activates the update chain ∙∙∙ Adding document details

Slide 103

Slide 103 text

Index id ... title ... body ... Update chain of UpdateRequestProcessors 1. Processes a document at a time with operation (add) 2. Plugin logic can mutate SolrInputDocument, i.e. add ﬁelds or do other processing as desired Adding document details

Slide 104

Slide 104 text

Slide 105

Slide 105 text

Slide 106

Slide 106 text

Slide 107

Slide 107 text

Index id ... title ... body ... lang ... Update chain of UpdateRequestProcessors 1. Update processor added a lang ﬁeld by analyzing body 2. Finish by calling RunUpdateProcessor (usually) Adding document details

Slide 108

Slide 108 text

Slide 109

Slide 109 text

Slide 110

Slide 110 text

Index id ... title ... body ... lang ... id ... title ... body ... lang ... Lucene analyzer chain 1. Fields are analyzed individually Adding document details

Slide 111

Slide 111 text

Index id ... title ... body ... lang ... id ... title ... body ... lang ... Lucene analyzer chain 1. No analysis on id Adding document details

Slide 112

Slide 112 text

Index id ... title ... body ... lang ... title ... body ... lang ... Lucene analyzer chain 1. Field title being processed id ... Adding document details

Slide 113

Slide 113 text

Index id ... title ... body ... lang ... title ... body ... lang ... Lucene analyzer chain 1. Field title being processed id ... Adding document details

Slide 114

Slide 114 text

Index id ... title ... body ... lang ... title ... body ... lang ... Lucene analyzer chain 1. Field title being processed id ... Adding document details

Slide 115

Slide 115 text

Index id ... title ... body ... lang ... title ... body ... lang ... Lucene analyzer chain 1. Field title being processed id ... Adding document details

Slide 116

Slide 116 text

Index id ... title ... body ... lang ... title ... body ... lang ... Lucene analyzer chain 1. Field body being processed id ... Adding document details

Slide 117

Slide 117 text

Index id ... title ... body ... lang ... title ... lang ... Lucene analyzer chain 1. Field body being processed id ... body ... Adding document details

Slide 118

Slide 118 text

Index id ... title ... body ... lang ... title ... lang ... Lucene analyzer chain 1. Field body being processed id ... body ... Adding document details

Slide 119

Slide 119 text

Index id ... title ... body ... lang ... title ... lang ... Lucene analyzer chain 1. Field body being processed id ... body ... Adding document details

Slide 120

Slide 120 text

Index id ... title ... body ... lang ... title ... lang ... Lucene analyzer chain 1. Field lang being processed 2. User a diﬀerent analyzer chain id ... body ... Adding document details

Slide 121

Slide 121 text

Index id ... title ... body ... lang ... title ... Lucene analyzer chain 1. Field lang being processed 2. User a diﬀerent analyzer chain id ... body ... lang ... Adding document details

Slide 122

Slide 122 text

id ... title ... body ... lang ... Index Lucene analyzer chain 1. All ﬁelds analyzed Adding document details

Slide 123

Slide 123 text

Index id ... title ... body ... genre ... Adding document details

Slide 124

Slide 124 text

Index query Search details

Slide 125

Slide 125 text

SearchHandler Index query Search details

Slide 126

Slide 126 text

Index query Search components Search details

Slide 127

Slide 127 text

Analysis chain Index query Search details

Slide 128

Slide 128 text

Index query Search details

Slide 129

Slide 129 text

Index query Search components Search details

Slide 130

Slide 130 text

Index result SearchHandler Search details

Slide 131

Slide 131 text

Hands-on: Multi-lingual search with Solr See example on http://github.com/atilika/berlin-buzzwords-2013

Slide 132

Slide 132 text

Multi-language challenges •How do we detect language accurately? • Indexing side is feasible (accuracy > 99.1%), but query side is hard because of ambiguity •How to deal with language query side? • Supply language to use in the application (best if possible) • Search all relevant language variants (OR query) • Search a fallback ﬁeld using n-gramming • Boost important language or content Not knowing query term language will most likely impact negatively on overall rank

Slide 133

Slide 133 text

NLP eco-system

Slide 134

Slide 134 text

Basis Technology • High-end provider of text analytics software • Rosette Linguistics Platform (RLP) highlights • Language and encoding identiﬁcation (55 languages and 45 encodings) • Segmentation for Chinese, Japanese and Korean • De-compounding for German, Dutch, Korean, etc. • Lemmatization for a range of languages • Part-of-speech tagging for a range of language • Sentence boundary detection • Named entity extraction • Name indexing, transliteration and matching • Integrates well with Lucene/Solr

Slide 135

Slide 135 text

Apache OpenNLP • Machine learning toolkit for NLP • Implements a range of common and best-practice algorithms • Very easy-to-use tools and APIs targeted towards NLP • Features and applications • Tokenization • Sentence segmentation • Part-of-speech tagging • Named entity recognition • Chunking • Licensing terms • Code itself has an Apache License 2.0 • Some models are available, but licensing terms and F-scores are unclear... • See LUCENE-2899 for OpenNLP a Lucene Analyzer (work-in-progress)

Slide 136

Slide 136 text

Hands-on: Basic text processing with OpenNLP See demo code on http://github.com/atilika/berlin-buzzwords-2013

Slide 137

Slide 137 text

Other eco-system options

Slide 138

Slide 138 text

Summary

Slide 139

Slide 139 text

Summary •Getting languages right is a hard problem • Linguistics helps improve search quality •Linguistics in Lucene, ElasticSearch and Solr • A wide range of languages are supported out-of-the-box • Considerations to be made on indexing and query side • Lucene Analyzers work on a per-ﬁeld level • Solr UpdateRequestProcessors work on the document level • Solr has functionality for automatically detecting language (available in ElasticSearch as a plugin) •Linguistics options also available in the eco-system

Slide 140

Slide 140 text

Practical advice

Slide 141

Slide 141 text

Practical advice • Understand your content and your users’ needs • Understand your language and its issues • Understand what users want from search • Do you have issues with recall? • Consider synonyms, stemming • Consider compound-segmentation for European languages • Consider WordDelimiterFilter, phonetic matching • Do you have issues with precision? • Consider using ANDs instead of ORs for terms • Consider improving content quality? Search fewer ﬁelds? • Is some content more important than other? • Consider boosting content with a boost query

Slide 142

Slide 142 text

Thanks you Jan Høydahl www.cominvent.com Thanks for some slide material Bushra Zawaydeh Thanks for fun Arabic language lessons Gaute Lambertsen Thanks for helping talk preparations

Slide 143

Slide 143 text

Example code •Example code is available on Github • https://github.com/atilika/berlin-buzzwords-2013 •Get started using • git clone git://github.com/atilika/berlin-buzzwords-2013.git • less berlin-buzzwords-2013/README.md •Contact us if you have any questions • [email protected] •

Slide 144

Slide 144 text

͋Γ͕ͱ͏͍͟͝·ͨ͠ Thank you very much MN9O ا%1P Vielen Dank Merci beaucoup