6 Apache Lucene Core Apache LuceneTM is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. Apache Lucene is an open source project available for free download. http://lucene.apache.org/core/
7 Apache Lucene Core Apache LuceneTM is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. Apache Lucene is an open source project available for free download. http://lucene.apache.org/core/
8 Apache Lucene Core Apache LuceneTM is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. Apache Lucene is an open source project available for free download. http://lucene.apache.org/core/
17 As a law stu-dent, I went on a few job in-ter-views. At one, the in-ter-viewer’s first com-ment was “It’s so un- usual that I see a résumé with-out any typos.” “Are you se-ri-ous?” I said. She said, “Yes, prob-a-bly 90% of the résumés I get have ty-pos. And that in-cludes the ones we get from the top schools.” I got the job. Prob-a-bly there were bet-ter-qual-i-fied can-di-dates, but they dam-aged their chances with sloppy résumés. The irony is that those peo-ple, who most needed to hear the in-ter-viewer’s feed-back, weren’t in the room. Be-cause they never got an interview. … ...
18 As a law stu-dent, I went on a few job in-ter-views. At one, the in-ter-viewer’s first com-ment was “It’s so un- usual that I see a résumé with-out any typos.” “Are you se-ri-ous?” I said. She said, “Yes, prob-a-bly 90% of the résumés I get have ty-pos. And that in-cludes the ones we get from the top schools.” I got the job. Prob-a-bly there were bet-ter-qual-i-fied can-di-dates, but they dam-aged their chances with sloppy résumés. The irony is that those peo-ple, who most needed to hear the in-ter-viewer’s feed-back, weren’t in the room. Be-cause they never got an interview. … ...
19 Stuff a search engine can do Agenda Document Analysis 1 Searching and Ranking 3 Suggestions, More Like This, etc. 4 Would you like to know more..? 5 Indexing 2
20 Document Analysis Stuff a search engine can do As a law stu-dent, I went on a few job in-ter-views. At one, the in-ter-viewer’s first com-ment was “It’s so un- usual that I see a résumé with-out any typos.” “Are you se-ri-ous?” I said. She said, “Yes, prob-a-bly 90% of the résumés I get have ty-pos. And that in-cludes the ones we get from the top schools.” I got the job. Prob-a-bly there were bet-ter-qual-i-fied can-di-dates, but they dam-aged their chances with sloppy résumés. The irony is that those peo-ple, who most needed to hear the in-ter-viewer’s feed-back, weren’t in the room. Be-cause they never got an interview. … ...
21 Document Analysis Stuff a search engine can do As a law stu-dent, I went on a few job in-ter-views. At one, the in-ter-viewer’s first com-ment was “It’s so un- usual that I see a résumé with-out any typos.” “Are you se-ri-ous?” I said. She said, “Yes, prob-a-bly 90% of the résumés I get have ty-pos. And that in-cludes the ones we get from the top schools.” I got the job. Prob-a-bly there were bet-ter-qual-i-fied can-di-dates, but they dam-aged their chances with sloppy résumés. The irony is that those peo-ple, who most needed to hear the in-ter-viewer’s feed-back, weren’t in the room. Be-cause they never got an interview. … ... Analyzer
22 Stuff a search engine can do Anatomy of the Analyzer: Elasticsearch comes with pre-built analyzers, you can create your own. https://www.elastic.co/guide/en/elasticsearch/reference/current/analyzer-anatomy.html Document Analysis Character Filter 1 2 3 Tokenizer Token Filter
23 Stuff a search engine can do Anatomy of the Analyzer: Elasticsearch comes with pre-built analyzers, you can create your own. https://www.elastic.co/guide/en/elasticsearch/reference/current/analyzer-anatomy.html Document Analysis Character Filter 1 2 3 Tokenizer Token Filter
24 Stuff a search engine can do Agenda Document Analysis 1 Searching and Ranking 3 Suggestions, More Like This, etc. 4 Would you like to know more..? 5 Indexing 2 1
25 • Elasticsearch terms: ‒ An Index: data structure that houses documents (think RDBMS "table"); ‒ Index a document: insert into an Index ‒ Document: a JSON object (hash map) Stuff a search engine can do Indexing $ curl -XPUT 'http://localhost:9200/twitter/tweet/1' -d '{ "user" : "kimchy", "post_date" : "2009-11-15T14:12:12", "message" : "trying out Elasticsearch" }'
26 Stuff a search engine can do Indexing token document_id frequency He 1 1 who 1 1 controls 1 1 the 1 1 spice 1 1 universe 1 1 # document id 1 {"text": "He who controls the spice, controls the universe."}
27 Stuff a search engine can do Indexing token document_id frequency He 1 1 who 1 1 controls 1 1 the 1 1 spice 1 1 universe 1 1 A 2 1 mad 2 1 man 2 1 sees 2 1 what 2 1 he 2 1 # document id 1 {"text": "He who controls the spice, controls the universe."} # document id 2 {"text": "A mad man sees what he sees."}
28 Stuff a search engine can do Indexing token document_id frequency He 1 1 who 1 1 controls 1 1 the 1,3 2 spice 1 1 universe 1,3 2 A 2 1 mad 2,3 2 man 2,3 2 sees 2 1 what 2 1 he 2 1 What 3 1 if 3 1 a 3 1 controlled 3 1 # document id 1 {"text": "He who controls the spice, controls the universe."} # document id 2 {"text": "A mad man sees what he sees."} # document id 3 {"text": "What if a mad man controlled the universe?"}
29 Stuff a search engine can do Indexing token document_id frequency he 1,2 2 who 1 1 controls 1 1 the 1,3 2 spice 1 1 universe 1,3 2 a 2,3 2 mad 2,3 2 man 2,3 2 sees 2 1 what 2,3 2 if 3 1 controlled 3 1 # document id 1 {"text": "He who controls the spice, controls the universe."} # document id 2 {"text": "A mad man sees what he sees."} # document id 3 {"text": "What if a mad man controlled the universe?"} Lower case token filter
30 Stuff a search engine can do Indexing token document_id frequency he 1,2 2 who 1 1 control 1,3 2 the 1,3 2 spice 1 1 univers 1,3 2 a 2,3 2 mad 2,3 2 man 2,3 2 see 2 1 what 2,3 2 if 3 1 # document id 1 {"text": "He who controls the spice, controls the universe."} # document id 2 {"text": "A mad man sees what he sees."} # document id 3 {"text": "What if a mad man controlled the universe?"} + Stemmer
31 Stuff a search engine can do Indexing # document id 1 {"text": "He who controls the spice, controls the universe."} # document id 2 {"text": "A mad man sees what he sees."} # document id 3 {"text": "What if a mad man controlled the universe?"} - Stopwords token document_id frequency he 1,2 2 who 1 1 control 1,3 2 the 1,3 2 spice 1 1 univers 1,3 2 a 2,3 2 mad 2,3 2 man 2,3 2 see 2 1 what 2,3 2 if 3 1
32 Stuff a search engine can do Indexing token document_id frequency he 1,2 2 who 1 1 control 1,3 2 spice 1 1 univers 1,3 2 mad 2,3 2 man 2,3 2 see 2 1 what 2,3 2 # document id 1 {"text": "He who controls the spice, controls the universe."} # document id 2 {"text": "A mad man sees what he sees."} # document id 3 {"text": "What if a mad man controlled the universe?"}
33 Stuff a search engine can do Agenda Document Analysis 1 Searching and Ranking Suggestions, More Like This, etc. 4 Would you like to know more..? 5 Indexing 1 2 3
34 Stuff a search engine can do Structured Full-text Others • Similar to SQL • Find exact values • Ranges • Group by • Match • Match Phrase • Relevancy and boosting • More Like This • Multifield Search • Pipeline Aggregations • Geolocation • Proximity Matching Searching and Ranking
35 Stuff a search engine can do Structured Full-text Others • Similar to SQL • Find exact values • Ranges • Group by • Match • Match Phrase • Relevancy and boosting • More Like This • Multifield Search • Pipeline Aggregations • Geolocation • Proximity Matching Searching and Ranking
36 Stuff a search engine can do Searching and Ranking GET my_index/_search { "query": { "match" : { "text" : { "query" : "control spice" } } } } token document_id frequency he 1,2 2 who 1 1 control 1,3 2 spice 1 1 univers 1,3 2 mad 2,3 2 man 2,3 2 see 2 1 what 2,3 2
37 Stuff a search engine can do Searching and Ranking GET my_index/_search { "query": { "match" : { "text" : { "query" : "control spice" } } } } token control spice token document_id frequency he 1,2 2 who 1 1 control 1,3 2 spice 1 1 univers 1,3 2 mad 2,3 2 man 2,3 2 see 2 1 what 2,3 2
40 Stuff a search engine can do There are three main factors of a document’s score: • TF (term frequency): The more a token appears in a doc, the more important it is • IDF (inverse document frequency): The more documents containing the term, the less important it is • Field length: shorter docs are more likely to be relevant than longer docs Searching and Ranking
44 Stuff a search engine can do "BM25 Demystified" by Britta Weber https://www.elastic.co/elasticon/conf/2016/sf/improved-text-scoring-with-bm25 Searching and Ranking
45 Stuff a search engine can do "BM25 Demystified" by Britta Weber https://www.elastic.co/elasticon/conf/2016/sf/improved-text-scoring-with-bm25 Searching and Ranking
46 Stuff a search engine can do Agenda Document Analysis 1 Searching and Ranking Suggestions, More Like This, etc. 4 Would you like to know more..? 5 Indexing 1 2 4 3
47 5 Stuff a search engine can do Agenda Document Analysis 1 Searching and Ranking Suggestions, More Like This, etc. 4 Would you like to know more..? Indexing 1 2 4 3 4
48 Code - https://github.com/elastic/ Documentation - https://www.elastic.co/guide/index.html Elasticsearch: The Definitive Guide - https://www.elastic.co/guide/en/elasticsearch/guide/current/index.html Discuss Forum - https://discuss.elastic.co/ Private or Public Training - https://training.elastic.co/ Subscriptions - https://www.elastic.co/subscriptions Stuff a search engine can do Would you like to know more?