Slide 1

Slide 1 text

1 João Duarte Log Whisperer @elastic Stuff a search engine can do :slightly_smiling_face:

Slide 2

Slide 2 text

2

Slide 3

Slide 3 text

3

Slide 4

Slide 4 text

4

Slide 5

Slide 5 text

5

Slide 6

Slide 6 text

6 Apache Lucene Core Apache LuceneTM is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. Apache Lucene is an open source project available for free download. http://lucene.apache.org/core/

Slide 7

Slide 7 text

7 Apache Lucene Core Apache LuceneTM is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. Apache Lucene is an open source project available for free download. http://lucene.apache.org/core/

Slide 8

Slide 8 text

8 Apache Lucene Core Apache LuceneTM is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. Apache Lucene is an open source project available for free download. http://lucene.apache.org/core/

Slide 9

Slide 9 text

9

Slide 10

Slide 10 text

10

Slide 11

Slide 11 text

11

Slide 12

Slide 12 text

12

Slide 13

Slide 13 text

13

Slide 14

Slide 14 text

14 Elasticsearch Cluster

Slide 15

Slide 15 text

15

Slide 16

Slide 16 text

16

Slide 17

Slide 17 text

17 As a law stu-dent, I went on a few job in-ter-views. At one, the in-ter-viewer’s first com-ment was “It’s so un- usual that I see a résumé with-out any typos.” “Are you se-ri-ous?” I said. She said, “Yes, prob-a-bly 90% of the résumés I get have ty-pos. And that in-cludes the ones we get from the top schools.” I got the job. Prob-a-bly there were bet-ter-qual-i-fied can-di-dates, but they dam-aged their chances with sloppy résumés. The irony is that those peo-ple, who most needed to hear the in-ter-viewer’s feed-back, weren’t in the room. Be-cause they never got an interview. … ...

Slide 18

Slide 18 text

18 As a law stu-dent, I went on a few job in-ter-views. At one, the in-ter-viewer’s first com-ment was “It’s so un- usual that I see a résumé with-out any typos.” “Are you se-ri-ous?” I said. She said, “Yes, prob-a-bly 90% of the résumés I get have ty-pos. And that in-cludes the ones we get from the top schools.” I got the job. Prob-a-bly there were bet-ter-qual-i-fied can-di-dates, but they dam-aged their chances with sloppy résumés. The irony is that those peo-ple, who most needed to hear the in-ter-viewer’s feed-back, weren’t in the room. Be-cause they never got an interview. … ...

Slide 19

Slide 19 text

19 Stuff a search engine can do Agenda Document Analysis 1 Searching and Ranking 3 Suggestions, More Like This, etc. 4 Would you like to know more..? 5 Indexing 2

Slide 20

Slide 20 text

20 Document Analysis Stuff a search engine can do As a law stu-dent, I went on a few job in-ter-views. At one, the in-ter-viewer’s first com-ment was “It’s so un- usual that I see a résumé with-out any typos.” “Are you se-ri-ous?” I said. She said, “Yes, prob-a-bly 90% of the résumés I get have ty-pos. And that in-cludes the ones we get from the top schools.” I got the job. Prob-a-bly there were bet-ter-qual-i-fied can-di-dates, but they dam-aged their chances with sloppy résumés. The irony is that those peo-ple, who most needed to hear the in-ter-viewer’s feed-back, weren’t in the room. Be-cause they never got an interview. … ...

Slide 21

Slide 21 text

21 Document Analysis Stuff a search engine can do As a law stu-dent, I went on a few job in-ter-views. At one, the in-ter-viewer’s first com-ment was “It’s so un- usual that I see a résumé with-out any typos.” “Are you se-ri-ous?” I said. She said, “Yes, prob-a-bly 90% of the résumés I get have ty-pos. And that in-cludes the ones we get from the top schools.” I got the job. Prob-a-bly there were bet-ter-qual-i-fied can-di-dates, but they dam-aged their chances with sloppy résumés. The irony is that those peo-ple, who most needed to hear the in-ter-viewer’s feed-back, weren’t in the room. Be-cause they never got an interview. … ... Analyzer

Slide 22

Slide 22 text

22 Stuff a search engine can do Anatomy of the Analyzer: Elasticsearch comes with pre-built analyzers, you can create your own. https://www.elastic.co/guide/en/elasticsearch/reference/current/analyzer-anatomy.html Document Analysis Character Filter 1 2 3 Tokenizer Token Filter

Slide 23

Slide 23 text

23 Stuff a search engine can do Anatomy of the Analyzer: Elasticsearch comes with pre-built analyzers, you can create your own. https://www.elastic.co/guide/en/elasticsearch/reference/current/analyzer-anatomy.html Document Analysis Character Filter 1 2 3 Tokenizer Token Filter

Slide 24

Slide 24 text

24 Stuff a search engine can do Agenda Document Analysis 1 Searching and Ranking 3 Suggestions, More Like This, etc. 4 Would you like to know more..? 5 Indexing 2 1

Slide 25

Slide 25 text

25 • Elasticsearch terms: ‒ An Index: data structure that houses documents (think RDBMS "table"); ‒ Index a document: insert into an Index ‒ Document: a JSON object (hash map) Stuff a search engine can do Indexing $ curl -XPUT 'http://localhost:9200/twitter/tweet/1' -d '{ "user" : "kimchy", "post_date" : "2009-11-15T14:12:12", "message" : "trying out Elasticsearch" }'

Slide 26

Slide 26 text

26 Stuff a search engine can do Indexing token document_id frequency He 1 1 who 1 1 controls 1 1 the 1 1 spice 1 1 universe 1 1 # document id 1 {"text": "He who controls the spice, controls the universe."}

Slide 27

Slide 27 text

27 Stuff a search engine can do Indexing token document_id frequency He 1 1 who 1 1 controls 1 1 the 1 1 spice 1 1 universe 1 1 A 2 1 mad 2 1 man 2 1 sees 2 1 what 2 1 he 2 1 # document id 1 {"text": "He who controls the spice, controls the universe."} # document id 2 {"text": "A mad man sees what he sees."}

Slide 28

Slide 28 text

28 Stuff a search engine can do Indexing token document_id frequency He 1 1 who 1 1 controls 1 1 the 1,3 2 spice 1 1 universe 1,3 2 A 2 1 mad 2,3 2 man 2,3 2 sees 2 1 what 2 1 he 2 1 What 3 1 if 3 1 a 3 1 controlled 3 1 # document id 1 {"text": "He who controls the spice, controls the universe."} # document id 2 {"text": "A mad man sees what he sees."} # document id 3 {"text": "What if a mad man controlled the universe?"}

Slide 29

Slide 29 text

29 Stuff a search engine can do Indexing token document_id frequency he 1,2 2 who 1 1 controls 1 1 the 1,3 2 spice 1 1 universe 1,3 2 a 2,3 2 mad 2,3 2 man 2,3 2 sees 2 1 what 2,3 2 if 3 1 controlled 3 1 # document id 1 {"text": "He who controls the spice, controls the universe."} # document id 2 {"text": "A mad man sees what he sees."} # document id 3 {"text": "What if a mad man controlled the universe?"} Lower case token filter

Slide 30

Slide 30 text

30 Stuff a search engine can do Indexing token document_id frequency he 1,2 2 who 1 1 control 1,3 2 the 1,3 2 spice 1 1 univers 1,3 2 a 2,3 2 mad 2,3 2 man 2,3 2 see 2 1 what 2,3 2 if 3 1 # document id 1 {"text": "He who controls the spice, controls the universe."} # document id 2 {"text": "A mad man sees what he sees."} # document id 3 {"text": "What if a mad man controlled the universe?"} + Stemmer

Slide 31

Slide 31 text

31 Stuff a search engine can do Indexing # document id 1 {"text": "He who controls the spice, controls the universe."} # document id 2 {"text": "A mad man sees what he sees."} # document id 3 {"text": "What if a mad man controlled the universe?"} - Stopwords token document_id frequency he 1,2 2 who 1 1 control 1,3 2 the 1,3 2 spice 1 1 univers 1,3 2 a 2,3 2 mad 2,3 2 man 2,3 2 see 2 1 what 2,3 2 if 3 1

Slide 32

Slide 32 text

32 Stuff a search engine can do Indexing token document_id frequency he 1,2 2 who 1 1 control 1,3 2 spice 1 1 univers 1,3 2 mad 2,3 2 man 2,3 2 see 2 1 what 2,3 2 # document id 1 {"text": "He who controls the spice, controls the universe."} # document id 2 {"text": "A mad man sees what he sees."} # document id 3 {"text": "What if a mad man controlled the universe?"}

Slide 33

Slide 33 text

33 Stuff a search engine can do Agenda Document Analysis 1 Searching and Ranking Suggestions, More Like This, etc. 4 Would you like to know more..? 5 Indexing 1 2 3

Slide 34

Slide 34 text

34 Stuff a search engine can do Structured Full-text Others • Similar to SQL • Find exact values • Ranges • Group by • Match • Match Phrase • Relevancy and boosting • More Like This • Multifield Search • Pipeline Aggregations • Geolocation • Proximity Matching Searching and Ranking

Slide 35

Slide 35 text

35 Stuff a search engine can do Structured Full-text Others • Similar to SQL • Find exact values • Ranges • Group by • Match • Match Phrase • Relevancy and boosting • More Like This • Multifield Search • Pipeline Aggregations • Geolocation • Proximity Matching Searching and Ranking

Slide 36

Slide 36 text

36 Stuff a search engine can do Searching and Ranking GET my_index/_search { "query": { "match" : { "text" : { "query" : "control spice" } } } } token document_id frequency he 1,2 2 who 1 1 control 1,3 2 spice 1 1 univers 1,3 2 mad 2,3 2 man 2,3 2 see 2 1 what 2,3 2

Slide 37

Slide 37 text

37 Stuff a search engine can do Searching and Ranking GET my_index/_search { "query": { "match" : { "text" : { "query" : "control spice" } } } } token control spice token document_id frequency he 1,2 2 who 1 1 control 1,3 2 spice 1 1 univers 1,3 2 mad 2,3 2 man 2,3 2 see 2 1 what 2,3 2

Slide 38

Slide 38 text

38 Stuff a search engine can do Searching and Ranking GET my_index/_search { "query": { "match" : { "text" : { "query" : "control spice" } } } } token control spice

Slide 39

Slide 39 text

39 Stuff a search engine can do GET my_index/_search { "query": { "match" : { "text" : { "query" : "control spice" } } } } token control spice Searching and Ranking

Slide 40

Slide 40 text

40 Stuff a search engine can do There are three main factors of a document’s score: • TF (term frequency): The more a token appears in a doc, the more important it is • IDF (inverse document frequency): The more documents containing the term, the less important it is • Field length: shorter docs are more likely to be relevant than longer docs Searching and Ranking

Slide 41

Slide 41 text

41 Stuff a search engine can do Searching and Ranking

Slide 42

Slide 42 text

42 Stuff a search engine can do Searching and Ranking

Slide 43

Slide 43 text

43 Stuff a search engine can do Searching and Ranking

Slide 44

Slide 44 text

44 Stuff a search engine can do "BM25 Demystified" by Britta Weber https://www.elastic.co/elasticon/conf/2016/sf/improved-text-scoring-with-bm25 Searching and Ranking

Slide 45

Slide 45 text

45 Stuff a search engine can do "BM25 Demystified" by Britta Weber https://www.elastic.co/elasticon/conf/2016/sf/improved-text-scoring-with-bm25 Searching and Ranking

Slide 46

Slide 46 text

46 Stuff a search engine can do Agenda Document Analysis 1 Searching and Ranking Suggestions, More Like This, etc. 4 Would you like to know more..? 5 Indexing 1 2 4 3

Slide 47

Slide 47 text

47 5 Stuff a search engine can do Agenda Document Analysis 1 Searching and Ranking Suggestions, More Like This, etc. 4 Would you like to know more..? Indexing 1 2 4 3 4

Slide 48

Slide 48 text

48 Code - https://github.com/elastic/ Documentation - https://www.elastic.co/guide/index.html Elasticsearch: The Definitive Guide - https://www.elastic.co/guide/en/elasticsearch/guide/current/index.html Discuss Forum - https://discuss.elastic.co/ Private or Public Training - https://training.elastic.co/ Subscriptions - https://www.elastic.co/subscriptions Stuff a search engine can do Would you like to know more?

Slide 49

Slide 49 text

49 Stuff a search engine can do The End. Thank you!