Stuff a Search Engine Can Do

Stuff a Search Engine Can Do

This talk was given at the 2nd event of the Elastic Lisboa meetup group by João Duarte - https://www.meetup.com/Elastic-Lisboa/events/235801377 Demo code is at https://gist.github.com/jsvd/cafccdcf20bd30969ed8419c8ae9a573

098332e9d988080a9057816f84d668f7?s=128

Elasticsearch Inc

January 05, 2017
Tweet

Transcript

  1. 2.

    2

  2. 3.

    3

  3. 4.

    4

  4. 5.

    5

  5. 6.

    6 Apache Lucene Core Apache LuceneTM is a high-performance, full-featured

    text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. Apache Lucene is an open source project available for free download. http://lucene.apache.org/core/
  6. 7.

    7 Apache Lucene Core Apache LuceneTM is a high-performance, full-featured

    text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. Apache Lucene is an open source project available for free download. http://lucene.apache.org/core/
  7. 8.

    8 Apache Lucene Core Apache LuceneTM is a high-performance, full-featured

    text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. Apache Lucene is an open source project available for free download. http://lucene.apache.org/core/
  8. 9.

    9

  9. 10.

    10

  10. 11.

    11

  11. 12.

    12

  12. 13.

    13

  13. 15.

    15

  14. 16.

    16

  15. 17.

    17 As a law stu-dent, I went on a few

    job in-ter-views. At one, the in-ter-viewer’s first com-ment was “It’s so un- usual that I see a résumé with-out any typos.” “Are you se-ri-ous?” I said. She said, “Yes, prob-a-bly 90% of the résumés I get have ty-pos. And that in-cludes the ones we get from the top schools.” I got the job. Prob-a-bly there were bet-ter-qual-i-fied can-di-dates, but they dam-aged their chances with sloppy résumés. The irony is that those peo-ple, who most needed to hear the in-ter-viewer’s feed-back, weren’t in the room. Be-cause they never got an interview. … ...
  16. 18.

    18 As a law stu-dent, I went on a few

    job in-ter-views. At one, the in-ter-viewer’s first com-ment was “It’s so un- usual that I see a résumé with-out any typos.” “Are you se-ri-ous?” I said. She said, “Yes, prob-a-bly 90% of the résumés I get have ty-pos. And that in-cludes the ones we get from the top schools.” I got the job. Prob-a-bly there were bet-ter-qual-i-fied can-di-dates, but they dam-aged their chances with sloppy résumés. The irony is that those peo-ple, who most needed to hear the in-ter-viewer’s feed-back, weren’t in the room. Be-cause they never got an interview. … ...
  17. 19.

    19 Stuff a search engine can do Agenda Document Analysis

    1 Searching and Ranking 3 Suggestions, More Like This, etc. 4 Would you like to know more..? 5 Indexing 2
  18. 20.

    20 Document Analysis Stuff a search engine can do As

    a law stu-dent, I went on a few job in-ter-views. At one, the in-ter-viewer’s first com-ment was “It’s so un- usual that I see a résumé with-out any typos.” “Are you se-ri-ous?” I said. She said, “Yes, prob-a-bly 90% of the résumés I get have ty-pos. And that in-cludes the ones we get from the top schools.” I got the job. Prob-a-bly there were bet-ter-qual-i-fied can-di-dates, but they dam-aged their chances with sloppy résumés. The irony is that those peo-ple, who most needed to hear the in-ter-viewer’s feed-back, weren’t in the room. Be-cause they never got an interview. … ...
  19. 21.

    21 Document Analysis Stuff a search engine can do As

    a law stu-dent, I went on a few job in-ter-views. At one, the in-ter-viewer’s first com-ment was “It’s so un- usual that I see a résumé with-out any typos.” “Are you se-ri-ous?” I said. She said, “Yes, prob-a-bly 90% of the résumés I get have ty-pos. And that in-cludes the ones we get from the top schools.” I got the job. Prob-a-bly there were bet-ter-qual-i-fied can-di-dates, but they dam-aged their chances with sloppy résumés. The irony is that those peo-ple, who most needed to hear the in-ter-viewer’s feed-back, weren’t in the room. Be-cause they never got an interview. … ... Analyzer
  20. 22.

    22 Stuff a search engine can do Anatomy of the

    Analyzer: Elasticsearch comes with pre-built analyzers, you can create your own. https://www.elastic.co/guide/en/elasticsearch/reference/current/analyzer-anatomy.html Document Analysis Character Filter 1 2 3 Tokenizer Token Filter
  21. 23.

    23 Stuff a search engine can do Anatomy of the

    Analyzer: Elasticsearch comes with pre-built analyzers, you can create your own. https://www.elastic.co/guide/en/elasticsearch/reference/current/analyzer-anatomy.html Document Analysis Character Filter 1 2 3 Tokenizer Token Filter
  22. 24.

    24 Stuff a search engine can do Agenda Document Analysis

    1 Searching and Ranking 3 Suggestions, More Like This, etc. 4 Would you like to know more..? 5 Indexing 2 1
  23. 25.

    25 • Elasticsearch terms: ‒ An Index: data structure that

    houses documents (think RDBMS "table"); ‒ Index a document: insert into an Index ‒ Document: a JSON object (hash map) Stuff a search engine can do Indexing $ curl -XPUT 'http://localhost:9200/twitter/tweet/1' -d '{ "user" : "kimchy", "post_date" : "2009-11-15T14:12:12", "message" : "trying out Elasticsearch" }'
  24. 26.

    26 Stuff a search engine can do Indexing token document_id

    frequency He 1 1 who 1 1 controls 1 1 the 1 1 spice 1 1 universe 1 1 # document id 1 {"text": "He who controls the spice, controls the universe."}
  25. 27.

    27 Stuff a search engine can do Indexing token document_id

    frequency He 1 1 who 1 1 controls 1 1 the 1 1 spice 1 1 universe 1 1 A 2 1 mad 2 1 man 2 1 sees 2 1 what 2 1 he 2 1 # document id 1 {"text": "He who controls the spice, controls the universe."} # document id 2 {"text": "A mad man sees what he sees."}
  26. 28.

    28 Stuff a search engine can do Indexing token document_id

    frequency He 1 1 who 1 1 controls 1 1 the 1,3 2 spice 1 1 universe 1,3 2 A 2 1 mad 2,3 2 man 2,3 2 sees 2 1 what 2 1 he 2 1 What 3 1 if 3 1 a 3 1 controlled 3 1 # document id 1 {"text": "He who controls the spice, controls the universe."} # document id 2 {"text": "A mad man sees what he sees."} # document id 3 {"text": "What if a mad man controlled the universe?"}
  27. 29.

    29 Stuff a search engine can do Indexing token document_id

    frequency he 1,2 2 who 1 1 controls 1 1 the 1,3 2 spice 1 1 universe 1,3 2 a 2,3 2 mad 2,3 2 man 2,3 2 sees 2 1 what 2,3 2 if 3 1 controlled 3 1 # document id 1 {"text": "He who controls the spice, controls the universe."} # document id 2 {"text": "A mad man sees what he sees."} # document id 3 {"text": "What if a mad man controlled the universe?"} Lower case token filter
  28. 30.

    30 Stuff a search engine can do Indexing token document_id

    frequency he 1,2 2 who 1 1 control 1,3 2 the 1,3 2 spice 1 1 univers 1,3 2 a 2,3 2 mad 2,3 2 man 2,3 2 see 2 1 what 2,3 2 if 3 1 # document id 1 {"text": "He who controls the spice, controls the universe."} # document id 2 {"text": "A mad man sees what he sees."} # document id 3 {"text": "What if a mad man controlled the universe?"} + Stemmer
  29. 31.

    31 Stuff a search engine can do Indexing # document

    id 1 {"text": "He who controls the spice, controls the universe."} # document id 2 {"text": "A mad man sees what he sees."} # document id 3 {"text": "What if a mad man controlled the universe?"} - Stopwords token document_id frequency he 1,2 2 who 1 1 control 1,3 2 the 1,3 2 spice 1 1 univers 1,3 2 a 2,3 2 mad 2,3 2 man 2,3 2 see 2 1 what 2,3 2 if 3 1
  30. 32.

    32 Stuff a search engine can do Indexing token document_id

    frequency he 1,2 2 who 1 1 control 1,3 2 spice 1 1 univers 1,3 2 mad 2,3 2 man 2,3 2 see 2 1 what 2,3 2 # document id 1 {"text": "He who controls the spice, controls the universe."} # document id 2 {"text": "A mad man sees what he sees."} # document id 3 {"text": "What if a mad man controlled the universe?"}
  31. 33.

    33 Stuff a search engine can do Agenda Document Analysis

    1 Searching and Ranking Suggestions, More Like This, etc. 4 Would you like to know more..? 5 Indexing 1 2 3
  32. 34.

    34 Stuff a search engine can do Structured Full-text Others

    • Similar to SQL • Find exact values • Ranges • Group by • Match • Match Phrase • Relevancy and boosting • More Like This • Multifield Search • Pipeline Aggregations • Geolocation • Proximity Matching Searching and Ranking
  33. 35.

    35 Stuff a search engine can do Structured Full-text Others

    • Similar to SQL • Find exact values • Ranges • Group by • Match • Match Phrase • Relevancy and boosting • More Like This • Multifield Search • Pipeline Aggregations • Geolocation • Proximity Matching Searching and Ranking
  34. 36.

    36 Stuff a search engine can do Searching and Ranking

    GET my_index/_search { "query": { "match" : { "text" : { "query" : "control spice" } } } } token document_id frequency he 1,2 2 who 1 1 control 1,3 2 spice 1 1 univers 1,3 2 mad 2,3 2 man 2,3 2 see 2 1 what 2,3 2
  35. 37.

    37 Stuff a search engine can do Searching and Ranking

    GET my_index/_search { "query": { "match" : { "text" : { "query" : "control spice" } } } } token control spice token document_id frequency he 1,2 2 who 1 1 control 1,3 2 spice 1 1 univers 1,3 2 mad 2,3 2 man 2,3 2 see 2 1 what 2,3 2
  36. 38.

    38 Stuff a search engine can do Searching and Ranking

    GET my_index/_search { "query": { "match" : { "text" : { "query" : "control spice" } } } } token control spice
  37. 39.

    39 Stuff a search engine can do GET my_index/_search {

    "query": { "match" : { "text" : { "query" : "control spice" } } } } token control spice Searching and Ranking
  38. 40.

    40 Stuff a search engine can do There are three

    main factors of a document’s score: • TF (term frequency): The more a token appears in a doc, the more important it is • IDF (inverse document frequency): The more documents containing the term, the less important it is • Field length: shorter docs are more likely to be relevant than longer docs Searching and Ranking
  39. 44.

    44 Stuff a search engine can do "BM25 Demystified" by

    Britta Weber https://www.elastic.co/elasticon/conf/2016/sf/improved-text-scoring-with-bm25 Searching and Ranking
  40. 45.

    45 Stuff a search engine can do "BM25 Demystified" by

    Britta Weber https://www.elastic.co/elasticon/conf/2016/sf/improved-text-scoring-with-bm25 Searching and Ranking
  41. 46.

    46 Stuff a search engine can do Agenda Document Analysis

    1 Searching and Ranking Suggestions, More Like This, etc. 4 Would you like to know more..? 5 Indexing 1 2 4 3
  42. 47.

    47 5 Stuff a search engine can do Agenda Document

    Analysis 1 Searching and Ranking Suggestions, More Like This, etc. 4 Would you like to know more..? Indexing 1 2 4 3 4
  43. 48.

    48 Code - https://github.com/elastic/ Documentation - https://www.elastic.co/guide/index.html Elasticsearch: The Definitive

    Guide - https://www.elastic.co/guide/en/elasticsearch/guide/current/index.html Discuss Forum - https://discuss.elastic.co/ Private or Public Training - https://training.elastic.co/ Subscriptions - https://www.elastic.co/subscriptions Stuff a search engine can do Would you like to know more?