Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Stuff a Search Engine Can Do

Stuff a Search Engine Can Do

This talk was given at the 2nd event of the Elastic Lisboa meetup group by João Duarte - https://www.meetup.com/Elastic-Lisboa/events/235801377 Demo code is at https://gist.github.com/jsvd/cafccdcf20bd30969ed8419c8ae9a573

Elasticsearch Inc

January 05, 2017
Tweet

More Decks by Elasticsearch Inc

Other Decks in Programming

Transcript

  1. 1
    João Duarte
    Log Whisperer @elastic
    Stuff a search engine can do
    :slightly_smiling_face:

    View Slide

  2. 2

    View Slide

  3. 3

    View Slide

  4. 4

    View Slide

  5. 5

    View Slide

  6. 6
    Apache Lucene Core
    Apache LuceneTM is a high-performance,
    full-featured text search engine library written
    entirely in Java. It is a technology suitable for
    nearly any application that requires full-text
    search, especially cross-platform.
    Apache Lucene is an open source project
    available for free download.
    http://lucene.apache.org/core/

    View Slide

  7. 7
    Apache Lucene Core
    Apache LuceneTM is a high-performance,
    full-featured text search engine library written
    entirely in Java. It is a technology suitable for
    nearly any application that requires full-text
    search, especially cross-platform.
    Apache Lucene is an open source project
    available for free download.
    http://lucene.apache.org/core/

    View Slide

  8. 8
    Apache Lucene Core
    Apache LuceneTM is a high-performance,
    full-featured text search engine library written
    entirely in Java. It is a technology suitable for
    nearly any application that requires full-text
    search, especially cross-platform.
    Apache Lucene is an open source project
    available for free download.
    http://lucene.apache.org/core/

    View Slide

  9. 9

    View Slide

  10. 10

    View Slide

  11. 11

    View Slide

  12. 12

    View Slide

  13. 13

    View Slide

  14. 14
    Elasticsearch Cluster

    View Slide

  15. 15

    View Slide

  16. 16

    View Slide

  17. 17
    As a law stu-dent, I went on a few job in-ter-views. At
    one, the in-ter-viewer’s first com-ment was “It’s so un-
    usual that I see a résumé with-out any typos.”
    “Are you se-ri-ous?” I said.
    She said, “Yes, prob-a-bly 90% of the résumés I get
    have ty-pos. And that in-cludes the ones we get from
    the top schools.”
    I got the job. Prob-a-bly there were bet-ter-qual-i-fied
    can-di-dates, but they dam-aged their chances with
    sloppy résumés. The irony is that those peo-ple, who
    most needed to hear the in-ter-viewer’s feed-back,
    weren’t in the room. Be-cause they never got an
    interview.

    ...

    View Slide

  18. 18
    As a law stu-dent, I went on a few job in-ter-views. At
    one, the in-ter-viewer’s first com-ment was “It’s so un-
    usual that I see a résumé with-out any typos.”
    “Are you se-ri-ous?” I said.
    She said, “Yes, prob-a-bly 90% of the résumés I get
    have ty-pos. And that in-cludes the ones we get from
    the top schools.”
    I got the job. Prob-a-bly there were bet-ter-qual-i-fied
    can-di-dates, but they dam-aged their chances with
    sloppy résumés. The irony is that those peo-ple, who
    most needed to hear the in-ter-viewer’s feed-back,
    weren’t in the room. Be-cause they never got an
    interview.

    ...

    View Slide

  19. 19
    Stuff a search engine can do
    Agenda
    Document Analysis
    1
    Searching and Ranking
    3
    Suggestions, More Like This, etc.
    4
    Would you like to know more..?
    5
    Indexing
    2

    View Slide

  20. 20
    Document Analysis
    Stuff a search engine can do
    As a law stu-dent, I went on a few job in-ter-views. At
    one, the in-ter-viewer’s first com-ment was “It’s so un-
    usual that I see a résumé with-out any typos.”
    “Are you se-ri-ous?” I said.
    She said, “Yes, prob-a-bly 90% of the résumés I get
    have ty-pos. And that in-cludes the ones we get from
    the top schools.”
    I got the job. Prob-a-bly there were bet-ter-qual-i-fied
    can-di-dates, but they dam-aged their chances with
    sloppy résumés. The irony is that those peo-ple, who
    most needed to hear the in-ter-viewer’s feed-back,
    weren’t in the room. Be-cause they never got an
    interview.

    ...

    View Slide

  21. 21
    Document Analysis
    Stuff a search engine can do
    As a law stu-dent, I went on a few job in-ter-views. At
    one, the in-ter-viewer’s first com-ment was “It’s so un-
    usual that I see a résumé with-out any typos.”
    “Are you se-ri-ous?” I said.
    She said, “Yes, prob-a-bly 90% of the résumés I get
    have ty-pos. And that in-cludes the ones we get from
    the top schools.”
    I got the job. Prob-a-bly there were bet-ter-qual-i-fied
    can-di-dates, but they dam-aged their chances with
    sloppy résumés. The irony is that those peo-ple, who
    most needed to hear the in-ter-viewer’s feed-back,
    weren’t in the room. Be-cause they never got an
    interview.

    ...
    Analyzer

    View Slide

  22. 22
    Stuff a search engine can do
    Anatomy of the
    Analyzer:
    Elasticsearch comes with pre-built analyzers, you can create your own.
    https://www.elastic.co/guide/en/elasticsearch/reference/current/analyzer-anatomy.html
    Document Analysis
    Character
    Filter
    1 2 3
    Tokenizer Token
    Filter

    View Slide

  23. 23
    Stuff a search engine can do
    Anatomy of the
    Analyzer:
    Elasticsearch comes with pre-built analyzers, you can create your own.
    https://www.elastic.co/guide/en/elasticsearch/reference/current/analyzer-anatomy.html
    Document Analysis
    Character
    Filter
    1 2 3
    Tokenizer Token
    Filter

    View Slide

  24. 24
    Stuff a search engine can do
    Agenda
    Document Analysis
    1
    Searching and Ranking
    3
    Suggestions, More Like This, etc.
    4
    Would you like to know more..?
    5
    Indexing
    2
    1

    View Slide

  25. 25
    • Elasticsearch terms:
    ‒ An Index: data structure that houses documents (think RDBMS "table");
    ‒ Index a document: insert into an Index
    ‒ Document: a JSON object (hash map)
    Stuff a search engine can do
    Indexing
    $ curl -XPUT 'http://localhost:9200/twitter/tweet/1' -d '{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
    }'

    View Slide

  26. 26
    Stuff a search engine can do
    Indexing
    token document_id frequency
    He 1 1
    who 1 1
    controls 1 1
    the 1 1
    spice 1 1
    universe 1 1
    # document id 1
    {"text": "He who controls the spice, controls
    the universe."}

    View Slide

  27. 27
    Stuff a search engine can do
    Indexing
    token document_id frequency
    He 1 1
    who 1 1
    controls 1 1
    the 1 1
    spice 1 1
    universe 1 1
    A 2 1
    mad 2 1
    man 2 1
    sees 2 1
    what 2 1
    he 2 1
    # document id 1
    {"text": "He who controls the spice, controls
    the universe."}
    # document id 2
    {"text": "A mad man sees what he sees."}

    View Slide

  28. 28
    Stuff a search engine can do
    Indexing
    token document_id frequency
    He 1 1
    who 1 1
    controls 1 1
    the 1,3 2
    spice 1 1
    universe 1,3 2
    A 2 1
    mad 2,3 2
    man 2,3 2
    sees 2 1
    what 2 1
    he 2 1
    What 3 1
    if 3 1
    a 3 1
    controlled 3 1
    # document id 1
    {"text": "He who controls the spice, controls
    the universe."}
    # document id 2
    {"text": "A mad man sees what he sees."}
    # document id 3
    {"text": "What if a mad man controlled the
    universe?"}

    View Slide

  29. 29
    Stuff a search engine can do
    Indexing
    token document_id frequency
    he 1,2 2
    who 1 1
    controls 1 1
    the 1,3 2
    spice 1 1
    universe 1,3 2
    a 2,3 2
    mad 2,3 2
    man 2,3 2
    sees 2 1
    what 2,3 2
    if 3 1
    controlled 3 1
    # document id 1
    {"text": "He who controls the spice, controls
    the universe."}
    # document id 2
    {"text": "A mad man sees what he sees."}
    # document id 3
    {"text": "What if a mad man controlled the
    universe?"}
    Lower case token filter

    View Slide

  30. 30
    Stuff a search engine can do
    Indexing
    token document_id frequency
    he 1,2 2
    who 1 1
    control 1,3 2
    the 1,3 2
    spice 1 1
    univers 1,3 2
    a 2,3 2
    mad 2,3 2
    man 2,3 2
    see 2 1
    what 2,3 2
    if 3 1
    # document id 1
    {"text": "He who controls the spice, controls
    the universe."}
    # document id 2
    {"text": "A mad man sees what he sees."}
    # document id 3
    {"text": "What if a mad man controlled the
    universe?"}
    + Stemmer

    View Slide

  31. 31
    Stuff a search engine can do
    Indexing
    # document id 1
    {"text": "He who controls the spice, controls
    the universe."}
    # document id 2
    {"text": "A mad man sees what he sees."}
    # document id 3
    {"text": "What if a mad man controlled the
    universe?"}
    - Stopwords
    token document_id frequency
    he 1,2 2
    who 1 1
    control 1,3 2
    the 1,3 2
    spice 1 1
    univers 1,3 2
    a 2,3 2
    mad 2,3 2
    man 2,3 2
    see 2 1
    what 2,3 2
    if 3 1

    View Slide

  32. 32
    Stuff a search engine can do
    Indexing
    token document_id frequency
    he 1,2 2
    who 1 1
    control 1,3 2
    spice 1 1
    univers 1,3 2
    mad 2,3 2
    man 2,3 2
    see 2 1
    what 2,3 2
    # document id 1
    {"text": "He who controls the spice, controls
    the universe."}
    # document id 2
    {"text": "A mad man sees what he sees."}
    # document id 3
    {"text": "What if a mad man controlled the
    universe?"}

    View Slide

  33. 33
    Stuff a search engine can do
    Agenda
    Document Analysis
    1
    Searching and Ranking
    Suggestions, More Like This, etc.
    4
    Would you like to know more..?
    5
    Indexing
    1
    2
    3

    View Slide

  34. 34
    Stuff a search engine can do
    Structured Full-text Others
    • Similar to SQL
    • Find exact values
    • Ranges
    • Group by
    • Match
    • Match Phrase
    • Relevancy and boosting
    • More Like This
    • Multifield Search
    • Pipeline Aggregations
    • Geolocation
    • Proximity Matching
    Searching and Ranking

    View Slide

  35. 35
    Stuff a search engine can do
    Structured Full-text Others
    • Similar to SQL
    • Find exact values
    • Ranges
    • Group by
    • Match
    • Match Phrase
    • Relevancy and boosting
    • More Like This
    • Multifield Search
    • Pipeline Aggregations
    • Geolocation
    • Proximity Matching
    Searching and Ranking

    View Slide

  36. 36
    Stuff a search engine can do
    Searching and Ranking
    GET my_index/_search
    {
    "query": {
    "match" : {
    "text" : {
    "query" : "control spice"
    }
    }
    }
    }
    token document_id frequency
    he 1,2 2
    who 1 1
    control 1,3 2
    spice 1 1
    univers 1,3 2
    mad 2,3 2
    man 2,3 2
    see 2 1
    what 2,3 2

    View Slide

  37. 37
    Stuff a search engine can do
    Searching and Ranking
    GET my_index/_search
    {
    "query": {
    "match" : {
    "text" : {
    "query" : "control spice"
    }
    }
    }
    }
    token
    control
    spice
    token document_id frequency
    he 1,2 2
    who 1 1
    control 1,3 2
    spice 1 1
    univers 1,3 2
    mad 2,3 2
    man 2,3 2
    see 2 1
    what 2,3 2

    View Slide

  38. 38
    Stuff a search engine can do
    Searching and Ranking
    GET my_index/_search
    {
    "query": {
    "match" : {
    "text" : {
    "query" : "control spice"
    }
    }
    }
    }
    token
    control
    spice

    View Slide

  39. 39
    Stuff a search engine can do
    GET my_index/_search
    {
    "query": {
    "match" : {
    "text" : {
    "query" : "control spice"
    }
    }
    }
    }
    token
    control
    spice
    Searching and Ranking

    View Slide

  40. 40
    Stuff a search engine can do
    There are three main factors of a document’s score:
    • TF (term frequency): The more a token appears in a doc, the
    more important it is
    • IDF (inverse document frequency): The more documents
    containing the term, the less important it is
    • Field length: shorter docs are more likely to be relevant than
    longer docs
    Searching and Ranking

    View Slide

  41. 41
    Stuff a search engine can do
    Searching and Ranking

    View Slide

  42. 42
    Stuff a search engine can do
    Searching and Ranking

    View Slide

  43. 43
    Stuff a search engine can do
    Searching and Ranking

    View Slide

  44. 44
    Stuff a search engine can do
    "BM25 Demystified" by Britta Weber
    https://www.elastic.co/elasticon/conf/2016/sf/improved-text-scoring-with-bm25
    Searching and Ranking

    View Slide

  45. 45
    Stuff a search engine can do
    "BM25 Demystified" by Britta Weber
    https://www.elastic.co/elasticon/conf/2016/sf/improved-text-scoring-with-bm25
    Searching and Ranking

    View Slide

  46. 46
    Stuff a search engine can do
    Agenda
    Document Analysis
    1
    Searching and Ranking
    Suggestions, More Like This, etc.
    4
    Would you like to know more..?
    5
    Indexing
    1
    2
    4
    3

    View Slide

  47. 47
    5
    Stuff a search engine can do
    Agenda
    Document Analysis
    1
    Searching and Ranking
    Suggestions, More Like This, etc.
    4
    Would you like to know more..?
    Indexing
    1
    2
    4
    3
    4

    View Slide

  48. 48
    Code - https://github.com/elastic/
    Documentation - https://www.elastic.co/guide/index.html
    Elasticsearch: The Definitive Guide -
    https://www.elastic.co/guide/en/elasticsearch/guide/current/index.html
    Discuss Forum - https://discuss.elastic.co/
    Private or Public Training - https://training.elastic.co/
    Subscriptions - https://www.elastic.co/subscriptions
    Stuff a search engine can do
    Would you like to know more?

    View Slide

  49. 49
    Stuff a search engine can do
    The End.
    Thank you!

    View Slide