Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Stuff a Search Engine Can Do

Stuff a Search Engine Can Do

This talk was given at the 2nd event of the Elastic Lisboa meetup group by João Duarte - https://www.meetup.com/Elastic-Lisboa/events/235801377 Demo code is at https://gist.github.com/jsvd/cafccdcf20bd30969ed8419c8ae9a573

Elasticsearch Inc

January 05, 2017
Tweet

More Decks by Elasticsearch Inc

Other Decks in Programming

Transcript

  1. 1
    João Duarte
    Log Whisperer @elastic
    Stuff a search engine can do
    :slightly_smiling_face:

    View full-size slide

  2. 6
    Apache Lucene Core
    Apache LuceneTM is a high-performance,
    full-featured text search engine library written
    entirely in Java. It is a technology suitable for
    nearly any application that requires full-text
    search, especially cross-platform.
    Apache Lucene is an open source project
    available for free download.
    http://lucene.apache.org/core/

    View full-size slide

  3. 7
    Apache Lucene Core
    Apache LuceneTM is a high-performance,
    full-featured text search engine library written
    entirely in Java. It is a technology suitable for
    nearly any application that requires full-text
    search, especially cross-platform.
    Apache Lucene is an open source project
    available for free download.
    http://lucene.apache.org/core/

    View full-size slide

  4. 8
    Apache Lucene Core
    Apache LuceneTM is a high-performance,
    full-featured text search engine library written
    entirely in Java. It is a technology suitable for
    nearly any application that requires full-text
    search, especially cross-platform.
    Apache Lucene is an open source project
    available for free download.
    http://lucene.apache.org/core/

    View full-size slide

  5. 14
    Elasticsearch Cluster

    View full-size slide

  6. 17
    As a law stu-dent, I went on a few job in-ter-views. At
    one, the in-ter-viewer’s first com-ment was “It’s so un-
    usual that I see a résumé with-out any typos.”
    “Are you se-ri-ous?” I said.
    She said, “Yes, prob-a-bly 90% of the résumés I get
    have ty-pos. And that in-cludes the ones we get from
    the top schools.”
    I got the job. Prob-a-bly there were bet-ter-qual-i-fied
    can-di-dates, but they dam-aged their chances with
    sloppy résumés. The irony is that those peo-ple, who
    most needed to hear the in-ter-viewer’s feed-back,
    weren’t in the room. Be-cause they never got an
    interview.

    ...

    View full-size slide

  7. 18
    As a law stu-dent, I went on a few job in-ter-views. At
    one, the in-ter-viewer’s first com-ment was “It’s so un-
    usual that I see a résumé with-out any typos.”
    “Are you se-ri-ous?” I said.
    She said, “Yes, prob-a-bly 90% of the résumés I get
    have ty-pos. And that in-cludes the ones we get from
    the top schools.”
    I got the job. Prob-a-bly there were bet-ter-qual-i-fied
    can-di-dates, but they dam-aged their chances with
    sloppy résumés. The irony is that those peo-ple, who
    most needed to hear the in-ter-viewer’s feed-back,
    weren’t in the room. Be-cause they never got an
    interview.

    ...

    View full-size slide

  8. 19
    Stuff a search engine can do
    Agenda
    Document Analysis
    1
    Searching and Ranking
    3
    Suggestions, More Like This, etc.
    4
    Would you like to know more..?
    5
    Indexing
    2

    View full-size slide

  9. 20
    Document Analysis
    Stuff a search engine can do
    As a law stu-dent, I went on a few job in-ter-views. At
    one, the in-ter-viewer’s first com-ment was “It’s so un-
    usual that I see a résumé with-out any typos.”
    “Are you se-ri-ous?” I said.
    She said, “Yes, prob-a-bly 90% of the résumés I get
    have ty-pos. And that in-cludes the ones we get from
    the top schools.”
    I got the job. Prob-a-bly there were bet-ter-qual-i-fied
    can-di-dates, but they dam-aged their chances with
    sloppy résumés. The irony is that those peo-ple, who
    most needed to hear the in-ter-viewer’s feed-back,
    weren’t in the room. Be-cause they never got an
    interview.

    ...

    View full-size slide

  10. 21
    Document Analysis
    Stuff a search engine can do
    As a law stu-dent, I went on a few job in-ter-views. At
    one, the in-ter-viewer’s first com-ment was “It’s so un-
    usual that I see a résumé with-out any typos.”
    “Are you se-ri-ous?” I said.
    She said, “Yes, prob-a-bly 90% of the résumés I get
    have ty-pos. And that in-cludes the ones we get from
    the top schools.”
    I got the job. Prob-a-bly there were bet-ter-qual-i-fied
    can-di-dates, but they dam-aged their chances with
    sloppy résumés. The irony is that those peo-ple, who
    most needed to hear the in-ter-viewer’s feed-back,
    weren’t in the room. Be-cause they never got an
    interview.

    ...
    Analyzer

    View full-size slide

  11. 22
    Stuff a search engine can do
    Anatomy of the
    Analyzer:
    Elasticsearch comes with pre-built analyzers, you can create your own.
    https://www.elastic.co/guide/en/elasticsearch/reference/current/analyzer-anatomy.html
    Document Analysis
    Character
    Filter
    1 2 3
    Tokenizer Token
    Filter

    View full-size slide

  12. 23
    Stuff a search engine can do
    Anatomy of the
    Analyzer:
    Elasticsearch comes with pre-built analyzers, you can create your own.
    https://www.elastic.co/guide/en/elasticsearch/reference/current/analyzer-anatomy.html
    Document Analysis
    Character
    Filter
    1 2 3
    Tokenizer Token
    Filter

    View full-size slide

  13. 24
    Stuff a search engine can do
    Agenda
    Document Analysis
    1
    Searching and Ranking
    3
    Suggestions, More Like This, etc.
    4
    Would you like to know more..?
    5
    Indexing
    2
    1

    View full-size slide

  14. 25
    • Elasticsearch terms:
    ‒ An Index: data structure that houses documents (think RDBMS "table");
    ‒ Index a document: insert into an Index
    ‒ Document: a JSON object (hash map)
    Stuff a search engine can do
    Indexing
    $ curl -XPUT 'http://localhost:9200/twitter/tweet/1' -d '{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
    }'

    View full-size slide

  15. 26
    Stuff a search engine can do
    Indexing
    token document_id frequency
    He 1 1
    who 1 1
    controls 1 1
    the 1 1
    spice 1 1
    universe 1 1
    # document id 1
    {"text": "He who controls the spice, controls
    the universe."}

    View full-size slide

  16. 27
    Stuff a search engine can do
    Indexing
    token document_id frequency
    He 1 1
    who 1 1
    controls 1 1
    the 1 1
    spice 1 1
    universe 1 1
    A 2 1
    mad 2 1
    man 2 1
    sees 2 1
    what 2 1
    he 2 1
    # document id 1
    {"text": "He who controls the spice, controls
    the universe."}
    # document id 2
    {"text": "A mad man sees what he sees."}

    View full-size slide

  17. 28
    Stuff a search engine can do
    Indexing
    token document_id frequency
    He 1 1
    who 1 1
    controls 1 1
    the 1,3 2
    spice 1 1
    universe 1,3 2
    A 2 1
    mad 2,3 2
    man 2,3 2
    sees 2 1
    what 2 1
    he 2 1
    What 3 1
    if 3 1
    a 3 1
    controlled 3 1
    # document id 1
    {"text": "He who controls the spice, controls
    the universe."}
    # document id 2
    {"text": "A mad man sees what he sees."}
    # document id 3
    {"text": "What if a mad man controlled the
    universe?"}

    View full-size slide

  18. 29
    Stuff a search engine can do
    Indexing
    token document_id frequency
    he 1,2 2
    who 1 1
    controls 1 1
    the 1,3 2
    spice 1 1
    universe 1,3 2
    a 2,3 2
    mad 2,3 2
    man 2,3 2
    sees 2 1
    what 2,3 2
    if 3 1
    controlled 3 1
    # document id 1
    {"text": "He who controls the spice, controls
    the universe."}
    # document id 2
    {"text": "A mad man sees what he sees."}
    # document id 3
    {"text": "What if a mad man controlled the
    universe?"}
    Lower case token filter

    View full-size slide

  19. 30
    Stuff a search engine can do
    Indexing
    token document_id frequency
    he 1,2 2
    who 1 1
    control 1,3 2
    the 1,3 2
    spice 1 1
    univers 1,3 2
    a 2,3 2
    mad 2,3 2
    man 2,3 2
    see 2 1
    what 2,3 2
    if 3 1
    # document id 1
    {"text": "He who controls the spice, controls
    the universe."}
    # document id 2
    {"text": "A mad man sees what he sees."}
    # document id 3
    {"text": "What if a mad man controlled the
    universe?"}
    + Stemmer

    View full-size slide

  20. 31
    Stuff a search engine can do
    Indexing
    # document id 1
    {"text": "He who controls the spice, controls
    the universe."}
    # document id 2
    {"text": "A mad man sees what he sees."}
    # document id 3
    {"text": "What if a mad man controlled the
    universe?"}
    - Stopwords
    token document_id frequency
    he 1,2 2
    who 1 1
    control 1,3 2
    the 1,3 2
    spice 1 1
    univers 1,3 2
    a 2,3 2
    mad 2,3 2
    man 2,3 2
    see 2 1
    what 2,3 2
    if 3 1

    View full-size slide

  21. 32
    Stuff a search engine can do
    Indexing
    token document_id frequency
    he 1,2 2
    who 1 1
    control 1,3 2
    spice 1 1
    univers 1,3 2
    mad 2,3 2
    man 2,3 2
    see 2 1
    what 2,3 2
    # document id 1
    {"text": "He who controls the spice, controls
    the universe."}
    # document id 2
    {"text": "A mad man sees what he sees."}
    # document id 3
    {"text": "What if a mad man controlled the
    universe?"}

    View full-size slide

  22. 33
    Stuff a search engine can do
    Agenda
    Document Analysis
    1
    Searching and Ranking
    Suggestions, More Like This, etc.
    4
    Would you like to know more..?
    5
    Indexing
    1
    2
    3

    View full-size slide

  23. 34
    Stuff a search engine can do
    Structured Full-text Others
    • Similar to SQL
    • Find exact values
    • Ranges
    • Group by
    • Match
    • Match Phrase
    • Relevancy and boosting
    • More Like This
    • Multifield Search
    • Pipeline Aggregations
    • Geolocation
    • Proximity Matching
    Searching and Ranking

    View full-size slide

  24. 35
    Stuff a search engine can do
    Structured Full-text Others
    • Similar to SQL
    • Find exact values
    • Ranges
    • Group by
    • Match
    • Match Phrase
    • Relevancy and boosting
    • More Like This
    • Multifield Search
    • Pipeline Aggregations
    • Geolocation
    • Proximity Matching
    Searching and Ranking

    View full-size slide

  25. 36
    Stuff a search engine can do
    Searching and Ranking
    GET my_index/_search
    {
    "query": {
    "match" : {
    "text" : {
    "query" : "control spice"
    }
    }
    }
    }
    token document_id frequency
    he 1,2 2
    who 1 1
    control 1,3 2
    spice 1 1
    univers 1,3 2
    mad 2,3 2
    man 2,3 2
    see 2 1
    what 2,3 2

    View full-size slide

  26. 37
    Stuff a search engine can do
    Searching and Ranking
    GET my_index/_search
    {
    "query": {
    "match" : {
    "text" : {
    "query" : "control spice"
    }
    }
    }
    }
    token
    control
    spice
    token document_id frequency
    he 1,2 2
    who 1 1
    control 1,3 2
    spice 1 1
    univers 1,3 2
    mad 2,3 2
    man 2,3 2
    see 2 1
    what 2,3 2

    View full-size slide

  27. 38
    Stuff a search engine can do
    Searching and Ranking
    GET my_index/_search
    {
    "query": {
    "match" : {
    "text" : {
    "query" : "control spice"
    }
    }
    }
    }
    token
    control
    spice

    View full-size slide

  28. 39
    Stuff a search engine can do
    GET my_index/_search
    {
    "query": {
    "match" : {
    "text" : {
    "query" : "control spice"
    }
    }
    }
    }
    token
    control
    spice
    Searching and Ranking

    View full-size slide

  29. 40
    Stuff a search engine can do
    There are three main factors of a document’s score:
    • TF (term frequency): The more a token appears in a doc, the
    more important it is
    • IDF (inverse document frequency): The more documents
    containing the term, the less important it is
    • Field length: shorter docs are more likely to be relevant than
    longer docs
    Searching and Ranking

    View full-size slide

  30. 41
    Stuff a search engine can do
    Searching and Ranking

    View full-size slide

  31. 42
    Stuff a search engine can do
    Searching and Ranking

    View full-size slide

  32. 43
    Stuff a search engine can do
    Searching and Ranking

    View full-size slide

  33. 44
    Stuff a search engine can do
    "BM25 Demystified" by Britta Weber
    https://www.elastic.co/elasticon/conf/2016/sf/improved-text-scoring-with-bm25
    Searching and Ranking

    View full-size slide

  34. 45
    Stuff a search engine can do
    "BM25 Demystified" by Britta Weber
    https://www.elastic.co/elasticon/conf/2016/sf/improved-text-scoring-with-bm25
    Searching and Ranking

    View full-size slide

  35. 46
    Stuff a search engine can do
    Agenda
    Document Analysis
    1
    Searching and Ranking
    Suggestions, More Like This, etc.
    4
    Would you like to know more..?
    5
    Indexing
    1
    2
    4
    3

    View full-size slide

  36. 47
    5
    Stuff a search engine can do
    Agenda
    Document Analysis
    1
    Searching and Ranking
    Suggestions, More Like This, etc.
    4
    Would you like to know more..?
    Indexing
    1
    2
    4
    3
    4

    View full-size slide

  37. 48
    Code - https://github.com/elastic/
    Documentation - https://www.elastic.co/guide/index.html
    Elasticsearch: The Definitive Guide -
    https://www.elastic.co/guide/en/elasticsearch/guide/current/index.html
    Discuss Forum - https://discuss.elastic.co/
    Private or Public Training - https://training.elastic.co/
    Subscriptions - https://www.elastic.co/subscriptions
    Stuff a search engine can do
    Would you like to know more?

    View full-size slide

  38. 49
    Stuff a search engine can do
    The End.
    Thank you!

    View full-size slide