Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Common Terms Query

Zachary Tong
November 22, 2013

Common Terms Query

Presentation given at the Chicago Elasticsearch November Meetup

Zachary Tong

November 22, 2013
Tweet

More Decks by Zachary Tong

Other Decks in Programming

Transcript

  1. Common Terms
    Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    Have your cake and eat it too
    Friday, November 22, 13

    View Slide

  2. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    @ZacharyTong
    polyfractal on IRC
    Developing - Writing - Training
    ಠ_ಠ
    (amoeba)
    Friday, November 22, 13

    View Slide

  3. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    Stop Words
    Friday, November 22, 13

    View Slide

  4. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    or, it, be, a, and, to
    Friday, November 22, 13

    View Slide

  5. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    Often empty of meaning
    or, it, be, a, and, to
    Friday, November 22, 13

    View Slide

  6. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    Often empty of meaning
    or, it, be, a, and, to
    “the quick and brown fox jumped over a ledge”
    Friday, November 22, 13

    View Slide

  7. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    Often empty of meaning
    or, it, be, a, and, to
    “the quick and brown fox jumped over a ledge”
    Friday, November 22, 13

    View Slide

  8. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    Often empty of meaning
    Used frequently
    “the quick and brown fox jumped over a ledge”
    33%
    or, it, be, a, and, to
    Friday, November 22, 13

    View Slide

  9. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    Often empty of meaning
    Used frequently
    Bloat inverted index
    or, it, be, a, and, to
    Friday, November 22, 13

    View Slide

  10. Stop Words...
    Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    ...frequent
    ...little discriminatory value
    ...hurts performance
    Friday, November 22, 13

    View Slide

  11. Stop Words...
    Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    Stop Filter
    Friday, November 22, 13

    View Slide

  12. “To be or not to be”
    Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    Friday, November 22, 13

    View Slide

  13. “To be or not to be”
    Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    Friday, November 22, 13

    View Slide

  14. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    Multi-field mapping
    - with stop filter
    - without stop filter
    Friday, November 22, 13

    View Slide

  15. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    Multi-field mapping
    {
    “query” : {
    “bool” : {
    “should” : [
    “match” : {
    “body” : {
    “query” : “quick fox”,
    “boost” : 3
    }
    },
    “match” : {
    “body.without_stop” : {
    “query” : “quick fox”,
    “boost” : 1
    }
    }
    ]
    }}}
    Boost stop-removed match
    But check stop-words too
    Friday, November 22, 13

    View Slide

  16. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    Bloat inverted index
    Remember, stop-words:
    ?
    Friday, November 22, 13

    View Slide

  17. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    Bloat inverted index
    Multi-field mapping:
    2x!
    Friday, November 22, 13

    View Slide

  18. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    Other general problems:
    - manually maintain stop-list
    - language dependent
    - domain dependent
    - makes query scoring tricky
    Friday, November 22, 13

    View Slide

  19. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    Common Terms
    Intelligent stop-word removal
    Friday, November 22, 13

    View Slide

  20. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    Overview
    - identify “important” terms in query
    - find documents with “important” terms
    - score those matching docs with entire query
    Friday, November 22, 13

    View Slide

  21. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    “unimportant”
    “important”
    - quick
    - brown
    - fox
    - jumped
    - over
    - ledge
    - the
    - and
    - a
    Friday, November 22, 13

    View Slide

  22. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    high
    low
    - the
    - and
    - a
    Document
    Frequency:
    - quick
    - brown
    - fox
    - jumped
    - over
    - ledge
    Friday, November 22, 13

    View Slide

  23. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    {
    "common": {
    "body": {
    "query": "the quick and brown fox jumped over the ledge",
    "cutoff_frequency": 0.001
    }
    }
    }
    The Query
    Friday, November 22, 13

    View Slide

  24. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    {
    "query": {
    "bool": {
    "must": [
    { "term": { "body": "quick"}},
    { "term": { "body": "brown"}},
    { "term": { "body": "fox"}},
    { "term": { "body": "jumped"}},
    { "term": { "body": "over"}},
    { "term": { "body": "ledge"}},
    ]
    }}}
    internal execution
    (roughly)
    step 1: find docs w/ “important”
    Friday, November 22, 13

    View Slide

  25. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    {
    "query": {
    "bool": {
    "must": [
    { "term": { "body": "quick"}},
    { "term": { "body": "brown"}},
    { "term": { "body": "fox"}},
    { "term": { "body": "jumped"}},
    { "term": { "body": "over"}},
    { "term": { "body": "ledge"}},
    ],
    "should": [
    { "term": { "body": "the"}},
    { "term": { "body": "and"}},
    { "term": { "body": "a"}},
    ]
    }}}
    step 2: score matching docs
    internal execution
    (roughly)
    Friday, November 22, 13

    View Slide

  26. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    {
    "common": {
    "body": {
    "query": "the quick and brown fox jumped over the ledge",
    "cutoff_frequency": 0.001,
    "low_freq_operator": "or",
    "high_freq_operator": "or",
    "minimum_should_match": {
    "low_freq" : "60%",
    "high_freq" : "20%"
    }
    }
    }
    }
    Controlling Leniency
    use “or” for low-freq terms
    Friday, November 22, 13

    View Slide

  27. {
    "common": {
    "body": {
    "query": "the quick and brown fox jumped over the ledge",
    "cutoff_frequency": 0.001,
    "low_freq_operator": "or",
    "high_freq_operator": "or",
    "minimum_should_match": {
    "low_freq" : "60%",
    "high_freq" : "20%"
    }
    }
    }
    }
    Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    Controlling Leniency
    how many clauses should match
    Friday, November 22, 13

    View Slide

  28. {
    "query": {
    "bool": {
    "should": [
    { "term": { "body": "quick"}},
    { "term": { "body": "brown"}},
    { "term": { "body": "fox"}},
    { "term": { "body": "jumped"}},
    { "term": { "body": "over"}},
    { "term": { "body": "ledge"}},
    ],
    "should": [
    { "term": { "body": "the"}},
    { "term": { "body": "and"}},
    { "term": { "body": "a"}},
    ]
    }}}
    Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    Controlling Leniency
    internal execution
    (roughly)
    (use “or”)
    Friday, November 22, 13

    View Slide

  29. {
    "query": {
    "bool": {
    "should": [
    { "term": { "body": "quick"}},
    { "term": { "body": "brown"}},
    { "term": { "body": "fox"}},
    { "term": { "body": "jumped"}},
    { "term": { "body": "over"}},
    { "term": { "body": "ledge"}},
    ],
    "should": [
    { "term": { "body": "the"}},
    { "term": { "body": "and"}},
    { "term": { "body": "a"}},
    ]
    }}}
    Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    Controlling Leniency
    internal execution
    (roughly)
    4 clauses must match
    1 clause must match
    (60%)
    (20%)
    Friday, November 22, 13

    View Slide

  30. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    {
    "common": {
    "body": {
    "query": "the quick and brown fox jumped over the ledge",
    "cutoff_frequency": 0.001,
    "low_freq_operator": "or",
    "high_freq_operator": "or",
    "minimum_should_match": {
    "low_freq" : "60%",
    "high_freq" : "20%"
    }
    }
    }
    }
    Controlling Importance
    adjust the high/low cutoff
    (0.1%)
    Friday, November 22, 13

    View Slide

  31. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    high
    low
    - the
    - and
    - a
    Document
    Frequency:
    - quick
    - brown
    - fox
    - jumped
    - over
    - ledge
    Friday, November 22, 13

    View Slide

  32. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    {
    "common": {
    "body": {
    "query": "the quick and brown fox jumped over the ledge",
    "cutoff_frequency": 0.10,
    "low_freq_operator": "or",
    "high_freq_operator": "or",
    "minimum_should_match": {
    "low_freq" : "60%",
    "high_freq" : "20%"
    }
    }
    }
    }
    Controlling Importance
    adjust the high/low cutoff
    (10.0%)
    Friday, November 22, 13

    View Slide

  33. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    high
    low
    - the
    - and
    - a
    - over
    - quick
    Document
    Frequency:
    - brown
    - fox
    - jumped
    - ledge
    Friday, November 22, 13

    View Slide

  34. “To be or not to be”
    Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    Friday, November 22, 13

    View Slide

  35. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    {
    "common": {
    "body": {
    "query": "to be or not to be",
    "cutoff_frequency": 0.001
    }
    }
    }
    All high-frequency terms
    Friday, November 22, 13

    View Slide

  36. {
    "query": {
    "bool": {
    "must": [
    { "term": { "body": "to"}},
    { "term": { "body": "be"}},
    { "term": { "body": "or"}},
    { "term": { "body": "not"}},
    { "term": { "body": "be"}},
    ]
    }}}
    Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    internal execution
    (roughly)
    automatically
    a “must”
    All high-frequency terms
    Friday, November 22, 13

    View Slide

  37. {
    "common": {
    "body": {
    "query": "to be or not to be",
    "cutoff_frequency": 0.001,
    "minimum_should_match": {
    "low_freq" : "60%"
    }
    }
    }
    }
    Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    Controlling Leniency
    how many clauses should match
    Friday, November 22, 13

    View Slide

  38. {
    "query": {
    "bool": {
    "should": [
    { "term": { "body": "to"}},
    { "term": { "body": "be"}},
    { "term": { "body": "or"}},
    { "term": { "body": "not"}},
    { "term": { "body": "be"}},
    ]
    }}}
    Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    internal execution
    (roughly)
    becomes a “should”
    All high-frequency terms
    3 clauses must match
    (60%)
    Friday, November 22, 13

    View Slide

  39. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    Adaptive Stop-lists
    Are these stop words?
    - “video”
    - “movie”
    - “film”
    Friday, November 22, 13

    View Slide

  40. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    Adaptive Stop-lists
    Are these stop words?
    - “video”
    - “movie”
    - “film”
    For YouTube, they might be!
    Friday, November 22, 13

    View Slide

  41. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    Adaptive Stop-lists
    - Common Terms uses your index for frequency
    - Adapts to your domain
    - No manual stop-list creation/maintenance
    - Adapts to language, etc
    Friday, November 22, 13

    View Slide

  42. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    Limitations
    - Frequencies are per-index, not per-type
    Friday, November 22, 13

    View Slide

  43. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    Limitations
    - Frequencies are per-index, not per-type
    - No good way to pick cutoff frequency
    Friday, November 22, 13

    View Slide

  44. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    Limitations
    - Frequencies are per-index, not per-type
    - No good way to pick cutoff frequency
    - Takes data to “warm” the query
    Friday, November 22, 13

    View Slide

  45. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    Limitations
    - Frequencies are per-index, not per-type
    - No good way to pick cutoff frequency
    - Takes data to “warm” the query
    - Some advanced behavior missing (fuzzy, etc)
    Friday, November 22, 13

    View Slide

  46. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    Questions?
    ಠ_ಠ
    @ZacharyTong
    polyfractal on IRC
    Friday, November 22, 13

    View Slide

  47. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
    Resources
    Common Terms Docs : http://bit.ly/1an7NOd
    “Stop Stopping Stopwords” : http://bit.ly/17hE2uq
    Friday, November 22, 13

    View Slide