Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Common Terms Query

Zachary Tong
November 22, 2013

Common Terms Query

Presentation given at the Chicago Elasticsearch November Meetup

Zachary Tong

November 22, 2013
Tweet

More Decks by Zachary Tong

Other Decks in Programming

Transcript

  1. Common Terms Copyright Elasticsearch 2013. Copying, publishing and/or distributing without

    written permission is strictly prohibited Have your cake and eat it too Friday, November 22, 13
  2. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited @ZacharyTong polyfractal on IRC Developing - Writing - Training ಠ_ಠ (amoeba) Friday, November 22, 13
  3. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited or, it, be, a, and, to Friday, November 22, 13
  4. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited Often empty of meaning or, it, be, a, and, to Friday, November 22, 13
  5. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited Often empty of meaning or, it, be, a, and, to “the quick and brown fox jumped over a ledge” Friday, November 22, 13
  6. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited Often empty of meaning or, it, be, a, and, to “the quick and brown fox jumped over a ledge” Friday, November 22, 13
  7. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited Often empty of meaning Used frequently “the quick and brown fox jumped over a ledge” 33% or, it, be, a, and, to Friday, November 22, 13
  8. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited Often empty of meaning Used frequently Bloat inverted index or, it, be, a, and, to Friday, November 22, 13
  9. Stop Words... Copyright Elasticsearch 2013. Copying, publishing and/or distributing without

    written permission is strictly prohibited ...frequent ...little discriminatory value ...hurts performance Friday, November 22, 13
  10. Stop Words... Copyright Elasticsearch 2013. Copying, publishing and/or distributing without

    written permission is strictly prohibited Stop Filter Friday, November 22, 13
  11. “To be or not to be” Copyright Elasticsearch 2013. Copying,

    publishing and/or distributing without written permission is strictly prohibited Friday, November 22, 13
  12. “To be or not to be” Copyright Elasticsearch 2013. Copying,

    publishing and/or distributing without written permission is strictly prohibited Friday, November 22, 13
  13. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited Multi-field mapping - with stop filter - without stop filter Friday, November 22, 13
  14. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited Multi-field mapping { “query” : { “bool” : { “should” : [ “match” : { “body” : { “query” : “quick fox”, “boost” : 3 } }, “match” : { “body.without_stop” : { “query” : “quick fox”, “boost” : 1 } } ] }}} Boost stop-removed match But check stop-words too Friday, November 22, 13
  15. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited Bloat inverted index Remember, stop-words: ? Friday, November 22, 13
  16. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited Bloat inverted index Multi-field mapping: 2x! Friday, November 22, 13
  17. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited Other general problems: - manually maintain stop-list - language dependent - domain dependent - makes query scoring tricky Friday, November 22, 13
  18. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited Common Terms Intelligent stop-word removal Friday, November 22, 13
  19. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited Overview - identify “important” terms in query - find documents with “important” terms - score those matching docs with entire query Friday, November 22, 13
  20. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited “unimportant” “important” - quick - brown - fox - jumped - over - ledge - the - and - a Friday, November 22, 13
  21. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited high low - the - and - a Document Frequency: - quick - brown - fox - jumped - over - ledge Friday, November 22, 13
  22. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited { "common": { "body": { "query": "the quick and brown fox jumped over the ledge", "cutoff_frequency": 0.001 } } } The Query Friday, November 22, 13
  23. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited { "query": { "bool": { "must": [ { "term": { "body": "quick"}}, { "term": { "body": "brown"}}, { "term": { "body": "fox"}}, { "term": { "body": "jumped"}}, { "term": { "body": "over"}}, { "term": { "body": "ledge"}}, ] }}} internal execution (roughly) step 1: find docs w/ “important” Friday, November 22, 13
  24. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited { "query": { "bool": { "must": [ { "term": { "body": "quick"}}, { "term": { "body": "brown"}}, { "term": { "body": "fox"}}, { "term": { "body": "jumped"}}, { "term": { "body": "over"}}, { "term": { "body": "ledge"}}, ], "should": [ { "term": { "body": "the"}}, { "term": { "body": "and"}}, { "term": { "body": "a"}}, ] }}} step 2: score matching docs internal execution (roughly) Friday, November 22, 13
  25. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited { "common": { "body": { "query": "the quick and brown fox jumped over the ledge", "cutoff_frequency": 0.001, "low_freq_operator": "or", "high_freq_operator": "or", "minimum_should_match": { "low_freq" : "60%", "high_freq" : "20%" } } } } Controlling Leniency use “or” for low-freq terms Friday, November 22, 13
  26. { "common": { "body": { "query": "the quick and brown

    fox jumped over the ledge", "cutoff_frequency": 0.001, "low_freq_operator": "or", "high_freq_operator": "or", "minimum_should_match": { "low_freq" : "60%", "high_freq" : "20%" } } } } Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Controlling Leniency how many clauses should match Friday, November 22, 13
  27. { "query": { "bool": { "should": [ { "term": {

    "body": "quick"}}, { "term": { "body": "brown"}}, { "term": { "body": "fox"}}, { "term": { "body": "jumped"}}, { "term": { "body": "over"}}, { "term": { "body": "ledge"}}, ], "should": [ { "term": { "body": "the"}}, { "term": { "body": "and"}}, { "term": { "body": "a"}}, ] }}} Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Controlling Leniency internal execution (roughly) (use “or”) Friday, November 22, 13
  28. { "query": { "bool": { "should": [ { "term": {

    "body": "quick"}}, { "term": { "body": "brown"}}, { "term": { "body": "fox"}}, { "term": { "body": "jumped"}}, { "term": { "body": "over"}}, { "term": { "body": "ledge"}}, ], "should": [ { "term": { "body": "the"}}, { "term": { "body": "and"}}, { "term": { "body": "a"}}, ] }}} Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Controlling Leniency internal execution (roughly) 4 clauses must match 1 clause must match (60%) (20%) Friday, November 22, 13
  29. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited { "common": { "body": { "query": "the quick and brown fox jumped over the ledge", "cutoff_frequency": 0.001, "low_freq_operator": "or", "high_freq_operator": "or", "minimum_should_match": { "low_freq" : "60%", "high_freq" : "20%" } } } } Controlling Importance adjust the high/low cutoff (0.1%) Friday, November 22, 13
  30. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited high low - the - and - a Document Frequency: - quick - brown - fox - jumped - over - ledge Friday, November 22, 13
  31. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited { "common": { "body": { "query": "the quick and brown fox jumped over the ledge", "cutoff_frequency": 0.10, "low_freq_operator": "or", "high_freq_operator": "or", "minimum_should_match": { "low_freq" : "60%", "high_freq" : "20%" } } } } Controlling Importance adjust the high/low cutoff (10.0%) Friday, November 22, 13
  32. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited high low - the - and - a - over - quick Document Frequency: - brown - fox - jumped - ledge Friday, November 22, 13
  33. “To be or not to be” Copyright Elasticsearch 2013. Copying,

    publishing and/or distributing without written permission is strictly prohibited Friday, November 22, 13
  34. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited { "common": { "body": { "query": "to be or not to be", "cutoff_frequency": 0.001 } } } All high-frequency terms Friday, November 22, 13
  35. { "query": { "bool": { "must": [ { "term": {

    "body": "to"}}, { "term": { "body": "be"}}, { "term": { "body": "or"}}, { "term": { "body": "not"}}, { "term": { "body": "be"}}, ] }}} Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited internal execution (roughly) automatically a “must” All high-frequency terms Friday, November 22, 13
  36. { "common": { "body": { "query": "to be or not

    to be", "cutoff_frequency": 0.001, "minimum_should_match": { "low_freq" : "60%" } } } } Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Controlling Leniency how many clauses should match Friday, November 22, 13
  37. { "query": { "bool": { "should": [ { "term": {

    "body": "to"}}, { "term": { "body": "be"}}, { "term": { "body": "or"}}, { "term": { "body": "not"}}, { "term": { "body": "be"}}, ] }}} Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited internal execution (roughly) becomes a “should” All high-frequency terms 3 clauses must match (60%) Friday, November 22, 13
  38. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited Adaptive Stop-lists Are these stop words? - “video” - “movie” - “film” Friday, November 22, 13
  39. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited Adaptive Stop-lists Are these stop words? - “video” - “movie” - “film” For YouTube, they might be! Friday, November 22, 13
  40. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited Adaptive Stop-lists - Common Terms uses your index for frequency - Adapts to your domain - No manual stop-list creation/maintenance - Adapts to language, etc Friday, November 22, 13
  41. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited Limitations - Frequencies are per-index, not per-type Friday, November 22, 13
  42. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited Limitations - Frequencies are per-index, not per-type - No good way to pick cutoff frequency Friday, November 22, 13
  43. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited Limitations - Frequencies are per-index, not per-type - No good way to pick cutoff frequency - Takes data to “warm” the query Friday, November 22, 13
  44. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited Limitations - Frequencies are per-index, not per-type - No good way to pick cutoff frequency - Takes data to “warm” the query - Some advanced behavior missing (fuzzy, etc) Friday, November 22, 13
  45. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited Questions? ಠ_ಠ @ZacharyTong polyfractal on IRC Friday, November 22, 13
  46. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited Resources Common Terms Docs : http://bit.ly/1an7NOd “Stop Stopping Stopwords” : http://bit.ly/17hE2uq Friday, November 22, 13