Slide 1

Slide 1 text

Common Terms Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Have your cake and eat it too Friday, November 22, 13

Slide 2

Slide 2 text

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited @ZacharyTong polyfractal on IRC Developing - Writing - Training ಠ_ಠ (amoeba) Friday, November 22, 13

Slide 3

Slide 3 text

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Stop Words Friday, November 22, 13

Slide 4

Slide 4 text

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited or, it, be, a, and, to Friday, November 22, 13

Slide 5

Slide 5 text

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Often empty of meaning or, it, be, a, and, to Friday, November 22, 13

Slide 6

Slide 6 text

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Often empty of meaning or, it, be, a, and, to “the quick and brown fox jumped over a ledge” Friday, November 22, 13

Slide 7

Slide 7 text

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Often empty of meaning or, it, be, a, and, to “the quick and brown fox jumped over a ledge” Friday, November 22, 13

Slide 8

Slide 8 text

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Often empty of meaning Used frequently “the quick and brown fox jumped over a ledge” 33% or, it, be, a, and, to Friday, November 22, 13

Slide 9

Slide 9 text

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Often empty of meaning Used frequently Bloat inverted index or, it, be, a, and, to Friday, November 22, 13

Slide 10

Slide 10 text

Stop Words... Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited ...frequent ...little discriminatory value ...hurts performance Friday, November 22, 13

Slide 11

Slide 11 text

Stop Words... Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Stop Filter Friday, November 22, 13

Slide 12

Slide 12 text

“To be or not to be” Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Friday, November 22, 13

Slide 13

Slide 13 text

“To be or not to be” Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Friday, November 22, 13

Slide 14

Slide 14 text

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Multi-field mapping - with stop filter - without stop filter Friday, November 22, 13

Slide 15

Slide 15 text

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Multi-field mapping { “query” : { “bool” : { “should” : [ “match” : { “body” : { “query” : “quick fox”, “boost” : 3 } }, “match” : { “body.without_stop” : { “query” : “quick fox”, “boost” : 1 } } ] }}} Boost stop-removed match But check stop-words too Friday, November 22, 13

Slide 16

Slide 16 text

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Bloat inverted index Remember, stop-words: ? Friday, November 22, 13

Slide 17

Slide 17 text

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Bloat inverted index Multi-field mapping: 2x! Friday, November 22, 13

Slide 18

Slide 18 text

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Other general problems: - manually maintain stop-list - language dependent - domain dependent - makes query scoring tricky Friday, November 22, 13

Slide 19

Slide 19 text

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Common Terms Intelligent stop-word removal Friday, November 22, 13

Slide 20

Slide 20 text

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Overview - identify “important” terms in query - find documents with “important” terms - score those matching docs with entire query Friday, November 22, 13

Slide 21

Slide 21 text

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited “unimportant” “important” - quick - brown - fox - jumped - over - ledge - the - and - a Friday, November 22, 13

Slide 22

Slide 22 text

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited high low - the - and - a Document Frequency: - quick - brown - fox - jumped - over - ledge Friday, November 22, 13

Slide 23

Slide 23 text

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited { "common": { "body": { "query": "the quick and brown fox jumped over the ledge", "cutoff_frequency": 0.001 } } } The Query Friday, November 22, 13

Slide 24

Slide 24 text

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited { "query": { "bool": { "must": [ { "term": { "body": "quick"}}, { "term": { "body": "brown"}}, { "term": { "body": "fox"}}, { "term": { "body": "jumped"}}, { "term": { "body": "over"}}, { "term": { "body": "ledge"}}, ] }}} internal execution (roughly) step 1: find docs w/ “important” Friday, November 22, 13

Slide 25

Slide 25 text

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited { "query": { "bool": { "must": [ { "term": { "body": "quick"}}, { "term": { "body": "brown"}}, { "term": { "body": "fox"}}, { "term": { "body": "jumped"}}, { "term": { "body": "over"}}, { "term": { "body": "ledge"}}, ], "should": [ { "term": { "body": "the"}}, { "term": { "body": "and"}}, { "term": { "body": "a"}}, ] }}} step 2: score matching docs internal execution (roughly) Friday, November 22, 13

Slide 26

Slide 26 text

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited { "common": { "body": { "query": "the quick and brown fox jumped over the ledge", "cutoff_frequency": 0.001, "low_freq_operator": "or", "high_freq_operator": "or", "minimum_should_match": { "low_freq" : "60%", "high_freq" : "20%" } } } } Controlling Leniency use “or” for low-freq terms Friday, November 22, 13

Slide 27

Slide 27 text

{ "common": { "body": { "query": "the quick and brown fox jumped over the ledge", "cutoff_frequency": 0.001, "low_freq_operator": "or", "high_freq_operator": "or", "minimum_should_match": { "low_freq" : "60%", "high_freq" : "20%" } } } } Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Controlling Leniency how many clauses should match Friday, November 22, 13

Slide 28

Slide 28 text

{ "query": { "bool": { "should": [ { "term": { "body": "quick"}}, { "term": { "body": "brown"}}, { "term": { "body": "fox"}}, { "term": { "body": "jumped"}}, { "term": { "body": "over"}}, { "term": { "body": "ledge"}}, ], "should": [ { "term": { "body": "the"}}, { "term": { "body": "and"}}, { "term": { "body": "a"}}, ] }}} Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Controlling Leniency internal execution (roughly) (use “or”) Friday, November 22, 13

Slide 29

Slide 29 text

{ "query": { "bool": { "should": [ { "term": { "body": "quick"}}, { "term": { "body": "brown"}}, { "term": { "body": "fox"}}, { "term": { "body": "jumped"}}, { "term": { "body": "over"}}, { "term": { "body": "ledge"}}, ], "should": [ { "term": { "body": "the"}}, { "term": { "body": "and"}}, { "term": { "body": "a"}}, ] }}} Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Controlling Leniency internal execution (roughly) 4 clauses must match 1 clause must match (60%) (20%) Friday, November 22, 13

Slide 30

Slide 30 text

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited { "common": { "body": { "query": "the quick and brown fox jumped over the ledge", "cutoff_frequency": 0.001, "low_freq_operator": "or", "high_freq_operator": "or", "minimum_should_match": { "low_freq" : "60%", "high_freq" : "20%" } } } } Controlling Importance adjust the high/low cutoff (0.1%) Friday, November 22, 13

Slide 31

Slide 31 text

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited high low - the - and - a Document Frequency: - quick - brown - fox - jumped - over - ledge Friday, November 22, 13

Slide 32

Slide 32 text

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited { "common": { "body": { "query": "the quick and brown fox jumped over the ledge", "cutoff_frequency": 0.10, "low_freq_operator": "or", "high_freq_operator": "or", "minimum_should_match": { "low_freq" : "60%", "high_freq" : "20%" } } } } Controlling Importance adjust the high/low cutoff (10.0%) Friday, November 22, 13

Slide 33

Slide 33 text

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited high low - the - and - a - over - quick Document Frequency: - brown - fox - jumped - ledge Friday, November 22, 13

Slide 34

Slide 34 text

“To be or not to be” Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Friday, November 22, 13

Slide 35

Slide 35 text

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited { "common": { "body": { "query": "to be or not to be", "cutoff_frequency": 0.001 } } } All high-frequency terms Friday, November 22, 13

Slide 36

Slide 36 text

{ "query": { "bool": { "must": [ { "term": { "body": "to"}}, { "term": { "body": "be"}}, { "term": { "body": "or"}}, { "term": { "body": "not"}}, { "term": { "body": "be"}}, ] }}} Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited internal execution (roughly) automatically a “must” All high-frequency terms Friday, November 22, 13

Slide 37

Slide 37 text

{ "common": { "body": { "query": "to be or not to be", "cutoff_frequency": 0.001, "minimum_should_match": { "low_freq" : "60%" } } } } Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Controlling Leniency how many clauses should match Friday, November 22, 13

Slide 38

Slide 38 text

{ "query": { "bool": { "should": [ { "term": { "body": "to"}}, { "term": { "body": "be"}}, { "term": { "body": "or"}}, { "term": { "body": "not"}}, { "term": { "body": "be"}}, ] }}} Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited internal execution (roughly) becomes a “should” All high-frequency terms 3 clauses must match (60%) Friday, November 22, 13

Slide 39

Slide 39 text

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Adaptive Stop-lists Are these stop words? - “video” - “movie” - “film” Friday, November 22, 13

Slide 40

Slide 40 text

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Adaptive Stop-lists Are these stop words? - “video” - “movie” - “film” For YouTube, they might be! Friday, November 22, 13

Slide 41

Slide 41 text

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Adaptive Stop-lists - Common Terms uses your index for frequency - Adapts to your domain - No manual stop-list creation/maintenance - Adapts to language, etc Friday, November 22, 13

Slide 42

Slide 42 text

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Limitations - Frequencies are per-index, not per-type Friday, November 22, 13

Slide 43

Slide 43 text

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Limitations - Frequencies are per-index, not per-type - No good way to pick cutoff frequency Friday, November 22, 13

Slide 44

Slide 44 text

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Limitations - Frequencies are per-index, not per-type - No good way to pick cutoff frequency - Takes data to “warm” the query Friday, November 22, 13

Slide 45

Slide 45 text

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Limitations - Frequencies are per-index, not per-type - No good way to pick cutoff frequency - Takes data to “warm” the query - Some advanced behavior missing (fuzzy, etc) Friday, November 22, 13

Slide 46

Slide 46 text

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Questions? ಠ_ಠ @ZacharyTong polyfractal on IRC Friday, November 22, 13

Slide 47

Slide 47 text

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Resources Common Terms Docs : http://bit.ly/1an7NOd “Stop Stopping Stopwords” : http://bit.ly/17hE2uq Friday, November 22, 13