Common Terms Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Have your cake and eat it too Friday, November 22, 13
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited @ZacharyTong polyfractal on IRC Developing - Writing - Training ಠ_ಠ (amoeba) Friday, November 22, 13
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Stop Words Friday, November 22, 13
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited or, it, be, a, and, to Friday, November 22, 13
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Often empty of meaning or, it, be, a, and, to Friday, November 22, 13
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Often empty of meaning or, it, be, a, and, to “the quick and brown fox jumped over a ledge” Friday, November 22, 13
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Often empty of meaning or, it, be, a, and, to “the quick and brown fox jumped over a ledge” Friday, November 22, 13
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Often empty of meaning Used frequently “the quick and brown fox jumped over a ledge” 33% or, it, be, a, and, to Friday, November 22, 13
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Often empty of meaning Used frequently Bloat inverted index or, it, be, a, and, to Friday, November 22, 13
Stop Words... Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited ...frequent ...little discriminatory value ...hurts performance Friday, November 22, 13
“To be or not to be” Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Friday, November 22, 13
“To be or not to be” Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Friday, November 22, 13
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Multi-field mapping - with stop filter - without stop filter Friday, November 22, 13
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Bloat inverted index Remember, stop-words: ? Friday, November 22, 13
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Bloat inverted index Multi-field mapping: 2x! Friday, November 22, 13
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Other general problems: - manually maintain stop-list - language dependent - domain dependent - makes query scoring tricky Friday, November 22, 13
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Common Terms Intelligent stop-word removal Friday, November 22, 13
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Overview - identify “important” terms in query - find documents with “important” terms - score those matching docs with entire query Friday, November 22, 13
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited “unimportant” “important” - quick - brown - fox - jumped - over - ledge - the - and - a Friday, November 22, 13
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited high low - the - and - a Document Frequency: - quick - brown - fox - jumped - over - ledge Friday, November 22, 13
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited { "common": { "body": { "query": "the quick and brown fox jumped over the ledge", "cutoff_frequency": 0.001 } } } The Query Friday, November 22, 13
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited { "common": { "body": { "query": "the quick and brown fox jumped over the ledge", "cutoff_frequency": 0.001, "low_freq_operator": "or", "high_freq_operator": "or", "minimum_should_match": { "low_freq" : "60%", "high_freq" : "20%" } } } } Controlling Leniency use “or” for low-freq terms Friday, November 22, 13
{ "common": { "body": { "query": "the quick and brown fox jumped over the ledge", "cutoff_frequency": 0.001, "low_freq_operator": "or", "high_freq_operator": "or", "minimum_should_match": { "low_freq" : "60%", "high_freq" : "20%" } } } } Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Controlling Leniency how many clauses should match Friday, November 22, 13
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited high low - the - and - a Document Frequency: - quick - brown - fox - jumped - over - ledge Friday, November 22, 13
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited high low - the - and - a - over - quick Document Frequency: - brown - fox - jumped - ledge Friday, November 22, 13
“To be or not to be” Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Friday, November 22, 13
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited { "common": { "body": { "query": "to be or not to be", "cutoff_frequency": 0.001 } } } All high-frequency terms Friday, November 22, 13
{ "common": { "body": { "query": "to be or not to be", "cutoff_frequency": 0.001, "minimum_should_match": { "low_freq" : "60%" } } } } Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Controlling Leniency how many clauses should match Friday, November 22, 13
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Adaptive Stop-lists Are these stop words? - “video” - “movie” - “film” Friday, November 22, 13
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Adaptive Stop-lists Are these stop words? - “video” - “movie” - “film” For YouTube, they might be! Friday, November 22, 13
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Adaptive Stop-lists - Common Terms uses your index for frequency - Adapts to your domain - No manual stop-list creation/maintenance - Adapts to language, etc Friday, November 22, 13
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Limitations - Frequencies are per-index, not per-type Friday, November 22, 13
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Limitations - Frequencies are per-index, not per-type - No good way to pick cutoff frequency Friday, November 22, 13
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Limitations - Frequencies are per-index, not per-type - No good way to pick cutoff frequency - Takes data to “warm” the query Friday, November 22, 13
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Limitations - Frequencies are per-index, not per-type - No good way to pick cutoff frequency - Takes data to “warm” the query - Some advanced behavior missing (fuzzy, etc) Friday, November 22, 13
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Questions? ಠ_ಠ @ZacharyTong polyfractal on IRC Friday, November 22, 13