The Significant Terms Aggregation

Signiﬁcant terms

Terms aggregation Given query, gather for all docs that match
in a ﬁeld • highest frequent terms • lowest frequent terms • alphabetical order • …

Example: Reuters ! "article": {! "properties": {! "body": {! "type":
"string",! },! "date": {! "type": "string"! },! "organisations": {! "type": "string"! },! "places": {! "type": "string",! "index": "not_analyzed"! },! "title": {! "type": "string"! },! "topics": {! "type": "string"! }! }! }! article text opec, worldbank, un, … usa, uk, … earn, money, grain, …

Term aggregation?

Signiﬁcant terms Terms that appear more often in the subset
than in the whole document collection.

Documents in class (matching query, for example term X) Documents
not containing term Y Documents containing term Y Documents in class and containing term Y Is Y a signiﬁcant term for a class of documents?

In math terms T: documents containing terms C: all documents
S: Subset of documents matching query I: Documents in S and T S T I C

Example 1: Text features

Example 2: Recommendations

Movielens Rating 1..5, use 4,5 good, 1..3 bad ! {!
"gender": "F",! "age": "35",! "pos": " 3798 2059 1257 1259 586 589 1 5 3…”,! "zipcode": "97401\n",! "neg": " 719 3791 2054 585 587 3004 7 …”,! "occupation": "7"! }

Example 2: Recommendations Given a movie id 1. find all
users liking this movie 2. find the significant terms within the “good” field -> “People how liked this movie, also liked…”

If results are unexpected… ! Try setting ! shard_size or
shard_min_count! ! ! ! http://www.elasticsearch.org/guide/en/elasticsearch/reference/ current/search-aggregations-bucket-signiﬁcantterms- aggregation.html

WIP • Chi square • G square • Google normalized
distance • Mutual information • … …although we really do not know if you actually need that :-)

If you want to know more… Talk about signiﬁcant terms:
http://www.infoq.com/presentations/elasticsearch- revealing-uncommonly-common Blog: http://www.elasticsearch.org/blog/signiﬁcant-terms- aggregation/

Data Movie reviews: http://www.cs.cornell.edu/people/pabo/movie-review-data/ Movie-lens: http://grouplens.org/datasets/movielens/ Reuters: https://github.com/fergiemcdowall/reuters-21578-json

The Significant Terms Aggregation

The Significant Terms Aggregation

Elasticsearch Inc

More Decks by Elasticsearch Inc

Other Decks in Technology

Featured

Transcript

Signiﬁcant terms

Terms aggregation Given query, gather for all docs that match

Example: Reuters ! "article": {! "properties": {! "body": {! "type":

Term aggregation?

Signiﬁcant terms Terms that appear more often in the subset

Documents in class (matching query, for example term X) Documents

In math terms T: documents containing terms C: all documents

Example 1: Text features

Example 2: Recommendations

Movielens Rating 1..5, use 4,5 good, 1..3 bad ! {!

Example 2: Recommendations Given a movie id 1. ﬁnd all

If results are unexpected… ! Try setting ! shard_size or

WIP • Chi square • G square • Google normalized

If you want to know more… Talk about signiﬁcant terms:

Data Movie reviews: http://www.cs.cornell.edu/people/pabo/movie-review-data/ Movie-lens: http://grouplens.org/datasets/movielens/ Reuters: https://github.com/fergiemcdowall/reuters-21578-json