Slide 1

Slide 1 text

Significant terms

Slide 2

Slide 2 text

Terms aggregation Given query, gather for all docs that match in a field • highest frequent terms • lowest frequent terms • alphabetical order • …

Slide 3

Slide 3 text

Example: Reuters ! "article": {! "properties": {! "body": {! "type": "string",! },! "date": {! "type": "string"! },! "organisations": {! "type": "string"! },! "places": {! "type": "string",! "index": "not_analyzed"! },! "title": {! "type": "string"! },! "topics": {! "type": "string"! }! }! }! article text opec, worldbank, un, … usa, uk, … earn, money, grain, …

Slide 4

Slide 4 text

Term aggregation?

Slide 5

Slide 5 text

Significant terms Terms that appear more often in the subset than in the whole document collection.

Slide 6

Slide 6 text

Documents in class (matching query, for example term X) Documents not containing term Y Documents containing term Y Documents in class and containing term Y Is Y a significant term for a class of documents?

Slide 7

Slide 7 text

In math terms T: documents containing terms C: all documents S: Subset of documents matching query I: Documents in S and T S T I C

Slide 8

Slide 8 text

Example 1: Text features

Slide 9

Slide 9 text

Example 2: Recommendations

Slide 10

Slide 10 text

Movielens Rating 1..5, use 4,5 good, 1..3 bad ! {! "gender": "F",! "age": "35",! "pos": " 3798 2059 1257 1259 586 589 1 5 3…”,! "zipcode": "97401\n",! "neg": " 719 3791 2054 585 587 3004 7 …”,! "occupation": "7"! }

Slide 11

Slide 11 text

Example 2: Recommendations Given a movie id 1. find all users liking this movie 2. find the significant terms within the “good” field -> “People how liked this movie, also liked…”

Slide 12

Slide 12 text

If results are unexpected… ! Try setting ! shard_size or shard_min_count! ! ! ! http://www.elasticsearch.org/guide/en/elasticsearch/reference/ current/search-aggregations-bucket-significantterms- aggregation.html

Slide 13

Slide 13 text

WIP • Chi square • G square • Google normalized distance • Mutual information • … …although we really do not know if you actually need that :-)

Slide 14

Slide 14 text

If you want to know more… Talk about significant terms: http://www.infoq.com/presentations/elasticsearch- revealing-uncommonly-common Blog: http://www.elasticsearch.org/blog/significant-terms- aggregation/

Slide 15

Slide 15 text

Data Movie reviews: http://www.cs.cornell.edu/people/pabo/movie-review-data/ Movie-lens: http://grouplens.org/datasets/movielens/ Reuters: https://github.com/fergiemcdowall/reuters-21578-json