Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Significant Terms Aggregation

The Significant Terms Aggregation

Presented by Britta Weber at the Elasticsearch Switzerland User Group

You can find the scripts that accompany this presentation at
https://gist.github.com/brwe/292681b8e4ab2612633f

Abstract:
The significant terms aggregation is a feature that allows to users to identify terms that are relevant to a particular set of documents. Relevance here does not only mean how often a term occurs but how much more often a term occurs in the set compared to the whole document collection. In these slides, Britta explains how "significance" is measured here and shows some use cases for this type of aggregation.

098332e9d988080a9057816f84d668f7?s=128

Elasticsearch Inc

June 05, 2014
Tweet

Transcript

  1. Significant terms

  2. Terms aggregation Given query, gather for all docs that match

    in a field • highest frequent terms • lowest frequent terms • alphabetical order • …
  3. Example: Reuters ! "article": {! "properties": {! "body": {! "type":

    "string",! },! "date": {! "type": "string"! },! "organisations": {! "type": "string"! },! "places": {! "type": "string",! "index": "not_analyzed"! },! "title": {! "type": "string"! },! "topics": {! "type": "string"! }! }! }! article text opec, worldbank, un, … usa, uk, … earn, money, grain, …
  4. Term aggregation?

  5. Significant terms Terms that appear more often in the subset

    than in the whole document collection.
  6. Documents in class (matching query, for example term X) Documents

    not containing term Y Documents containing term Y Documents in class and containing term Y Is Y a significant term for a class of documents?
  7. In math terms T: documents containing terms C: all documents

    S: Subset of documents matching query I: Documents in S and T S T I C
  8. Example 1: Text features

  9. Example 2: Recommendations

  10. Movielens Rating 1..5, use 4,5 good, 1..3 bad ! {!

    "gender": "F",! "age": "35",! "pos": " 3798 2059 1257 1259 586 589 1 5 3…”,! "zipcode": "97401\n",! "neg": " 719 3791 2054 585 587 3004 7 …”,! "occupation": "7"! }
  11. Example 2: Recommendations Given a movie id 1. find all

    users liking this movie 2. find the significant terms within the “good” field -> “People how liked this movie, also liked…”
  12. If results are unexpected… ! Try setting ! shard_size or

    shard_min_count! ! ! ! http://www.elasticsearch.org/guide/en/elasticsearch/reference/ current/search-aggregations-bucket-significantterms- aggregation.html
  13. WIP • Chi square • G square • Google normalized

    distance • Mutual information • … …although we really do not know if you actually need that :-)
  14. If you want to know more… Talk about significant terms:

    http://www.infoq.com/presentations/elasticsearch- revealing-uncommonly-common Blog: http://www.elasticsearch.org/blog/significant-terms- aggregation/
  15. Data Movie reviews: http://www.cs.cornell.edu/people/pabo/movie-review-data/ Movie-lens: http://grouplens.org/datasets/movielens/ Reuters: https://github.com/fergiemcdowall/reuters-21578-json