$30 off During Our Annual Pro Sale. View Details »

The Significant Terms Aggregation

The Significant Terms Aggregation

Presented by Britta Weber at the Elasticsearch Switzerland User Group

You can find the scripts that accompany this presentation at
https://gist.github.com/brwe/292681b8e4ab2612633f

Abstract:
The significant terms aggregation is a feature that allows to users to identify terms that are relevant to a particular set of documents. Relevance here does not only mean how often a term occurs but how much more often a term occurs in the set compared to the whole document collection. In these slides, Britta explains how "significance" is measured here and shows some use cases for this type of aggregation.

Elasticsearch Inc

June 05, 2014
Tweet

More Decks by Elasticsearch Inc

Other Decks in Technology

Transcript

  1. Significant terms

    View Slide

  2. Terms aggregation
    Given query, gather for all docs that match in a field
    • highest frequent terms
    • lowest frequent terms
    • alphabetical order
    • …

    View Slide

  3. Example: Reuters
    !
    "article": {!
    "properties": {!
    "body": {!
    "type": "string",!
    },!
    "date": {!
    "type": "string"!
    },!
    "organisations": {!
    "type": "string"!
    },!
    "places": {!
    "type": "string",!
    "index": "not_analyzed"!
    },!
    "title": {!
    "type": "string"!
    },!
    "topics": {!
    "type": "string"!
    }!
    }!
    }!
    article text
    opec, worldbank, un, …
    usa, uk, …
    earn, money, grain, …

    View Slide

  4. Term aggregation?

    View Slide

  5. Significant terms
    Terms that appear more often in the subset than in
    the whole document collection.

    View Slide

  6. Documents in class
    (matching query,
    for example term X)
    Documents not
    containing term Y
    Documents
    containing term Y
    Documents in class
    and containing term Y
    Is Y a significant term for a class of documents?

    View Slide

  7. In math terms
    T: documents containing terms
    C: all documents
    S: Subset of documents
    matching query
    I: Documents in S and T
    S
    T I
    C

    View Slide

  8. Example 1: Text features

    View Slide

  9. Example 2:
    Recommendations

    View Slide

  10. Movielens
    Rating 1..5, use 4,5 good, 1..3 bad
    !
    {!
    "gender": "F",!
    "age": "35",!
    "pos": " 3798 2059 1257 1259 586 589 1 5 3…”,!
    "zipcode": "97401\n",!
    "neg": " 719 3791 2054 585 587 3004 7 …”,!
    "occupation": "7"!
    }

    View Slide

  11. Example 2:
    Recommendations
    Given a movie id
    1. find all users liking this movie
    2. find the significant terms within the “good” field
    -> “People how liked this movie, also liked…”

    View Slide

  12. If results are unexpected…
    !
    Try setting
    !
    shard_size or
    shard_min_count!
    !
    !
    !
    http://www.elasticsearch.org/guide/en/elasticsearch/reference/
    current/search-aggregations-bucket-significantterms-
    aggregation.html

    View Slide

  13. WIP
    • Chi square
    • G square
    • Google normalized distance
    • Mutual information
    • …
    …although we really do not know if you actually need
    that :-)

    View Slide

  14. If you want to know more…
    Talk about significant terms:
    http://www.infoq.com/presentations/elasticsearch-
    revealing-uncommonly-common
    Blog:
    http://www.elasticsearch.org/blog/significant-terms-
    aggregation/

    View Slide

  15. Data
    Movie reviews:
    http://www.cs.cornell.edu/people/pabo/movie-review-data/
    Movie-lens:
    http://grouplens.org/datasets/movielens/
    Reuters:
    https://github.com/fergiemcdowall/reuters-21578-json

    View Slide