Upgrade to Pro — share decks privately, control downloads, hide ads and more …

More About Aggregations: Meetup Talk

More About Aggregations: Meetup Talk

This Talk was presented at an Elastic Meetup hosted by Walmart Labs: https://www.meetup.com/Silicon-Valley-Elastic-Fantastics/events/241098356/

Session Abstract:
Aggregations are a powerful tool to gain insights into your data, but how do they actually work? In this talk, we will explore:

* how aggregations are executed 'under the hood’
* what users should be aware of when running aggregations
* the limitations of aggregations
* some ways to tweak their execution for both performance and accuracy

Attendees will leave this presentation with a rich understanding of how aggregations can make their lives better and their data even more useful.

Content was derived from this previous Elasticon talk: https://speakerdeck.com/elastic/all-about-aggregations

Elastic Co

July 18, 2017
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. 3 • Originally built on Lucene for text-based searching •

    Lucene and Elasticsearch work together to provide new storage formats and data types specific for numeric and keyword metrics. • Aggregations alongside searching More than search
  2. Searching & Aggregating 7 price color make sold 10000 red

    honda 10/28/2016 20000 red honda 11/05/2016 30000 green ford 05/08/2016 15000 blue toyota 07/02/2016 12000 green toyota 08/19/2016 20000 red honda 11/05/2016 80000 red bmw 01/01/2016 25000 blue ford 02/12/2016
  3. Data Structures For Field Values on Shards 8 color red

    red green blue green red red blue • Two considerations for our data • Fast querying by values • Fast aggregating by values
  4. Inverted Index: terms-to-documents 9 color doc1 doc2 doc3 red ‒

    ‒ ‒ blue ‒ ‒ ‒ green ‒ ‒ ‒ purple ‒ ‒ ‒ orange ‒ ‒ ‒ white ‒ ‒ ‒ black ‒ ‒ ‒ brown ‒ ‒ ‒
  5. Doc Values: documents-to-terms 10 1 value per document 1 column

    per field price color make sold 10000 red honda 10/28/2016 20000 red honda 11/05/2016 30000 green ford 05/08/2016 15000 blue toyota 07/02/2016 12000 green toyota 08/19/2016 20000 red honda 11/05/2016 80000 red bmw 01/01/2016 25000 blue ford 02/12/2016
  6. How Distributed Aggregations Work? 11 Data nodes Coordinating node •

    inline with search query • Executed in isolation on
 each shard • 4 phases • Parse • Collect • Combine • Reduce
  7. Phase 1: Parse 12 Data nodes Coordinating node • Coordinating

    node splits
 the request into shard
 requests • Shards parse
 aggregations and
 initialize data-structures
  8. Phase 2,3: Collect, Combine 13 Data nodes Coordinating node •

    Shards process all
 matching documents • Once done, they combine
 aggregated data into
 an aggregation
  9. Phase 4: Reduce 14 Data nodes Coordinating node • Shards

    send their
 aggregations to the
 coordinating node • Which reduces them
 into a single aggregation
  10. Designed for speed 15 Single network round-trip Single pass through

    data on shards Aggregates are computed in memory Trades accuracy for speed Only pay for documents that match query Can be composed (average response time — broken by day)
  11. Types of Aggregations 16 • Bucket • Terms • (Date)

    Histograms • Filter • Range • … • Metric • Stats • Percentiles • Cardinality (unique counts) • Top Hits • Scripted • …
  12. Example Terms Aggregation Query 17 GET products/_search { "size" :

    0, "query": {"match_all": {} }, "aggs" : { "my_produce_ids” : { "terms": { "field": "pid", "size": 3 } } } }
  13. Example Terms Aggregation Response 18 { "hits": {…}, "aggregations": {

    "my_product_ids”: { "doc_count_error_upper_bound": 3302, "sum_other_doc_count": 8879020, "buckets": [ { "key": "030758836X", "doc_count": 7440 }, { "key": "0439023483", "doc_count": 6717 }, { "key": "0375831002", "doc_count": 4864 } ] }}}
  14. Things To Consider 19 { "hits": {…}, "aggregations": { "my_product_ids”:

    { "doc_count_error_upper_bound": 3302, "sum_other_doc_count": 8879020, "buckets": [ { "key": "030758836X", "doc_count": 7440 }, { "key": "0439023483", "doc_count": 6717 }, { "key": "0375831002", "doc_count": 4864 } ] }}} Upper bound on error on counts for each term number of docs not included in buckets
  15. Locality Bias: Top N(1) 20 A COUNT RED 5 GREEN

    4 BLUE 2 COUNT RED 2 GREEN 4 BLUE 1 B COUNT RED 7 GREEN 8 BLUE 3 A B Node A’s Counts Node B’s Counts Global Counts
  16. Shard Size: Top 3 21 Data nodes Coordinating node •

    How many buckets to
 return per shard? • “shard_size” 15 15 15 15 3
  17. Example Terms Aggregation Query 22 GET products/_search { "size" :

    0, "query": {"match_all": {} }, "aggs" : { "my_produce_ids” : { "terms": { "field": "pid", "size": 3, “shard_size”: 999999 } } } }
  18. Summary 23 Aggregations are powerful & fast Need to trade

    accuracy for speed/memory in some cases Use `shard_size` to help manage accuracy with terms aggregation Leverage Kibana to help write aggregations! Profile your aggregations using the Query Profiler
  19. What We Missed 24 Pipeline Aggregations: Aggregations of Aggregations Using

    `requests.cache` to cache complex static aggregations Matrix Aggregations: covariance and correlation New aggregation types introduced all the time
  20. What to expect? 26 Efficient sparse doc-value reading and writing

    index-time sorting Removal of types Cross-cluster search Upgrading to 6.0 with rolling restarts! and so much more!
  21. • Elastic Discussion Forums: 
 https://discuss.elastic.co/ • Aggregation Documentation:
 https://www.elastic.co/guide/en/elasticsearch/reference/current/search-

    aggregations.html • Terms Aggregation Approximation: https://www.elastic.co/guide/en/elasticsearch/ reference/current/search-aggregations-bucket-terms-aggregation.html#search- aggregations-bucket-terms-aggregation-approximate-counts • Similar Deck From my colleagues Adrien and Colin! https://www.elastic.co/elasticon/ 2015/sf/all-about-aggregations Resources 27