Upgrade to Pro — share decks privately, control downloads, hide ads and more …

All About Aggregations

Elastic Co
March 10, 2015

All About Aggregations

This talk was presented at the inaugural Elastic{ON} conference, http://elasticon.com

Session Abstract:
Aggregations are a powerful tool to gain insights into your data, but how do they actually work? In this talk, we will explore:

* how aggregations are executed 'under the hood’
* what users should be aware of when running aggregations
* the limitations of aggregations
* some ways to tweak their execution for both performance and accuracy

Attendees will leave this presentation with a rich understanding of how aggregations can make their lives better and their data even more useful.

Elastic Co

March 10, 2015
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. { } CC-BY-ND 4.0 Outline • How do Aggregations work?

    • Aggregation features • Things to be aware of – Memory usage – Accuracy 2
  2. { } CC-BY-ND 4.0 How do aggs work? 3 •

    `inline` with search query • Executed in isolation on each shard • 4 phases • Parse • Collect • Combine • Reduce Data nodes Coordinating node
  3. { } CC-BY-ND 4.0 Phase 1: Parse 4 • Coordinating

    node splits the request into shard requests • Shards parse aggregations and initialize data-structures Data nodes Coordinating node
  4. { } CC-BY-ND 4.0 Phase 2,3: Collect + Combine 5

    • Shards process all matching documents • Once done, they combine aggregated data into an aggregation Data nodes Coordinating node
  5. { } CC-BY-ND 4.0 Phase 4: Reduce 6 • Shards

    send their aggregation to the coordinating node • Which reduces them into a single aggregation Data nodes Coordinating node
  6. { } CC-BY-ND 4.0 Designed for speed • Single network

    round-trip • Single pass through data on shards • Aggregates are computed in-memory • Trades accuracy for speed 7
  7. { } CC-BY-ND 4.0 Features you will like • Leverages

    the inverted index – you only pay the price for documents that match your query • Can be composed – average response time, broken by day • Near real-time – Don’t wait for an batch job to run 8
  8. { } CC-BY-ND 4.0 Aggregations you will like • Bucket

    • Terms • (Date) histograms • Filter • Range • Nested • Children • ... 9 • Metrics • Stats • Percentiles • Cardinality (unique counts) • Top hits • Scripted • ...
  9. { } CC-BY-ND 4.0 Example { “aggs”: { “by_day”: {

    “date_histogram”: { “field”: “timestamp”, “interval”: “day” }, “aggs”: { “max_temperature”: { “max”: { “field”: “temperature” } } }}}} 10 { “aggregations”: { “by_day”: { “buckets”: [ { “key”: “2015-05-05T00:00:00.000Z”, “doc_count”: 24, “max_temperature”: { “value”: 13 }, …, ]}}} Request Response
  10. { } CC-BY-ND 4.0 Things to consider • Filter aggregation

    runtime • Depth-first vs Breadth-first • Terms aggregation accuracy 11
  11. { } CC-BY-ND 4.0 Filter aggregation vs query filter {

    “query”: { “match_all”: {} }, “aggs”: { “my_filter_counts”: { “filter”: { “term”: { “in_stock”: true } } }}} 12 { “query”: { “filtered”: { “query”: { “match_all”: {} }, “filter”: { “term”: { “in_stock”: true } } }}} Filter Aggregation Filtered Query
  12. { } CC-BY-ND 4.0 Breadth-first mode • Useful for: •

    Deeply nested aggregations • High cardinality aggregations 13
  13. { } CC-BY-ND 4.0 Depth-first mode 14 Shoes 1 60

    Min price of top two categories Id Category Price 1 Shoes 60 2 Clothing 80 3 Shoes 50 4 Sports 10 5 Sports 35
  14. { } CC-BY-ND 4.0 Depth-first mode 15 Shoes 1 Clothing

    1 60 80 Id Category Price 1 Shoes 60 2 Clothing 80 3 Shoes 50 4 Sports 10 5 Sports 35 Min price of top two categories
  15. { } CC-BY-ND 4.0 Depth-first mode 16 Shoes 2 Clothing

    1 50 80 Id Category Price 1 Shoes 60 2 Clothing 80 3 Shoes 50 4 Sports 10 5 Sports 35 Min price of top two categories
  16. { } CC-BY-ND 4.0 Depth-first mode 17 Shoes 2 Clothing

    1 Sports 1 50 80 10 Id Category Price 1 Shoes 60 2 Clothing 80 3 Shoes 50 4 Sports 10 5 Sports 35 Min price of top two categories
  17. { } CC-BY-ND 4.0 Depth-first mode 18 Shoes 2 Clothing

    1 Sports 2 50 80 10 Id Category Price 1 Shoes 60 2 Clothing 80 3 Shoes 50 4 Sports 10 5 Sports 35 Min price of top two categories
  18. { } CC-BY-ND 4.0 Depth-first mode (result) 19 Shoes 2

    Sports 2 50 10 Id Category Price 1 Shoes 60 2 Clothing 80 3 Shoes 50 4 Sports 10 5 Sports 35 Clothing 1 80 Min price of top two categories
  19. { } CC-BY-ND 4.0 Depth-first mode (result) 20 Shoes 2

    Sports 2 50 10 Id Category Price 1 Shoes 60 2 Clothing 80 3 Shoes 50 4 Sports 10 5 Sports 35 Min price of top two categories
  20. { } CC-BY-ND 4.0 Breadth-first mode (record) 21 Shoes 1

    [1] Id Category Price 1 Shoes 60 2 Clothing 80 3 Shoes 50 4 Sports 10 5 Sports 35 Min price of top two categories
  21. { } CC-BY-ND 4.0 Breadth-first mode (record) 22 Shoes 1

    [1] Clothing 1 [2] Id Category Price 1 Shoes 60 2 Clothing 80 3 Shoes 50 4 Sports 10 5 Sports 35 Min price of top two categories
  22. { } CC-BY-ND 4.0 Breadth-first mode (record) 23 Shoes 2

    [1, 3] Clothing 1 [2] Sports 2 [4, 5] Id Category Price 1 Shoes 60 2 Clothing 80 3 Shoes 50 4 Sports 10 5 Sports 35 Min price of top two categories
  23. { } CC-BY-ND 4.0 Breadth-first mode (prune) 24 Shoes 2

    [1, 3] Sports 2 [4, 5] Id Category Price 1 Shoes 60 2 Clothing 80 3 Shoes 50 4 Sports 10 5 Sports 35 Min price of top two categories
  24. { } CC-BY-ND 4.0 Breadth-first mode (replay) 25 Shoes 2

    Sports 2 50 10 Id Category Price 1 Shoes 60 2 Clothing 80 3 Shoes 50 4 Sports 10 5 Sports 35 Min price of top two categories
  25. { } CC-BY-ND 4.0 Accuracy: Getting the top N terms

    26 • We want to find the top 5 colors in our index Shard A Shard B Shard C 1 Blue (25) Blue (30) Blue (45) 2 Red (18) Red (25) Green (44) 3 Green (6) Orange (17) Maroon (36) 4 Yellow (3) Maroon (16) Brown (30) 5 Purple (2) Brown (15) Purple (29) 6 Orange (2) Pink (14) Pink (28) 7 Brown (2) Teal (10) White (2) 8 Pink (2) White (8) Yellow (1) 9 Teal (1) Green (6)
  26. { } CC-BY-ND 4.0 Accuracy: Getting the top N terms

    27 • Ask each shard for its top 5 colors • Combine results to get top 5 list Shard A Shard B Shard C 1 Blue (25) Blue (30) Blue (45) 2 Red (18) Red (25) Green (44) 3 Green (6) Orange (17) Maroon (36) 4 Yellow (3) Maroon (16) Brown (30) 5 Purple (2) Brown (15) Purple (29) 6 Orange (2) Pink (14) Pink (28) 7 Brown (2) Teal (10) White (2) 8 Pink (2) White (8) Yellow (1) 9 Teal (1) Green (6) Color Count 1 Blue 100 2 Maroon 52 3 Green 50 4 Brown 45 5 Red 43
  27. { } CC-BY-ND 4.0 Accuracy: Getting the top N terms

    28 • Blue has an accurate count as it appeared in the top 5 on every shard Shard A Shard B Shard C 1 Blue (25) Blue (30) Blue (45) 2 Red (18) Red (25) Green (44) 3 Green (6) Orange (17) Maroon (36) 4 Yellow (3) Maroon (16) Brown (30) 5 Purple (2) Brown (15) Purple (29) 6 Orange (2) Pink (14) Pink (28) 7 Brown (2) Teal (10) White (2) 8 Pink (2) White (8) Yellow (1) 9 Teal (1) Green (6) Color Count 1 Blue 100 2 Maroon 52 3 Green 50 4 Brown 45 5 Red 43
  28. { } CC-BY-ND 4.0 Accuracy: Getting the top N terms

    29 • Count for green is not accurate • Didn’t make it into the top 5 for shard B Shard A Shard B Shard C 1 Blue (25) Blue (30) Blue (45) 2 Red (18) Red (25) Green (44) 3 Green (6) Orange (17) Maroon (36) 4 Yellow (3) Maroon (16) Brown (30) 5 Purple (2) Brown (15) Purple (29) 6 Orange (2) Pink (14) Pink (28) 7 Brown (2) Teal (10) White (2) 8 Pink (2) White (8) Yellow (1) 9 Teal (1) Green (6) Color Count 1 Blue 100 2 Maroon 52 3 Green 50 4 Brown 45 5 Red 43
  29. { } CC-BY-ND 4.0 Accuracy: Getting the top N terms

    30 • Pink didn’t make it top the top 5 for any shard • But has a doc count of 44 so should be 5th Shard A Shard B Shard C 1 Blue (25) Blue (30) Blue (45) 2 Red (18) Red (25) Green (44) 3 Green (6) Orange (17) Maroon (36) 4 Yellow (3) Maroon (16) Brown (30) 5 Purple (2) Brown (15) Purple (29) 6 Orange (2) Pink (14) Pink (28) 7 Brown (2) Teal (10) White (2) 8 Pink (2) White (8) Yellow (1) 9 Teal (1) Green (6) Color Count 1 Blue 100 2 Maroon 52 3 Green 50 4 Brown 45 5 Red 43
  30. { } CC-BY-ND 4.0 Accuracy: Getting the N top terms

    { "aggs": { "colors": { "terms": { "field": “color.raw", "size": 5, “shard_size”: 5 } }}} 31 "doc_count_error_upper_bound": 46
  31. { } CC-BY-ND 4.0 Accuracy: Getting the N top terms

    { "aggs": { "colors": { "terms": { "field": “color.raw", "size": 5, “shard_size”: 5 "show_term_doc_count_error": true } }}} 32 { "key": "Blue", "doc_count": 100, "doc_count_error_upper_bound": 0 }, { "key": "Maroon", "doc_count": 52, "doc_count_error_upper_bound": 2 }, …
  32. { } CC-BY-ND 4.0 Summary • Aggregations are powerful and

    fast • Need to trade accuracy for speed/memory in some cases • Be aware of the memory requirements you are placing on your cluster • Consider using `breadth_first` mode to help with memory stress • Use `shard_size` to help manage accuracy with terms aggregation 33
  33. { } This work is licensed under the Creative Commons

    Attribution-NoDerivatives 4.0 International License. To view a copy of this license, visit: http://creativecommons.org/licenses/by-nd/4.0/ or send a letter to: Creative Commons PO Box 1866 Mountain View, CA 94042 USA CC-BY-ND 4.0