Upgrade to Pro — share decks privately, control downloads, hide ads and more …

All About Aggregations

Dd9d954997353b37b4c2684f478192d3?s=47 Elastic Co
March 10, 2015

All About Aggregations

This talk was presented at the inaugural Elastic{ON} conference, http://elasticon.com

Session Abstract:
Aggregations are a powerful tool to gain insights into your data, but how do they actually work? In this talk, we will explore:

* how aggregations are executed 'under the hood’
* what users should be aware of when running aggregations
* the limitations of aggregations
* some ways to tweak their execution for both performance and accuracy

Attendees will leave this presentation with a rich understanding of how aggregations can make their lives better and their data even more useful.

Dd9d954997353b37b4c2684f478192d3?s=128

Elastic Co

March 10, 2015
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. All About Aggregations Adrien Grand and Colin Goodheart-Smithe

  2. { } CC-BY-ND 4.0 Outline • How do Aggregations work?

    • Aggregation features • Things to be aware of – Memory usage – Accuracy 2
  3. { } CC-BY-ND 4.0 How do aggs work? 3 •

    `inline` with search query • Executed in isolation on each shard • 4 phases • Parse • Collect • Combine • Reduce Data nodes Coordinating node
  4. { } CC-BY-ND 4.0 Phase 1: Parse 4 • Coordinating

    node splits the request into shard requests • Shards parse aggregations and initialize data-structures Data nodes Coordinating node
  5. { } CC-BY-ND 4.0 Phase 2,3: Collect + Combine 5

    • Shards process all matching documents • Once done, they combine aggregated data into an aggregation Data nodes Coordinating node
  6. { } CC-BY-ND 4.0 Phase 4: Reduce 6 • Shards

    send their aggregation to the coordinating node • Which reduces them into a single aggregation Data nodes Coordinating node
  7. { } CC-BY-ND 4.0 Designed for speed • Single network

    round-trip • Single pass through data on shards • Aggregates are computed in-memory • Trades accuracy for speed 7
  8. { } CC-BY-ND 4.0 Features you will like • Leverages

    the inverted index – you only pay the price for documents that match your query • Can be composed – average response time, broken by day • Near real-time – Don’t wait for an batch job to run 8
  9. { } CC-BY-ND 4.0 Aggregations you will like • Bucket

    • Terms • (Date) histograms • Filter • Range • Nested • Children • ... 9 • Metrics • Stats • Percentiles • Cardinality (unique counts) • Top hits • Scripted • ...
  10. { } CC-BY-ND 4.0 Example { “aggs”: { “by_day”: {

    “date_histogram”: { “field”: “timestamp”, “interval”: “day” }, “aggs”: { “max_temperature”: { “max”: { “field”: “temperature” } } }}}} 10 { “aggregations”: { “by_day”: { “buckets”: [ { “key”: “2015-05-05T00:00:00.000Z”, “doc_count”: 24, “max_temperature”: { “value”: 13 }, …, ]}}} Request Response
  11. { } CC-BY-ND 4.0 Things to consider • Filter aggregation

    runtime • Depth-first vs Breadth-first • Terms aggregation accuracy 11
  12. { } CC-BY-ND 4.0 Filter aggregation vs query filter {

    “query”: { “match_all”: {} }, “aggs”: { “my_filter_counts”: { “filter”: { “term”: { “in_stock”: true } } }}} 12 { “query”: { “filtered”: { “query”: { “match_all”: {} }, “filter”: { “term”: { “in_stock”: true } } }}} Filter Aggregation Filtered Query
  13. { } CC-BY-ND 4.0 Breadth-first mode • Useful for: •

    Deeply nested aggregations • High cardinality aggregations 13
  14. { } CC-BY-ND 4.0 Depth-first mode 14 Shoes 1 60

    Min price of top two categories Id Category Price 1 Shoes 60 2 Clothing 80 3 Shoes 50 4 Sports 10 5 Sports 35
  15. { } CC-BY-ND 4.0 Depth-first mode 15 Shoes 1 Clothing

    1 60 80 Id Category Price 1 Shoes 60 2 Clothing 80 3 Shoes 50 4 Sports 10 5 Sports 35 Min price of top two categories
  16. { } CC-BY-ND 4.0 Depth-first mode 16 Shoes 2 Clothing

    1 50 80 Id Category Price 1 Shoes 60 2 Clothing 80 3 Shoes 50 4 Sports 10 5 Sports 35 Min price of top two categories
  17. { } CC-BY-ND 4.0 Depth-first mode 17 Shoes 2 Clothing

    1 Sports 1 50 80 10 Id Category Price 1 Shoes 60 2 Clothing 80 3 Shoes 50 4 Sports 10 5 Sports 35 Min price of top two categories
  18. { } CC-BY-ND 4.0 Depth-first mode 18 Shoes 2 Clothing

    1 Sports 2 50 80 10 Id Category Price 1 Shoes 60 2 Clothing 80 3 Shoes 50 4 Sports 10 5 Sports 35 Min price of top two categories
  19. { } CC-BY-ND 4.0 Depth-first mode (result) 19 Shoes 2

    Sports 2 50 10 Id Category Price 1 Shoes 60 2 Clothing 80 3 Shoes 50 4 Sports 10 5 Sports 35 Clothing 1 80 Min price of top two categories
  20. { } CC-BY-ND 4.0 Depth-first mode (result) 20 Shoes 2

    Sports 2 50 10 Id Category Price 1 Shoes 60 2 Clothing 80 3 Shoes 50 4 Sports 10 5 Sports 35 Min price of top two categories
  21. { } CC-BY-ND 4.0 Breadth-first mode (record) 21 Shoes 1

    [1] Id Category Price 1 Shoes 60 2 Clothing 80 3 Shoes 50 4 Sports 10 5 Sports 35 Min price of top two categories
  22. { } CC-BY-ND 4.0 Breadth-first mode (record) 22 Shoes 1

    [1] Clothing 1 [2] Id Category Price 1 Shoes 60 2 Clothing 80 3 Shoes 50 4 Sports 10 5 Sports 35 Min price of top two categories
  23. { } CC-BY-ND 4.0 Breadth-first mode (record) 23 Shoes 2

    [1, 3] Clothing 1 [2] Sports 2 [4, 5] Id Category Price 1 Shoes 60 2 Clothing 80 3 Shoes 50 4 Sports 10 5 Sports 35 Min price of top two categories
  24. { } CC-BY-ND 4.0 Breadth-first mode (prune) 24 Shoes 2

    [1, 3] Sports 2 [4, 5] Id Category Price 1 Shoes 60 2 Clothing 80 3 Shoes 50 4 Sports 10 5 Sports 35 Min price of top two categories
  25. { } CC-BY-ND 4.0 Breadth-first mode (replay) 25 Shoes 2

    Sports 2 50 10 Id Category Price 1 Shoes 60 2 Clothing 80 3 Shoes 50 4 Sports 10 5 Sports 35 Min price of top two categories
  26. { } CC-BY-ND 4.0 Accuracy: Getting the top N terms

    26 • We want to find the top 5 colors in our index Shard A Shard B Shard C 1 Blue (25) Blue (30) Blue (45) 2 Red (18) Red (25) Green (44) 3 Green (6) Orange (17) Maroon (36) 4 Yellow (3) Maroon (16) Brown (30) 5 Purple (2) Brown (15) Purple (29) 6 Orange (2) Pink (14) Pink (28) 7 Brown (2) Teal (10) White (2) 8 Pink (2) White (8) Yellow (1) 9 Teal (1) Green (6)
  27. { } CC-BY-ND 4.0 Accuracy: Getting the top N terms

    27 • Ask each shard for its top 5 colors • Combine results to get top 5 list Shard A Shard B Shard C 1 Blue (25) Blue (30) Blue (45) 2 Red (18) Red (25) Green (44) 3 Green (6) Orange (17) Maroon (36) 4 Yellow (3) Maroon (16) Brown (30) 5 Purple (2) Brown (15) Purple (29) 6 Orange (2) Pink (14) Pink (28) 7 Brown (2) Teal (10) White (2) 8 Pink (2) White (8) Yellow (1) 9 Teal (1) Green (6) Color Count 1 Blue 100 2 Maroon 52 3 Green 50 4 Brown 45 5 Red 43
  28. { } CC-BY-ND 4.0 Accuracy: Getting the top N terms

    28 • Blue has an accurate count as it appeared in the top 5 on every shard Shard A Shard B Shard C 1 Blue (25) Blue (30) Blue (45) 2 Red (18) Red (25) Green (44) 3 Green (6) Orange (17) Maroon (36) 4 Yellow (3) Maroon (16) Brown (30) 5 Purple (2) Brown (15) Purple (29) 6 Orange (2) Pink (14) Pink (28) 7 Brown (2) Teal (10) White (2) 8 Pink (2) White (8) Yellow (1) 9 Teal (1) Green (6) Color Count 1 Blue 100 2 Maroon 52 3 Green 50 4 Brown 45 5 Red 43
  29. { } CC-BY-ND 4.0 Accuracy: Getting the top N terms

    29 • Count for green is not accurate • Didn’t make it into the top 5 for shard B Shard A Shard B Shard C 1 Blue (25) Blue (30) Blue (45) 2 Red (18) Red (25) Green (44) 3 Green (6) Orange (17) Maroon (36) 4 Yellow (3) Maroon (16) Brown (30) 5 Purple (2) Brown (15) Purple (29) 6 Orange (2) Pink (14) Pink (28) 7 Brown (2) Teal (10) White (2) 8 Pink (2) White (8) Yellow (1) 9 Teal (1) Green (6) Color Count 1 Blue 100 2 Maroon 52 3 Green 50 4 Brown 45 5 Red 43
  30. { } CC-BY-ND 4.0 Accuracy: Getting the top N terms

    30 • Pink didn’t make it top the top 5 for any shard • But has a doc count of 44 so should be 5th Shard A Shard B Shard C 1 Blue (25) Blue (30) Blue (45) 2 Red (18) Red (25) Green (44) 3 Green (6) Orange (17) Maroon (36) 4 Yellow (3) Maroon (16) Brown (30) 5 Purple (2) Brown (15) Purple (29) 6 Orange (2) Pink (14) Pink (28) 7 Brown (2) Teal (10) White (2) 8 Pink (2) White (8) Yellow (1) 9 Teal (1) Green (6) Color Count 1 Blue 100 2 Maroon 52 3 Green 50 4 Brown 45 5 Red 43
  31. { } CC-BY-ND 4.0 Accuracy: Getting the N top terms

    { "aggs": { "colors": { "terms": { "field": “color.raw", "size": 5, “shard_size”: 5 } }}} 31 "doc_count_error_upper_bound": 46
  32. { } CC-BY-ND 4.0 Accuracy: Getting the N top terms

    { "aggs": { "colors": { "terms": { "field": “color.raw", "size": 5, “shard_size”: 5 "show_term_doc_count_error": true } }}} 32 { "key": "Blue", "doc_count": 100, "doc_count_error_upper_bound": 0 }, { "key": "Maroon", "doc_count": 52, "doc_count_error_upper_bound": 2 }, …
  33. { } CC-BY-ND 4.0 Summary • Aggregations are powerful and

    fast • Need to trade accuracy for speed/memory in some cases • Be aware of the memory requirements you are placing on your cluster • Consider using `breadth_first` mode to help with memory stress • Use `shard_size` to help manage accuracy with terms aggregation 33
  34. { } Thank you! @jpountz @colings86ES

  35. { } This work is licensed under the Creative Commons

    Attribution-NoDerivatives 4.0 International License. To view a copy of this license, visit: http://creativecommons.org/licenses/by-nd/4.0/ or send a letter to: Creative Commons PO Box 1866 Mountain View, CA 94042 USA CC-BY-ND 4.0