Quantitative Cluster Sizing

Slide 1

Slide 1 text

‹#› Christian Dahlqvist Solution Architect @acdahlqvist Ryan Schneider Education Engineer @djschny Quantitative Cluster Sizing

Slide 2

Slide 2 text

It depends... 2

Slide 3

Slide 3 text

Agenda 3 Understanding why "it depends" Sizing methodology Scenario and experiment results Interpreting results and expanding to other scenarios 1 2 3 4

Slide 4

Slide 4 text

4 Terminology • Node • Index • Shards ‒Primary ‒Replica • Mapping

Slide 5

Slide 5 text

5 Elasticsearch Factors • Size of shards • Number of shards on each node • Size of each document • Mapping configuration ‒ which fields are searchable ‒ automatic multi-fields ‒ whether message and _all are enabled • Backing server capacity (SSD vs. HD, CPU, etc.)

Slide 6

Slide 6 text

6 Your Organization Requirements / SLAs • Retention period of data • Ratio and quantity of index vs. search • Nature of use case • Continuous vs. bulk indexing • Kinds of queries being executed • Desired response time for queries that are run frequent vs. occasionally • Required sustained vs. peak indexing rate • Budget & failure tolerance

Slide 7

Slide 7 text

‹#› 7 Let's try to determine • How much disk storage will N documents require? • When is a single shard too big for my requirements • How many active shards saturate my particular hardware • How many shards/nodes will I need to sustain X index rate and Y search response

Slide 8

Slide 8 text

Agenda 8 Understanding why "it depends" Scenario and experiment results Interpreting results and expanding to other scenarios 1 3 4 Sizing methodology 2

Slide 9

Slide 9 text

Methodology of Experiments Each experiment tries to accomplish a discrete goal and build upon previous 9 Determine various disk utilization 1 2 3 4 Determine breaking point of a shard Determine saturation point of a node Test configuration on small cluster

Slide 10

Slide 10 text

10 Experiment One • Use a single node cluster with one index ‒ 1 primary ‒ 0 replica • Index a decent amount of data (1GB or about 10 million docs) • Calculate storage on disk both as-is and after a _forcemerge • Repeat the above calculations with different mapping configurations ‒ _all both enabled and disabled ‒ settings for each field Determine various disk utilization 1

Slide 11

Slide 11 text

11 Experiment Two • Use a single node cluster with one index ‒ 1 primary ‒ 0 replica • Index realistic data and use realistic queries • Plot index speed and query response time • Determine where point of diminishing returns is for your requirements Determine breaking point of a shard 2

Slide 12

Slide 12 text

12 Experiment Three • Use a single node cluster with one index ‒ 2 primary ‒ 0 replica • Repeat experiment two to see how performance varies • Keep adding more shards to see when point of diminishing returns occurs Determine saturation point of a node 3

Slide 13

Slide 13 text

13 Experiment Four • Configure small representative cluster • Add representative data volume • Run realistic benchmarks: • Max indexing rate • Querying across varying data volumes • Benchmark concurrent querying and indexing at various levels • Measure resource usage, overall docs, disk usage, etc. Test desired configuration on small cluster 4

Slide 14

Slide 14 text

Agenda 14 Scenario and experiment results Sizing methodology Understanding why "it depends" Interpreting results and expanding to other scenarios 3 2 1 4

Slide 15

Slide 15 text

‹#› Practical Sizing Example 15

Slide 16

Slide 16 text

Sizing Scenario 16 Data Use Case Platform • Structured Logging • Events in JSON • Average size: 1.5kB • 40% Structured • 60% Analyzed Text • 15 days retention • Kibana Dashboard for error analysis (interactive) • Complex Kibana Dashboards for trends • Small number of users • Evaluating Elastic Cloud • 1:16 RAM/Disk ratio • 64GB RAM / node • 1TB SSD storage / node

Slide 17

Slide 17 text

Benchmarking Setup 17 AWS EC2 S3 Snapshot Master node Elastic Cloud 2 x 64GB Instances in 2 AZ Elasticsearch Benchmark Driver

Slide 18

Slide 18 text

Sizing Methodology See style page for more color options 18 2 3 4 Disk Utilization Shard Sizing Single Node Benchmarking Multi-Node Benchmarking 1

Slide 19

Slide 19 text

Disk Utilization From raw events to securely indexed on disk 19 Raw Data JSON Indexed Indexed & Replicated Indexed & Replicated

Slide 20

Slide 20 text

Disk Utilization From JSON to indexed size on disk 20 Default Logstash Mapping Custom Mapping 100% Structured 0.585 ratio 0.401 ratio (-31.4%) 40% Structured 60% Analyzed Text 1.055 ratio 0.761 ratio (-27.8%)

Slide 21

Slide 21 text

Sizing Methodology See style page for more color options 21 1 3 4 2 Disk Utilization Shard Sizing Single Node Benchmarking Multi-Node Benchmarking

Slide 22

Slide 22 text

Shard Sizing 22 Structured Data Semistructured Data Shard Size Record Count Dashboard Latency

Slide 23

Slide 23 text

Shard Sizing 23 ~500 ms ~20 GB ~18M records

Slide 24

Slide 24 text

Sizing Methodology See style page for more color options 24 1 2 4 3 Disk Utilization Shard Sizing Single Node Benchmarking Multi-Node Benchmarking

Slide 25

Slide 25 text

Single Node Querying Sub-title or chart title here in sentence case 25

Slide 26

Slide 26 text

Single Node Indexing How many shards per node is optimal for indexing? 26

Slide 27

Slide 27 text

Sizing Methodology See style page for more color options 27 1 2 3 4 Disk Utilization Shard Sizing Single Node Benchmarking Multi-Node Benchmarking

Slide 28

Slide 28 text

Scaling for Benchmarking Creating small representative benchmarking cluster for log analytics 28 X Queries Y Index Requests 2 Data nodes N Data nodes X Queries Y * 2/N Index Requests

Slide 29

Slide 29 text

Maximum Indexing Rate 29 Structured Data Structured Data Semistructured Data Semistructured Data Events per Second Throughput MB/s

Slide 30

Slide 30 text

Maximum Indexing Rate 30 Events per Second Throughput MB/s Increasing Event Size Increasing Event Size Structured Data Structured Data Semistructured Data Semistructured Data

Slide 31

Slide 31 text

Maximum Indexing Rate 31 Events per Second Throughput MB/s Increasing Event Size Increasing Event Size Smallest Events Largest Events Structured Data Structured Data Semistructured Data Semistructured Data

Slide 32

Slide 32 text

Concurrent Indexing and Querying Indexing rate vs Dashboard query latency vs Data volume queried 32 Achieved Indexing Rate Target Indexing Rate Dashboard Latency Data Volume Queried

Slide 33

Slide 33 text

Concurrent Indexing and Querying Indexing rate vs Dashboard query latency vs Data volume queried 33 Achieved Rate < Target Rate Increasing Query Latency

Slide 34

Slide 34 text

Agenda 34 Interpreting results and expanding to other scenarios Sizing methodology Scenario and experiment results Understanding why "it depends" 4 2 3 1

Slide 35

Slide 35 text

‹#› Interpreting Results Applying the experiment results to capacity plan

Slide 36

Slide 36 text

Interpreting Results What can we learn from this simple benchmark? 36 1TB storage, 15 days retention => ~68GB index size/day 1 68GB index size => 700 events/s, 89GB raw JSON logs/day 2 More ingest or retention => Scale out 3 Evaluate more settings => Optimize further Simplified example => Your results WILL be different 4 5

Slide 37

Slide 37 text

1 1TB storage, 15 days retention => ~68GB index size/day Interpreting Results What can we learn from this simple benchmark? 37 68GB index size => 700 events/s, 89GB raw JSON logs/day More ingest or retention => Scale out 2 3 Evaluate more settings => Optimize further Simplified example => Your results WILL be different 4 5

Slide 38

Slide 38 text

1TB storage, 15 days retention => ~68GB index size/day 1 Interpreting Results What can we learn from this simple benchmark? 38 68GB index size => 700 events/s, 89GB raw JSON logs/day More ingest or retention => Scale out 3 2 Evaluate more settings => Optimize further Simplified example => Your results WILL be different 4 5

Slide 39

Slide 39 text

Evaluate more settings => Optimize further 4 Interpreting Results What can we learn from this simple benchmark? 39 Simplified example => Your results WILL be different 5 1TB storage, 15 days retention => ~68GB index size/day 1 68GB index size => 700 events/s, 89GB raw JSON logs/day 2 More ingest or retention => Scale out 3

Slide 40

Slide 40 text

Simplified example => Your results WILL be different 5 Interpreting Results What can we learn from this simple benchmark? 40 Evaluate more settings => Optimize further 4 1TB storage, 15 days retention => ~68GB index size/day 1 68GB index size => 700 events/s, 89GB raw JSON logs/day 2 More ingest or retention => Scale out 3

Slide 41

Slide 41 text

‹#› Other Factors Extend methodology for your situations

Slide 42

Slide 42 text

42 Other Factors to Consider • Hot vs. cold nodes and architecture • Risk and fault tolerances (one replica not enough?) • Mixed use clusters

Slide 43

Slide 43 text

43 Recommendations / Tips • The more realistic data and queries, the better results • Be systematic; standard scientific method • Record your results • Script your tests • Rerun your tests ‒ when you need to upgrade hardware ‒ when your requirements change greatly • Monitor your cluster and usage

Slide 44

Slide 44 text

‹#› 44 Summary • Why "it depends" • A methodology to apply • An example to use for reference • How to apply for each unique situation  • Elastic is here to help ‒ Community resources ‒ Subscription support ‒ Professional services We now know...

Slide 45

Slide 45 text

It depends... but we know how to quantify it! 45

Slide 46

Slide 46 text

Questions? 19 Also find us at the AMA Booth

Slide 47

Slide 47 text

‹#› Please attribute Elastic with a link to elastic.co Except where otherwise noted, this work is licensed under http://creativecommons.org/licenses/by-nd/4.0/ Creative Commons and the double C in a circle are registered trademarks of Creative Commons in the United States and other countries. Third party marks and brands are the property of their respective holders. 47