Scaling Sensu Go

Scaling Sensu Go By Sean Porter, Co-founder & CTO.

Who am I? • Creator of Sensu • Co-founder •
CTO • @PorterTech 2

Overview 1. How we 10X’d performance in 6 months 2.
Deployment architectures 3. Hardware recommendations 4. Summary 5. Questions 3

4 Goals for Sensu Go

Scale 7 In terms of: • Performance • Organization

GA 8 December 5th, 2018

• Steep learning curve • Requires RabbitMQ and Redis expertise
• Capable of scaling* Scaling Sensu Core (1.X) 9

Scaling Sensu Core (1.X) 10

Scaling Sensu Core (1.X) 11

Step 1 - Instrument 13

• Used AWS EC2 • M5.2xlarge to i3.metal • Agent
session load tool • Disappointing results (~5k) • Inconsistent Step 2 - Test environment 14

Step 3 - Get serious 15

16 Spent $10k on gaming hardware.

• Control • Consistency • Capacity Why bear bare metal?
18

• AMD Threadripper 2920X (12 Cores, 3.5GHz) • Gigabyte X399
AORUS PRO • 16GB DDR4 2666MHz CL16 (2x 8GB) • Two Intel 660p Series M.2 PCIe 512GB SSDs • Intel Gigabit CT PCIe Network Card Backend hardware 20

• AMD Threadripper 2990WX (32 Cores, 3.0GHz) • Gigabyte X399
AORUS PRO • 32GB DDR4 2666MHz CL16 (4x 8GB) • Intel 660p Series M.2 PCIe 512GB SSD Agents hardware 21

• Two Ubiquiti UniFi 8 Port 60W Switches • Separate
load tool and data planes Network hardware 22

• Consistently delivered disappointing results! Agents: 4,000 Checks: 8 at
5s interval Events/s: 6,400 • Produced data! The ﬁrst results 24

• Identified several possible bottlenecks • Identified bugs while under
load! • Began experimentation... The first results 25

• Sensu Events! • ~95% of etcd write operations •
Disabled Event persistence - 11,200 Events/s • etcd max database size (10GB*) • Needed to move the workload The primary offender 26

• AMD Threadripper 2920X (12 Cores, 3.5GHz) • Gigabyte X399
AORUS PRO • 16GB DDR4 2666MHz CL16 (2x 8GB) • Two Intel 660p Series M.2 PCIe 512GB SSDs • Three Intel Gigabit CT PCIe Network Card PostgreSQL hardware 29

Agents: 4,000 Checks: 14 at 5s interval Events/s: 11,200 Not
good enough! New results with PostgreSQL 30 30

• Multi-Version Concurrency Control • Many updates - need aggressive
auto-vacuuming! vacuum_cost_delay = 10ms vacuum_cost_limit = 10000 autovacuum_naptime = 10s autovacuum_vacuum_scale_factor = 0.05 autovacuum_analyze_scale_factor = 0.025 PostgreSQL tuning 31

• Tune write-ahead logging • Reduce the number of disk
writes wal_sync_method = fdatasync wal_writer_delay = 5000ms max_wal_size = 5GB min_wal_size = 1GB PostgreSQL tuning 32

• Burying Check TTL switch set on every Event! •
Additional etcd PUT and DELETE operations A huge bug! 33

Agents: 4,000 Checks: 40 at 5s interval Events/s: 32,000 Much
better! Still not good enough. New results with bug ﬁx 34 34

• Several etcd range (reads) requests per Event • Caching
reduced etcd range requests by 50% • No improvement to Event throughput :( Entity and silenced caches 35

• Every object is serialized for transport and storage •
Changed from JSON to Protobuf ◦ Applied to Agent transport and etcd store ◦ Reduced serialized object size! ◦ Less CPU time Serialization 36

• Increased Backend internal queue lengths ◦ From 100 to
1000 (made configurable) • Increased Backend internal worker counts ◦ From 100 to 1000 (made configurable) • Increases concurrency and absorbs latency spikes Internal queues and workers 37

Agents: 36,000 Checks: 38 at 10s interval (4 subscriptions) Events/s:
34,200 Almost there!!! New results 38 38

Agents: 40,000 Checks: 38 at 10s interval (4 subscriptions) Events/s:
38,000 New results 40 40

• https://github.com/sensu/sensu-perf • Performance tests are reproducible • Users can
test their own deployments! • Now part of release QA! The performance project 42

43 What’s next for scaling Sensu?

Multi-site Federation • 40,000 Agents per cluster • Run multiple/distributed
Sensu Go clusters • Centralized RBAC policy management • Centralized visibility via the WebUI 44

45 Deployment architectures

53 Hardware recommendations*

Backend requirements • 16 vCPU • 16GB memory • Attached
NVMe SSD ◦ >50MB/s and >5k sustained random IOPS • Gigabit ethernet (low latency) 54 54

PostgreSQL requirements • 16 vCPU • 16GB memory • Attached
NVMe SSD ◦ >300MB/s and >5k sustained random IOPS • 10 gigabit ethernet (low latency) 55 55

56 Summary

58 Questions?

Scaling Sensu Go

Scaling Sensu Go

More Decks by portertech

Other Decks in Programming

Featured

Transcript