● Two Ubiquiti UniFi 8 Port 60W Switches
● Separate load tool and data planes
Network hardware
22
Slide 23
Slide 23 text
23
Slide 24
Slide 24 text
● Consistently delivered disappointing results!
Agents: 4,000
Checks: 8 at 5s interval
Events/s: 6,400
● Produced data!
The first results
24
Slide 25
Slide 25 text
● Identified several possible bottlenecks
● Identified bugs while under load!
● Began experimentation...
The first results
25
Slide 26
Slide 26 text
● Sensu Events!
● ~95% of etcd write operations
● Disabled Event persistence - 11,200 Events/s
● etcd max database size (10GB*)
● Needed to move the workload
The primary offender
26
Slide 27
Slide 27 text
27
Slide 28
Slide 28 text
28
Slide 29
Slide 29 text
● AMD Threadripper 2920X (12 Cores, 3.5GHz)
● Gigabyte X399 AORUS PRO
● 16GB DDR4 2666MHz CL16 (2x 8GB)
● Two Intel 660p Series M.2 PCIe 512GB SSDs
● Three Intel Gigabit CT PCIe Network Card
PostgreSQL hardware
29
Slide 30
Slide 30 text
Agents: 4,000
Checks: 14 at 5s interval
Events/s: 11,200
Not good enough!
New results with PostgreSQL
30
30
Slide 31
Slide 31 text
● Multi-Version Concurrency Control
● Many updates - need aggressive auto-vacuuming!
vacuum_cost_delay = 10ms
vacuum_cost_limit = 10000
autovacuum_naptime = 10s
autovacuum_vacuum_scale_factor = 0.05
autovacuum_analyze_scale_factor = 0.025
PostgreSQL tuning
31
Slide 32
Slide 32 text
● Tune write-ahead logging
● Reduce the number of disk writes
wal_sync_method = fdatasync
wal_writer_delay = 5000ms
max_wal_size = 5GB
min_wal_size = 1GB
PostgreSQL tuning
32
Slide 33
Slide 33 text
● Burying Check TTL switch set on every Event!
● Additional etcd PUT and DELETE operations
A huge bug!
33
Slide 34
Slide 34 text
Agents: 4,000
Checks: 40 at 5s interval
Events/s: 32,000
Much better! Still not good enough.
New results with bug fix
34
34
Slide 35
Slide 35 text
● Several etcd range (reads) requests per Event
● Caching reduced etcd range requests by 50%
● No improvement to Event throughput :(
Entity and silenced caches
35
Slide 36
Slide 36 text
● Every object is serialized for transport and storage
● Changed from JSON to Protobuf
○ Applied to Agent transport and etcd store
○ Reduced serialized object size!
○ Less CPU time
Serialization
36
Slide 37
Slide 37 text
● Increased Backend internal queue lengths
○ From 100 to 1000 (made configurable)
● Increased Backend internal worker counts
○ From 100 to 1000 (made configurable)
● Increases concurrency and absorbs latency spikes
Internal queues and workers
37
Slide 38
Slide 38 text
Agents: 36,000
Checks: 38 at 10s interval (4 subscriptions)
Events/s: 34,200
Almost there!!!
New results
38
38
Slide 39
Slide 39 text
39
Slide 40
Slide 40 text
Agents: 40,000
Checks: 38 at 10s interval (4 subscriptions)
Events/s: 38,000
New results
40
40
Slide 41
Slide 41 text
41
Slide 42
Slide 42 text
● https://github.com/sensu/sensu-perf
● Performance tests are reproducible
● Users can test their own deployments!
● Now part of release QA!
The performance project
42
Slide 43
Slide 43 text
43
What’s next for scaling Sensu?
Slide 44
Slide 44 text
Multi-site Federation
● 40,000 Agents per cluster
● Run multiple/distributed Sensu Go clusters
● Centralized RBAC policy management
● Centralized visibility via the WebUI
44