Scaling Sensu Go - Speaker Deck

Slide 1

Slide 1 text

Scaling Sensu Go By Sean Porter, Co-founder & CTO.

Slide 2

Slide 2 text

Who am I? ● Creator of Sensu ● Co-founder ● CTO ● @PorterTech 2

Slide 3

Slide 3 text

Overview 1. How we 10X’d performance in 6 months 2. Deployment architectures 3. Hardware recommendations 4. Summary 5. Questions 3

Slide 4

Slide 4 text

4 Goals for Sensu Go

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Scale 7 In terms of: ● Performance ● Organization

Slide 8

Slide 8 text

GA 8 December 5th, 2018

Slide 9

Slide 9 text

● Steep learning curve ● Requires RabbitMQ and Redis expertise ● Capable of scaling* Scaling Sensu Core (1.X) 9

Slide 10

Slide 10 text

Scaling Sensu Core (1.X) 10

Slide 11

Slide 11 text

Scaling Sensu Core (1.X) 11

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Step 1 - Instrument 13

Slide 14

Slide 14 text

● Used AWS EC2 ● M5.2xlarge to i3.metal ● Agent session load tool ● Disappointing results (~5k) ● Inconsistent Step 2 - Test environment 14

Slide 15

Slide 15 text

Step 3 - Get serious 15

Slide 16

Slide 16 text

16 Spent $10k on gaming hardware.

Slide 17

Slide 17 text

Slide 18

Slide 18 text

● Control ● Consistency ● Capacity Why bear bare metal? 18

Slide 19

Slide 19 text

Slide 20

Slide 20 text

● AMD Threadripper 2920X (12 Cores, 3.5GHz) ● Gigabyte X399 AORUS PRO ● 16GB DDR4 2666MHz CL16 (2x 8GB) ● Two Intel 660p Series M.2 PCIe 512GB SSDs ● Intel Gigabit CT PCIe Network Card Backend hardware 20

Slide 21

Slide 21 text

● AMD Threadripper 2990WX (32 Cores, 3.0GHz) ● Gigabyte X399 AORUS PRO ● 32GB DDR4 2666MHz CL16 (4x 8GB) ● Intel 660p Series M.2 PCIe 512GB SSD Agents hardware 21

Slide 22

Slide 22 text

● Two Ubiquiti UniFi 8 Port 60W Switches ● Separate load tool and data planes Network hardware 22

Slide 23

Slide 23 text

Slide 24

Slide 24 text

● Consistently delivered disappointing results! Agents: 4,000 Checks: 8 at 5s interval Events/s: 6,400 ● Produced data! The ﬁrst results 24

Slide 25

Slide 25 text

● Identified several possible bottlenecks ● Identified bugs while under load! ● Began experimentation... The first results 25

Slide 26

Slide 26 text

● Sensu Events! ● ~95% of etcd write operations ● Disabled Event persistence - 11,200 Events/s ● etcd max database size (10GB*) ● Needed to move the workload The primary offender 26

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

● AMD Threadripper 2920X (12 Cores, 3.5GHz) ● Gigabyte X399 AORUS PRO ● 16GB DDR4 2666MHz CL16 (2x 8GB) ● Two Intel 660p Series M.2 PCIe 512GB SSDs ● Three Intel Gigabit CT PCIe Network Card PostgreSQL hardware 29

Slide 30

Slide 30 text

Agents: 4,000 Checks: 14 at 5s interval Events/s: 11,200 Not good enough! New results with PostgreSQL 30 30

Slide 31

Slide 31 text

● Multi-Version Concurrency Control ● Many updates - need aggressive auto-vacuuming! vacuum_cost_delay = 10ms vacuum_cost_limit = 10000 autovacuum_naptime = 10s autovacuum_vacuum_scale_factor = 0.05 autovacuum_analyze_scale_factor = 0.025 PostgreSQL tuning 31

Slide 32

Slide 32 text

● Tune write-ahead logging ● Reduce the number of disk writes wal_sync_method = fdatasync wal_writer_delay = 5000ms max_wal_size = 5GB min_wal_size = 1GB PostgreSQL tuning 32

Slide 33

Slide 33 text

● Burying Check TTL switch set on every Event! ● Additional etcd PUT and DELETE operations A huge bug! 33

Slide 34

Slide 34 text

Agents: 4,000 Checks: 40 at 5s interval Events/s: 32,000 Much better! Still not good enough. New results with bug ﬁx 34 34

Slide 35

Slide 35 text

● Several etcd range (reads) requests per Event ● Caching reduced etcd range requests by 50% ● No improvement to Event throughput :( Entity and silenced caches 35

Slide 36

Slide 36 text

● Every object is serialized for transport and storage ● Changed from JSON to Protobuf ○ Applied to Agent transport and etcd store ○ Reduced serialized object size! ○ Less CPU time Serialization 36

Slide 37

Slide 37 text

● Increased Backend internal queue lengths ○ From 100 to 1000 (made configurable) ● Increased Backend internal worker counts ○ From 100 to 1000 (made configurable) ● Increases concurrency and absorbs latency spikes Internal queues and workers 37

Slide 38

Slide 38 text

Agents: 36,000 Checks: 38 at 10s interval (4 subscriptions) Events/s: 34,200 Almost there!!! New results 38 38

Slide 39

Slide 39 text

Slide 40

Slide 40 text

Agents: 40,000 Checks: 38 at 10s interval (4 subscriptions) Events/s: 38,000 New results 40 40

Slide 41

Slide 41 text

Slide 42

Slide 42 text

● https://github.com/sensu/sensu-perf ● Performance tests are reproducible ● Users can test their own deployments! ● Now part of release QA! The performance project 42

Slide 43

Slide 43 text

43 What’s next for scaling Sensu?

Slide 44

Slide 44 text

Multi-site Federation ● 40,000 Agents per cluster ● Run multiple/distributed Sensu Go clusters ● Centralized RBAC policy management ● Centralized visibility via the WebUI 44

Slide 45

Slide 45 text

45 Deployment architectures

Slide 46

Slide 46 text

Slide 47

Slide 47 text

Slide 48

Slide 48 text

Slide 49

Slide 49 text

Slide 50

Slide 50 text

Slide 51

Slide 51 text

Slide 52

Slide 52 text

Slide 53

Slide 53 text

53 Hardware recommendations*

Slide 54

Slide 54 text

Backend requirements ● 16 vCPU ● 16GB memory ● Attached NVMe SSD ○ >50MB/s and >5k sustained random IOPS ● Gigabit ethernet (low latency) 54 54

Slide 55

Slide 55 text

PostgreSQL requirements ● 16 vCPU ● 16GB memory ● Attached NVMe SSD ○ >300MB/s and >5k sustained random IOPS ● 10 gigabit ethernet (low latency) 55 55