Scaling Sensu Go

98f9dfc2e5e1318ac78b8c716582cd30?s=47 portertech
September 10, 2019

Scaling Sensu Go

For over eight years, our community has been using Sensu to monitor their applications and infrastructure at scale. Sensu Go became generally available at the beginning of this year, and was designed to be more portable, easier and faster to deploy, and most importantly: more scalable than ever before! In this talk, Sensu CTO Sean Porter will share Sensu Go scaling patterns, best practices, and case studies. He’ll also explain our design and architectural choices and talk about our plan to take things even further.

98f9dfc2e5e1318ac78b8c716582cd30?s=128

portertech

September 10, 2019
Tweet

Transcript

  1. 3.

    Overview 1. How we 10X’d performance in 6 months 2.

    Deployment architectures 3. Hardware recommendations 4. Summary 5. Questions 3
  2. 5.

    5

  3. 6.

    6

  4. 9.

    • Steep learning curve • Requires RabbitMQ and Redis expertise

    • Capable of scaling* Scaling Sensu Core (1.X) 9
  5. 12.

    12

  6. 14.

    • Used AWS EC2 • M5.2xlarge to i3.metal • Agent

    session load tool • Disappointing results (~5k) • Inconsistent Step 2 - Test environment 14
  7. 17.

    17

  8. 19.

    19

  9. 20.

    • AMD Threadripper 2920X (12 Cores, 3.5GHz) • Gigabyte X399

    AORUS PRO • 16GB DDR4 2666MHz CL16 (2x 8GB) • Two Intel 660p Series M.2 PCIe 512GB SSDs • Intel Gigabit CT PCIe Network Card Backend hardware 20
  10. 21.

    • AMD Threadripper 2990WX (32 Cores, 3.0GHz) • Gigabyte X399

    AORUS PRO • 32GB DDR4 2666MHz CL16 (4x 8GB) • Intel 660p Series M.2 PCIe 512GB SSD Agents hardware 21
  11. 22.

    • Two Ubiquiti UniFi 8 Port 60W Switches • Separate

    load tool and data planes Network hardware 22
  12. 23.

    23

  13. 24.

    • Consistently delivered disappointing results! Agents: 4,000 Checks: 8 at

    5s interval Events/s: 6,400 • Produced data! The first results 24
  14. 25.

    • Identified several possible bottlenecks • Identified bugs while under

    load! • Began experimentation... The first results 25
  15. 26.

    • Sensu Events! • ~95% of etcd write operations •

    Disabled Event persistence - 11,200 Events/s • etcd max database size (10GB*) • Needed to move the workload The primary offender 26
  16. 27.

    27

  17. 28.

    28

  18. 29.

    • AMD Threadripper 2920X (12 Cores, 3.5GHz) • Gigabyte X399

    AORUS PRO • 16GB DDR4 2666MHz CL16 (2x 8GB) • Two Intel 660p Series M.2 PCIe 512GB SSDs • Three Intel Gigabit CT PCIe Network Card PostgreSQL hardware 29
  19. 30.

    Agents: 4,000 Checks: 14 at 5s interval Events/s: 11,200 Not

    good enough! New results with PostgreSQL 30 30
  20. 31.

    • Multi-Version Concurrency Control • Many updates - need aggressive

    auto-vacuuming! vacuum_cost_delay = 10ms vacuum_cost_limit = 10000 autovacuum_naptime = 10s autovacuum_vacuum_scale_factor = 0.05 autovacuum_analyze_scale_factor = 0.025 PostgreSQL tuning 31
  21. 32.

    • Tune write-ahead logging • Reduce the number of disk

    writes wal_sync_method = fdatasync wal_writer_delay = 5000ms max_wal_size = 5GB min_wal_size = 1GB PostgreSQL tuning 32
  22. 33.

    • Burying Check TTL switch set on every Event! •

    Additional etcd PUT and DELETE operations A huge bug! 33
  23. 34.

    Agents: 4,000 Checks: 40 at 5s interval Events/s: 32,000 Much

    better! Still not good enough. New results with bug fix 34 34
  24. 35.

    • Several etcd range (reads) requests per Event • Caching

    reduced etcd range requests by 50% • No improvement to Event throughput :( Entity and silenced caches 35
  25. 36.

    • Every object is serialized for transport and storage •

    Changed from JSON to Protobuf ◦ Applied to Agent transport and etcd store ◦ Reduced serialized object size! ◦ Less CPU time Serialization 36
  26. 37.

    • Increased Backend internal queue lengths ◦ From 100 to

    1000 (made configurable) • Increased Backend internal worker counts ◦ From 100 to 1000 (made configurable) • Increases concurrency and absorbs latency spikes Internal queues and workers 37
  27. 38.
  28. 39.

    39

  29. 41.

    41

  30. 42.

    • https://github.com/sensu/sensu-perf • Performance tests are reproducible • Users can

    test their own deployments! • Now part of release QA! The performance project 42
  31. 44.

    Multi-site Federation • 40,000 Agents per cluster • Run multiple/distributed

    Sensu Go clusters • Centralized RBAC policy management • Centralized visibility via the WebUI 44
  32. 46.

    46

  33. 47.

    47

  34. 48.

    48

  35. 49.

    49

  36. 50.

    50

  37. 51.

    51

  38. 52.

    52

  39. 54.

    Backend requirements • 16 vCPU • 16GB memory • Attached

    NVMe SSD ◦ >50MB/s and >5k sustained random IOPS • Gigabit ethernet (low latency) 54 54
  40. 55.

    PostgreSQL requirements • 16 vCPU • 16GB memory • Attached

    NVMe SSD ◦ >300MB/s and >5k sustained random IOPS • 10 gigabit ethernet (low latency) 55 55
  41. 57.

    57