Scaling Sensu Go

98f9dfc2e5e1318ac78b8c716582cd30?s=47 portertech
September 10, 2019

Scaling Sensu Go

For over eight years, our community has been using Sensu to monitor their applications and infrastructure at scale. Sensu Go became generally available at the beginning of this year, and was designed to be more portable, easier and faster to deploy, and most importantly: more scalable than ever before! In this talk, Sensu CTO Sean Porter will share Sensu Go scaling patterns, best practices, and case studies. He’ll also explain our design and architectural choices and talk about our plan to take things even further.

98f9dfc2e5e1318ac78b8c716582cd30?s=128

portertech

September 10, 2019
Tweet

Transcript

  1. Scaling Sensu Go By Sean Porter, Co-founder & CTO.

  2. Who am I? • Creator of Sensu • Co-founder •

    CTO • @PorterTech 2
  3. Overview 1. How we 10X’d performance in 6 months 2.

    Deployment architectures 3. Hardware recommendations 4. Summary 5. Questions 3
  4. 4 Goals for Sensu Go

  5. 5

  6. 6

  7. Scale 7 In terms of: • Performance • Organization

  8. GA 8 December 5th, 2018

  9. • Steep learning curve • Requires RabbitMQ and Redis expertise

    • Capable of scaling* Scaling Sensu Core (1.X) 9
  10. Scaling Sensu Core (1.X) 10

  11. Scaling Sensu Core (1.X) 11

  12. 12

  13. Step 1 - Instrument 13

  14. • Used AWS EC2 • M5.2xlarge to i3.metal • Agent

    session load tool • Disappointing results (~5k) • Inconsistent Step 2 - Test environment 14
  15. Step 3 - Get serious 15

  16. 16 Spent $10k on gaming hardware.

  17. 17

  18. • Control • Consistency • Capacity Why bear bare metal?

    18
  19. 19

  20. • AMD Threadripper 2920X (12 Cores, 3.5GHz) • Gigabyte X399

    AORUS PRO • 16GB DDR4 2666MHz CL16 (2x 8GB) • Two Intel 660p Series M.2 PCIe 512GB SSDs • Intel Gigabit CT PCIe Network Card Backend hardware 20
  21. • AMD Threadripper 2990WX (32 Cores, 3.0GHz) • Gigabyte X399

    AORUS PRO • 32GB DDR4 2666MHz CL16 (4x 8GB) • Intel 660p Series M.2 PCIe 512GB SSD Agents hardware 21
  22. • Two Ubiquiti UniFi 8 Port 60W Switches • Separate

    load tool and data planes Network hardware 22
  23. 23

  24. • Consistently delivered disappointing results! Agents: 4,000 Checks: 8 at

    5s interval Events/s: 6,400 • Produced data! The first results 24
  25. • Identified several possible bottlenecks • Identified bugs while under

    load! • Began experimentation... The first results 25
  26. • Sensu Events! • ~95% of etcd write operations •

    Disabled Event persistence - 11,200 Events/s • etcd max database size (10GB*) • Needed to move the workload The primary offender 26
  27. 27

  28. 28

  29. • AMD Threadripper 2920X (12 Cores, 3.5GHz) • Gigabyte X399

    AORUS PRO • 16GB DDR4 2666MHz CL16 (2x 8GB) • Two Intel 660p Series M.2 PCIe 512GB SSDs • Three Intel Gigabit CT PCIe Network Card PostgreSQL hardware 29
  30. Agents: 4,000 Checks: 14 at 5s interval Events/s: 11,200 Not

    good enough! New results with PostgreSQL 30 30
  31. • Multi-Version Concurrency Control • Many updates - need aggressive

    auto-vacuuming! vacuum_cost_delay = 10ms vacuum_cost_limit = 10000 autovacuum_naptime = 10s autovacuum_vacuum_scale_factor = 0.05 autovacuum_analyze_scale_factor = 0.025 PostgreSQL tuning 31
  32. • Tune write-ahead logging • Reduce the number of disk

    writes wal_sync_method = fdatasync wal_writer_delay = 5000ms max_wal_size = 5GB min_wal_size = 1GB PostgreSQL tuning 32
  33. • Burying Check TTL switch set on every Event! •

    Additional etcd PUT and DELETE operations A huge bug! 33
  34. Agents: 4,000 Checks: 40 at 5s interval Events/s: 32,000 Much

    better! Still not good enough. New results with bug fix 34 34
  35. • Several etcd range (reads) requests per Event • Caching

    reduced etcd range requests by 50% • No improvement to Event throughput :( Entity and silenced caches 35
  36. • Every object is serialized for transport and storage •

    Changed from JSON to Protobuf ◦ Applied to Agent transport and etcd store ◦ Reduced serialized object size! ◦ Less CPU time Serialization 36
  37. • Increased Backend internal queue lengths ◦ From 100 to

    1000 (made configurable) • Increased Backend internal worker counts ◦ From 100 to 1000 (made configurable) • Increases concurrency and absorbs latency spikes Internal queues and workers 37
  38. Agents: 36,000 Checks: 38 at 10s interval (4 subscriptions) Events/s:

    34,200 Almost there!!! New results 38 38
  39. 39

  40. Agents: 40,000 Checks: 38 at 10s interval (4 subscriptions) Events/s:

    38,000 New results 40 40
  41. 41

  42. • https://github.com/sensu/sensu-perf • Performance tests are reproducible • Users can

    test their own deployments! • Now part of release QA! The performance project 42
  43. 43 What’s next for scaling Sensu?

  44. Multi-site Federation • 40,000 Agents per cluster • Run multiple/distributed

    Sensu Go clusters • Centralized RBAC policy management • Centralized visibility via the WebUI 44
  45. 45 Deployment architectures

  46. 46

  47. 47

  48. 48

  49. 49

  50. 50

  51. 51

  52. 52

  53. 53 Hardware recommendations*

  54. Backend requirements • 16 vCPU • 16GB memory • Attached

    NVMe SSD ◦ >50MB/s and >5k sustained random IOPS • Gigabit ethernet (low latency) 54 54
  55. PostgreSQL requirements • 16 vCPU • 16GB memory • Attached

    NVMe SSD ◦ >300MB/s and >5k sustained random IOPS • 10 gigabit ethernet (low latency) 55 55
  56. 56 Summary

  57. 57

  58. 58 Questions?