Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Feedback on scalability and load testing of a c...

Rudder
February 04, 2020

Feedback on scalability and load testing of a configuration management software

🎥 https://www.youtube.com/watch?v=rQlLzVbXEXk
🧑 Nicolas Charles
📅 Configuration Management Camp 2020

RUDDER is based on API/Web application that allows users to configure and verify their configurations. Relying on agents on every system, it checks and remediates configurations every 5 minutes and centralizes the result of application. Each result is made up of hundreds of events that are historized, and each configuration changes involves calculating and displaying the configurations and conformities for users within a reasonable time.

A RUDDER instance can handle 20 000 nodes. Can you imagine what this implies from a network, CPU and storage point of view? How to reach and maintain these performances? What are the different steps that made this possible? And what tools have been put in place?

This presentation will explain the technical stack used (Scala, PostgreSQL, C and Rust), as well as the path, failures and successes that allow us today to reproduce the environments, and also to test and validate the hypotheses to achieve and keep these results.

Rudder

February 04, 2020
Tweet

More Decks by Rudder

Other Decks in Programming

Transcript

  1. • RUDDER is a Continuous Configuration Management software with a

    Compliance focus • Configurations are defined with GUI or API • Configuration checked on every nodes every 5 minutes • RUDDER computes the compliance of every nodes RUDDER - Why scalability is a topic?
  2. RUDDER - Defining configurations PARAM RULE • Id DIRECTIVE •

    Id • (Components) GROUP • Id RUDDER config (global) • Policy Mode • Schedule NODE • Properties • Policy Mode • Schedule Environmental context • Id : . . . • Generated : . . . Files Node configuration Historization Historization
  3. RUDDER - Computing compliance 5 • Id : . .

    . • Generated : . . . Files Node configuration RUN • Reports • Reports • ... • ... METADATA • node id • config id • run timestamp RUN • Reports • Reports • ... • ... METADATA • node id • config id • run timestamp • Signature Get Policy Send configuration reports Expected reports (node id, config id, timestamp) Run reports Historization Compliance (historized) Store expected reports Metadata • Integrity • Signature Config • Id • For Rule R, Directive D1, Component C On every nodes
  4. • Load testing is the process of putting demand on

    a system and measuring its response.[2] • Simulating real-world usage, and ensuring the system can sustain it. Load testing? [2] https://en.wikipedia.org/wiki/Load_testing
  5. • For SaaS systems, you manage the lifecycle of your

    application, so you can test for top of next months expected scale. • For a software deployed on users infrastructures, upgrades are done during maintenance windows ◦ They need (long) migration tests ⇨ You have to think forward Testing at what scale??
  6. • A German cars manufacturer manages 13 000 nodes with

    a 3 years-old RUDDER 3.1 instance, and 4 500 nodes with a 2 years old RUDDER 4.1 version • Need to consolidate on one instance • Expect 20 000 nodes this year How hard can it be? RUDDER - Testing at what scale?
  7. • RUDDER supports “old” OS - RHEL 7, SLES 12

    • Uses system dependencies, which can be antediluvian: PostgreSQL 9.2, rsyslog 8.24.0 ◦ Their performance can be problematic RUDDER - Some constraints
  8. RUDDER - architecture Rudder Root Server Interfaces CLI WEB UI

    API Users Applications Compliance Configuration Inventory Rudder Engine + Plugins Relay Node Rudder Agent Node Rudder Agent Node Rudder Agent
  9. RUDDER - components Rudder Root Server Interfaces CLI WEB UI

    API Users Applications Compliance Configuration Inventory Rudder Engine + Plugins Relay Node Rudder Agent Node Rudder Agent Node Rudder Agent Scala PostgreSQL OpenLDAP rsyslog
  10. RUDDER - where could be the bottlenecks? Rudder Root Server

    Interfaces CLI WEB UI API Users Applications Compliance Configuration Inventory Rudder Engine + Plugins Relay Node Rudder Agent Node Rudder Agent Node Rudder Agent
  11. • Testing every parts of Rudder ◦ In isolation first,

    then all together ◦ Ramp-up the load • Inventories • Set of groups, techniques, directives , rules • Reports from all nodes • Relevant logs and way to analyze them Testing for 20 000 nodes - what is needed?
  12. • Script to generate inventories ◦ Template + Tsung, contributed

    by community https://github.com/Normation/rudder-tools/tree/master/contrib/stress_suite ◦ Bash scripts with more flexibility: generated certificates, multiple templates, etc; plus persistent data to generate reports https://github.com/Normation/rudder-tools/pull/534/files Testing for 20 000 nodes - Tooling?
  13. • Script to generate reports from created nodes ◦ Python

    script to get expected reports from the database, generate and send reports over syslog https://github.com/Normation/rudder-tools/tree/master/contrib/load-database ◦ Python script to get expected reports from the database, generates reports, sign them using the static data and sends them over HTTPS https://github.com/Normation/rudder-tools/pull/534/files Testing for 20 000 nodes - Tooling?
  14. • First test, with 6000 nodes • Should be easy,

    as earlier versions of RUDDER supported 13000 nodes… Testing for 20 000 nodes
  15. • First test, with 6000 nodes • Should be easy,

    as earlier versions of RUDDER supported 13000 nodes… • Added features in 5.0 have a significant cost in scalability :( Testing for 20 000 nodes
  16. • The most resource intensive part a. Snapshot inventories, groups,

    directive and rules b. Historize data c. Convert them in policies mapping d. Generate files with data and policies for the nodes e. Save expected reports f. Run post-generation hooks RUDDER - Policy generation
  17. • Extract timings ◦ Each step of Policy Generation is

    measured… but Done XXX in: 322 350 ms is clearly not precise enough ◦ Need more detailed timings on each parts ▪ Aggregate timings over each threads and components • Process timings ◦ Automate extracting features from thousands of measures ▪ 100 000 message “mergeCompareByRules in x ms” RUDDER - Policy generation
  18. • Inspect slow parts ◦ Variable computation ◦ Database reads/writes

    ◦ I/O • Ensure that memory usage is reasonable • The core of RUDDER is written in Scala, so there is a lot of available tooling RUDDER - Policy generation
  19. • jmap (fast) ◦ histogram of all objects allocated, memory

    used and number ◦ dump of JVM state RUDDER - JVM tooling
  20. • Eclipse Memory Analyzer (async) ◦ analyse dump and suggest

    fixes ◦ doesn’t work great for large heap RUDDER - JVM tooling
  21. • YourKit ◦ Instrument the JVM ◦ Measure object creation,

    methods calls, memory used, etc ◦ Great detailed vision RUDDER - JVM tooling
  22. • YourKit ◦ Instrument the JVM ◦ Measure object creation,

    methods calls, memory used, etc ◦ Great detailed vision of systems ◦ But massive slowdown of the application when activated RUDDER - JVM tooling
  23. • YourKit ◦ Instrument the JVM ◦ Measure object creation,

    methods calls, memory used, etc ◦ Great detailed vision of systems ◦ But massive slowdown of the application when activated RUDDER - JVM tooling
  24. • Facts ◦ Too many objects are created with super

    short live ▪ Strain on GC, increase CPUs usage ◦ Some unnecessary computations ◦ Serializing and deserializing JSON is CPU and Memory expensive ◦ IO is super expensive RUDDER - Policy generation
  25. • Optimizations ◦ Use collection views when necessary ▪ Caution,

    sometimes views are hidden and cause leaks ◦ Fix some algorithms to remove unnecessary lookups ◦ Identify the necessary lifespan of big objects, to ensure there are not referenced when not necessary ◦ Batched all requests to database ▪ Lower memory usage ▪ Big strings from PostgreSQL that are deserialized in one go RUDDER - Policy generation
  26. • Python script to generate reports • 3000+ syslog messages/second

    • Catastrophic compliance RUDDER - Reports and compliance
  27. • Python script to generate reports • 3000+ syslog messages/second

    • Catastrophic compliance • nestat -suna ◦ 80% of messages are dropped... RUDDER - Reports and compliance
  28. • rsyslog inserts report in PostgreSQL 1 by 1 ◦

    Need to upgrade to 2019 versions • Increase UDP buffer to handle the load • Increase the number of workers https://docs.rudder.io/reference/5.0/administration/performance.html#_increase_th e_udp_buffer • Fixed the network part for now RUDDER - Reports and compliance
  29. • rsyslog inserts report in PostgreSQL 1 by 1 ◦

    Need to upgrade to 2019 versions • Increase UDP buffer to handle the load • Increase the number of workers https://docs.rudder.io/reference/5.0/administration/performance.html#_increase_th e_udp_buffer • Fixed the network part for now ◦ In 6.0, we dropped rsyslog, to use HTTPS reports parsed by a Rust module RUDDER - Reports and compliance
  30. • Compliance computation can’t deal with all reports • Process

    for computation ◦ Fetch nodes last runs ◦ Get expected reports ◦ Get reports from these nodes ◦ Expand variables, and compute compliance • Depends on PostgreSQL • Computations and lookups on the Scala part RUDDER - Reports and compliance
  31. • PgBadger ◦ Analyze PostgreSQL logs ◦ Very complete stats

    on database usage ▪ Top queries, most expensive queries, temp files RUDDER - Tooling for PostgreSQL
  32. • https://explain.depesz.com/ ◦ Explain query plan ◦ In an understandable

    way ◦ Help correct the queries RUDDER - Tooling for PostgreSQL
  33. • Give enough resources to PostgreSQL to handle the load

    ◦ shared_buffer ◦ work_mem ◦ maintenance_work_mem ◦ temp_buffers ◦ wal_buffer • Configure it based on the reality of the system ◦ random_page_cost is too large for SSD by default ◦ effective_cache_size RUDDER - Configure PostgreSQL
  34. • Optimizations in the Scala part ◦ Improvement in variables

    detections ◦ Lighter data structures RUDDER - Reports and compliance
  35. • Options to disable features to save resources ◦ Full

    historization of compliance in database ◦ Historization of non-compliance reports ◦ Changes computation ◦ Recompute of dynamics groups RUDDER - Feature flags
  36. • Load testing is time consuming ◦ 10 minutes to

    12 hours for a relevant test ◦ Enormous logs ▪ Scripts to extracts stats on these logs automatically ◦ Analyzing the results can be tedious ▪ Especially as I/O can be erratics (sometimes *4 perf degradation) • Generating relevant data is hard ◦ A German cars manufacturer gave us access to a system with their inventories to have real data - thank you ! RUDDER - Load Testing Takeaway
  37. • Beware of intuitions ◦ Guesses and hypothesis are often

    wrong ▪ Need to be meticulous ◦ Some code or may look bad, but turn out to be pretty efficient • You need to be persistent ◦ Only after looking many times at the same code did I see potential issues. ◦ Some changes needed a lot of thinking and interaction with other developers ▪ Thank you for bearing with me!! RUDDER - Load Testing Takeaway
  38. • Sometimes, making things faster makes them slower ◦ As

    they starve other resources ◦ Policy generation suffered from improved compliance computation • Libraries/dependencies have a real impact on performance ◦ Our JSON serializing/deserializing library is 8 years old, and really inefficient ◦ Templating engine library is not thread-safe ◦ Upgrading rsyslog, postgresql and java helped a lot RUDDER - Load Testing Takeaway
  39. • JVM JIT optimizes a lot of stuff ◦ Some

    optimizations failed to change anything ◦ JIT optimized the hell out of it ◦ Unit tests show massive perf improvement over each execution iteration RUDDER - Load Testing Takeaway
  40. • Some changes are simply too large for a minor

    version ◦ Next optimization can only fit in a major version: ▪ stream from/to database ▪ rewrite algorithms ▪ change serialization/deserialization librairies ▪ change templating engine ▪ generate policies node by node RUDDER - Load Testing Takeaway
  41. • Improving scalability in one point shows the next bottlenecks

    ◦ It can feel like a never ending task ◦ But they get smaller and smaller RUDDER - Load Testing Takeaway
  42. • Results ◦ 28 PRs for performance in 3 latests

    minors versions of Rudder ◦ 30% improvement in memory usage ◦ 40% faster compliance computation RUDDER - Did it matter?
  43. • Policy generation time is linear in number of nodes

    and components • Memory usage is linear-ish in number of nodes and components • With complex configuration (JS variables, many components per nodes): ◦ 4500 nodes, 1,6 Million components: 10 GB, 13 minutes for policy generation ◦ 10 000 nodes, 11 Millions components: 67 GB, 40 minutes ◦ 12 000 nodes, 13,2 Millions components: 82 GB, 48 minutes • With actual german manufacturer configuration ◦ Full policy generation for 12 500 nodes in 7m40 RUDDER - Did it matter?
  44. • RUDDER is limited by available IO and Memory ◦

    It does make really good use of CPUs • With enough RAM, RUDDER supports 20k nodes RUDDER - Did it matter?
  45. • Test on platform with 1TB RAMDisk ◦ Lower cost

    of I/O • Automate all these tests and measures ◦ System is now stable enough to be automated RUDDER - What’s next?
  46. • Repeat on branch 6.0 ◦ Totally different underlying framework

    ◦ Code changed a lot between the two branches ◦ Many new features in 6.0 ◦ A lot of work has already been done, but we hope to have better perfs than in 5.0 RUDDER - What’s next?
  47. • Build with observability ◦ Gather data, measure evolution ◦

    Don’t need to add logs when an issue arise ◦ Helpful to identify performance regression ◦ • Keep the librairies up to date ◦ Evaluate new librairies regularly ◦ Enormous cost of changing deprecated librairies ◦ Huge impact in performance and quality What could have been better?
  48. • Why not testing with real 20000 nodes? ◦ Mimicking

    the nodes give pertinent insights ◦ It’s cost effective (the test platform cost ~100€/month) ◦ Setting up an infra (storage, network, etc) for 20000 nodes is *really* complex and time consuming ▪ Even in the cloud ▪ And it needs to run for quite a long time The elephant in the room
  49. • Why not Gatling??? ◦ A lot of iterations and

    tests of tests ◦ Tests on several versions with different methods ◦ Need to consolidate The elephant in the room