Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Feedback on scalability and load testing of a configuration management software

Rudder
February 04, 2020

Feedback on scalability and load testing of a configuration management software

🎥 https://www.youtube.com/watch?v=rQlLzVbXEXk
🧑 Nicolas Charles
📅 Configuration Management Camp 2020

RUDDER is based on API/Web application that allows users to configure and verify their configurations. Relying on agents on every system, it checks and remediates configurations every 5 minutes and centralizes the result of application. Each result is made up of hundreds of events that are historized, and each configuration changes involves calculating and displaying the configurations and conformities for users within a reasonable time.

A RUDDER instance can handle 20 000 nodes. Can you imagine what this implies from a network, CPU and storage point of view? How to reach and maintain these performances? What are the different steps that made this possible? And what tools have been put in place?

This presentation will explain the technical stack used (Scala, PostgreSQL, C and Rust), as well as the path, failures and successes that allow us today to reproduce the environments, and also to test and validate the hypotheses to achieve and keep these results.

Rudder

February 04, 2020
Tweet

More Decks by Rudder

Other Decks in Programming

Transcript

  1. rudder.io
    Feedback on scalability and load
    testing of a configuration management
    software
    Nicolas CHARLES
    [email protected] - @nico_charles

    View Slide

  2. ● RUDDER is a Continuous Configuration Management software with a
    Compliance focus
    ● Configurations are defined with GUI or API
    ● Configuration checked on every nodes every 5 minutes
    ● RUDDER computes the compliance of every nodes
    RUDDER - Why scalability is a topic?

    View Slide

  3. RUDDER - Defining configurations

    View Slide

  4. RUDDER - Defining configurations
    PARAM
    RULE

    Id
    DIRECTIVE

    Id

    (Components)
    GROUP

    Id
    RUDDER config
    (global)
    ● Policy Mode
    ● Schedule
    NODE
    ● Properties
    ● Policy Mode
    ● Schedule
    Environmental context

    Id : . . .

    Generated : . . .
    Files
    Node configuration
    Historization
    Historization

    View Slide

  5. RUDDER - Computing compliance
    5

    Id : . . .

    Generated : . . .
    Files
    Node configuration
    RUN

    Reports

    Reports

    ...

    ...
    METADATA

    node id

    config id

    run timestamp
    RUN

    Reports

    Reports

    ...

    ...
    METADATA

    node id

    config id

    run timestamp

    Signature
    Get Policy
    Send configuration
    reports
    Expected reports
    (node id, config id,
    timestamp)
    Run reports
    Historization
    Compliance
    (historized)
    Store expected reports
    Metadata

    Integrity

    Signature
    Config

    Id

    For Rule R,
    Directive D1,
    Component C
    On every nodes

    View Slide

  6. RUDDER - Compliance

    View Slide

  7. ● Load testing is the process of putting demand on a system and measuring its
    response.[2]
    ● Simulating real-world usage, and ensuring the system can sustain it.
    Load testing?
    [2] https://en.wikipedia.org/wiki/Load_testing

    View Slide

  8. ● For SaaS systems, you manage the lifecycle of your application, so you can
    test for top of next months expected scale.
    ● For a software deployed on users infrastructures, upgrades are done
    during maintenance windows
    ○ They need (long) migration tests
    ⇨ You have to think forward
    Testing at what scale??

    View Slide

  9. ● A German cars manufacturer manages 13 000 nodes with a 3 years-old
    RUDDER 3.1 instance, and 4 500 nodes with a 2 years old RUDDER 4.1
    version
    ● Need to consolidate on one instance
    ● Expect 20 000 nodes this year
    How hard can it be?
    RUDDER - Testing at what scale?

    View Slide

  10. ● RUDDER supports “old” OS - RHEL 7, SLES 12
    ● Uses system dependencies, which can be antediluvian: PostgreSQL 9.2,
    rsyslog 8.24.0
    ○ Their performance can be problematic
    RUDDER - Some constraints

    View Slide

  11. RUDDER - architecture
    Rudder Root Server
    Interfaces
    CLI
    WEB UI
    API
    Users
    Applications
    Compliance Configuration Inventory
    Rudder Engine + Plugins
    Relay Node
    Rudder Agent
    Node
    Rudder Agent
    Node
    Rudder Agent

    View Slide

  12. RUDDER - components
    Rudder Root Server
    Interfaces
    CLI
    WEB UI
    API
    Users
    Applications
    Compliance Configuration Inventory
    Rudder Engine + Plugins
    Relay Node
    Rudder Agent
    Node
    Rudder Agent
    Node
    Rudder Agent
    Scala
    PostgreSQL OpenLDAP
    rsyslog

    View Slide

  13. RUDDER - where could be the bottlenecks?
    Rudder Root Server
    Interfaces
    CLI
    WEB UI
    API
    Users
    Applications
    Compliance Configuration Inventory
    Rudder Engine + Plugins
    Relay Node
    Rudder Agent
    Node
    Rudder Agent
    Node
    Rudder Agent

    View Slide

  14. ● Testing every parts of Rudder
    ○ In isolation first, then all together
    ○ Ramp-up the load
    ● Inventories
    ● Set of groups, techniques, directives , rules
    ● Reports from all nodes
    ● Relevant logs and way to analyze them
    Testing for 20 000 nodes - what is needed?

    View Slide

  15. ● Script to generate inventories
    ○ Template + Tsung, contributed by community
    https://github.com/Normation/rudder-tools/tree/master/contrib/stress_suite
    ○ Bash scripts with more flexibility: generated certificates, multiple templates, etc; plus
    persistent data to generate reports
    https://github.com/Normation/rudder-tools/pull/534/files
    Testing for 20 000 nodes - Tooling?

    View Slide

  16. ● Script to generate reports from created nodes
    ○ Python script to get expected reports from the database, generate and send reports over
    syslog
    https://github.com/Normation/rudder-tools/tree/master/contrib/load-database
    ○ Python script to get expected reports from the database, generates reports, sign them using
    the static data and sends them over HTTPS
    https://github.com/Normation/rudder-tools/pull/534/files
    Testing for 20 000 nodes - Tooling?

    View Slide

  17. ● First test, with 6000 nodes
    ● Should be easy, as earlier versions of RUDDER supported 13000 nodes…
    Testing for 20 000 nodes

    View Slide

  18. ● First test, with 6000 nodes
    ● Should be easy, as earlier versions of RUDDER supported 13000 nodes…
    ● Added features in 5.0 have a significant cost in scalability :(
    Testing for 20 000 nodes

    View Slide

  19. ● The most resource intensive part
    a. Snapshot inventories, groups, directive and rules
    b. Historize data
    c. Convert them in policies mapping
    d. Generate files with data and policies for the nodes
    e. Save expected reports
    f. Run post-generation hooks
    RUDDER - Policy generation

    View Slide

  20. ● Extract timings
    ○ Each step of Policy Generation is measured… but
    Done XXX in: 322 350 ms
    is clearly not precise enough
    ○ Need more detailed timings on each parts
    ■ Aggregate timings over each threads and components
    ● Process timings
    ○ Automate extracting features from thousands of measures
    ■ 100 000 message “mergeCompareByRules in x ms”
    RUDDER - Policy generation

    View Slide

  21. ● Inspect slow parts
    ○ Variable computation
    ○ Database reads/writes
    ○ I/O
    ● Ensure that memory usage is reasonable
    ● The core of RUDDER is written in Scala, so there is a lot of available tooling
    RUDDER - Policy generation

    View Slide

  22. ● jmap (fast)
    ○ histogram of all objects allocated, memory used and number
    ○ dump of JVM state
    RUDDER - JVM tooling

    View Slide

  23. ● Eclipse Memory Analyzer (async)
    ○ analyse dump and suggest fixes
    ○ doesn’t work great for large heap
    RUDDER - JVM tooling

    View Slide

  24. ● YourKit
    ○ Instrument the JVM
    ○ Measure object creation, methods calls, memory used, etc
    ○ Great detailed vision
    RUDDER - JVM tooling

    View Slide

  25. ● YourKit
    ○ Instrument the JVM
    ○ Measure object creation, methods calls, memory used, etc
    ○ Great detailed vision of systems
    ○ But massive slowdown of the application when activated
    RUDDER - JVM tooling

    View Slide

  26. ● YourKit
    ○ Instrument the JVM
    ○ Measure object creation, methods calls, memory used, etc
    ○ Great detailed vision of systems
    ○ But massive slowdown of the application when activated
    RUDDER - JVM tooling

    View Slide

  27. ● https://gceasy.io/
    ○ Analyze GC logs and extract metrics and health of GC
    RUDDER - JVM tooling

    View Slide

  28. ● Facts
    ○ Too many objects are created with super short live
    ■ Strain on GC, increase CPUs usage
    ○ Some unnecessary computations
    ○ Serializing and deserializing JSON is CPU and Memory expensive
    ○ IO is super expensive
    RUDDER - Policy generation

    View Slide

  29. ● Optimizations
    ○ Use collection views when necessary
    ■ Caution, sometimes views are hidden and cause leaks
    ○ Fix some algorithms to remove unnecessary lookups
    ○ Identify the necessary lifespan of big objects, to ensure there are not referenced when not
    necessary
    ○ Batched all requests to database
    ■ Lower memory usage
    ■ Big strings from PostgreSQL that are deserialized in one go
    RUDDER - Policy generation

    View Slide

  30. ● Python script to generate reports
    ● 3000+ syslog messages/second
    ● Catastrophic compliance
    RUDDER - Reports and compliance

    View Slide

  31. ● Python script to generate reports
    ● 3000+ syslog messages/second
    ● Catastrophic compliance
    ● nestat -suna
    ○ 80% of messages are dropped...
    RUDDER - Reports and compliance

    View Slide

  32. ● rsyslog inserts report in PostgreSQL 1 by 1
    ○ Need to upgrade to 2019 versions
    ● Increase UDP buffer to handle the load
    ● Increase the number of workers
    https://docs.rudder.io/reference/5.0/administration/performance.html#_increase_th
    e_udp_buffer
    ● Fixed the network part for now
    RUDDER - Reports and compliance

    View Slide

  33. ● rsyslog inserts report in PostgreSQL 1 by 1
    ○ Need to upgrade to 2019 versions
    ● Increase UDP buffer to handle the load
    ● Increase the number of workers
    https://docs.rudder.io/reference/5.0/administration/performance.html#_increase_th
    e_udp_buffer
    ● Fixed the network part for now
    ○ In 6.0, we dropped rsyslog, to use HTTPS reports parsed by a Rust module
    RUDDER - Reports and compliance

    View Slide

  34. ● Compliance computation can’t deal with all reports
    ● Process for computation
    ○ Fetch nodes last runs
    ○ Get expected reports
    ○ Get reports from these nodes
    ○ Expand variables, and compute compliance
    ● Depends on PostgreSQL
    ● Computations and lookups on the Scala part
    RUDDER - Reports and compliance

    View Slide

  35. ● PgBadger
    ○ Analyze PostgreSQL logs
    ○ Very complete stats on database usage
    ■ Top queries, most expensive queries, temp files
    RUDDER - Tooling for PostgreSQL

    View Slide

  36. ● https://explain.depesz.com/
    ○ Explain query plan
    ○ In an understandable way
    ○ Help correct the queries
    RUDDER - Tooling for PostgreSQL

    View Slide

  37. ● Give enough resources to PostgreSQL to handle the load
    ○ shared_buffer
    ○ work_mem
    ○ maintenance_work_mem
    ○ temp_buffers
    ○ wal_buffer
    ● Configure it based on the reality of the system
    ○ random_page_cost is too large for SSD by default
    ○ effective_cache_size
    RUDDER - Configure PostgreSQL

    View Slide

  38. ● Optimizations in the Scala part
    ○ Improvement in variables detections
    ○ Lighter data structures
    RUDDER - Reports and compliance

    View Slide

  39. ● Options to disable features to save resources
    ○ Full historization of compliance in database
    ○ Historization of non-compliance reports
    ○ Changes computation
    ○ Recompute of dynamics groups
    RUDDER - Feature flags

    View Slide

  40. ● Define timeouts, parallelism...
    RUDDER - Tunability

    View Slide

  41. ● Load testing is time consuming
    ○ 10 minutes to 12 hours for a relevant test
    ○ Enormous logs
    ■ Scripts to extracts stats on these logs automatically
    ○ Analyzing the results can be tedious
    ■ Especially as I/O can be erratics (sometimes *4 perf degradation)
    ● Generating relevant data is hard
    ○ A German cars manufacturer gave us access to a system with their inventories to have real
    data - thank you !
    RUDDER - Load Testing Takeaway

    View Slide

  42. ● Beware of intuitions
    ○ Guesses and hypothesis are often wrong
    ■ Need to be meticulous
    ○ Some code or may look bad, but turn out to be pretty efficient
    ● You need to be persistent
    ○ Only after looking many times at the same code did I see potential issues.
    ○ Some changes needed a lot of thinking and interaction with other developers
    ■ Thank you for bearing with me!!
    RUDDER - Load Testing Takeaway

    View Slide

  43. ● Sometimes, making things faster makes them slower
    ○ As they starve other resources
    ○ Policy generation suffered from improved compliance computation
    ● Libraries/dependencies have a real impact on performance
    ○ Our JSON serializing/deserializing library is 8 years old, and really inefficient
    ○ Templating engine library is not thread-safe
    ○ Upgrading rsyslog, postgresql and java helped a lot
    RUDDER - Load Testing Takeaway

    View Slide

  44. ● JVM JIT optimizes a lot of stuff
    ○ Some optimizations failed to change anything
    ○ JIT optimized the hell out of it
    ○ Unit tests show massive perf improvement over each execution iteration
    RUDDER - Load Testing Takeaway

    View Slide

  45. ● Some changes are simply too large for a minor version
    ○ Next optimization can only fit in a major version:
    ■ stream from/to database
    ■ rewrite algorithms
    ■ change serialization/deserialization librairies
    ■ change templating engine
    ■ generate policies node by node
    RUDDER - Load Testing Takeaway

    View Slide

  46. ● Improving scalability in one point shows the next bottlenecks
    ○ It can feel like a never ending task
    ○ But they get smaller and smaller
    RUDDER - Load Testing Takeaway

    View Slide

  47. ● Results
    ○ 28 PRs for performance in 3 latests minors versions of Rudder
    ○ 30% improvement in memory usage
    ○ 40% faster compliance computation
    RUDDER - Did it matter?

    View Slide

  48. ● Policy generation time is linear in number of nodes and components
    ● Memory usage is linear-ish in number of nodes and components
    ● With complex configuration (JS variables, many components per nodes):
    ○ 4500 nodes, 1,6 Million components: 10 GB, 13 minutes for policy generation
    ○ 10 000 nodes, 11 Millions components: 67 GB, 40 minutes
    ○ 12 000 nodes, 13,2 Millions components: 82 GB, 48 minutes
    ● With actual german manufacturer configuration
    ○ Full policy generation for 12 500 nodes in 7m40
    RUDDER - Did it matter?

    View Slide

  49. ● RUDDER is limited by available IO and Memory
    ○ It does make really good use of CPUs
    ● With enough RAM, RUDDER supports 20k nodes
    RUDDER - Did it matter?

    View Slide

  50. ● Test on platform with 1TB RAMDisk
    ○ Lower cost of I/O
    ● Automate all these tests and measures
    ○ System is now stable enough to be automated
    RUDDER - What’s next?

    View Slide

  51. ● Repeat on branch 6.0
    ○ Totally different underlying framework
    ○ Code changed a lot between the two branches
    ○ Many new features in 6.0
    ○ A lot of work has already been done, but we hope to have better perfs than in 5.0
    RUDDER - What’s next?

    View Slide

  52. ● Build with observability
    ○ Gather data, measure evolution
    ○ Don’t need to add logs when an issue arise
    ○ Helpful to identify performance regression

    ● Keep the librairies up to date
    ○ Evaluate new librairies regularly
    ○ Enormous cost of changing deprecated librairies
    ○ Huge impact in performance and quality
    What could have been better?

    View Slide

  53. rudder.io
    Feedback on scalability and load testing of a configuration
    management software
    Questions ?

    View Slide

  54. ● Why not testing with real 20000 nodes?
    ○ Mimicking the nodes give pertinent insights
    ○ It’s cost effective (the test platform cost ~100€/month)
    ○ Setting up an infra (storage, network, etc) for 20000 nodes is *really* complex and time
    consuming
    ■ Even in the cloud
    ■ And it needs to run for quite a long time
    The elephant in the room

    View Slide

  55. ● Why not Gatling???
    ○ A lot of iterations and tests of tests
    ○ Tests on several versions with different methods
    ○ Need to consolidate
    The elephant in the room

    View Slide