Feedback on scalability and load testing of a configuration management software

rudder.io Feedback on scalability and load testing of a conﬁguration
management software Nicolas CHARLES [email protected] - @nico_charles

• RUDDER is a Continuous Configuration Management software with a
Compliance focus • Configurations are defined with GUI or API • Configuration checked on every nodes every 5 minutes • RUDDER computes the compliance of every nodes RUDDER - Why scalability is a topic?

RUDDER - Deﬁning conﬁgurations

RUDDER - Defining configurations PARAM RULE • Id DIRECTIVE •
Id • (Components) GROUP • Id RUDDER config (global) • Policy Mode • Schedule NODE • Properties • Policy Mode • Schedule Environmental context • Id : . . . • Generated : . . . Files Node configuration Historization Historization

RUDDER - Computing compliance 5 • Id : . .
. • Generated : . . . Files Node configuration RUN • Reports • Reports • ... • ... METADATA • node id • config id • run timestamp RUN • Reports • Reports • ... • ... METADATA • node id • config id • run timestamp • Signature Get Policy Send configuration reports Expected reports (node id, config id, timestamp) Run reports Historization Compliance (historized) Store expected reports Metadata • Integrity • Signature Config • Id • For Rule R, Directive D1, Component C On every nodes

RUDDER - Compliance

• Load testing is the process of putting demand on
a system and measuring its response.[2] • Simulating real-world usage, and ensuring the system can sustain it. Load testing? [2] https://en.wikipedia.org/wiki/Load_testing

• For SaaS systems, you manage the lifecycle of your
application, so you can test for top of next months expected scale. • For a software deployed on users infrastructures, upgrades are done during maintenance windows ◦ They need (long) migration tests ⇨ You have to think forward Testing at what scale??

• A German cars manufacturer manages 13 000 nodes with
a 3 years-old RUDDER 3.1 instance, and 4 500 nodes with a 2 years old RUDDER 4.1 version • Need to consolidate on one instance • Expect 20 000 nodes this year How hard can it be? RUDDER - Testing at what scale?

• RUDDER supports “old” OS - RHEL 7, SLES 12
• Uses system dependencies, which can be antediluvian: PostgreSQL 9.2, rsyslog 8.24.0 ◦ Their performance can be problematic RUDDER - Some constraints

RUDDER - architecture Rudder Root Server Interfaces CLI WEB UI
API Users Applications Compliance Conﬁguration Inventory Rudder Engine + Plugins Relay Node Rudder Agent Node Rudder Agent Node Rudder Agent

RUDDER - components Rudder Root Server Interfaces CLI WEB UI
API Users Applications Compliance Conﬁguration Inventory Rudder Engine + Plugins Relay Node Rudder Agent Node Rudder Agent Node Rudder Agent Scala PostgreSQL OpenLDAP rsyslog

RUDDER - where could be the bottlenecks? Rudder Root Server
Interfaces CLI WEB UI API Users Applications Compliance Conﬁguration Inventory Rudder Engine + Plugins Relay Node Rudder Agent Node Rudder Agent Node Rudder Agent

• Testing every parts of Rudder ◦ In isolation first,
then all together ◦ Ramp-up the load • Inventories • Set of groups, techniques, directives , rules • Reports from all nodes • Relevant logs and way to analyze them Testing for 20 000 nodes - what is needed?

• Script to generate inventories ◦ Template + Tsung, contributed
by community https://github.com/Normation/rudder-tools/tree/master/contrib/stress_suite ◦ Bash scripts with more flexibility: generated certificates, multiple templates, etc; plus persistent data to generate reports https://github.com/Normation/rudder-tools/pull/534/files Testing for 20 000 nodes - Tooling?

• Script to generate reports from created nodes ◦ Python
script to get expected reports from the database, generate and send reports over syslog https://github.com/Normation/rudder-tools/tree/master/contrib/load-database ◦ Python script to get expected reports from the database, generates reports, sign them using the static data and sends them over HTTPS https://github.com/Normation/rudder-tools/pull/534/files Testing for 20 000 nodes - Tooling?

• First test, with 6000 nodes • Should be easy,
as earlier versions of RUDDER supported 13000 nodes… Testing for 20 000 nodes

• First test, with 6000 nodes • Should be easy,
as earlier versions of RUDDER supported 13000 nodes… • Added features in 5.0 have a significant cost in scalability :( Testing for 20 000 nodes

• The most resource intensive part a. Snapshot inventories, groups,
directive and rules b. Historize data c. Convert them in policies mapping d. Generate files with data and policies for the nodes e. Save expected reports f. Run post-generation hooks RUDDER - Policy generation

• Extract timings ◦ Each step of Policy Generation is
measured… but Done XXX in: 322 350 ms is clearly not precise enough ◦ Need more detailed timings on each parts ▪ Aggregate timings over each threads and components • Process timings ◦ Automate extracting features from thousands of measures ▪ 100 000 message “mergeCompareByRules in x ms” RUDDER - Policy generation

• Inspect slow parts ◦ Variable computation ◦ Database reads/writes
◦ I/O • Ensure that memory usage is reasonable • The core of RUDDER is written in Scala, so there is a lot of available tooling RUDDER - Policy generation

• jmap (fast) ◦ histogram of all objects allocated, memory
used and number ◦ dump of JVM state RUDDER - JVM tooling

• Eclipse Memory Analyzer (async) ◦ analyse dump and suggest
fixes ◦ doesn’t work great for large heap RUDDER - JVM tooling

• YourKit ◦ Instrument the JVM ◦ Measure object creation,
methods calls, memory used, etc ◦ Great detailed vision RUDDER - JVM tooling

• YourKit ◦ Instrument the JVM ◦ Measure object creation,
methods calls, memory used, etc ◦ Great detailed vision of systems ◦ But massive slowdown of the application when activated RUDDER - JVM tooling

• https://gceasy.io/ ◦ Analyze GC logs and extract metrics and
health of GC RUDDER - JVM tooling

• Facts ◦ Too many objects are created with super
short live ▪ Strain on GC, increase CPUs usage ◦ Some unnecessary computations ◦ Serializing and deserializing JSON is CPU and Memory expensive ◦ IO is super expensive RUDDER - Policy generation

• Optimizations ◦ Use collection views when necessary ▪ Caution,
sometimes views are hidden and cause leaks ◦ Fix some algorithms to remove unnecessary lookups ◦ Identify the necessary lifespan of big objects, to ensure there are not referenced when not necessary ◦ Batched all requests to database ▪ Lower memory usage ▪ Big strings from PostgreSQL that are deserialized in one go RUDDER - Policy generation

• Python script to generate reports • 3000+ syslog messages/second
• Catastrophic compliance RUDDER - Reports and compliance

• Python script to generate reports • 3000+ syslog messages/second
• Catastrophic compliance • nestat -suna ◦ 80% of messages are dropped... RUDDER - Reports and compliance

• rsyslog inserts report in PostgreSQL 1 by 1 ◦
Need to upgrade to 2019 versions • Increase UDP buffer to handle the load • Increase the number of workers https://docs.rudder.io/reference/5.0/administration/performance.html#_increase_th e_udp_buffer • Fixed the network part for now RUDDER - Reports and compliance

• rsyslog inserts report in PostgreSQL 1 by 1 ◦
Need to upgrade to 2019 versions • Increase UDP buffer to handle the load • Increase the number of workers https://docs.rudder.io/reference/5.0/administration/performance.html#_increase_th e_udp_buffer • Fixed the network part for now ◦ In 6.0, we dropped rsyslog, to use HTTPS reports parsed by a Rust module RUDDER - Reports and compliance

• Compliance computation can’t deal with all reports • Process
for computation ◦ Fetch nodes last runs ◦ Get expected reports ◦ Get reports from these nodes ◦ Expand variables, and compute compliance • Depends on PostgreSQL • Computations and lookups on the Scala part RUDDER - Reports and compliance

• PgBadger ◦ Analyze PostgreSQL logs ◦ Very complete stats
on database usage ▪ Top queries, most expensive queries, temp files RUDDER - Tooling for PostgreSQL

• https://explain.depesz.com/ ◦ Explain query plan ◦ In an understandable
way ◦ Help correct the queries RUDDER - Tooling for PostgreSQL

• Give enough resources to PostgreSQL to handle the load
◦ shared_buffer ◦ work_mem ◦ maintenance_work_mem ◦ temp_buffers ◦ wal_buffer • Configure it based on the reality of the system ◦ random_page_cost is too large for SSD by default ◦ effective_cache_size RUDDER - Conﬁgure PostgreSQL

• Optimizations in the Scala part ◦ Improvement in variables
detections ◦ Lighter data structures RUDDER - Reports and compliance

• Options to disable features to save resources ◦ Full
historization of compliance in database ◦ Historization of non-compliance reports ◦ Changes computation ◦ Recompute of dynamics groups RUDDER - Feature ﬂags

• Define timeouts, parallelism... RUDDER - Tunability

• Load testing is time consuming ◦ 10 minutes to
12 hours for a relevant test ◦ Enormous logs ▪ Scripts to extracts stats on these logs automatically ◦ Analyzing the results can be tedious ▪ Especially as I/O can be erratics (sometimes *4 perf degradation) • Generating relevant data is hard ◦ A German cars manufacturer gave us access to a system with their inventories to have real data - thank you ! RUDDER - Load Testing Takeaway

• Beware of intuitions ◦ Guesses and hypothesis are often
wrong ▪ Need to be meticulous ◦ Some code or may look bad, but turn out to be pretty efficient • You need to be persistent ◦ Only after looking many times at the same code did I see potential issues. ◦ Some changes needed a lot of thinking and interaction with other developers ▪ Thank you for bearing with me!! RUDDER - Load Testing Takeaway

• Sometimes, making things faster makes them slower ◦ As
they starve other resources ◦ Policy generation suffered from improved compliance computation • Libraries/dependencies have a real impact on performance ◦ Our JSON serializing/deserializing library is 8 years old, and really inefficient ◦ Templating engine library is not thread-safe ◦ Upgrading rsyslog, postgresql and java helped a lot RUDDER - Load Testing Takeaway

• JVM JIT optimizes a lot of stuff ◦ Some
optimizations failed to change anything ◦ JIT optimized the hell out of it ◦ Unit tests show massive perf improvement over each execution iteration RUDDER - Load Testing Takeaway

• Some changes are simply too large for a minor
version ◦ Next optimization can only fit in a major version: ▪ stream from/to database ▪ rewrite algorithms ▪ change serialization/deserialization librairies ▪ change templating engine ▪ generate policies node by node RUDDER - Load Testing Takeaway

• Improving scalability in one point shows the next bottlenecks
◦ It can feel like a never ending task ◦ But they get smaller and smaller RUDDER - Load Testing Takeaway

• Results ◦ 28 PRs for performance in 3 latests
minors versions of Rudder ◦ 30% improvement in memory usage ◦ 40% faster compliance computation RUDDER - Did it matter?

• Policy generation time is linear in number of nodes
and components • Memory usage is linear-ish in number of nodes and components • With complex configuration (JS variables, many components per nodes): ◦ 4500 nodes, 1,6 Million components: 10 GB, 13 minutes for policy generation ◦ 10 000 nodes, 11 Millions components: 67 GB, 40 minutes ◦ 12 000 nodes, 13,2 Millions components: 82 GB, 48 minutes • With actual german manufacturer configuration ◦ Full policy generation for 12 500 nodes in 7m40 RUDDER - Did it matter?

• RUDDER is limited by available IO and Memory ◦
It does make really good use of CPUs • With enough RAM, RUDDER supports 20k nodes RUDDER - Did it matter?

• Test on platform with 1TB RAMDisk ◦ Lower cost
of I/O • Automate all these tests and measures ◦ System is now stable enough to be automated RUDDER - What’s next?

• Repeat on branch 6.0 ◦ Totally different underlying framework
◦ Code changed a lot between the two branches ◦ Many new features in 6.0 ◦ A lot of work has already been done, but we hope to have better perfs than in 5.0 RUDDER - What’s next?

• Build with observability ◦ Gather data, measure evolution ◦
Don’t need to add logs when an issue arise ◦ Helpful to identify performance regression ◦ • Keep the librairies up to date ◦ Evaluate new librairies regularly ◦ Enormous cost of changing deprecated librairies ◦ Huge impact in performance and quality What could have been better?

rudder.io Feedback on scalability and load testing of a conﬁguration
management software Questions ?

• Why not testing with real 20000 nodes? ◦ Mimicking
the nodes give pertinent insights ◦ It’s cost effective (the test platform cost ~100€/month) ◦ Setting up an infra (storage, network, etc) for 20000 nodes is *really* complex and time consuming ▪ Even in the cloud ▪ And it needs to run for quite a long time The elephant in the room

• Why not Gatling??? ◦ A lot of iterations and
tests of tests ◦ Tests on several versions with different methods ◦ Need to consolidate The elephant in the room

Feedback on scalability and load testing of a c...

Feedback on scalability and load testing of a configuration management software

More Decks by Rudder

Other Decks in Programming

Featured

Transcript