Pro Yearly is on sale from $80 to $50! »

Testing Distributed Systems with Fuzzy Monkey Testing

Testing Distributed Systems with Fuzzy Monkey Testing

One of the keys to good software is good testing. There are well-known testing suites for back end code – things like junit and py.test. There are also good front-end testing tools – things like Selenium. But for testing distributed systems there aren’t so many well-known tools – because the problem is quite different, and harder. These slides cover the “Fuzzy Monkey” methodology used for testing three different successful distributed systems (including the Assimilation Suite) – its history and how and why it works.


Alan Robertson

October 03, 2016


  1. Resilience-Testing Distributed Systems Resilience-Testing Distributed Systems with “Fuzzy Monkey” testing

    with “Fuzzy Monkey” testing #AssimProj @OSSAlanR Alan Robertson <> Assimilation Systems Limited
  2. 2/19 Biography Biography • 35+ years in IT/development – 10

    years in system management (SysAdmin) • Founded Linux-HA project - led 1998-2007 – aka “Heartbeat” - now called Pacemaker • Founded Assimilation Project in 2010 • Founded Assimilation Systems Limited in 2013 • Alumnus of Bell Labs, SuSE, IBM
  3. 3/19 What Is Fuzzy Monkey Testing? What Is Fuzzy Monkey

    Testing? • A method of testing of distributed systems • Specializes in resilience testing – testing for robustness in the presence of failures
  4. 4/19 Fuzzy Monkey History Fuzzy Monkey History • First conceived

    in Fall of 2001 • Initially implemented by CTS as part of the Linux-HA project – CTS == Cluster Testing System • Continued into the Pacemaker Project • CTS Adopted by Corosync • Fuzzy Monkey method re-implemented for Assimilation Project in 2014 • Came up with Fuzzy Monkey name in May 2016
  5. 5/19 Why create a unique testing method? Why create a

    unique testing method? • Testing distributed systems is hard • Manual testing is rarely successful • No good tools out there • Eliminate Embarrassment: – I was tired of having egg on my face when I put out a release of Linux-HA with bugs that I should have caught – I hated doing manual testing – and I was bad at it
  6. 6/19 Why is Automated Testing Important? Why is Automated Testing

    Important? • Automated testing speeds up product releases • Automated testing available to developers decreases end- user-visible bugs • Continuous Integration needs automated testing • Continuous Deployment requires automated testing Modern software development cannot rely on manual testing
  7. 7/19 How do normal automated tests work? How do normal

    automated tests work? • Fixed list of tests which do fixed things • Tests are typically synchronous • Each test tests one thing – often just calls a function and looks at the result • Each test expects one correct answer • When tests complete, they leave things like when they started • It’s easy to tell when a test is complete – when the function returns • Tests are not subject to timing problems • Tests are complete when the last test completes
  8. 8 Why is Automated Distributed System Why is Automated Distributed

    System Testing Hard? Testing Hard? • Tests are always asynchronous • If you run the same test twice, you might get two different correct answers • The results of the test depend on the current distributed configuration • Events happen when they do, and are observed at a random time later • Specifically trying to create timing issues • You need randomness in the tests to make it more likely you hit all the timing windows • It’s hard to tell when a test is “done” • It’s hard to know when you’ve tested enough
  9. 9/19 What does this mean for Distributed testing? What does

    this mean for Distributed testing? • Tests need to be run multiple times • Tests need to be selected at random • Tests need to include randomness in them • Some tests will deliberately make configuration changes • Tests need to expect all the correct possible outcomes • Tests need to allow for the fact that things might not happen in a particular order • Often need white box testing to aim at timing windows
  10. 10/19 How does “Fuzzy Monkey” testing work? How does “Fuzzy

    Monkey” testing work? • Direct all syslogs to a central machine (using TCP) • Key results come through syslog to central machine • Tests exercise the systems over ssh or similar • Each test typically randomly picks one or more systems to use in the test • Each test expects certain syslog messages indicating success, allowing for possible alternatives typically ignoring ordering • “Oh, darn!” messages cause tests to be marked as failed • Each test must wait for systems to stabilize before marking it complete • Audits of system state are performed after each test
  11. 11/19 When does a test succeed? When does a test

    succeed? • It finds all the regular expressions it expects in syslog before the timeout • It doesn’t find any “Oh Darn!” messages • It passes the post-test system sanity audit
  12. 12/19 How could this How could this possibly possibly work?

    work? Doesn’t seem likely does it?
  13. 13/19 What does this require from applications? What does this

    require from applications? • They have to be willing to log important things to syslog • They need to be defensive – and output “Oh Darn!” messages • Need to be consistent in “Oh Darn!” messages • You will likely have to log a few more things that originally planned • You need to know what kinds of things are likely to stress the application • You need to know what kinds of sanity audits you can perform quickly, and others that maybe you only run infrequently
  14. 14/19 How does this work out? How does this work

    out? • Basically once we test a certain kind of interaction – and pass the test – it never fails in the field • If used with good unit testing, you likely need dozens of tests, not hundreds • Expect your product to work more reliably than any competitor’s – RedHat left the business of producing their own unique HA solution • Expect that to learn things about how the application works – maybe some you didn’t know even as author • Expect that some tests will need tweaking to make sure they’re really “done” before going to next test • Syslog regex testing can be a little fragile – you will have to update your tests as the code changes • If you use Docker, you may run into docker bugs or unexpected interactions – which will change over time... • These kinds of tests can take days to run
  15. 15/19 Why “Re-Invent” CTS for Assimilation? Why “Re-Invent” CTS for

    Assimilation? • CTS assumed you set up the syslog and real/virtual machines “somehow” – Distributing software and setting up syslog correctly is a pain – In Assimilation system testing, that’s all automated – We used Docker which is lower overhead than virtual machines and designed from the ground up for automation • I Added test-specific post-queries to validate the database – in CTS all post-query audits were identical • I was concerned about intellectual property issues • I did reuse (and improve) the LogWatcher class from CTS
  16. 16/19 Why call it “Fuzzy Monkey” Testing? Why call it

    “Fuzzy Monkey” Testing? • It’s a tribute to two other related testing methods – Fuzz testing – having tests not always give the same input – Chaos Monkey testing – which also tries to break the system – in production (we predate Chaos Monkey by several years) • I liked it :-D • It’s unrelated to the vodka-based cocktail ;-)
  17. © 2015 Assimilation Systems Limited 17/19 • ted-systems-with-fuzzy-monkey-testing/ •

    OR • Assimilation Source Code: – /tree/master/cma/systemtests • These Slides: – Where to find this online? Where to find this online?
  18. © 2015 Assimilation Systems Limited 18/19 Get Involved (Assimilation Project)!

    Get Involved (Assimilation Project)! • Get Assimilated! • Contribute! – Users – give it a try – Security best practice experts – Testers, System Management, Continuous Integration – Designers – Developers (C,Python, Shell, PowerShell, JavaScript) – Porters (esp Windows) – Promoters, Publicists, Packagers, etc.
  19. © 2015 Assimilation Systems Limited 19/19 Resistance Is Futile! Resistance

    Is Futile! These slides: Mailing List: @OSSAlanR #assimilation on Assimilation Web Site: Company Web Site: Download: