Slide 1

Slide 1 text

Resilience-Testing Distributed Systems Resilience-Testing Distributed Systems with “Fuzzy Monkey” testing with “Fuzzy Monkey” testing #AssimProj @OSSAlanR Alan Robertson Assimilation Systems Limited owasp.org/index.php/OWASP_Assimilation_Project http://AssimilationSystems.com

Slide 2

Slide 2 text

2/19 Biography Biography ● 35+ years in IT/development – 10 years in system management (SysAdmin) ● Founded Linux-HA project - led 1998-2007 – aka “Heartbeat” - now called Pacemaker ● Founded Assimilation Project in 2010 ● Founded Assimilation Systems Limited in 2013 ● Alumnus of Bell Labs, SuSE, IBM

Slide 3

Slide 3 text

3/19 What Is Fuzzy Monkey Testing? What Is Fuzzy Monkey Testing? ● A method of testing of distributed systems ● Specializes in resilience testing – testing for robustness in the presence of failures

Slide 4

Slide 4 text

4/19 Fuzzy Monkey History Fuzzy Monkey History ● First conceived in Fall of 2001 ● Initially implemented by CTS as part of the Linux-HA project – CTS == Cluster Testing System ● Continued into the Pacemaker Project ● CTS Adopted by Corosync ● Fuzzy Monkey method re-implemented for Assimilation Project in 2014 ● Came up with Fuzzy Monkey name in May 2016

Slide 5

Slide 5 text

5/19 Why create a unique testing method? Why create a unique testing method? ● Testing distributed systems is hard ● Manual testing is rarely successful ● No good tools out there ● Eliminate Embarrassment: – I was tired of having egg on my face when I put out a release of Linux-HA with bugs that I should have caught – I hated doing manual testing – and I was bad at it

Slide 6

Slide 6 text

6/19 Why is Automated Testing Important? Why is Automated Testing Important? ● Automated testing speeds up product releases ● Automated testing available to developers decreases end- user-visible bugs ● Continuous Integration needs automated testing ● Continuous Deployment requires automated testing Modern software development cannot rely on manual testing

Slide 7

Slide 7 text

7/19 How do normal automated tests work? How do normal automated tests work? ● Fixed list of tests which do fixed things ● Tests are typically synchronous ● Each test tests one thing – often just calls a function and looks at the result ● Each test expects one correct answer ● When tests complete, they leave things like when they started ● It’s easy to tell when a test is complete – when the function returns ● Tests are not subject to timing problems ● Tests are complete when the last test completes

Slide 8

Slide 8 text

8 Why is Automated Distributed System Why is Automated Distributed System Testing Hard? Testing Hard? ● Tests are always asynchronous ● If you run the same test twice, you might get two different correct answers ● The results of the test depend on the current distributed configuration ● Events happen when they do, and are observed at a random time later ● Specifically trying to create timing issues ● You need randomness in the tests to make it more likely you hit all the timing windows ● It’s hard to tell when a test is “done” ● It’s hard to know when you’ve tested enough

Slide 9

Slide 9 text

9/19 What does this mean for Distributed testing? What does this mean for Distributed testing? ● Tests need to be run multiple times ● Tests need to be selected at random ● Tests need to include randomness in them ● Some tests will deliberately make configuration changes ● Tests need to expect all the correct possible outcomes ● Tests need to allow for the fact that things might not happen in a particular order ● Often need white box testing to aim at timing windows

Slide 10

Slide 10 text

10/19 How does “Fuzzy Monkey” testing work? How does “Fuzzy Monkey” testing work? ● Direct all syslogs to a central machine (using TCP) ● Key results come through syslog to central machine ● Tests exercise the systems over ssh or similar ● Each test typically randomly picks one or more systems to use in the test ● Each test expects certain syslog messages indicating success, allowing for possible alternatives typically ignoring ordering ● “Oh, darn!” messages cause tests to be marked as failed ● Each test must wait for systems to stabilize before marking it complete ● Audits of system state are performed after each test

Slide 11

Slide 11 text

11/19 When does a test succeed? When does a test succeed? ● It finds all the regular expressions it expects in syslog before the timeout ● It doesn’t find any “Oh Darn!” messages ● It passes the post-test system sanity audit

Slide 12

Slide 12 text

12/19 How could this How could this possibly possibly work? work? Doesn’t seem likely does it?

Slide 13

Slide 13 text

13/19 What does this require from applications? What does this require from applications? ● They have to be willing to log important things to syslog ● They need to be defensive – and output “Oh Darn!” messages ● Need to be consistent in “Oh Darn!” messages ● You will likely have to log a few more things that originally planned ● You need to know what kinds of things are likely to stress the application ● You need to know what kinds of sanity audits you can perform quickly, and others that maybe you only run infrequently

Slide 14

Slide 14 text

14/19 How does this work out? How does this work out? ● Basically once we test a certain kind of interaction – and pass the test – it never fails in the field ● If used with good unit testing, you likely need dozens of tests, not hundreds ● Expect your product to work more reliably than any competitor’s – RedHat left the business of producing their own unique HA solution ● Expect that to learn things about how the application works – maybe some you didn’t know even as author ● Expect that some tests will need tweaking to make sure they’re really “done” before going to next test ● Syslog regex testing can be a little fragile – you will have to update your tests as the code changes ● If you use Docker, you may run into docker bugs or unexpected interactions – which will change over time... ● These kinds of tests can take days to run

Slide 15

Slide 15 text

15/19 Why “Re-Invent” CTS for Assimilation? Why “Re-Invent” CTS for Assimilation? ● CTS assumed you set up the syslog and real/virtual machines “somehow” – Distributing software and setting up syslog correctly is a pain – In Assimilation system testing, that’s all automated – We used Docker which is lower overhead than virtual machines and designed from the ground up for automation ● I Added test-specific post-queries to validate the database – in CTS all post-query audits were identical ● I was concerned about intellectual property issues ● I did reuse (and improve) the LogWatcher class from CTS

Slide 16

Slide 16 text

16/19 Why call it “Fuzzy Monkey” Testing? Why call it “Fuzzy Monkey” Testing? ● It’s a tribute to two other related testing methods – Fuzz testing – having tests not always give the same input – Chaos Monkey testing – which also tries to break the system – in production (we predate Chaos Monkey by several years) ● I liked it :-D ● It’s unrelated to the vodka-based cocktail ;-)

Slide 17

Slide 17 text

© 2015 Assimilation Systems Limited 17/19 ● http://assimilationsystems.com/2016/05/24/testing-distribu ted-systems-with-fuzzy-monkey-testing/ ● OR http://bit.ly/FuzzyMonkey ● Assimilation Source Code: – https://github.com/assimilation/assimilation-official /tree/master/cma/systemtests ● These Slides: – https://bit.ly/FuzzyMonkeySlides2016 Where to find this online? Where to find this online?

Slide 18

Slide 18 text

© 2015 Assimilation Systems Limited 18/19 Get Involved (Assimilation Project)! Get Involved (Assimilation Project)! ● Get Assimilated! ● Contribute! – Users – give it a try – Security best practice experts – Testers, System Management, Continuous Integration – Designers – Developers (C,Python, Shell, PowerShell, JavaScript) – Porters (esp Windows) – Promoters, Publicists, Packagers, etc.

Slide 19

Slide 19 text

© 2015 Assimilation Systems Limited 19/19 Resistance Is Futile! Resistance Is Futile! These slides: bit.ly/FuzzyMonkeySlides16 Mailing List: bit.ly/AssimML @OSSAlanR #assimilation on irc.freenode.net Assimilation Web Site: assimproj.org https://www.owasp.org/index.php/OWASP_Assimilation_Project Company Web Site: assimilationsystems.com Download: assimilationsystems.com/download