Testing Distributed Systems with Fuzzy Monkey Testing

Resilience-Testing Distributed Systems Resilience-Testing Distributed Systems with “Fuzzy Monkey” testing
with “Fuzzy Monkey” testing #AssimProj @OSSAlanR Alan Robertson <[email protected]> Assimilation Systems Limited owasp.org/index.php/OWASP_Assimilation_Project http://AssimilationSystems.com

2/19 Biography Biography • 35+ years in IT/development – 10
years in system management (SysAdmin) • Founded Linux-HA project - led 1998-2007 – aka “Heartbeat” - now called Pacemaker • Founded Assimilation Project in 2010 • Founded Assimilation Systems Limited in 2013 • Alumnus of Bell Labs, SuSE, IBM

3/19 What Is Fuzzy Monkey Testing? What Is Fuzzy Monkey
Testing? • A method of testing of distributed systems • Specializes in resilience testing – testing for robustness in the presence of failures

4/19 Fuzzy Monkey History Fuzzy Monkey History • First conceived
in Fall of 2001 • Initially implemented by CTS as part of the Linux-HA project – CTS == Cluster Testing System • Continued into the Pacemaker Project • CTS Adopted by Corosync • Fuzzy Monkey method re-implemented for Assimilation Project in 2014 • Came up with Fuzzy Monkey name in May 2016

5/19 Why create a unique testing method? Why create a
unique testing method? • Testing distributed systems is hard • Manual testing is rarely successful • No good tools out there • Eliminate Embarrassment: – I was tired of having egg on my face when I put out a release of Linux-HA with bugs that I should have caught – I hated doing manual testing – and I was bad at it

6/19 Why is Automated Testing Important? Why is Automated Testing
Important? • Automated testing speeds up product releases • Automated testing available to developers decreases end- user-visible bugs • Continuous Integration needs automated testing • Continuous Deployment requires automated testing Modern software development cannot rely on manual testing

7/19 How do normal automated tests work? How do normal
automated tests work? • Fixed list of tests which do fixed things • Tests are typically synchronous • Each test tests one thing – often just calls a function and looks at the result • Each test expects one correct answer • When tests complete, they leave things like when they started • It’s easy to tell when a test is complete – when the function returns • Tests are not subject to timing problems • Tests are complete when the last test completes

8 Why is Automated Distributed System Why is Automated Distributed
System Testing Hard? Testing Hard? • Tests are always asynchronous • If you run the same test twice, you might get two different correct answers • The results of the test depend on the current distributed configuration • Events happen when they do, and are observed at a random time later • Specifically trying to create timing issues • You need randomness in the tests to make it more likely you hit all the timing windows • It’s hard to tell when a test is “done” • It’s hard to know when you’ve tested enough

9/19 What does this mean for Distributed testing? What does
this mean for Distributed testing? • Tests need to be run multiple times • Tests need to be selected at random • Tests need to include randomness in them • Some tests will deliberately make configuration changes • Tests need to expect all the correct possible outcomes • Tests need to allow for the fact that things might not happen in a particular order • Often need white box testing to aim at timing windows

10/19 How does “Fuzzy Monkey” testing work? How does “Fuzzy
Monkey” testing work? • Direct all syslogs to a central machine (using TCP) • Key results come through syslog to central machine • Tests exercise the systems over ssh or similar • Each test typically randomly picks one or more systems to use in the test • Each test expects certain syslog messages indicating success, allowing for possible alternatives typically ignoring ordering • “Oh, darn!” messages cause tests to be marked as failed • Each test must wait for systems to stabilize before marking it complete • Audits of system state are performed after each test

11/19 When does a test succeed? When does a test
succeed? • It finds all the regular expressions it expects in syslog before the timeout • It doesn’t find any “Oh Darn!” messages • It passes the post-test system sanity audit

12/19 How could this How could this possibly possibly work?
work? Doesn’t seem likely does it?

13/19 What does this require from applications? What does this
require from applications? • They have to be willing to log important things to syslog • They need to be defensive – and output “Oh Darn!” messages • Need to be consistent in “Oh Darn!” messages • You will likely have to log a few more things that originally planned • You need to know what kinds of things are likely to stress the application • You need to know what kinds of sanity audits you can perform quickly, and others that maybe you only run infrequently

14/19 How does this work out? How does this work
out? • Basically once we test a certain kind of interaction – and pass the test – it never fails in the field • If used with good unit testing, you likely need dozens of tests, not hundreds • Expect your product to work more reliably than any competitor’s – RedHat left the business of producing their own unique HA solution • Expect that to learn things about how the application works – maybe some you didn’t know even as author • Expect that some tests will need tweaking to make sure they’re really “done” before going to next test • Syslog regex testing can be a little fragile – you will have to update your tests as the code changes • If you use Docker, you may run into docker bugs or unexpected interactions – which will change over time... • These kinds of tests can take days to run

15/19 Why “Re-Invent” CTS for Assimilation? Why “Re-Invent” CTS for
Assimilation? • CTS assumed you set up the syslog and real/virtual machines “somehow” – Distributing software and setting up syslog correctly is a pain – In Assimilation system testing, that’s all automated – We used Docker which is lower overhead than virtual machines and designed from the ground up for automation • I Added test-specific post-queries to validate the database – in CTS all post-query audits were identical • I was concerned about intellectual property issues • I did reuse (and improve) the LogWatcher class from CTS

16/19 Why call it “Fuzzy Monkey” Testing? Why call it
“Fuzzy Monkey” Testing? • It’s a tribute to two other related testing methods – Fuzz testing – having tests not always give the same input – Chaos Monkey testing – which also tries to break the system – in production (we predate Chaos Monkey by several years) • I liked it :-D • It’s unrelated to the vodka-based cocktail ;-)

© 2015 Assimilation Systems Limited 17/19 • http://assimilationsystems.com/2016/05/24/testing-distribu ted-systems-with-fuzzy-monkey-testing/ •
OR http://bit.ly/FuzzyMonkey • Assimilation Source Code: – https://github.com/assimilation/assimilation-official /tree/master/cma/systemtests • These Slides: – https://bit.ly/FuzzyMonkeySlides2016 Where to find this online? Where to find this online?

© 2015 Assimilation Systems Limited 18/19 Get Involved (Assimilation Project)!
Get Involved (Assimilation Project)! • Get Assimilated! • Contribute! – Users – give it a try – Security best practice experts – Testers, System Management, Continuous Integration – Designers – Developers (C,Python, Shell, PowerShell, JavaScript) – Porters (esp Windows) – Promoters, Publicists, Packagers, etc.

© 2015 Assimilation Systems Limited 19/19 Resistance Is Futile! Resistance
Is Futile! These slides: bit.ly/FuzzyMonkeySlides16 Mailing List: bit.ly/AssimML @OSSAlanR #assimilation on irc.freenode.net Assimilation Web Site: assimproj.org https://www.owasp.org/index.php/OWASP_Assimilation_Project Company Web Site: assimilationsystems.com Download: assimilationsystems.com/download

Testing Distributed Systems with Fuzzy Monkey T...

Testing Distributed Systems with Fuzzy Monkey Testing

Alan Robertson

More Decks by Alan Robertson

Other Decks in Programming

Featured

Transcript

Resilience-Testing Distributed Systems Resilience-Testing Distributed Systems with “Fuzzy Monkey” testing

2/19 Biography Biography • 35+ years in IT/development – 10

3/19 What Is Fuzzy Monkey Testing? What Is Fuzzy Monkey

4/19 Fuzzy Monkey History Fuzzy Monkey History • First conceived

5/19 Why create a unique testing method? Why create a

6/19 Why is Automated Testing Important? Why is Automated Testing

7/19 How do normal automated tests work? How do normal

8 Why is Automated Distributed System Why is Automated Distributed

9/19 What does this mean for Distributed testing? What does

10/19 How does “Fuzzy Monkey” testing work? How does “Fuzzy

11/19 When does a test succeed? When does a test

12/19 How could this How could this possibly possibly work?

13/19 What does this require from applications? What does this

14/19 How does this work out? How does this work

15/19 Why “Re-Invent” CTS for Assimilation? Why “Re-Invent” CTS for

16/19 Why call it “Fuzzy Monkey” Testing? Why call it

© 2015 Assimilation Systems Limited 17/19 • http://assimilationsystems.com/2016/05/24/testing-distribu ted-systems-with-fuzzy-monkey-testing/ •

© 2015 Assimilation Systems Limited 18/19 Get Involved (Assimilation Project)!

© 2015 Assimilation Systems Limited 19/19 Resistance Is Futile! Resistance