$30 off During Our Annual Pro Sale. View Details »

Testing Distributed Systems with Fuzzy Monkey Testing

Testing Distributed Systems with Fuzzy Monkey Testing

One of the keys to good software is good testing. There are well-known testing suites for back end code – things like junit and py.test. There are also good front-end testing tools – things like Selenium. But for testing distributed systems there aren’t so many well-known tools – because the problem is quite different, and harder. These slides cover the “Fuzzy Monkey” methodology used for testing three different successful distributed systems (including the Assimilation Suite) – its history and how and why it works.

http://bit.ly/FuzzyMonkey

Alan Robertson

October 03, 2016
Tweet

More Decks by Alan Robertson

Other Decks in Programming

Transcript

  1. Resilience-Testing Distributed Systems
    Resilience-Testing Distributed Systems
    with “Fuzzy Monkey” testing
    with “Fuzzy Monkey” testing
    #AssimProj @OSSAlanR
    Alan Robertson
    Assimilation Systems Limited
    owasp.org/index.php/OWASP_Assimilation_Project
    http://AssimilationSystems.com

    View Slide

  2. 2/19
    Biography
    Biography

    35+ years in IT/development – 10 years in system
    management (SysAdmin)

    Founded Linux-HA project - led 1998-2007 – aka “Heartbeat” -
    now called Pacemaker

    Founded Assimilation Project in 2010

    Founded Assimilation Systems Limited in 2013

    Alumnus of Bell Labs, SuSE, IBM

    View Slide

  3. 3/19
    What Is Fuzzy Monkey Testing?
    What Is Fuzzy Monkey Testing?

    A method of testing of distributed systems

    Specializes in resilience testing
    – testing for robustness in the
    presence of failures

    View Slide

  4. 4/19
    Fuzzy Monkey History
    Fuzzy Monkey History

    First conceived in Fall of 2001

    Initially implemented by CTS as part of the Linux-HA project
    – CTS == Cluster Testing System

    Continued into the Pacemaker Project

    CTS Adopted by Corosync

    Fuzzy Monkey method re-implemented for Assimilation
    Project in 2014

    Came up with Fuzzy Monkey name in May 2016

    View Slide

  5. 5/19
    Why create a unique testing method?
    Why create a unique testing method?

    Testing distributed systems is hard

    Manual testing is rarely successful

    No good tools out there

    Eliminate Embarrassment:
    – I was tired of having egg on my face when I put out a release of
    Linux-HA with bugs that I should have caught
    – I hated doing manual testing – and I was bad at it

    View Slide

  6. 6/19
    Why is Automated Testing Important?
    Why is Automated Testing Important?

    Automated testing speeds up product releases

    Automated testing available to developers decreases end-
    user-visible bugs

    Continuous Integration needs automated testing

    Continuous Deployment requires automated testing
    Modern software development cannot rely on manual testing

    View Slide

  7. 7/19
    How do normal automated tests work?
    How do normal automated tests work?

    Fixed list of tests which do fixed things

    Tests are typically synchronous

    Each test tests one thing – often just calls a function and looks at the
    result

    Each test expects one correct answer

    When tests complete, they leave things like when they started

    It’s easy to tell when a test is complete – when the function returns

    Tests are not subject to timing problems

    Tests are complete when the last test completes

    View Slide

  8. 8
    Why is Automated Distributed System
    Why is Automated Distributed System
    Testing Hard?
    Testing Hard?

    Tests are always asynchronous

    If you run the same test twice, you might get two different correct answers

    The results of the test depend on the current distributed configuration

    Events happen when they do, and are observed at a random time later

    Specifically trying to create timing issues

    You need randomness in the tests to make it more likely you hit all the timing
    windows

    It’s hard to tell when a test is “done”

    It’s hard to know when you’ve tested enough

    View Slide

  9. 9/19
    What does this mean for Distributed testing?
    What does this mean for Distributed testing?

    Tests need to be run multiple times

    Tests need to be selected at random

    Tests need to include randomness in them

    Some tests will deliberately make configuration changes

    Tests need to expect all the correct possible outcomes

    Tests need to allow for the fact that things might not
    happen in a particular order

    Often need white box testing to aim at timing windows

    View Slide

  10. 10/19
    How does “Fuzzy Monkey” testing work?
    How does “Fuzzy Monkey” testing work?

    Direct all syslogs to a central machine (using TCP)

    Key results come through syslog to central machine

    Tests exercise the systems over ssh or similar

    Each test typically randomly picks one or more systems to use in the test

    Each test expects certain syslog messages indicating success, allowing
    for possible alternatives typically ignoring ordering

    “Oh, darn!” messages cause tests to be marked as failed

    Each test must wait for systems to stabilize before marking it complete

    Audits of system state are performed after each test

    View Slide

  11. 11/19
    When does a test succeed?
    When does a test succeed?

    It finds all the regular expressions it expects in syslog
    before the timeout

    It doesn’t find any “Oh Darn!” messages

    It passes the post-test system sanity audit

    View Slide

  12. 12/19
    How could this
    How could this possibly
    possibly work?
    work?
    Doesn’t seem likely does it?

    View Slide

  13. 13/19
    What does this require from applications?
    What does this require from applications?

    They have to be willing to log important things to syslog

    They need to be defensive – and output “Oh Darn!” messages

    Need to be consistent in “Oh Darn!” messages

    You will likely have to log a few more things that originally
    planned

    You need to know what kinds of things are likely to stress the
    application

    You need to know what kinds of sanity audits you can perform
    quickly, and others that maybe you only run infrequently

    View Slide

  14. 14/19
    How does this work out?
    How does this work out?

    Basically once we test a certain kind of interaction – and pass the test – it never fails in the field

    If used with good unit testing, you likely need dozens of tests, not hundreds

    Expect your product to work more reliably than any competitor’s
    – RedHat left the business of producing their own unique HA solution

    Expect that to learn things about how the application works – maybe some you didn’t know
    even as author

    Expect that some tests will need tweaking to make sure they’re really “done” before going to
    next test

    Syslog regex testing can be a little fragile – you will have to update your tests as the code
    changes

    If you use Docker, you may run into docker bugs or unexpected interactions – which will
    change over time...

    These kinds of tests can take days to run

    View Slide

  15. 15/19
    Why “Re-Invent” CTS for Assimilation?
    Why “Re-Invent” CTS for Assimilation?

    CTS assumed you set up the syslog and real/virtual machines
    “somehow”
    – Distributing software and setting up syslog correctly is a pain
    – In Assimilation system testing, that’s all automated
    – We used Docker which is lower overhead than virtual machines and
    designed from the ground up for automation

    I Added test-specific post-queries to validate the database – in
    CTS all post-query audits were identical

    I was concerned about intellectual property issues

    I did reuse (and improve) the LogWatcher class from CTS

    View Slide

  16. 16/19
    Why call it “Fuzzy Monkey” Testing?
    Why call it “Fuzzy Monkey” Testing?

    It’s a tribute to two other related testing methods
    – Fuzz testing – having tests not always give the same input
    – Chaos Monkey testing – which also tries to break the system –
    in production (we predate Chaos Monkey by several years)

    I liked it :-D

    It’s unrelated to the vodka-based cocktail ;-)

    View Slide

  17. © 2015 Assimilation Systems Limited
    17/19

    http://assimilationsystems.com/2016/05/24/testing-distribu
    ted-systems-with-fuzzy-monkey-testing/

    OR http://bit.ly/FuzzyMonkey

    Assimilation Source Code:
    – https://github.com/assimilation/assimilation-official
    /tree/master/cma/systemtests

    These Slides:
    – https://bit.ly/FuzzyMonkeySlides2016
    Where to find this online?
    Where to find this online?

    View Slide

  18. © 2015 Assimilation Systems Limited
    18/19
    Get Involved (Assimilation Project)!
    Get Involved (Assimilation Project)!

    Get Assimilated!

    Contribute!
    – Users – give it a try
    – Security best practice experts
    – Testers, System Management, Continuous Integration
    – Designers
    – Developers (C,Python, Shell, PowerShell, JavaScript)
    – Porters (esp Windows)
    – Promoters, Publicists, Packagers, etc.

    View Slide

  19. © 2015 Assimilation Systems Limited
    19/19
    Resistance Is Futile!
    Resistance Is Futile!
    These slides: bit.ly/FuzzyMonkeySlides16
    Mailing List: bit.ly/AssimML
    @OSSAlanR
    #assimilation on irc.freenode.net
    Assimilation Web Site: assimproj.org
    https://www.owasp.org/index.php/OWASP_Assimilation_Project
    Company Web Site: assimilationsystems.com
    Download: assimilationsystems.com/download

    View Slide