Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Limits of Testing and How to Exceed Them

The Limits of Testing and How to Exceed Them

Your unit tests pass and your code coverage looks great, so you can just hit “Deploy” and head out for the weekend, right? Unfortunately, passing tests, while useful, do not begin to guarantee that your system works correctly. However, we can do better! Techniques such as property based testing, dependent types, and manual testing can be combined with unit testing to ensure highly reliable software. How do you know when you are “covered” by a unit test and when you must employ other techniques? We will consider precisely what unit testing really does, where it fails, and how to create the best plan for ensuring the overall quality of your application.

Craig Stuntz

October 08, 2015
Tweet

More Decks by Craig Stuntz

Other Decks in Programming

Transcript

  1. The Limits of Testing
    and How to Exceed
    Them
    Craig Stuntz
    https://speakerdeck.com/craigstuntz

    View full-size slide

  2. https://www.flickr.com/photos/10159247@N04/4335602802/
    In ancient times, programmer life was simple. Dinosaurs roamed the Earth, we didn’t write unit tests, and we employed people to slowly and painstakingly find bugs for
    us.

    View full-size slide



  3. And then we decided testing was good.

    And then people said we should test all the …. time.

    And from then on our software was perfectly reliable and secure. We can all go home.

    That’s the end of my presentation, thanks for coming….

    View full-size slide

  4. Now it turns out that fuzzing software makes security bugs jump out at you in a way that tests never will.

    View full-size slide

  5. Now it turns out that fuzzing software makes security bugs jump out at you in a way that tests never will.

    View full-size slide

  6. Now it turns out that static analysis makes resource leak bugs jump out at you in a way that tests never will.

    Now it turns out that…

    Wait. This is getting complicated.

    What to do?

    View full-size slide

  7. Agenda
    • Unit testing doesn’t (and isn’t intended to)
    find every program error
    • Other techniques complement testing to find
    errors that unit tests can’t find
    • These methods are useful on real-world
    software, today
    I’m interested in building correct software. Sometimes people start by writing this off as impossible.

    It’s easier to dismiss something as impossible than to ask if you can bite off a big chunk of it.

    View full-size slide

  8. Immediate
    Digression
    Manual Testing
    Really useful, but doesn’t fit the theme of the rest of the talk.

    Still, really useful, so let’s talk anyway!

    How is manual testing fundamentally different than unit tests, automated tests, etc.?

    View full-size slide

  9. Sometimes we think of manual testing as poking weird values into inputs. And hey, it works sometimes: Both Android and iPhone lock screens broken by “boredom
    testing.” But computers can do this faster.

    The best application for manual testing: What is something that computers can never do by themselves today?

    View full-size slide

  10. https://medium.com/backchannel/how-technology-led-a-hospital-to-give-a-patient-38-times-
    his-dosage-ded7b3688558
    This is life or death. This is an alert screen.

    Epic EMR. One of 17,000 alerts the UCSF physicians received in that month alone.

    Contains the number “160” in 14 different places.

    Nurse clicked through this and patient received 38 1/2 tablets of an antibiotic. He’s fortunate to have survived.

    Use human testing for things only people can do!

    View full-size slide

  11. Unit Testing
    It works until it doesn’t.
    For the rest of this presentation I’m going to talk about tests performed by a computer.

    For many people, unit tests are both a design methodology and the first line of defense against bugs.

    View full-size slide

  12. Let’s Write a parseInt!
    let parseInt (value: string) : int = ???
    Because I’m a NIH developer, and because it’s a really simple example to play with, I’ll write my own parseInt. It’s simple, right? Maybe too simple to say anything
    worthwhile?

    View full-size slide

  13. Test First!
    []
    member this.``Should parse 0``() =
    let actual = parseInt "0"
    actual |> should equal 0
    But I believe in test first and TDD, so… What sort of tests do I need for parseInt?

    This looks like a good start. Of course, this test does not pass yet, because I haven't implemented the method. That failure is an important piece of information! If I can’t
    parse 0, my parseInt isn’t very good.

    So let's say that I go and implement some parseInt code. At least enough to make the test pass. Now, this test tells me very little about the correctness of the method.
    That's interesting! Implementing the method removed information from the system! That seems really weird, but…

    View full-size slide

  14. Test First!
    []
    member this.``Should parse 0``() =
    let actual = parseInt "0"
    actual |> should equal 0
    []
    member this.``Should parse 1``() =
    let actual = parseInt "1"
    actual |> should equal 1
    Maybe I should add another test.

    Am I missing anything?

    View full-size slide

  15. Test First!
    []
    member this.``Should parse -1``() =
    let actual = parseInt "-1"
    actual |> should equal -1
    []
    member this.``Should parse with whitespace``() =
    let actual = parseInt " 123 "
    actual |> should equal 123

    View full-size slide

  16. Test First!
    []
    member this.``Should parse +1 with whitespace``() =
    let actual = parseInt " +1 "
    actual |> should equal 1
    []
    member this.``Should do ??? with freeform prose``() =
    let actual = parseInt "A QA engineer walks into a bar…"
    actual |> should equal ???
    Anything else? null, MaxInt+1, non-%20 whitespace, MaxInt, MinInt, 1729?

    I’m starting to realize I have more questions than answers!

    View full-size slide

  17. More Questions
    • Is this for trusted or non-trusted input?
    1) Trusted = exception; untrusted = fail gracefully.

    2) For a private method, maybe. For a library function, no! Need tests per invocation?

    3) , $, etc.?

    It sounds like we might need a lot of tests. How many? Does it seem weird that we’re talking more about corner cases than “success?” Does this teeny little helper
    method really need to be perfect? I just wanna parse 123!

    View full-size slide

  18. More Questions
    • Is this for trusted or non-trusted input?
    • Can I trust that my function will be invoked
    correctly?
    1) Trusted = exception; untrusted = fail gracefully.

    2) For a private method, maybe. For a library function, no! Need tests per invocation?

    3) , $, etc.?

    It sounds like we might need a lot of tests. How many? Does it seem weird that we’re talking more about corner cases than “success?” Does this teeny little helper
    method really need to be perfect? I just wanna parse 123!

    View full-size slide

  19. More Questions
    • Is this for trusted or non-trusted input?
    • Can I trust that my function will be invoked
    correctly?
    • What is the culture of the input?
    1) Trusted = exception; untrusted = fail gracefully.

    2) For a private method, maybe. For a library function, no! Need tests per invocation?

    3) , $, etc.?

    It sounds like we might need a lot of tests. How many? Does it seem weird that we’re talking more about corner cases than “success?” Does this teeny little helper
    method really need to be perfect? I just wanna parse 123!

    View full-size slide

  20. Getting one digit wrong really can get your company into the headlines.

    Also, what about security sensitive code. Hashes, RNGs.

    Does it seem like test case suggestions focused on error cases? Even if 90% of the time we get expected input, I’m far more interested in the reasons which explain 90%
    of the failures.

    View full-size slide

  21. Bad Error
    Handling Kills
    “Almost all catastrophic
    failures (92%) are
    the result of incorrect
    handling of non-fatal
    errors explicitly
    signaled in software.”
    https://www.usenix.org/conference/osdi14/technical-sessions/presentation/yuan
    Only tested software designed for high reliability. (Cassandra, HDFS, Hadoop…)

    “But it does suggest that top-down testing, say, using input and error injection techniques, will be challenged by the large input and state space.”

    View full-size slide

  22. Simple Testing Can Prevent Most Critical Failures, Yuan et. al.
    92% of the time the catastrophe was caused not by the error itself but rather the combination of the error and then handling it incorrectly!

    View full-size slide

  23. How Can I Be
    Completely Confident
    in a Simple Function?
    (Or at least do the right thing when it fails)
    (And also insure it’s always called correctly)
    (Every. Single. Time)
    Let’s face it, this is the bare minimum first step for trusting an application, right?

    You might ask, “Why is this idiot rambling on about parseInt? I have 10 million lines of code to test.” I think it’s sometimes informative to start with the simplest thing
    which could possibly work.

    View full-size slide

  24. Unit Tests
    • Helping you think
    through bottom-up
    designs
    • Preventing
    regressions
    • Getting you to the
    point where at least
    something works.
    Are Great
    • Showing overall
    design consistency
    (top-down)
    • Finding security holes
    • Proving correctness
    or totality of
    implementation
    Not So Helpful
    We can use techniques like strong typing, fuzzing, and formal methods to compliment testing to give more control over code correctness. You will still need tests, but
    you’ll get much more “coverage” with fewer tests.

    Looking at the lists here, a theme emerges.

    View full-size slide

  25. When My Test
    Fails
    I know I’ve found a bug
    (useful!)
    Passes
    I know my function works for at
    least one input out of billions
    (maybe useless?)
    Does this make sense to everyone? Do you agree that a passing test doesn’t tell you much about the overall quality of the system?

    Is there a way to ensure we always get correct output for any input?

    Yes, but before we even get there, there’s a bigger problem we haven’t talked about yet.

    View full-size slide

  26. How Can I Be Completely
    Confident When
    Composing Two Functions?
    (Composing two correct functions should produce
    the correct result.)
    (Every. Single. Time)
    Let’s face it, this is the bare minimum second step for trusting an application, right?

    More generally, I would like to be able to build complete, correct programs from a foundation of correct functions. Now verifying my 10 million lines of code is easy; start
    with correct functions, then combine them correctly!

    View full-size slide

  27. parseIntAndReturnAbsoluteValue = abs ∘ parseInt
    If I have two good functions, like abs and parseInt, I would like to be able to combine them in order to produce a correct program.

    But there’s a problem: parseInt, as written, isn’t total (define). I can call it with strings which aren’t integers, and it’s really hard to use tests to ensure I call it correctly
    100% of the time. How do I know it will always return something useful?

    View full-size slide

  28. let parseInt (str) =
    !" implementation
    One thing I need to do is ensure that people call my function passing a string as the argument, and that the thing it returns is actually an integer, in every case.

    View full-size slide

  29. let parseInt (value: string) : int =
    !" implementation
    That’s not too hard. I can prove this with the type system.

    As long as I don’t do anything which subverts the type system (unsafe casts, unchecked exceptions, null — or use a language which won’t allow it!), I can at least be sure
    I’m in the right ballpark.

    But how do I ensure I’m only passed a string representing an integer? Or should I? Can I force the caller to “do the right thing” and handle the error if they happen to
    pass a bad value.

    View full-size slide

  30. public static bool TryParse(
    string s,
    out int result
    )
    {
    !!.
    }
    Again, you can do it with the type system! I’m showing a C# example here, since the idiomatic F# solution is different.

    View full-size slide

  31. public static bool TryParse(
    string s,
    out int result
    )
    {
    !!.
    }
    !" appropriate when input is “trusted”
    int betterBeAnInt = ParseInt(input);
    !" appropriate for untrusted input
    int couldBeAnInt;
    if (TryParse(input, out couldBeAnInt))
    {
    !" !!.
    It is now difficult to invoke the function without testing success. You have to go out of your way. This probably eliminates the need to use tests to ensure that every case
    in which this function is invoked checks the success of the function.

    Consider input validation. Bad input is in the contract. Exceptions inappropriate. Instead of returning an integer, return an integer and a Boolean.

    View full-size slide

  32. But There’s Still The
    Matter of That String
    Argument
    We can prove that we do the right thing when our parseInt correctly classifies a given input value as a legitimate integer and parses it, or rejects it as invalid, but how can
    we show that we do that correctly? Aren’t we back at square one? Types are super neat because you get this confidence essentially for free, and it never fails, but even
    the F# type system can’t make sure I return the right integer.

    View full-size slide

  33. State Space
    0
    }
    1
    {A
    B
    In principle, your app, or your function, is a black box. Same input, same output. Easy to test, right?

    This application should have only two possible states!

    To be totally confident in your system you need to test, by some means, the entire state space (LoC discussion).

    View full-size slide

  34. State Space
    “Hello”
    }
    “World”
    {A
    B

    It gets harder quickly. If my inputs are two strings instead of two bits, I now have considerably more possible test cases!

    (Click) In the real world, you have additional “inputs” like time and randomness, and whatever is in your database.

    View full-size slide

  35. Formal Methods
    Using formal methods means the design is driven by a mathematical formalism. By definition, this is not test driven development, although you will probably still write
    tests. Formal methods are sometimes considered controversial in the software development community, because they acknowledge the existence and utility of math.

    View full-size slide

  36. ____ + 1234 ____
    [ \t]*[+-]?[0-9]+[ \t]*
    It’s easier to use formal methods if there’s an off-the-shelf formalism you can use. For the problem of parsing, these exist!

    One way to reduce the input domain of the parseInt function from an untestably large number of potential states is to use a regular expression. This is not the sort of
    regular expression you might encounter in Perl or Ruby; it is a much more restricted syntax typically used on the front end of a compiler. The important point, here, is that
    we can reduce the number of potential state of the function to a number that you can count on your fingers.

    View full-size slide

  37. 0
    1
    2
    3 4
    [ \t]
    [+-]
    [0-9]
    [0-9]
    [0-9]
    [ \t]
    [+-]
    REs convert to FSMs.

    3+4 are accepting states.

    4-5 states, 2 of them accepting, well less than “any possible string!”

    View full-size slide

  38. Totality checking. Breaking my vow to avoid showing implementations.

    Lots of code here, but the important word is at the top.

    I’ve hesitated about showing implementations until now, but I can’t avoid it here, because…

    The proof is built into the implementation

    View full-size slide

  39. When
    My
    Test Type Checker
    Fails
    I know I’ve found a
    bug
    (useful!)
    I might have a bug
    (sometimes useful,
    sometimes
    frustrating)
    Passes
    I know my function
    works for at least one
    input out of billions
    (maybe useless?)
    There is a class of
    bugs which cannot
    exist
    (awesome!)
    We can expand this chart now.

    Tests and types are not opponents; they complement each other.

    Where one succeeds, the other fails, and vice versa.

    View full-size slide

  40. Property Based
    Testing
    Still, there are cases where it’s hard to use formal methods.

    Not every problem has an off-the-shelf formalism ready to use.

    But we don’t have to just give up and accept unit tests as the best we can do!

    View full-size slide

  41. let parsedIntEqualsOriginalNumber =
    fun (number: int) !→
    number = parseInt (number.ToString())
    > open FsCheck;;
    > Check.Quick parsedIntEqualsOriginalNumber;;
    Falsifiable, after 1 test (1 shrink) (StdGen
    (1764712907,296066647)):
    Original:
    -2
    Shrunk:
    -1
    val it : unit = ()
    >
    Can you state things about your system which will always be true?

    What must be true for my system to work?

    Looks like I have to do some work on my implementation here!

    Important: I didn’t have to specify the failing case, as I would with a unit test. FsCheck found it for me.

    View full-size slide

  42. PBT: Great for helping to find bugs in specific routines.

    Fuzzing: Great for finding unhanded errors in entire systems.

    View full-size slide

  43. PBT: Great for helping to find bugs in specific routines.

    Fuzzing: Great for finding unhanded errors in entire systems.

    View full-size slide

  44. http://colin-scott.github.io/blog/2015/10/07/fuzzing-raft-for-fun-and-profit/
    It often makes sense to write a custom fuzzer. It’s not hard, and the return is huge. This example more similar to property based testing, since it uses the stated invariants
    from the Raft specification to test an implementation.

    (Policy story)

    View full-size slide

  45. Runtime Validation
    Sometimes the most important value to test is the only one that matters to you at runtime.

    Assertions are a little under-used, because we tend to think of them as checking trivial things.

    But using the techniques of property-based testing, we can do end to end validation of our system.

    View full-size slide

  46. let input = " +123 "
    let number = parseInt input !" 123
    let test = number.ToSting() !" "123"
    if test <> input !" true!
    then
    let testNumber = parseInt test !" 123
    if number <> testNumber !" false (yay!)
    then failwith "Uh oh!"
    !" We’re safe now! Use number…
    Similar to property based testing

    View full-size slide

  47. http://lefthandedgoat.github.io/canopy/
    Integration testing should always be automated. Deals with coupling between systems not covered by type safety (DB, DOM, etc.)

    Use Canopy

    Also: write integration test method.

    View full-size slide

  48. The Quality Landscape
    • Manual testing
    • Integration tests
    • Unit tests
    • Runtime validation
    • Property based testing
    • Fuzzing
    • Formal methods
    • Static analysis
    • Type systems
    • Totality checking
    The long and the short of it: Think big! Don’t “test all the ___ing time” because somebody told you to. Keep your eyes on the prize of software correctness.

    Ask yourself which things are most important to the overall quality of your system. Pick the tool(s) which give you the biggest return.

    Synopsis of each.

    View full-size slide

  49. Craig Stuntz
    @craigstuntz
    [email protected]
    http://blogs.teamb.com/craigstuntz
    http://www.meetup.com/Papers-We-Love-Columbus/

    View full-size slide