The Limits of Unit Testing, and How to Exceed Them

Slide 1

Slide 1 text

The Limits of Testing and How to Exceed Them Craig Stuntz https://speakerdeck.com/craigstuntz

Slide 2

Slide 2 text

? I have a couple of questions for you. How many of you are QAs? Developers? (I’m a developer) I want to talk about the interactions between diﬀerent team members, software, and customers.

Slide 3

Slide 3 text

Why QA software? What’s QA, really? We also care about correctness and suitability for the purpose (you know, the things the EULA says we don’t care about). Who does it, and what are we doing? How do we split the work. Should devs care? Another ?: Consider the product you’re working on right now.

Slide 4

Slide 4 text

The Prime Directive Your ﬁrst job must be to protect the physical and mental health of users and coworkers.

Slide 5

Slide 5 text

What kind of bugs do you want in your final, QA approved product?

Slide 6

Slide 6 text

“None?” People want to say “none,” but that’s setting a high bar to clear.

Slide 7

Slide 7 text

https://www.ﬂickr.com/photos/ﬁlipbossuyt/21409291292/ Not impossible, though! Jumping over a bar 2 meters in the air isn’t easy, but it can be done if you’re prepared to work at it. It takes years of practice. You won’t have a lot of time to do much else. Can you wait years to ship your product? So if you want no defects, I’ll tell you how to do that. Cut most of your features. Then do it again.

Slide 8

Slide 8 text

80/20 80/20 rule for software: If you cut 80% of the features, maybe 20% of users will notice.

Slide 9

Slide 9 text

Most software has far too many features. This is the bottom of the third page of ﬁfth tab of the options dialog for the Java plug-in. If you select the highlighted option, this instructs Java to not install malware on your machine during security updates. Naturally, it’s de-selected by default. So there is always plenty of room to cut features.

Slide 10

Slide 10 text

MVP • What must your product do (or you don’t have a product)? • What else must it do to protect the mental and physical health of your employees and users? • What else must be true for the two conditions above to hold? How do you choose? Here’s a list of features you probably can’t compromise on. Everything else is a tradeoﬀ between time and stability

Slide 11

Slide 11 text

What kind of bugs do you want in your final, QA approved product? OK, once more: Stew on that a second while I talk about history a bit.

Slide 12

Slide 12 text

https://www.ﬂickr.com/photos/10159247@N04/4335602802/ I promised to talk about unit testing, so let’s do that. In ancient times, programmer life was simple. Dinosaurs roamed the Earth, we didn’t write unit tests, and we employed people to manually and painstakingly ﬁnd bugs for us.

Slide 13

Slide 13 text

And then we decided testing was good. And then people said we should test all the …. time. And from then on our software was perfectly reliable and secure. We can all go home. That’s the end of my presentation, thanks for coming…. I’m joking, but devs who care about QA at all often take a unit-test-focused (obsessed?) approach. Got quality concerns? Write more tests!

Slide 14

Slide 14 text

But there are other techniques! It turns out that fuzzing software makes security bugs jump out at you in a way that tests never will.

Slide 15

Slide 15 text

But there are other techniques! It turns out that fuzzing software makes security bugs jump out at you in a way that tests never will.

Slide 16

Slide 16 text

Now it turns out that static analysis makes resource leak bugs jump out at you in a way that tests never will. Now it turns out that model checking… Wait. This is getting complicated. What to do?

Slide 17

Slide 17 text

Agenda • Safety for users and team • Whole project quality • Goal driven • Works today I’m interested in building correct software. Sometimes people start by writing this off as impossible. It’s easier to dismiss something as impossible than to ask if you can bite off a big chunk of it. Whole project quality - not just individual pieces of testing piled on top of each other Goal driven - Other techniques complement testing to find errors that unit tests can’t find Realistic - These methods are useful on real-world software, today

Slide 18

Slide 18 text

https://www.ﬂickr.com/photos/taylor-mcbride/3732682242/ It turns out the QA landscape is huge and there are some beautiful techniques available that you can combine to implement a realistic plan for achieving a desired level of quality. The biggest danger that will stop you from getting there is looking to just one technique, or a pile of techniques to solve all your problems. Focus on the quality plan, not the mechanism. Another question: How is manual testing fundamentally diﬀerent than unit tests, automated tests, etc.? (Anyone?)

Slide 19

Slide 19 text

Sometimes we think of manual testing as poking weird values into inputs. And hey, it works sometimes: Both Android and iPhone lock screens broken by “boredom testing.” But computers can do this faster. The best application for manual testing: What is something that computers can never do by themselves today?

Slide 20

Slide 20 text

https://medium.com/backchannel/how-technology-led-a-hospital-to-give-a-patient-38-times- his-dosage-ded7b3688558 No jokes here: This is life or death. This is an alert screen. Epic EMR. One of 17,000 alerts the UCSF physicians received in that month alone. Contains the number “160” in 14 diﬀerent places. Nurse clicked through this and patient received 38 1/2 tablets of an antibiotic. He’s fortunate to have survived. Use human testing for things only people can do!

Slide 21

Slide 21 text

https://lobste.rs/s/fdmbn5 For the rest of this presentation I’m going to talk about tests performed by a computer. For many people, unit tests are both a design methodology and the ﬁrst line of defense against bugs. In some cases, though, it’s just a metric you hit to get a paycheck.

Slide 22

Slide 22 text

Let’s Write a parseInt! let parseInt (value: string) : int = ??? Because I’m a NIH developer, and because it’s a really simple example to play with, I’ll write my own parseInt. It’s simple, right? Maybe too simple to say anything worthwhile?

Slide 23

Slide 23 text

Test First! [] member this.``Should parse 0``() = let actual = parseInt "0" actual |> should equal 0 But I believe in test ﬁrst and TDD, so… What sort of tests do I need for parseInt? This looks like a good start. Of course, this test does not pass yet, because I haven't implemented the method. That failure is an important piece of information! If I can’t parse 0, my parseInt isn’t very good. So let's say that I go and implement some parseInt code. At least enough to make the test pass. Now, this test tells me very little about the correctness of the method. That's interesting! Implementing the method removed information from the system! That seems really weird, but…

Slide 24

Slide 24 text

Test First! [] member this.``Should parse 0``() = let actual = parseInt "0" actual |> should equal 0 [] member this.``Should parse 1``() = let actual = parseInt "1" actual |> should equal 1 Maybe I should add another test. Am I missing anything?

Slide 25

Slide 25 text

Test First! [] member this.``Should parse -1``() = let actual = parseInt "-1" actual |> should equal -1 [] member this.``Should parse with whitespace``() = let actual = parseInt " 123 " actual |> should equal 123

Slide 26

Slide 26 text

Test First! [] member this.``Should parse +1 with whitespace``() = let actual = parseInt " +1 " actual |> should equal 1 [] member this.``Should do ??? with freeform prose``() = let actual = parseInt "A QA engineer walks into a bar…" actual |> should equal ??? Anything else? null, MaxInt+1, non-%20 whitespace, MaxInt, MinInt, 1729? I’m starting to realize I have more questions than answers!

Slide 27

Slide 27 text

More Questions • Is this for trusted or non-trusted input? 1) Trusted = exception; untrusted = fail gracefully. 2) For a private method, maybe. For a library function, no! Need tests per invocation? 3) , $, etc.? It sounds like we might need a lot of tests. How many? Does it seem weird that we’re talking more about corner cases than “success?” Does this teeny little helper method really need to be perfect? I just wanna parse 123!

Slide 28

Slide 28 text

More Questions • Is this for trusted or non-trusted input? • Can I trust that my function will be invoked correctly? 1) Trusted = exception; untrusted = fail gracefully. 2) For a private method, maybe. For a library function, no! Need tests per invocation? 3) , $, etc.? It sounds like we might need a lot of tests. How many? Does it seem weird that we’re talking more about corner cases than “success?” Does this teeny little helper method really need to be perfect? I just wanna parse 123!

Slide 29

Slide 29 text

More Questions • Is this for trusted or non-trusted input? • Can I trust that my function will be invoked correctly? • What is the culture of the input? 1) Trusted = exception; untrusted = fail gracefully. 2) For a private method, maybe. For a library function, no! Need tests per invocation? 3) , $, etc.? It sounds like we might need a lot of tests. How many? Does it seem weird that we’re talking more about corner cases than “success?” Does this teeny little helper method really need to be perfect? I just wanna parse 123!

Slide 30

Slide 30 text

Getting one digit wrong really can get your company into the headlines. Also, what about security sensitive code. Hashes, RNGs. Does it seem like test case suggestions focused on error cases? Even if 90% of the time we get expected input, I’m far more interested in the reasons which explain 90% of the failures.

Slide 31

Slide 31 text

Bad Error Handling Kills “Almost all catastrophic failures (92%) are the result of incorrect handling of non-fatal errors explicitly signaled in software.” https://www.usenix.org/conference/osdi14/technical-sessions/presentation/yuan Only tested software designed for high reliability. (Cassandra, HDFS, Hadoop…) “But it does suggest that top-down testing, say, using input and error injection techniques, will be challenged by the large input and state space.”

Slide 32

Slide 32 text

Simple Testing Can Prevent Most Critical Failures, Yuan et. al. 92% of the time the catastrophe was caused not by the error itself but rather the combination of the error and then handling it incorrectly!

Slide 33

Slide 33 text

How Can I Be Completely Confident in a Simple Function? (Or at least do the right thing when it fails) (And also insure it’s always called correctly) (Every. Single. Time) Let’s face it, this is the bare minimum ﬁrst step for trusting an application, right? You might ask, “Why is this idiot rambling on about parseInt? I have 10 million lines of code to test.” I think it’s sometimes informative to start with the simplest thing which could possibly work.

Slide 34

Slide 34 text

Unit Tests • Helping you think through bottom-up designs • Preventing regressions • Getting you to the point where at least something works. Are Great • Showing overall design consistency (top-down) • Finding security holes • Proving correctness or totality of implementation Not So Helpful We can use techniques like strong typing, fuzzing, and formal methods to compliment testing to give more control over code correctness. You will still need tests, but you’ll get much more “coverage” with fewer tests. Looking at the lists here, a theme emerges. To write a test, you needed a mental model of how your function should work. Having written the tests, however, you have thrown away the model. All that's left are the examples.

Slide 35

Slide 35 text

When My Test Fails I know I’ve found a bug (useful!) Passes I know my function works for at least one input out of billions (maybe useless?) Does this make sense to everyone? Do you agree that a passing test doesn’t tell you much about the overall quality of the system? Is there a way to ensure we always get correct output for any input? Yes, but before we even get there, there’s a bigger problem we haven’t talked about yet.

Slide 36

Slide 36 text

How Can I Be Completely Confident When Composing Two Functions? (Composing two correct functions should produce the correct result.) (Every. Single. Time) Let’s face it, this is the bare minimum second step for trusting an application, right? More generally, I would like to be able to build complete, correct programs from a foundation of correct functions. Now verifying my 10 million lines of code is easy; start with correct functions, then combine them correctly!

Slide 37

Slide 37 text

parseIntAndReturnAbsoluteValue = abs ∘ parseInt If I have two good functions, like abs and parseInt, I would like to be able to combine them in order to produce a correct program. But there’s a problem: parseInt, as written, isn’t total (deﬁne). I can call it with strings which aren’t integers, and it’s really hard to use tests to ensure I call it correctly 100% of the time. How do I know it will always return something useful?

Slide 38

Slide 38 text

let parseInt (str) = !" implementation One thing I need to do is ensure that people call my function passing a string as the argument, and that the thing it returns is actually an integer, in every case.

Slide 39

Slide 39 text

let parseInt (value: string) : int = !" implementation That’s not too hard. I can prove this with the type system. As long as I don’t do anything which subverts the type system (unsafe casts, unchecked exceptions, null — or use a language which won’t allow it!), I can at least be sure I’m in the right ballpark. But how do I ensure I’m only passed a string representing an integer? Or should I? Can I force the caller to “do the right thing” and handle the error if they happen to pass a bad value.

Slide 40

Slide 40 text

public static bool TryParse( string s, out int result ) { !!. } Again, you can do it with the type system! I’m showing a C# example here, since the idiomatic F# solution is diﬀerent.

Slide 41

Slide 41 text

public static bool TryParse( string s, out int result ) { !!. } !" appropriate when input is “trusted” int betterBeAnInt = ParseInt(input); !" appropriate for untrusted input int couldBeAnInt; if (TryParse(input, out couldBeAnInt)) { !" !!. It is now diﬃcult to invoke the function without testing success. You have to go out of your way. This probably eliminates the need to use tests to ensure that every case in which this function is invoked checks the success of the function. Consider input validation. Bad input is in the contract. Exceptions inappropriate. Instead of returning an integer, return an integer and a Boolean.

Slide 42

Slide 42 text

But There’s Still The Matter of That String Argument We can prove that we do the right thing when our parseInt correctly classiﬁes a given input value as a legitimate integer and parses it, or rejects it as invalid, but how can we show that we do that correctly? Aren’t we back at square one? Types are super neat because you get this conﬁdence essentially for free, and it never fails, but even the F# type system can’t make sure I return the right integer.

Slide 43

Slide 43 text

State Space 0 } 1 {A B In principle, your app, or your function, is a black box. Same input, same output. Easy to test, right? This application should have only two possible states! To be totally conﬁdent in your system you need to test, by some means, the entire state space (LoC discussion).

Slide 44

Slide 44 text

State Space “Hello” } “World” {A B ⚅ It gets harder quickly. If my inputs are two strings instead of two bits, I now have considerably more possible test cases! (Click) In the real world, you have additional “inputs” like time and randomness, and whatever is in your database.

Slide 45

Slide 45 text

Formal Methods Using formal methods means the design is driven by a mathematical formalism. By deﬁnition, this is not test driven development, although you will probably still write tests. Formal methods are sometimes considered controversial in the software development community, because they acknowledge the existence and utility of math.

Slide 46

Slide 46 text

____ + 1234 ____ [ \t]*[+-]?[0-9]+[ \t]* It’s easier to use formal methods if there’s an oﬀ-the-shelf formalism you can use. For the problem of parsing, these exist! One way to reduce the input domain of the parseInt function from an untestably large number of potential states is to use a regular expression. This is not the sort of regular expression you might encounter in Perl or Ruby; it is a much more restricted syntax typically used on the front end of a compiler. The important point, here, is that we can reduce the number of potential state of the function to a number that you can count on your ﬁngers.

Slide 47

Slide 47 text

0 1 2 3 4 [ \t] [+-] [0-9] [0-9] [0-9] [ \t] [+-] REs convert to FSMs. 3+4 are accepting states. 4-5 states, 2 of them accepting, well less than “any possible string!”

Slide 48

Slide 48 text

Totality checking. Breaking my vow to avoid showing implementations. Lots of code here, but the important word is at the top. I’ve hesitated about showing implementations until now, but I can’t avoid it here, because… The proof is built into the implementation

Slide 49

Slide 49 text

When My Test Type Checker Fails I know I’ve found a bug (useful!) I might have a bug (sometimes useful, sometimes frustrating) Passes I know my function works for at least one input out of billions (maybe useless?) There is a class of bugs which cannot exist (awesome!) We can expand this chart now. Tests and types are not opponents; they complement each other. Where one succeeds, the other fails, and vice versa.

Slide 50

Slide 50 text

Property Based Testing Still, there are cases where it’s hard to use formal methods. Not every problem has an oﬀ-the-shelf formalism ready to use. But we don’t have to just give up and accept unit tests as the best we can do!

Slide 51

Slide 51 text

let parsedIntEqualsOriginalNumber = fun (number: int) !→ number = parseInt (number.ToString()) > open FsCheck;; > Check.Quick parsedIntEqualsOriginalNumber;; Falsifiable, after 1 test (1 shrink) (StdGen (1764712907,296066647)): Original: -2 Shrunk: -1 Can you state things about your system which will always be true? What must be true for my system to work? Looks like I have to do some work on my implementation here! Important: I didn’t have to specify the failing case, as I would with a unit test. FsCheck found it for me. In unit testing, you start with a mental model of the speciﬁcation, and write your own tests. With property based testing, you write down the speciﬁcation, and the tests are generated for you.

Slide 52

Slide 52 text

PBT: Great for helping to find bugs in specific routines. Fuzzing: Great for finding unhanded errors in entire systems.

Slide 53

Slide 53 text

PBT: Great for helping to find bugs in specific routines. Fuzzing: Great for finding unhanded errors in entire systems.

Slide 54

Slide 54 text

http://colin-scott.github.io/blog/2015/10/07/fuzzing-raft-for-fun-and-proﬁt/ It often makes sense to write a custom fuzzer. It’s not hard, and the return is huge. This example more similar to property based testing, since it uses the stated invariants from the Raft speciﬁcation to test an implementation. (Fizil story)

Slide 55

Slide 55 text

Runtime Validation Sometimes the most important value to test is the only one that matters to you at runtime. Assertions are a little under-used, because we tend to think of them as checking trivial things. But using the techniques of property-based testing, we can do end to end validation of our system.

Slide 56

Slide 56 text

let input = " +123 " let number = parseInt input !" 123 let test = number.ToSting() !" "123" if test <> input !" true! then let testNumber = parseInt test !" 123 if number <> testNumber !" false (yay!) then failwith "Uh oh!" !" We’re safe now! Use number… Similar to property based testing

Slide 57

Slide 57 text

http://lefthandedgoat.github.io/canopy/ Integration testing should always be automated. Deals with coupling between systems not covered by type safety (DB, DOM, etc.) Use Canopy Also: write integration test method.

Slide 58

Slide 58 text

The Quality Landscape • Manual testing • Integration tests • Unit tests • Runtime validation • Property based testing • Fuzzing • Formal methods • Static analysis • Type systems • Totality checking The long and the short of it: Think big! Don’t “test all the ___ing time” because somebody told you to. Keep your eyes on the prize of software correctness. Ask yourself which things are most important to the overall quality of your system. Pick the tool(s) which give you the biggest return. Synopsis of each.

Slide 59

Slide 59 text

Craig Stuntz @craigstuntz [email protected] http://blogs.teamb.com/craigstuntz http://www.meetup.com/Papers-We-Love-Columbus/