Mashing Up QA and Security - CodeMash 2017 - with notes

Slide 1

Slide 1 text

M a s h i n g U p Q A a n d S e c u r i t y Craig Stuntz Improving https://speakerdeck.com/craigstuntz https://github.com/CraigStuntz/Fizil When I submit talk proposals, I ask myself two questions: 1) What will audience take away from this talk, and 2) What’s the weirdest stuﬀ the conference will possibly accept? So I’m going to cover a lot of material, and some of it will seem quite strange. Please do feel free to raise your hand or just shout out a question. I’ll be happy to slow down and explain. I’m not planning on leaving the last 10 minutes empty for questions, so do ask whenever you feel it would be helpful. My goal here is to leave you with food for thought. I think I’ll have succeeded if you ﬁnd at least one or two of the ideas here intriguing enough to want to research yourself next week.

Slide 2

Slide 2 text

https://www.flickr.com/photos/futureshape/566200801 Security is domain-specific QA, but the fields often seem worlds apart. Maybe it shouldn’t be that way?

Slide 3

Slide 3 text

QAs and security analysts are smart people

Slide 4

Slide 4 text

https://what-if.xkcd.com/49/ who are sometimes considered a bit strange or maybe even feared by developers and business types

Slide 5

Slide 5 text

but care very deeply about software correctness

Slide 6

Slide 6 text

Unfortunately, they go to totally diﬀerent parties

Slide 7

Slide 7 text

Except CodeMash Can we take this opportunity to learn from each other?

Slide 8

Slide 8 text

S o f t w a r e C o r r e c t n e s s OK, spoiler alert! There are four core ideas I’m going to explore in this talk. Security and QA both explore software correctness. I’d like to add some precision to what we mean when we say that software is “insecure” or “buggy.”

Slide 9

Slide 9 text

M a n u a l A n a l y s i s Manual analysis adds value when we deal with human computer interaction. But you can do anything with manual analysis! Sometimes, stuﬀ you shouldn’t…

Slide 10

Slide 10 text

U n d e f i n e d B e h a v i o r Software often has unintended behaviors. Eliminating these helps security and quality, and can be automated much more commonly than generally suspected. We’re going to talk about how to specify the behavior of real systems consistently. It is a truth universally acknowledged that deleting code is an excellent way of improving the quality and security of an application.

Slide 11

Slide 11 text

I m p l e m e n t i n g T h i s S t u f f I used these ideas to build a tool with legs in both disciplines, with interesting results! Now, my code is not a magic ﬁx for all the world’s problems. It’s an experiment. I hope you’ll be inspired to try some experiments of your own. You’ll probably do it better than I do.

Slide 12

Slide 12 text

I started down this rabbit hole years ago when I first learned about AFL Has anyone here used it/heard of it? For a super simple technique, it delivers some pretty impressive results We’ll talk more about how it works later, but first I’ll give an example of why I find it worthy of study.

Slide 13

Slide 13 text

Slide 14

Slide 14 text

20 TB SWF files from Google index https://security.googleblog.com/2011/08/fuzzing-at-scale.html Here’s one example of AFL in use. Google researchers took…

Slide 15

Slide 15 text

1 week run time on 2000 cores to find minimal set of 20000 SWF files https://security.googleblog.com/2011/08/fuzzing-at-scale.html They simply observe the execution paths — the traces from runtime profiling — to select from those 20 TB of SWF files a representative subset which maximizes the many different ways a file can be parsed.

Slide 16

Slide 16 text

3 weeks run time on 2000 cores with mutated inputs https://security.googleblog.com/2011/08/fuzzing-at-scale.html …and then go to town. Take those inputs, ﬁddle with the bits, and test again

Slide 17

Slide 17 text

㱺 400 unique crash signatures https://security.googleblog.com/2011/08/fuzzing-at-scale.html and you start ﬁnding lots of bugs

Slide 18

Slide 18 text

㱺 106 distinct security bugs https://security.googleblog.com/2011/08/fuzzing-at-scale.html which turn out to be pretty serious. This was around 5 years ago and it’s much more eﬃcient today. As automated tests go, a month runtime on 2000 cores is fairly high. But with cloud computing this kind of infrastructure is available to anyone who needs it.

Slide 19

Slide 19 text

But Flash is a giant pile of C code, and human beings can’t write safe C code. Finding memory access and overﬂow bugs in a giant pile of C code is sort of pedestrian.

Slide 20

Slide 20 text

Also, the test is pretty simple: If the app crashes, there’s a bug (probably exploitable).

Slide 21

Slide 21 text

https://www.ﬂickr.com/photos/sloth_rider/392367929 Sometimes the best way to understand something is to implement it yourself. Could we pick a harder problem?

Slide 22

Slide 22 text

https://commons.wikimedia.org/wiki/File:ACT_recycling_truck.jpg Can rule out a whole lot of routine C memory errors with managed code / garbage collection.

Slide 23

Slide 23 text

Can start with a system under test considered to be stable instead of Flash.

Slide 24

Slide 24 text

So I started a project….

Slide 25

Slide 25 text

Fizil AFL Runs on Windows ✅ There’s a fork Runs on Unix ❌ ✅ Fast ❌ ✅ Bunnies! ❌ Process models In Process, Out of Process Fork Server, Out of Process Instrumentation guided Soon? ✅ Automatic instrumentation .NET Assemblies Clang, GCC, Python Rich suite of fuzzing strategies Getting there! ✅ Automatically disables crash reporting ✅ ❌ Rich tooling ❌ ✅ Proven track record ❌ ✅ Stable ❌ ✅ License Apache 2.0 Apache 2.0 I take a lot of inspiration from AFL, going so far as to port some of the AFL C code to F# in Fizil. But my goals are really diﬀerent, and the two tools do very diﬀerent things.

Slide 26

Slide 26 text

and eventually I was able to test real software

Slide 27

Slide 27 text

and eventually I was able to test real software

Slide 28

Slide 28 text

So naturally the ﬁrst bug I found was in some giant pile of C code.

Slide 29

Slide 29 text

https://unsplash.com/search/bug?photo=emTCWiq2txk Later, I started ﬁnding actual bugs. Now at this point, some of you are probably thinking, “yeah, yeah, get on to the interesting parts already,” and some are probably wondering what fuzzing is

Slide 30

Slide 30 text

F u z z i n g https://commons.wikimedia.org/wiki/File:Rabbit_american_fuzzy_lop_buck_white.jpg It won’t take long to explain, so I’m going to give a quick overview so everyone can follow along Fuzzing is a technique we associate with security research, but it’s a special case of speciﬁcation-based random testing, and it’s useful for QA as well, but very much underutilized!

Slide 31

Slide 31 text

{ "a" : "bc" } Run a program or function, the system under test, with some input. I’m testing a JSON parser, so let’s start with something simple

Slide 32

Slide 32 text

A B D C E Observe the execution path. Like watching test code coverage, except we track the order in which lines are executed, not just whether or not they ever ran. We store a hash of the path.

Slide 33

Slide 33 text

A B D C E Observe the execution path. Like watching test code coverage, except we track the order in which lines are executed, not just whether or not they ever ran. We store a hash of the path.

Slide 34

Slide 34 text

{ "a" : "bc" } ✅ For this input we note the parser terminates without crashing and indicates it’s valid JSON

Slide 35

Slide 35 text

{ "a" : "bc" } Alter that input by mutating the original and run again, still observing the path

Slide 36

Slide 36 text

{ "a" : "bc" } | Alter that input by mutating the original and run again, still observing the path

Slide 37

Slide 37 text

A B D C E This time, the path changes

Slide 38

Slide 38 text

A B D C E This time, the path changes

Slide 39

Slide 39 text

| "a" : "bc" } ❌ The program terminates without crashing, but marks the JSON as invalid. We always need a property to test after termination of the system under test In AFL, the most common property to test is whether the system under test crashes. JSON.NET doesn’t tend to crash, so I want to test whether it correctly validates JSON.

Slide 40

Slide 40 text

https://www.ﬂickr.com/photos/29278394@N00/696701369 Do this a lot. Hundreds of thousands, preferably millions or more. Do it at a large enough scale and you’ll probably ﬁnd interesting things. Any questions about how fuzzing works in general before I go on?

Slide 41

Slide 41 text

I m p o s s i b l e ? Or just really, amazingly difficult? https://commons.wikimedia.org/wiki/File:Impossible_cube_illusion_angle.svg I’d like to talk about some things which security and quality assurance have in common. Here’s one: They’re obviously important, but widely considered to be impossible to ﬁnish.

Slide 42

Slide 42 text

When Is QA “Done”? Some people say this is impossible. Maybe it is, maybe it’s not. But I do think we should at least try to say what this might mean before we decide that.

Slide 43

Slide 43 text

When Can We Call a System “Secure”? Some people say this is impossible. Maybe it is, maybe it’s not. But I do think we should at least try to say what this might mean before we decide that.

Slide 44

Slide 44 text

https://xkcd.com/1316/ Part of why this is hard is software keeps getting larger — both the apps an OS they run on Open source actually made this a lot worse — more parts and keeps changing Has often been the case that a code base can grow larger than a team of humans can secure/QA, even using automation.

Slide 45

Slide 45 text

That now happens on an empty project. You do File->New and you have megabytes of code you’re lucky if you can even run, never mind test?

Slide 46

Slide 46 text

When a task is too big and too repetitive for a human, we want to use a computer instead, even if it’s not obvious how to do it.

Slide 47

Slide 47 text

E x p l o r a t o r y https://dojo.ministryoftesting.com/lessons/exploratory-testing-an-api In both security and QA we do various sorts of automated testing and analysis and manual exploration and study. Good testers think a lot about where that line should be drawn Can we use computers to automate exploratory testing? I think we have to explore this question, because contemporary software is too big to do 100% of exploratory testing manually. There will always be a need to test how humans react to software, but we must automate as much analysis as possible.

Slide 48

Slide 48 text

S e c u r i t y ⊇ Q A ? To try to start to answer these questions, let’s back up and ask what we know about QA and security. And am I correct at all to describe security as domain-speciﬁc QA? What even is security? QA?

Slide 49

Slide 49 text

Behavior Developers, QAs, security analysts, and users are all interested in the actual behavior of software (click) Some of this will be described by a specification, formal or, more commonly, informal QAs will investigate whether the behavior (click) correctly conforms to the spec or (click) fails They usually also expand the specification as they work (click) Security analysts, on the other hand, are most interested in the areas where the behavior falls outside of the specification, but also where the specification is self- contradictory.

Slide 50

Slide 50 text

Behavior Specification Developers, QAs, security analysts, and users are all interested in the actual behavior of software (click) Some of this will be described by a specification, formal or, more commonly, informal QAs will investigate whether the behavior (click) correctly conforms to the spec or (click) fails They usually also expand the specification as they work (click) Security analysts, on the other hand, are most interested in the areas where the behavior falls outside of the specification, but also where the specification is self- contradictory.

Slide 51

Slide 51 text

Slide 52

Slide 52 text

Slide 53

Slide 53 text

Slide 54

Slide 54 text

Slide 55

Slide 55 text

P e o p l e https://www.ﬂickr.com/photos/wocintechchat/25677176162/ What sort of people work in QA and security jobs?

Slide 56

Slide 56 text

http://amanda.secured.org/in-securities-comic/ Characteristics of Security Pro, Michał Zalewski (Mee-how) https://lcamtuf.blogspot.com/2016/08/so-you-want-to-work-in-security-but-are.html Based on Parisa Tabriz’s article, but boiled down to a list of 4 points I have time to cover here today Infosec is all about the mismatch between our intuition and the actual behavior of the systems we build Security is a protoscience. Think of chemistry in the early 19th century: a glorious and messy thing, and will change! If you are not embarrassed by the views you held two years ago, you are getting complacent - and complacency kills. Walk in [the] shoes [of software engineers] for a while: write your own code, show it to the world, and be humiliated by all the horrible mistakes you will inevitably make.

Slide 57

Slide 57 text

https://www.quora.com/What-qualities-make-a-good-QA-engineer —Thomas Peham Characteristics of QA Similar to security pros, but heavy emphasis on communication. Do developers value quality?

Slide 58

Slide 58 text

T o o l s https://commons.wikimedia.org/wiki/ File:Tools,_arsenical_copper,_Naxos,_2700%E2%80%932200_BC,_BM,_GR_1969.12-31,_142703.jpg You might think that QA and security tools would be really diﬀerent,

Slide 59

Slide 59 text

but they can be surprisingly similar. Fiddler is used by people in both domains,

Slide 60

Slide 60 text

And ZAP is quite similar to Fiddler with some security-speciﬁc features.

Slide 61

Slide 61 text

S i m p l e T e s t i n g https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf Some tools are remarkably effective! You can fix large classes of bugs using static analysis in your CI system, but nothing forces you to use it.

Slide 62

Slide 62 text

https://laurent22.github.io/so-injections/ You can prevent most SQLi via parameterized queries, but nothing forces you to use them… unless you run Coverity (which is free for open source) or Veracode or…

Slide 63

Slide 63 text

While it’s true you can squash large classes of defects and security issues with simple fixes, you won’t get to 100% this way. Defects can be introduced downstream, even at the microcode level, and people can make intentional tradeoffs away from quality. Intel used to spend effort on formal validation of microcode. Rumor has it that’s been drastically scaled back

Slide 64

Slide 64 text

https://en.wikipedia.org/wiki/File:Row_hammer.svg We see security bugs in hardware, too, like the rowhammer attack on physical memory. (Explain?) So some speciﬁc problems are easy to ﬁx, but making an entire system correct and secure is much harder. Still, it does seem like we ought to at least pick the low-hanging fruit.

Slide 65

Slide 65 text

S p e c i f i c a t i o n s https://lorinhochstein.wordpress.com/2014/06/04/crossing-the-river-with-tla/ Specifications. To say whether software behaves correctly, we have to define what is correct. QAs are sometimes cynical of claims for specifications, since we’ve been promised they’re the means to automatically create perfect software and, well, we’re not there yet. Specifying a system is not simple and most people aren’t good at it. Does that mean they’re a waste of time? We can’t talk about the correctness of a system without a specification. Specifications aren’t a royal road to correctness; they’re an obligation, a precondition of any assertion of quality.

Slide 66

Slide 66 text

[] let testReadDoubleWithExponent() = let actual = parseString "10.0e1" actual |> shouldEqual (Parsed (JsonNumber "10.0e1")) But “speciﬁcation” sounds formal and academic and probably we don’t generally get these handed to us on a silver platter. What does it mean in the real world? Well, a speciﬁcation is simply something which must always be true. A unit test is a simple kind of spec. It’s useful because it’s enforced by the CI build. It’s limited because it’s only one example, but much better than nothing!

Slide 67

Slide 67 text

let toHexString (bytes: byte[]) : string = //... Another example of something which is always true is a type signature. In F#, this function cannot return null, ever. That’s really useful! That single feature of the type system eliminates — for real — about a quarter of the bugs I see in production C#, Java, and JavaScript systems I’m asked to maintain. More specs: This function is invertible

Slide 68

Slide 68 text

http://d3s.mff.cuni.cz/research/seminar/download/2010-02-23-Tobies-HypervisorVerification.pdf Formal methods combine specifications with theorem provers to demonstrate when specifications do or do not always hold in the code. And they work! I’ll give examples later on in the talk.

Slide 69

Slide 69 text

Best Known Example Released Informal Spec Formal Spec Execute QuickCheck 1999 “Reversing a list twice should result in the same list” prop_RevRev xs = reverse (reverse xs) == xs where types = xs::[Int] Main> quickCheck prop_RevRev OK, passed 100 tests. Earlier I mentioned Speciﬁcation-based random testing. Idea is to produce lots of random input and see if a certain speciﬁcation holds. QuickCheck is a well-known example (explain) (click) Does that sound more like QA or security? (click) But of course this is really similar to what AFL does, only the spec is a bit more general (click) and the number of iterations is far higher

Slide 70

Slide 70 text

Slide 71

Slide 71 text

Best Known Example Released Informal Spec Formal Spec Execute QuickCheck 1999 “Reversing a list twice should result in the same list” prop_RevRev xs = reverse (reverse xs) == xs where types = xs::[Int] Main> quickCheck prop_RevRev OK, passed 100 tests. AFL 2007 “System under test shouldn’t crash no matter what I pass to it” if (WIFSIGNALED(status) #$ !stop_soon) { kill_signal = WTERMSIG(status); return FAULT_CRASH; } ./afl-fuzz -i testcase_dir -o findings_dir -- \ /path/to/tested/program [&&.program's cmdline&&.] Earlier I mentioned Speciﬁcation-based random testing. Idea is to produce lots of random input and see if a certain speciﬁcation holds. QuickCheck is a well-known example (explain) (click) Does that sound more like QA or security? (click) But of course this is really similar to what AFL does, only the spec is a bit more general (click) and the number of iterations is far higher

Slide 72

Slide 72 text

Best Known Example Released Informal Spec Formal Spec Execute QuickCheck 1999 “Reversing a list twice should result in the same list” prop_RevRev xs = reverse (reverse xs) == xs where types = xs::[Int] Main> quickCheck prop_RevRev OK, passed 100 tests. AFL 2007 “System under test shouldn’t crash no matter what I pass to it” if (WIFSIGNALED(status) #$ !stop_soon) { kill_signal = WTERMSIG(status); return FAULT_CRASH; } ./afl-fuzz -i testcase_dir -o findings_dir -- \ /path/to/tested/program [&&.program's cmdline&&.] Earlier I mentioned Speciﬁcation-based random testing. Idea is to produce lots of random input and see if a certain speciﬁcation holds. QuickCheck is a well-known example (explain) (click) Does that sound more like QA or security? (click) But of course this is really similar to what AFL does, only the spec is a bit more general (click) and the number of iterations is far higher

Slide 73

Slide 73 text

https://www.ﬂickr.com/photos/x1brett/2279939232 Unfortunately, having a method for showing when speciﬁcations hold makes it really obvious that producing coherent specs for your software isn’t always easy! Very easy to add requirements until they contradict each other

Slide 74

Slide 74 text

https://medium.com/@betable/tifu-by-using-math-random-f1c308c4fd9d Even simple specifications can sometimes be different to prove. PRNG from Safari on left, old version of Chrome on right. Obviously, there’s an issue. Specification well-understood, but impossible to write a test which passes 100% of time for a perfect PRNG. We need a spec for what our system should do, but that doesn’t automatically make it testable

Slide 75

Slide 75 text

Thought Experiment: W h a t I f A u t o m a t e d T e s t s W e r e P e r f e c t ? But let’s say we solved all those problems. You could prove software system perfectly conformed to the spec. Let’s say the spec was also right. Problem solved? QA = spec + people

Slide 76

Slide 76 text

This is life or death. This is an alert screen. Epic EMR. One of 17,000 alerts the UCSF physicians received in that month alone. (click) Contains the number “160” (or “6160”) in 14 diﬀerent places. None are “wrong” values. Nurse clicked through this and patient received 38 1/2 times the correct dose of an antibiotic. He’s fortunate to have survived. dark patterns

Slide 77

Slide 77 text

Slide 78

Slide 78 text

W h a t I f S e c u r i t y A n a l y s i s T o o l s W e r e P e r f e c t ? You could prove edge cases outside the spec did not exist. Let’s say the spec was also right. No SQLi, no overﬂows, no… anything. Problem solved?

Slide 79

Slide 79 text

–DNI James Clapper “Something like 90 percent of cyber intrusions start with phishing… Somebody always falls for it.” https://twitter.com/ODNIgov/status/776070411482193920 Security = spec + people. People who open phishing emails aren’t even dumb. You have someone in HR, reviewing resumes, and so an essential part of their job is opening PDFs and Word ﬁles emailed to them by clueless, anonymous people on the Internet, right?

Slide 80

Slide 80 text

https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45366.pdf If you display a security indicaition to a human being, how will they react? Google and UC Berkeley have conducted research on user response to security warnings

Slide 81

Slide 81 text

The same team also produced an opt-in Chrome extension which would show you a survey if you clicked through a security warning. Knowing how humans react to security UI is a security-critical test!

Slide 82

Slide 82 text

Manual Testing Examples Exploratory testing, Binary analysis Effort Very high Killer App Finding cases where code technically correct but fails at human- computer interaction Major Disadvantage Often misused OK, so proof that some piece of software conforms to a specification isn’t sufficient to call it perfect in terms of quality or security, but could we even have that proof? Let’s consider what we can do today, and the return on that effort.

Slide 83

Slide 83 text

Dynamic Analysis Examples QuickCheck, AFL, sqlmap Effort Low Killer App More like an app killer, amiright? Major Disadvantage Tends to ﬁnd a few speciﬁc (though important!) bugs

Slide 84

Slide 84 text

Static Analysis Examples FxCop, FindBugs, Coverity, Veracode Effort Very low Killer App Cheaper than air. Just do it. Major Disadvantage Limited to ﬁnding a few hundred important kinds of bugs

Slide 85

Slide 85 text

Formal Veriﬁcation / Symbolic Execution Examples VCC, TLA+, Cryptol Effort High effort but correspondingly high return Killer App MiTLS, Hyper-V Memory Manager Major Disadvantage Hard to ﬁnd people with skill set

Slide 86

Slide 86 text

Program Synthesis Examples Nothing off the shelf, really, but Agda and Z3 help Effort PhD-level research Killer App Elimination of incidental complexity Major Disadvantage Doesn’t really exist in general form This may be the future of software. When you write code today you’re writing an informal specification in a language designed for general control of hardware. What if you wrote in a language better suited to your problem domain? This is not a DSL, as we think of it today, because DSLs don’t verify the consistency and satisfiability of specifications.

Slide 87

Slide 87 text

For developers, the message here is: Don’t fix the bug. When you receive a defect report from a security or QA team, don’t fix it. Add something to your process, like static analysis, which makes the entire class of defect impossible. Then fix everything it catches.

Slide 88

Slide 88 text

That’s all a bunch of words and hand-waving. Let’s bring this down to earth, shall we?

Slide 89

Slide 89 text

How Amazon Web Services Uses Formal Methods “Formal methods are a big success at AWS, helping us prevent subtle but serious bugs from reaching production, bugs we would not have found through any other technique. They have helped us devise aggressive optimizations to complex algorithms without sacrificing quality.” http://research.microsoft.com/en-us/um/people/lamport/tla/amazon.html First, this stuﬀ works in the real world, today. The scale at which customers use AWS services is far too large for even Amazon to comprehensively test. They work around this limitations with by formally specifying all of their protocols in TLA+

Slide 90

Slide 90 text

Researchers at INRIA an Microsoft Research used a dependently typed language called F* to implement TLS and compare the behavior of the formally veriﬁed implementation to others on fuzzed input. Six major vulnerabilities so far, including Triple Handshake, FREAK, and Logjam

Slide 91

Slide 91 text

Slide 92

Slide 92 text

“Finding and Understanding Bugs in C Compilers,” Yang et al. https://www.flux.utah.edu/paper/yang-pldi11 It's possible to write a correct C compiler, but the best C developers in the world can't do it in C. You can’t test every possible C program with ad hoc test suites. Yang et al. used “smart” fuzzing of the C language to cover the maximum surface area of a compiler. They report: “Twenty-five of our reported GCC bugs have been classified as P1, the maximum, release-blocking priority for GCC defects. Our results suggest that fixed test suites— the main way that compilers are tested—are an inadequate mechanism for quality control.” The only C compiler to survive this kind of testing was CompCert, which is formally verified in Coq. (Rust story?)

Slide 93

Slide 93 text

These are phenomenally eﬀective tools, and surprisingly under-utilized in industry. It turns out, you can try this at home! What I Learned Writing a .NET Fuzzer

Slide 94

Slide 94 text

=================================== Technical "whitepaper" for afl-fuzz =================================== This document provides a quick overview of the guts of American Fuzzy Lop. See README for the general instruction manual; and for a discussion of motivations and design goals behind AFL, see historical_notes.txt. 0) Design statement ------------------- American Fuzzy Lop does its best not to focus on any singular principle of operation and not be a proof-of-concept for any specific theory. The tool can be thought of as a collection of hacks that have been tested in practice, found to be surprisingly effective, and have been implemented in the simplest, most robust way I could think of at the time. Many of the resulting features are made possible thanks to the availability of lightweight instrumentation that served as a foundation for the tool, but this mechanism should be thought of merely as a means to an end. The only true governing principles are speed, reliability, and ease of use. 1) Coverage measurements ------------------------ The instrumentation injected into compiled programs captures branch (edge) coverage, along with coarse branch-taken hit counts. The code injected at branch points is essentially equivalent to: cur_location = ; shared_mem[cur_location ^ prev_location]++; prev_location = cur_location >> 1; http://lcamtuf.coredump.cx/aﬂ/technical_details.txt So as I said earlier, I love the concept of AFL, and especially the large amount of documentation that Michał Zalewski has written explaining the design choices and implementation details Now I want to explain my own journey

Slide 95

Slide 95 text

M e m o r y AFL often ﬁnds memory crashes. You get this almost free. We know how to almost entirely eliminate these. In a managed language…. But the idea is still good. Can we ﬁnd other kinds of errors as “easily” as memory-related crashes? This turned out to be quite a long road.

Slide 96

Slide 96 text

{ "a" : "bc" } But you have to start somewhere. Find a corpus. I started with. It’s valid JSON! Can we parse this?

Slide 97

Slide 97 text

let jsonNetResult = try JsonConvert.DeserializeObject(str) |> ignore Success with | :? JsonReaderException as jre -> jre.Message |> Error | :? JsonSerializationException as jse -> jse.Message |> Error | :? System.FormatException as fe -> if fe.Message.StartsWith("Invalid hex character”) // hard coded in Json.NET then fe.Message |> Error else reraise() ⃪ T est ⬑ Special case error stuff I had to write a short program to run the deserializer, which I’ll call the test harness

Slide 98

Slide 98 text

use proc = new Process() proc.StartInfo.FileName <- executablePath inputMethod.BeforeStart proc testCase.Data proc.StartInfo.UseShellExecute <- false proc.StartInfo.RedirectStandardOutput <- true proc.StartInfo.RedirectStandardError <- true proc.StartInfo.EnvironmentVariables.Add(SharedMemory.environmentVariableName, sharedMemoryName) let output = new System.Text.StringBuilder() let err = new System.Text.StringBuilder() proc.OutputDataReceived.Add(fun args -> output.Append(args.Data) |> ignore) proc.ErrorDataReceived.Add (fun args -> err.Append(args.Data) |> ignore) proc.Start() |> ignore inputMethod.AfterStart proc testCase.Data proc.BeginOutputReadLine() proc.BeginErrorReadLine() proc.WaitForExit() let exitCode = proc.ExitCode let crashed = exitCode = WinApi.ClrUnhandledExceptionCode ⃪ Set up ⃪ Read results ⃪ Important bit And another program to read the input data and execute the test harness executable and then see if it succeeded Pretty simple, so far. There’s a lot of code here. Don’t worry about the details. Setup, execute, collect data. The code is on my GitHub and I gave you the link at the beginning of the slides if you want to look deeper. But my original sample input wasn’t very interesting.

Slide 99

Slide 99 text

Then I stood on the shoulders of giants. Turns out lots of people (well, two) like to collect problematic JSON. Now I have about 200 good test cases. But I want hundreds of thousands.

Slide 100

Slide 100 text

/// An ordered list of functions to use when starting with a single piece of /// example data and producing new examples to try let private allStrategies = [ bitFlip 1 bitFlip 2 bitFlip 4 byteFlip 1 byteFlip 2 byteFlip 4 arith8 arith16 arith32 interest8 interest16 ] So I wrote a bunch of ways to transform that input into new cases This list is just copied from AFL

Slide 101

Slide 101 text

let totalBits = bytes.Length * 8 let testCases = seq { for bit = 0 to totalBits - flipBits do let newBytes = Array.copy bytes let firstByte = bit / 8 let firstByteMask, secondByteMask = bitMasks(bit, flipBits) let newFirstByte = bytes.[firstByte] ^^^ firstByteMask newBytes.[firstByte] <- newFirstByte let secondByte = firstByte + 1 if secondByteMask <> 0uy && secondByte < bytes.Length then let newSecondByte = bytes.[secondByte] ^^^ secondByteMask newBytes.[secondByte] <- newSecondByte yield newBytes } Fuzz one byte → ^^^ means xor ↓ And I translated the AFL fuzz C code into F# So now I have a bunch of test cases, but I need to understand them. If I have an input and I ﬂip one bit, maybe that’s a valuable new test case, or more likely it’s totally useless. How do I know?

Slide 102

Slide 102 text

https://commons.wikimedia.org/wiki/File:CPT-Recursion-Factorial-Code.svg I need to trace all of the call stacks executed during the test. I’m looking for tests which produce new sequences of stacks.

Slide 103

Slide 103 text

private static void F(string arg) { Console.WriteLine("f"); Console.Error.WriteLine("Error!"); Environment.Exit(1); } How do I know which call stacks happen during testing? Instrumentation. At the simplest level, we want to turn this:

Slide 104

Slide 104 text

private static void F(string arg) { instrument.Trace(29875); Console.WriteLine("f"); Console.Error.WriteLine("Error!"); Environment.Exit(1); } ← Random number into this. Which is to say, whenever we enter some block, inform an external observer what just happened. In order to inject this code, I have a couple of choices

Slide 105

Slide 105 text

private static void F(string arg) { #if MANUAL_INSTRUMENTATION instrument.Trace(29875); #endif Console.WriteLine("f"); Console.Error.WriteLine("Error!"); Environment.Exit(1); } I could manually add the instrumentation, which is painful

Slide 106

Slide 106 text

Or I could write a proﬁler. The .NET framework provides a proﬁling API, but it requires hosting the runtime and is annoying in other ways

Slide 107

Slide 107 text

An easier solution is Mono.Cecil, which is sort of like .NET reﬂection except you can modify .NET assemblies without loading them into your process ﬁrst.

Slide 108

Slide 108 text

let stringify (ob: obj) : string = JsonConvert.SerializeObject(ob) I want to be able to work with any .NET executable, so I instrument binaries instead of source. So I can write some really simple F# code, like this. Normally the compiler transforms it into (click). That’s complicated, so don’t worry much about it. The important thing is I need to add a couple of instructions (click). This just tells the fuzzer what is happening inside the program as it runs

Slide 109

Slide 109 text

Slide 110

Slide 110 text

let stringify (ob: obj) : string = JsonConvert.SerializeObject(ob) // Method: System.String\u0020Program::stringify(System.Object) .body stringify { arg_02_0 [generated] arg_07_0 [generated] nop() arg_02_0 = ldloc(ob) arg_07_0 = call(JsonConvert::SerializeObject, arg_02_0) ret(arg_07_0) } // Method: System.String\u0020Program::stringify(System.Object) .body stringify { arg_05_0 [generated] arg_0C_0 [generated] arg_11_0 [generated] arg_05_0 = ldc.i4(23831) call(Instrument::Trace, arg_05_0) nop() arg_0C_0 = ldloc(ob) arg_11_0 = call(JsonConvert::SerializeObject, arg_0C_0) ret(arg_11_0) } I want to be able to work with any .NET executable, so I instrument binaries instead of source. So I can write some really simple F# code, like this. Normally the compiler transforms it into (click). That’s complicated, so don’t worry much about it. The important thing is I need to add a couple of instructions (click). This just tells the fuzzer what is happening inside the program as it runs

Slide 111

Slide 111 text

So what do we have so far? A ton of inputs, and lots of data about how the system under test behaves with them. Unfortunately, it doesn’t tell me very much. The only thing I was able to crash with this method was

Slide 112

Slide 112 text

…Windows. Or maybe the Parallels memory manager. Not what I was after. But this was a rather diﬃcult problem to work around. It’s hard to ﬁnd bugs in an application when your OS is blue screening!

Slide 113

Slide 113 text

http://www.json.org/ I need a way to determine if Json.NET is parsing the JSON correctly. So I thought I should write a JSON validator to check its behavior. Fortunately, there’s a standard! “Probably the boldest design decision I made was to not put a version number on JSON so there is no mechanism for revising it. We are stuck with JSON: whatever it is in its current form, that’s it.” -Crockford

Slide 114

Slide 114 text

https://tools.ietf.org/html/rfc4627 So, naturally, there’s also another JSON standard

Slide 115

Slide 115 text

http://www.ecma-international.org/ecma-262/5.1/#sec-15.12 And another JSON standard

Slide 116

Slide 116 text

http://www.ecma-international.org/publications/standards/Ecma-404.htm And another JSON standard

Slide 117

Slide 117 text

https://tools.ietf.org/html/rfc7158 And another JSON standard

Slide 118

Slide 118 text

https://tools.ietf.org/html/rfc7159 And another JSON standard. And no, they don’t all agree on everything, nor is there a single, “latest” version. Despite this multitude of standards, there are still edge cases intentionally delegated to the implementer — what we would call “undeﬁned behavior” in C.

Slide 119

Slide 119 text

https://github.com/nst/STJSON I was going to write my own validator, but… Nicolas Seriot wrote a validator called STJSON which attempts to synthesize these as much as possible.

Slide 120

Slide 120 text

https://github.com/CraigStuntz/Fizil/blob/master/StJson/StJsonParser.fs Swift doesn’t readily compile to Windows, but if you squint hard enough it kind of looks like F#, so I ported the code and used it to validate Json.NET's behavior.

Slide 121

Slide 121 text

Standard Accepts, Json.NET Rejects Value 88888888888888888888888888888888888888888888888888 88888888888888888888888888888888888888888888888888 88888888888888888888888888888888888888888888888888 88888888888888888888888888888888888888888888888888 88888888888888888888888888888888888888888888888888 Standard Says No limit Json.NET MaximumJavascriptIntegerCharacterLength = 380; Things JSON.NET fails on that the standard accepts

Slide 122

Slide 122 text

Standard Rejects, Json.NET Accepts Value [,,,] Standard Says A JSON value MUST be an object, array, number, or string, or one of the following three literal names: false null true Json.NET [null, null, null, null] Things JSON.NET succeeds on that the standard rejects

Slide 123

Slide 123 text

I m p l e m e n t a t i o n D e t a i l s I have lots of interesting stories from implementing this code, but some of them get kind of low level, so I’ll share just a couple I think are of general interest. Please do feel free to pull my code or reach out to me if you want complete details!

Slide 124

Slide 124 text

let private insertTraceInstruction(ilProcessor: ILProcessor, before: Instruction, state) = let compileTimeRandom = state.Random.Next(0, UInt16.MaxValue |> Convert.ToInt32) let ldArg = ilProcessor.Create(OpCodes.Ldc_I4, compileTimeRandom) let callTrace = ilProcessor.Create(OpCodes.Call, state.Trace) ilProcessor.InsertBefore(before, ldArg) ilProcessor.InsertAfter (ldArg, callTrace) This margin is too narrow to contain a try/finally example, so see: https://goo.gl/W4y7JH Inserting the IL instructions I needed was fairly easy. Here is the important bit of the code which does it. How did I learn how to write this? I instrumented a small program “manually” by writing the instrumentation code myself, and then decompiled that program to ﬁgure out which IL instructions I needed. Inserting them with Mono.Cecil is just a few lines of code. try/ﬁnally is much, much harder. I won’t even try to walk you through it here. Look at the GitHub repo if you want to see how it’s done.

Slide 125

Slide 125 text

Strong naming was a consistent pain for me. I’m altering the binaries of assemblies, and part of the point of strong naming is to stop you from doing just that, so naturally if the assembly is strongly named it can’t be loaded when I’m ﬁnished.

Slide 126

Slide 126 text

let private removeStrongName (assemblyDefinition : AssemblyDefinition) = let name = assemblyDefinition.Name; name.HasPublicKey <- false; name.PublicKey <- Array.empty; assemblyDefinition.Modules |> Seq.iter ( fun moduleDefinition -> moduleDefinition.Attributes <- moduleDefinition.Attributes &&& ~~~ModuleAttributes.StrongNameSigned) let aptca = assemblyDefinition.CustomAttributes.FirstOrDefault( fun attr -> attr.AttributeType.FullName = typeof.FullName) assemblyDefinition.CustomAttributes.Remove aptca |> ignore assembly.MainModule.AssemblyReferences |> Seq.filter (fun reference -> Set.contains reference.Name assembliesToInstrument) |> Seq.iter (fun reference -> reference.PublicKeyToken <- null ) So I need to remove the strong name from any assembly I fuzz, but I also need to remove the PublicKeyToken from any other assembly which references it. Doing this in Mono.Cecil is not well-documented, and after quite a bit of time spent in GitHub issues and trial and error I ﬁgured out that it takes 5 distinct steps to do this.

Slide 127

Slide 127 text

I n / O u t o f P r o c e s s In order to stop Windows/Parallels from crashing I decided to try and fuzz the system under test in process with the fuzzer itself to reduce the number of processes I was creating. This worked, but had a surprising result: The number of distinct paths through the code I found during testing changed. Why? Isn’t it running the same code with the same inputs? Yes, absolutely! But when you run a function, the .NET framework doesn’t guarantee that the same instructions will be executed each time. Let’s look at why.

Slide 128

Slide 128 text

So my test harness is going to execute some code, and I’ve instrumented that code, which means I’ll get a series of trace events as the code runs. I might expect these events to come in the same order for executing a single function.

Slide 129

Slide 129 text

But that’s not what happens. This is an actual trace of the method I just showed you, and it doesn’t happen at the same time. Why?

Slide 130

Slide 130 text

–ECMA-335, Common Language Infrastructure (CLI), Partition I “If marked BeforeFieldInit then the type’s initializer method is executed at, or sometime before, first access to any static field defined for that type.” The .NET CLR guarantees a type initializer will be invoked before the type is used, but it doesn’t specify exactly when.

Slide 131

Slide 131 text

f ( x ) = f ( x ) t i m e ( f ( x ) ) ! = t i m e ( f ( x ) ) ✅ ❌ For many methods, the CLR guarantees they’ll return the same result for the same argument, but does not guarantee the same instructions will be executed in the same order to get that result. What does this tell us about QA? What does this tell us about security? (Any thoughts on that?)

Slide 132

Slide 132 text

U n i c o d e Original JSON { "a": "bc" } ASCII Bytes 7B 20 22 61 22 20 3A 20 22 62 63 22 20 7D UTF-8 with Byte Order Mark EF BB BF 7B 20 22 61 22 20 3A 20 22 62 63 22 20 7D UTF-16 BE with BOM FE FF 00 7B 00 20 00 22 00 61 00 22 00 20 00 3A 00 20 00 22 00 62 00 63 00 22 00 20 00 7D What does it mean to fuzz JSON? Which of these byte arrays should I fuzz? Web uses UTF-8. Browsers use WTF-8. Windows, C#, and JSON.NET like UTF-16. We must choose an encoding for our test corpus and then choose whether to convert or just fuzz as-is

Slide 133

Slide 133 text

T h a n k Y o u ! Presentation Review Cassandra Faris Chad James Damian Synadinos Doug Mair Tommy Graves Source Code Inspiration Michał Zalewski Nicolas Seriot Everyone Who Works on dnSpy & Mono.Cecil Finally…

Slide 134

Slide 134 text

No content

Slide 135

Slide 135 text

C r a i g S t u n t z @craigstuntz [email protected] http://www.craigstuntz.com http://www.meetup.com/Papers-We-Love-Columbus/ https://speakerdeck.com/craigstuntz