Mashing Up QA and Security - CodeMash 2017 - with notes

M a s h i n g U p Q
A a n d S e c u r i t y Craig Stuntz Improving https://speakerdeck.com/craigstuntz https://github.com/CraigStuntz/Fizil When I submit talk proposals, I ask myself two questions: 1) What will audience take away from this talk, and 2) What’s the weirdest stuﬀ the conference will possibly accept? So I’m going to cover a lot of material, and some of it will seem quite strange. Please do feel free to raise your hand or just shout out a question. I’ll be happy to slow down and explain. I’m not planning on leaving the last 10 minutes empty for questions, so do ask whenever you feel it would be helpful. My goal here is to leave you with food for thought. I think I’ll have succeeded if you ﬁnd at least one or two of the ideas here intriguing enough to want to research yourself next week.

https://www.flickr.com/photos/futureshape/566200801 Security is domain-specific QA, but the fields often seem
worlds apart. Maybe it shouldn’t be that way?

QAs and security analysts are smart people

https://what-if.xkcd.com/49/ who are sometimes considered a bit strange or maybe
even feared by developers and business types

but care very deeply about software correctness

Unfortunately, they go to totally diﬀerent parties

Except CodeMash Can we take this opportunity to learn from
each other?

S o f t w a r e C o
r r e c t n e s s <spoilers> OK, spoiler alert! There are four core ideas I’m going to explore in this talk. Security and QA both explore software correctness. I’d like to add some precision to what we mean when we say that software is “insecure” or “buggy.”

M a n u a l A n a l
y s i s <spoilers> Manual analysis adds value when we deal with human computer interaction. But you can do anything with manual analysis! Sometimes, stuﬀ you shouldn’t…

U n d e f i n e d B
e h a v i o r <spoilers> Software often has unintended behaviors. Eliminating these helps security and quality, and can be automated much more commonly than generally suspected. We’re going to talk about how to specify the behavior of real systems consistently. It is a truth universally acknowledged that deleting code is an excellent way of improving the quality and security of an application.

I m p l e m e n t i
n g T h i s S t u f f <spoilers> I used these ideas to build a tool with legs in both disciplines, with interesting results! Now, my code is not a magic ﬁx for all the world’s problems. It’s an experiment. I hope you’ll be inspired to try some experiments of your own. You’ll probably do it better than I do.

I started down this rabbit hole years ago when I
first learned about AFL Has anyone here used it/heard of it? For a super simple technique, it delivers some pretty impressive results We’ll talk more about how it works later, but first I’ll give an example of why I find it worthy of study.

20 TB SWF files from Google index https://security.googleblog.com/2011/08/fuzzing-at-scale.html Here’s one
example of AFL in use. Google researchers took…

1 week run time on 2000 cores to find minimal
set of 20000 SWF files https://security.googleblog.com/2011/08/fuzzing-at-scale.html They simply observe the execution paths — the traces from runtime profiling — to select from those 20 TB of SWF files a representative subset which maximizes the many different ways a file can be parsed.

3 weeks run time on 2000 cores with mutated inputs
https://security.googleblog.com/2011/08/fuzzing-at-scale.html …and then go to town. Take those inputs, ﬁddle with the bits, and test again

㱺 400 unique crash signatures https://security.googleblog.com/2011/08/fuzzing-at-scale.html and you start ﬁnding
lots of bugs

㱺 106 distinct security bugs https://security.googleblog.com/2011/08/fuzzing-at-scale.html which turn out to
be pretty serious. This was around 5 years ago and it’s much more eﬃcient today. As automated tests go, a month runtime on 2000 cores is fairly high. But with cloud computing this kind of infrastructure is available to anyone who needs it.

But Flash is a giant pile of C code, and
human beings can’t write safe C code. Finding memory access and overﬂow bugs in a giant pile of C code is sort of pedestrian.

Also, the test is pretty simple: If the app crashes,
there’s a bug (probably exploitable).

https://www.ﬂickr.com/photos/sloth_rider/392367929 Sometimes the best way to understand something is to
implement it yourself. Could we pick a harder problem?

https://commons.wikimedia.org/wiki/File:ACT_recycling_truck.jpg Can rule out a whole lot of routine C
memory errors with managed code / garbage collection.

Can start with a system under test considered to be
stable instead of Flash.

So I started a project….

Fizil AFL Runs on Windows ✅ There’s a fork Runs
on Unix ❌ ✅ Fast ❌ ✅ Bunnies! ❌ Process models In Process, Out of Process Fork Server, Out of Process Instrumentation guided Soon? ✅ Automatic instrumentation .NET Assemblies Clang, GCC, Python Rich suite of fuzzing strategies Getting there! ✅ Automatically disables crash reporting ✅ ❌ Rich tooling ❌ ✅ Proven track record ❌ ✅ Stable ❌ ✅ License Apache 2.0 Apache 2.0 I take a lot of inspiration from AFL, going so far as to port some of the AFL C code to F# in Fizil. But my goals are really diﬀerent, and the two tools do very diﬀerent things.

and eventually I was able to test real software

So naturally the ﬁrst bug I found was in some
giant pile of C code.

https://unsplash.com/search/bug?photo=emTCWiq2txk Later, I started ﬁnding actual bugs. Now at this
point, some of you are probably thinking, “yeah, yeah, get on to the interesting parts already,” and some are probably wondering what fuzzing is

F u z z i n g https://commons.wikimedia.org/wiki/File:Rabbit_american_fuzzy_lop_buck_white.jpg It won’t
take long to explain, so I’m going to give a quick overview so everyone can follow along Fuzzing is a technique we associate with security research, but it’s a special case of speciﬁcation-based random testing, and it’s useful for QA as well, but very much underutilized!

{ "a" : "bc" } Run a program or function,
the system under test, with some input. I’m testing a JSON parser, so let’s start with something simple

A B D C E Observe the execution path. Like
watching test code coverage, except we track the order in which lines are executed, not just whether or not they ever ran. We store a hash of the path.

{ "a" : "bc" } ✅ For this input we
note the parser terminates without crashing and indicates it’s valid JSON

{ "a" : "bc" } Alter that input by mutating
the original and run again, still observing the path

{ "a" : "bc" } | Alter that input by
mutating the original and run again, still observing the path

A B D C E This time, the path changes

| "a" : "bc" } ❌ The program terminates without
crashing, but marks the JSON as invalid. We always need a property to test after termination of the system under test In AFL, the most common property to test is whether the system under test crashes. JSON.NET doesn’t tend to crash, so I want to test whether it correctly validates JSON.

https://www.ﬂickr.com/photos/29278394@N00/696701369 Do this a lot. Hundreds of thousands, preferably millions
or more. Do it at a large enough scale and you’ll probably ﬁnd interesting things. Any questions about how fuzzing works in general before I go on?

I m p o s s i b l e
? Or just really, amazingly difficult? https://commons.wikimedia.org/wiki/File:Impossible_cube_illusion_angle.svg I’d like to talk about some things which security and quality assurance have in common. Here’s one: They’re obviously important, but widely considered to be impossible to ﬁnish.

When Is QA “Done”? Some people say this is impossible.
Maybe it is, maybe it’s not. But I do think we should at least try to say what this might mean before we decide that.

When Can We Call a System “Secure”? Some people say
this is impossible. Maybe it is, maybe it’s not. But I do think we should at least try to say what this might mean before we decide that.

https://xkcd.com/1316/ Part of why this is hard is software keeps
getting larger — both the apps an OS they run on Open source actually made this a lot worse — more parts and keeps changing Has often been the case that a code base can grow larger than a team of humans can secure/QA, even using automation.

That now happens on an empty project. You do File->New
and you have megabytes of code you’re lucky if you can even run, never mind test?

When a task is too big and too repetitive for
a human, we want to use a computer instead, even if it’s not obvious how to do it.

E x p l o r a t o r
y https://dojo.ministryoftesting.com/lessons/exploratory-testing-an-api In both security and QA we do various sorts of automated testing and analysis and manual exploration and study. Good testers think a lot about where that line should be drawn Can we use computers to automate exploratory testing? I think we have to explore this question, because contemporary software is too big to do 100% of exploratory testing manually. There will always be a need to test how humans react to software, but we must automate as much analysis as possible.

S e c u r i t y ⊇ Q
A ? To try to start to answer these questions, let’s back up and ask what we know about QA and security. And am I correct at all to describe security as domain-speciﬁc QA? What even is security? QA?

Behavior Developers, QAs, security analysts, and users are all interested
in the actual behavior of software (click) Some of this will be described by a specification, formal or, more commonly, informal QAs will investigate whether the behavior (click) correctly conforms to the spec or (click) fails They usually also expand the specification as they work (click) Security analysts, on the other hand, are most interested in the areas where the behavior falls outside of the specification, but also where the specification is self- contradictory.

Behavior Specification Developers, QAs, security analysts, and users are all
interested in the actual behavior of software (click) Some of this will be described by a specification, formal or, more commonly, informal QAs will investigate whether the behavior (click) correctly conforms to the spec or (click) fails They usually also expand the specification as they work (click) Security analysts, on the other hand, are most interested in the areas where the behavior falls outside of the specification, but also where the specification is self- contradictory.

P e o p l e https://www.ﬂickr.com/photos/wocintechchat/25677176162/ What sort of
people work in QA and security jobs?

http://amanda.secured.org/in-securities-comic/ Characteristics of Security Pro, Michał Zalewski (Mee-how) https://lcamtuf.blogspot.com/2016/08/so-you-want-to-work-in-security-but-are.html Based
on Parisa Tabriz’s article, but boiled down to a list of 4 points I have time to cover here today Infosec is all about the mismatch between our intuition and the actual behavior of the systems we build Security is a protoscience. Think of chemistry in the early 19th century: a glorious and messy thing, and will change! If you are not embarrassed by the views you held two years ago, you are getting complacent - and complacency kills. Walk in [the] shoes [of software engineers] for a while: write your own code, show it to the world, and be humiliated by all the horrible mistakes you will inevitably make.

https://www.quora.com/What-qualities-make-a-good-QA-engineer —Thomas Peham Characteristics of QA Similar to security pros,
but heavy emphasis on communication. Do developers value quality?

T o o l s https://commons.wikimedia.org/wiki/ File:Tools,_arsenical_copper,_Naxos,_2700%E2%80%932200_BC,_BM,_GR_1969.12-31,_142703.jpg You might think
that QA and security tools would be really diﬀerent,

but they can be surprisingly similar. Fiddler is used by
people in both domains,

And ZAP is quite similar to Fiddler with some security-speciﬁc
features.

S i m p l e T e s t
i n g https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf Some tools are remarkably effective! You can fix large classes of bugs using static analysis in your CI system, but nothing forces you to use it.

https://laurent22.github.io/so-injections/ You can prevent most SQLi via parameterized queries, but
nothing forces you to use them… unless you run Coverity (which is free for open source) or Veracode or…

While it’s true you can squash large classes of defects
and security issues with simple fixes, you won’t get to 100% this way. Defects can be introduced downstream, even at the microcode level, and people can make intentional tradeoffs away from quality. Intel used to spend effort on formal validation of microcode. Rumor has it that’s been drastically scaled back

https://en.wikipedia.org/wiki/File:Row_hammer.svg We see security bugs in hardware, too, like the
rowhammer attack on physical memory. (Explain?) So some speciﬁc problems are easy to ﬁx, but making an entire system correct and secure is much harder. Still, it does seem like we ought to at least pick the low-hanging fruit.

S p e c i f i c a t
i o n s https://lorinhochstein.wordpress.com/2014/06/04/crossing-the-river-with-tla/ Specifications. To say whether software behaves correctly, we have to define what is correct. QAs are sometimes cynical of claims for specifications, since we’ve been promised they’re the means to automatically create perfect software and, well, we’re not there yet. Specifying a system is not simple and most people aren’t good at it. Does that mean they’re a waste of time? We can’t talk about the correctness of a system without a specification. Specifications aren’t a royal road to correctness; they’re an obligation, a precondition of any assertion of quality.

[<Test>] let testReadDoubleWithExponent() = let actual = parseString "10.0e1" actual
|> shouldEqual (Parsed (JsonNumber "10.0e1")) But “speciﬁcation” sounds formal and academic and probably we don’t generally get these handed to us on a silver platter. What does it mean in the real world? Well, a speciﬁcation is simply something which must always be true. A unit test is a simple kind of spec. It’s useful because it’s enforced by the CI build. It’s limited because it’s only one example, but much better than nothing!

let toHexString (bytes: byte[]) : string = //... Another example
of something which is always true is a type signature. In F#, this function cannot return null, ever. That’s really useful! That single feature of the type system eliminates — for real — about a quarter of the bugs I see in production C#, Java, and JavaScript systems I’m asked to maintain. More specs: This function is invertible

http://d3s.mff.cuni.cz/research/seminar/download/2010-02-23-Tobies-HypervisorVerification.pdf Formal methods combine specifications with theorem provers to demonstrate
when specifications do or do not always hold in the code. And they work! I’ll give examples later on in the talk.

Best Known Example Released Informal Spec Formal Spec Execute QuickCheck
1999 “Reversing a list twice should result in the same list” prop_RevRev xs = reverse (reverse xs) == xs where types = xs::[Int] Main> quickCheck prop_RevRev OK, passed 100 tests. Earlier I mentioned Speciﬁcation-based random testing. Idea is to produce lots of random input and see if a certain speciﬁcation holds. QuickCheck is a well-known example (explain) (click) Does that sound more like QA or security? (click) But of course this is really similar to what AFL does, only the spec is a bit more general (click) and the number of iterations is far higher

Best Known Example Released Informal Spec Formal Spec Execute QuickCheck
1999 “Reversing a list twice should result in the same list” prop_RevRev xs = reverse (reverse xs) == xs where types = xs::[Int] Main> quickCheck prop_RevRev OK, passed 100 tests. AFL 2007 “System under test shouldn’t crash no matter what I pass to it” if (WIFSIGNALED(status) #$ !stop_soon) { kill_signal = WTERMSIG(status); return FAULT_CRASH; } ./afl-fuzz -i testcase_dir -o findings_dir -- \ /path/to/tested/program [&&.program's cmdline&&.] Earlier I mentioned Speciﬁcation-based random testing. Idea is to produce lots of random input and see if a certain speciﬁcation holds. QuickCheck is a well-known example (explain) (click) Does that sound more like QA or security? (click) But of course this is really similar to what AFL does, only the spec is a bit more general (click) and the number of iterations is far higher

https://www.ﬂickr.com/photos/x1brett/2279939232 Unfortunately, having a method for showing when speciﬁcations hold
makes it really obvious that producing coherent specs for your software isn’t always easy! Very easy to add requirements until they contradict each other

https://medium.com/@betable/tifu-by-using-math-random-f1c308c4fd9d Even simple specifications can sometimes be different to prove.
PRNG from Safari on left, old version of Chrome on right. Obviously, there’s an issue. Specification well-understood, but impossible to write a test which passes 100% of time for a perfect PRNG. We need a spec for what our system should do, but that doesn’t automatically make it testable

Thought Experiment: W h a t I f A u
t o m a t e d T e s t s W e r e P e r f e c t ? But let’s say we solved all those problems. You could prove software system perfectly conformed to the spec. Let’s say the spec was also right. Problem solved? QA = spec + people

This is life or death. This is an alert screen.
Epic EMR. One of 17,000 alerts the UCSF physicians received in that month alone. (click) Contains the number “160” (or “6160”) in 14 diﬀerent places. None are “wrong” values. Nurse clicked through this and patient received 38 1/2 times the correct dose of an antibiotic. He’s fortunate to have survived. dark patterns

W h a t I f S e c u
r i t y A n a l y s i s T o o l s W e r e P e r f e c t ? You could prove edge cases outside the spec did not exist. Let’s say the spec was also right. No SQLi, no overﬂows, no… anything. Problem solved?

–DNI James Clapper “Something like 90 percent of cyber intrusions
start with phishing… Somebody always falls for it.” https://twitter.com/ODNIgov/status/776070411482193920 Security = spec + people. People who open phishing emails aren’t even dumb. You have someone in HR, reviewing resumes, and so an essential part of their job is opening PDFs and Word ﬁles emailed to them by clueless, anonymous people on the Internet, right?

https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45366.pdf If you display a security indicaition to a human
being, how will they react? Google and UC Berkeley have conducted research on user response to security warnings

The same team also produced an opt-in Chrome extension which
would show you a survey if you clicked through a security warning. Knowing how humans react to security UI is a security-critical test!

Manual Testing Examples Exploratory testing, Binary analysis Effort Very high
Killer App Finding cases where code technically correct but fails at human- computer interaction Major Disadvantage Often misused OK, so proof that some piece of software conforms to a specification isn’t sufficient to call it perfect in terms of quality or security, but could we even have that proof? Let’s consider what we can do today, and the return on that effort.

Dynamic Analysis Examples QuickCheck, AFL, sqlmap Effort Low Killer App
More like an app killer, amiright? Major Disadvantage Tends to ﬁnd a few speciﬁc (though important!) bugs

Static Analysis Examples FxCop, FindBugs, Coverity, Veracode Effort Very low
Killer App Cheaper than air. Just do it. Major Disadvantage Limited to ﬁnding a few hundred important kinds of bugs

Formal Veriﬁcation / Symbolic Execution Examples VCC, TLA+, Cryptol Effort
High effort but correspondingly high return Killer App MiTLS, Hyper-V Memory Manager Major Disadvantage Hard to ﬁnd people with skill set

Program Synthesis Examples Nothing off the shelf, really, but Agda
and Z3 help Effort PhD-level research Killer App Elimination of incidental complexity Major Disadvantage Doesn’t really exist in general form This may be the future of software. When you write code today you’re writing an informal specification in a language designed for general control of hardware. What if you wrote in a language better suited to your problem domain? This is not a DSL, as we think of it today, because DSLs don’t verify the consistency and satisfiability of specifications.

For developers, the message here is: Don’t fix the bug.
When you receive a defect report from a security or QA team, don’t fix it. Add something to your process, like static analysis, which makes the entire class of defect impossible. Then fix everything it catches.

That’s all a bunch of words and hand-waving. Let’s bring
this down to earth, shall we?

How Amazon Web Services Uses Formal Methods “Formal methods are
a big success at AWS, helping us prevent subtle but serious bugs from reaching production, bugs we would not have found through any other technique. They have helped us devise aggressive optimizations to complex algorithms without sacrificing quality.” http://research.microsoft.com/en-us/um/people/lamport/tla/amazon.html First, this stuﬀ works in the real world, today. The scale at which customers use AWS services is far too large for even Amazon to comprehensively test. They work around this limitations with by formally specifying all of their protocols in TLA+

Researchers at INRIA an Microsoft Research used a dependently typed
language called F* to implement TLS and compare the behavior of the formally veriﬁed implementation to others on fuzzed input. Six major vulnerabilities so far, including Triple Handshake, FREAK, and Logjam

“Finding and Understanding Bugs in C Compilers,” Yang et al.
https://www.flux.utah.edu/paper/yang-pldi11 It's possible to write a correct C compiler, but the best C developers in the world can't do it in C. You can’t test every possible C program with ad hoc test suites. Yang et al. used “smart” fuzzing of the C language to cover the maximum surface area of a compiler. They report: “Twenty-five of our reported GCC bugs have been classified as P1, the maximum, release-blocking priority for GCC defects. Our results suggest that fixed test suites— the main way that compilers are tested—are an inadequate mechanism for quality control.” The only C compiler to survive this kind of testing was CompCert, which is formally verified in Coq. (Rust story?)

These are phenomenally eﬀective tools, and surprisingly under-utilized in industry.
It turns out, you can try this at home! What I Learned Writing a .NET Fuzzer

=================================== Technical "whitepaper" for afl-fuzz =================================== This document provides a
quick overview of the guts of American Fuzzy Lop. See README for the general instruction manual; and for a discussion of motivations and design goals behind AFL, see historical_notes.txt. 0) Design statement ------------------- American Fuzzy Lop does its best not to focus on any singular principle of operation and not be a proof-of-concept for any specific theory. The tool can be thought of as a collection of hacks that have been tested in practice, found to be surprisingly effective, and have been implemented in the simplest, most robust way I could think of at the time. Many of the resulting features are made possible thanks to the availability of lightweight instrumentation that served as a foundation for the tool, but this mechanism should be thought of merely as a means to an end. The only true governing principles are speed, reliability, and ease of use. 1) Coverage measurements ------------------------ The instrumentation injected into compiled programs captures branch (edge) coverage, along with coarse branch-taken hit counts. The code injected at branch points is essentially equivalent to: cur_location = <COMPILE_TIME_RANDOM>; shared_mem[cur_location ^ prev_location]++; prev_location = cur_location >> 1; http://lcamtuf.coredump.cx/aﬂ/technical_details.txt So as I said earlier, I love the concept of AFL, and especially the large amount of documentation that Michał Zalewski has written explaining the design choices and implementation details Now I want to explain my own journey

M e m o r y AFL often ﬁnds memory
crashes. You get this almost free. We know how to almost entirely eliminate these. In a managed language…. But the idea is still good. Can we ﬁnd other kinds of errors as “easily” as memory-related crashes? This turned out to be quite a long road.

{ "a" : "bc" } But you have to start
somewhere. Find a corpus. I started with. It’s valid JSON! Can we parse this?

let jsonNetResult = try JsonConvert.DeserializeObject<obj>(str) |> ignore Success with |
:? JsonReaderException as jre -> jre.Message |> Error | :? JsonSerializationException as jse -> jse.Message |> Error | :? System.FormatException as fe -> if fe.Message.StartsWith("Invalid hex character”) // hard coded in Json.NET then fe.Message |> Error else reraise() ⃪ T est ⬑ Special case error stuff I had to write a short program to run the deserializer, which I’ll call the test harness

use proc = new Process() proc.StartInfo.FileName <- executablePath inputMethod.BeforeStart proc
testCase.Data proc.StartInfo.UseShellExecute <- false proc.StartInfo.RedirectStandardOutput <- true proc.StartInfo.RedirectStandardError <- true proc.StartInfo.EnvironmentVariables.Add(SharedMemory.environmentVariableName, sharedMemoryName) let output = new System.Text.StringBuilder() let err = new System.Text.StringBuilder() proc.OutputDataReceived.Add(fun args -> output.Append(args.Data) |> ignore) proc.ErrorDataReceived.Add (fun args -> err.Append(args.Data) |> ignore) proc.Start() |> ignore inputMethod.AfterStart proc testCase.Data proc.BeginOutputReadLine() proc.BeginErrorReadLine() proc.WaitForExit() let exitCode = proc.ExitCode let crashed = exitCode = WinApi.ClrUnhandledExceptionCode ⃪ Set up ⃪ Read results ⃪ Important bit And another program to read the input data and execute the test harness executable and then see if it succeeded Pretty simple, so far. There’s a lot of code here. Don’t worry about the details. Setup, execute, collect data. The code is on my GitHub and I gave you the link at the beginning of the slides if you want to look deeper. But my original sample input wasn’t very interesting.

Then I stood on the shoulders of giants. Turns out
lots of people (well, two) like to collect problematic JSON. Now I have about 200 good test cases. But I want hundreds of thousands.

/// An ordered list of functions to use when starting
with a single piece of /// example data and producing new examples to try let private allStrategies = [ bitFlip 1 bitFlip 2 bitFlip 4 byteFlip 1 byteFlip 2 byteFlip 4 arith8 arith16 arith32 interest8 interest16 ] So I wrote a bunch of ways to transform that input into new cases This list is just copied from AFL

let totalBits = bytes.Length * 8 let testCases = seq
{ for bit = 0 to totalBits - flipBits do let newBytes = Array.copy bytes let firstByte = bit / 8 let firstByteMask, secondByteMask = bitMasks(bit, flipBits) let newFirstByte = bytes.[firstByte] ^^^ firstByteMask newBytes.[firstByte] <- newFirstByte let secondByte = firstByte + 1 if secondByteMask <> 0uy && secondByte < bytes.Length then let newSecondByte = bytes.[secondByte] ^^^ secondByteMask newBytes.[secondByte] <- newSecondByte yield newBytes } Fuzz one byte → ^^^ means xor ↓ And I translated the AFL fuzz C code into F# So now I have a bunch of test cases, but I need to understand them. If I have an input and I ﬂip one bit, maybe that’s a valuable new test case, or more likely it’s totally useless. How do I know?

https://commons.wikimedia.org/wiki/File:CPT-Recursion-Factorial-Code.svg I need to trace all of the call stacks
executed during the test. I’m looking for tests which produce new sequences of stacks.

private static void F(string arg) { Console.WriteLine("f"); Console.Error.WriteLine("Error!"); Environment.Exit(1); }
How do I know which call stacks happen during testing? Instrumentation. At the simplest level, we want to turn this:

private static void F(string arg) { instrument.Trace(29875); Console.WriteLine("f"); Console.Error.WriteLine("Error!"); Environment.Exit(1);
} ← Random number into this. Which is to say, whenever we enter some block, inform an external observer what just happened. In order to inject this code, I have a couple of choices

private static void F(string arg) { #if MANUAL_INSTRUMENTATION instrument.Trace(29875); #endif
Console.WriteLine("f"); Console.Error.WriteLine("Error!"); Environment.Exit(1); } I could manually add the instrumentation, which is painful

Or I could write a proﬁler. The .NET framework provides
a proﬁling API, but it requires hosting the runtime and is annoying in other ways

An easier solution is Mono.Cecil, which is sort of like
.NET reﬂection except you can modify .NET assemblies without loading them into your process ﬁrst.

let stringify (ob: obj) : string = JsonConvert.SerializeObject(ob) I want
to be able to work with any .NET executable, so I instrument binaries instead of source. So I can write some really simple F# code, like this. Normally the compiler transforms it into (click). That’s complicated, so don’t worry much about it. The important thing is I need to add a couple of instructions (click). This just tells the fuzzer what is happening inside the program as it runs

let stringify (ob: obj) : string = JsonConvert.SerializeObject(ob) // Method:
System.String\u0020Program::stringify(System.Object) .body stringify { arg_02_0 [generated] arg_07_0 [generated] nop() arg_02_0 = ldloc(ob) arg_07_0 = call(JsonConvert::SerializeObject, arg_02_0) ret(arg_07_0) } I want to be able to work with any .NET executable, so I instrument binaries instead of source. So I can write some really simple F# code, like this. Normally the compiler transforms it into (click). That’s complicated, so don’t worry much about it. The important thing is I need to add a couple of instructions (click). This just tells the fuzzer what is happening inside the program as it runs

let stringify (ob: obj) : string = JsonConvert.SerializeObject(ob) // Method:
System.String\u0020Program::stringify(System.Object) .body stringify { arg_02_0 [generated] arg_07_0 [generated] nop() arg_02_0 = ldloc(ob) arg_07_0 = call(JsonConvert::SerializeObject, arg_02_0) ret(arg_07_0) } // Method: System.String\u0020Program::stringify(System.Object) .body stringify { arg_05_0 [generated] arg_0C_0 [generated] arg_11_0 [generated] arg_05_0 = ldc.i4(23831) call(Instrument::Trace, arg_05_0) nop() arg_0C_0 = ldloc(ob) arg_11_0 = call(JsonConvert::SerializeObject, arg_0C_0) ret(arg_11_0) } I want to be able to work with any .NET executable, so I instrument binaries instead of source. So I can write some really simple F# code, like this. Normally the compiler transforms it into (click). That’s complicated, so don’t worry much about it. The important thing is I need to add a couple of instructions (click). This just tells the fuzzer what is happening inside the program as it runs

So what do we have so far? A ton of
inputs, and lots of data about how the system under test behaves with them. Unfortunately, it doesn’t tell me very much. The only thing I was able to crash with this method was

…Windows. Or maybe the Parallels memory manager. Not what I
was after. But this was a rather diﬃcult problem to work around. It’s hard to ﬁnd bugs in an application when your OS is blue screening!

http://www.json.org/ I need a way to determine if Json.NET is
parsing the JSON correctly. So I thought I should write a JSON validator to check its behavior. Fortunately, there’s a standard! “Probably the boldest design decision I made was to not put a version number on JSON so there is no mechanism for revising it. We are stuck with JSON: whatever it is in its current form, that’s it.” -Crockford

https://tools.ietf.org/html/rfc4627 So, naturally, there’s also another JSON standard

http://www.ecma-international.org/ecma-262/5.1/#sec-15.12 And another JSON standard

http://www.ecma-international.org/publications/standards/Ecma-404.htm And another JSON standard

https://tools.ietf.org/html/rfc7158 And another JSON standard

https://tools.ietf.org/html/rfc7159 And another JSON standard. And no, they don’t all
agree on everything, nor is there a single, “latest” version. Despite this multitude of standards, there are still edge cases intentionally delegated to the implementer — what we would call “undeﬁned behavior” in C.

https://github.com/nst/STJSON I was going to write my own validator, but…
Nicolas Seriot wrote a validator called STJSON which attempts to synthesize these as much as possible.

https://github.com/CraigStuntz/Fizil/blob/master/StJson/StJsonParser.fs Swift doesn’t readily compile to Windows, but if you
squint hard enough it kind of looks like F#, so I ported the code and used it to validate Json.NET's behavior.

Standard Accepts, Json.NET Rejects Value 88888888888888888888888888888888888888888888888888 88888888888888888888888888888888888888888888888888 88888888888888888888888888888888888888888888888888 88888888888888888888888888888888888888888888888888 88888888888888888888888888888888888888888888888888
Standard Says No limit Json.NET MaximumJavascriptIntegerCharacterLength = 380; Things JSON.NET fails on that the standard accepts

Standard Rejects, Json.NET Accepts Value [,,,] Standard Says A JSON
value MUST be an object, array, number, or string, or one of the following three literal names: false null true Json.NET [null, null, null, null] Things JSON.NET succeeds on that the standard rejects

I m p l e m e n t a
t i o n D e t a i l s I have lots of interesting stories from implementing this code, but some of them get kind of low level, so I’ll share just a couple I think are of general interest. Please do feel free to pull my code or reach out to me if you want complete details!

let private insertTraceInstruction(ilProcessor: ILProcessor, before: Instruction, state) = let compileTimeRandom
= state.Random.Next(0, UInt16.MaxValue |> Convert.ToInt32) let ldArg = ilProcessor.Create(OpCodes.Ldc_I4, compileTimeRandom) let callTrace = ilProcessor.Create(OpCodes.Call, state.Trace) ilProcessor.InsertBefore(before, ldArg) ilProcessor.InsertAfter (ldArg, callTrace) This margin is too narrow to contain a try/finally example, so see: https://goo.gl/W4y7JH Inserting the IL instructions I needed was fairly easy. Here is the important bit of the code which does it. How did I learn how to write this? I instrumented a small program “manually” by writing the instrumentation code myself, and then decompiled that program to ﬁgure out which IL instructions I needed. Inserting them with Mono.Cecil is just a few lines of code. try/ﬁnally is much, much harder. I won’t even try to walk you through it here. Look at the GitHub repo if you want to see how it’s done.

Strong naming was a consistent pain for me. I’m altering
the binaries of assemblies, and part of the point of strong naming is to stop you from doing just that, so naturally if the assembly is strongly named it can’t be loaded when I’m ﬁnished.

let private removeStrongName (assemblyDefinition : AssemblyDefinition) = let name =
assemblyDefinition.Name; name.HasPublicKey <- false; name.PublicKey <- Array.empty; assemblyDefinition.Modules |> Seq.iter ( fun moduleDefinition -> moduleDefinition.Attributes <- moduleDefinition.Attributes &&& ~~~ModuleAttributes.StrongNameSigned) let aptca = assemblyDefinition.CustomAttributes.FirstOrDefault( fun attr -> attr.AttributeType.FullName = typeof<System.Security.AllowPartiallyTrustedCallersAttribute>.FullName) assemblyDefinition.CustomAttributes.Remove aptca |> ignore assembly.MainModule.AssemblyReferences |> Seq.filter (fun reference -> Set.contains reference.Name assembliesToInstrument) |> Seq.iter (fun reference -> reference.PublicKeyToken <- null ) So I need to remove the strong name from any assembly I fuzz, but I also need to remove the PublicKeyToken from any other assembly which references it. Doing this in Mono.Cecil is not well-documented, and after quite a bit of time spent in GitHub issues and trial and error I ﬁgured out that it takes 5 distinct steps to do this.

I n / O u t o f P r
o c e s s In order to stop Windows/Parallels from crashing I decided to try and fuzz the system under test in process with the fuzzer itself to reduce the number of processes I was creating. This worked, but had a surprising result: The number of distinct paths through the code I found during testing changed. Why? Isn’t it running the same code with the same inputs? Yes, absolutely! But when you run a function, the .NET framework doesn’t guarantee that the same instructions will be executed each time. Let’s look at why.

So my test harness is going to execute some code,
and I’ve instrumented that code, which means I’ll get a series of trace events as the code runs. I might expect these events to come in the same order for executing a single function.

But that’s not what happens. This is an actual trace
of the method I just showed you, and it doesn’t happen at the same time. Why?

–ECMA-335, Common Language Infrastructure (CLI), Partition I “If marked BeforeFieldInit
then the type’s initializer method is executed at, or sometime before, first access to any static field defined for that type.” The .NET CLR guarantees a type initializer will be invoked before the type is used, but it doesn’t specify exactly when.

f ( x ) = f ( x ) t
i m e ( f ( x ) ) ! = t i m e ( f ( x ) ) ✅ ❌ For many methods, the CLR guarantees they’ll return the same result for the same argument, but does not guarantee the same instructions will be executed in the same order to get that result. What does this tell us about QA? What does this tell us about security? (Any thoughts on that?)

U n i c o d e Original JSON {
"a": "bc" } ASCII Bytes 7B 20 22 61 22 20 3A 20 22 62 63 22 20 7D UTF-8 with Byte Order Mark EF BB BF 7B 20 22 61 22 20 3A 20 22 62 63 22 20 7D UTF-16 BE with BOM FE FF 00 7B 00 20 00 22 00 61 00 22 00 20 00 3A 00 20 00 22 00 62 00 63 00 22 00 20 00 7D What does it mean to fuzz JSON? Which of these byte arrays should I fuzz? Web uses UTF-8. Browsers use WTF-8. Windows, C#, and JSON.NET like UTF-16. We must choose an encoding for our test corpus and then choose whether to convert or just fuzz as-is

T h a n k Y o u ! Presentation
Review Cassandra Faris Chad James Damian Synadinos Doug Mair Tommy Graves Source Code Inspiration Michał Zalewski Nicolas Seriot Everyone Who Works on dnSpy & Mono.Cecil Finally…

C r a i g S t u n t
z @craigstuntz [email protected] http://www.craigstuntz.com http://www.meetup.com/Papers-We-Love-Columbus/ https://speakerdeck.com/craigstuntz

Mashing Up QA and Security - CodeMash 2017 - wi...

Mashing Up QA and Security - CodeMash 2017 - with notes

More Decks by Craig Stuntz

Other Decks in Programming

Featured

Transcript