Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Mashing Up QA and Security - CodeMash 2017 - with notes

Mashing Up QA and Security - CodeMash 2017 - with notes

This version of the deck has speaker notes. I've published a separate version without notes.

Security is domain specific quality assurance, but developers, testers, and security professionals often don’t work together. When this type of disconnect exists between big groups of people who are very good at their jobs, there is usually a mostly untapped potential for learning. I’ve been exploring this landscape by writing an open source fuzzer aimed at discovering new test cases (not just crashes!) using binary rewriting of managed executables and genetic modification of a test corpus, implemented in F# and using Mono.Cecil. I’ll ontrast the fundamentals of each discipline, demonstrate tools used by experts on both sides of the security and QA fence, and challenge the audience to find new ways to mix them up. Expect to see lots of code and leave with ideas for making entire communities better, not just your own team!

56e5c49368a2e0ab999848a8d9e3c116?s=128

Craig Stuntz

January 13, 2017
Tweet

Transcript

  1. M a s h i n g U p Q

    A a n d S e c u r i t y Craig Stuntz Improving https://speakerdeck.com/craigstuntz https://github.com/CraigStuntz/Fizil When I submit talk proposals, I ask myself two questions: 1) What will audience take away from this talk, and 2) What’s the weirdest stuff the conference will possibly accept? So I’m going to cover a lot of material, and some of it will seem quite strange. Please do feel free to raise your hand or just shout out a question. I’ll be happy to slow down and explain. I’m not planning on leaving the last 10 minutes empty for questions, so do ask whenever you feel it would be helpful. My goal here is to leave you with food for thought. I think I’ll have succeeded if you find at least one or two of the ideas here intriguing enough to want to research yourself next week.
  2. https://www.flickr.com/photos/futureshape/566200801 Security is domain-specific QA, but the fields often seem

    worlds apart. Maybe it shouldn’t be that way?
  3. QAs and security analysts are smart people

  4. https://what-if.xkcd.com/49/ who are sometimes considered a bit strange or maybe

    even feared by developers and business types
  5. but care very deeply about software correctness

  6. Unfortunately, they go to totally different parties

  7. Except CodeMash Can we take this opportunity to learn from

    each other?
  8. S o f t w a r e C o

    r r e c t n e s s <spoilers> OK, spoiler alert! There are four core ideas I’m going to explore in this talk. Security and QA both explore software correctness. I’d like to add some precision to what we mean when we say that software is “insecure” or “buggy.”
  9. M a n u a l A n a l

    y s i s <spoilers> Manual analysis adds value when we deal with human computer interaction. But you can do anything with manual analysis! Sometimes, stuff you shouldn’t…
  10. U n d e f i n e d B

    e h a v i o r <spoilers> Software often has unintended behaviors. Eliminating these helps security and quality, and can be automated much more commonly than generally suspected. We’re going to talk about how to specify the behavior of real systems consistently. It is a truth universally acknowledged that deleting code is an excellent way of improving the quality and security of an application.
  11. I m p l e m e n t i

    n g T h i s S t u f f <spoilers> I used these ideas to build a tool with legs in both disciplines, with interesting results! Now, my code is not a magic fix for all the world’s problems. It’s an experiment. I hope you’ll be inspired to try some experiments of your own. You’ll probably do it better than I do.
  12. I started down this rabbit hole years ago when I

    first learned about AFL Has anyone here used it/heard of it? For a super simple technique, it delivers some pretty impressive results We’ll talk more about how it works later, but first I’ll give an example of why I find it worthy of study.
  13. I started down this rabbit hole years ago when I

    first learned about AFL Has anyone here used it/heard of it? For a super simple technique, it delivers some pretty impressive results We’ll talk more about how it works later, but first I’ll give an example of why I find it worthy of study.
  14. 20 TB SWF files from Google index https://security.googleblog.com/2011/08/fuzzing-at-scale.html Here’s one

    example of AFL in use. Google researchers took…
  15. 1 week run time on 2000 cores to find minimal

    set of 20000 SWF files https://security.googleblog.com/2011/08/fuzzing-at-scale.html They simply observe the execution paths — the traces from runtime profiling — to select from those 20 TB of SWF files a representative subset which maximizes the many different ways a file can be parsed.
  16. 3 weeks run time on 2000 cores with mutated inputs

    https://security.googleblog.com/2011/08/fuzzing-at-scale.html …and then go to town. Take those inputs, fiddle with the bits, and test again
  17. 㱺 400 unique crash signatures https://security.googleblog.com/2011/08/fuzzing-at-scale.html and you start finding

    lots of bugs
  18. 㱺 106 distinct security bugs https://security.googleblog.com/2011/08/fuzzing-at-scale.html which turn out to

    be pretty serious. This was around 5 years ago and it’s much more efficient today. As automated tests go, a month runtime on 2000 cores is fairly high. But with cloud computing this kind of infrastructure is available to anyone who needs it.
  19. But Flash is a giant pile of C code, and

    human beings can’t write safe C code. Finding memory access and overflow bugs in a giant pile of C code is sort of pedestrian.
  20. Also, the test is pretty simple: If the app crashes,

    there’s a bug (probably exploitable).
  21. https://www.flickr.com/photos/sloth_rider/392367929 Sometimes the best way to understand something is to

    implement it yourself. Could we pick a harder problem?
  22. https://commons.wikimedia.org/wiki/File:ACT_recycling_truck.jpg Can rule out a whole lot of routine C

    memory errors with managed code / garbage collection.
  23. Can start with a system under test considered to be

    stable instead of Flash.
  24. So I started a project….

  25. Fizil AFL Runs on Windows ✅ There’s a fork Runs

    on Unix ❌ ✅ Fast ❌ ✅ Bunnies! ❌ Process models In Process, Out of Process Fork Server, Out of Process Instrumentation guided Soon? ✅ Automatic instrumentation .NET Assemblies Clang, GCC, Python Rich suite of fuzzing strategies Getting there! ✅ Automatically disables crash reporting ✅ ❌ Rich tooling ❌ ✅ Proven track record ❌ ✅ Stable ❌ ✅ License Apache 2.0 Apache 2.0 I take a lot of inspiration from AFL, going so far as to port some of the AFL C code to F# in Fizil. But my goals are really different, and the two tools do very different things.
  26. and eventually I was able to test real software

  27. and eventually I was able to test real software

  28. So naturally the first bug I found was in some

    giant pile of C code.
  29. https://unsplash.com/search/bug?photo=emTCWiq2txk Later, I started finding actual bugs. Now at this

    point, some of you are probably thinking, “yeah, yeah, get on to the interesting parts already,” and some are probably wondering what fuzzing is
  30. F u z z i n g https://commons.wikimedia.org/wiki/File:Rabbit_american_fuzzy_lop_buck_white.jpg It won’t

    take long to explain, so I’m going to give a quick overview so everyone can follow along Fuzzing is a technique we associate with security research, but it’s a special case of specification-based random testing, and it’s useful for QA as well, but very much underutilized!
  31. { "a" : "bc" } Run a program or function,

    the system under test, with some input. I’m testing a JSON parser, so let’s start with something simple
  32. A B D C E Observe the execution path. Like

    watching test code coverage, except we track the order in which lines are executed, not just whether or not they ever ran. We store a hash of the path.
  33. A B D C E Observe the execution path. Like

    watching test code coverage, except we track the order in which lines are executed, not just whether or not they ever ran. We store a hash of the path.
  34. { "a" : "bc" } ✅ For this input we

    note the parser terminates without crashing and indicates it’s valid JSON
  35. { "a" : "bc" } Alter that input by mutating

    the original and run again, still observing the path
  36. { "a" : "bc" } | Alter that input by

    mutating the original and run again, still observing the path
  37. A B D C E This time, the path changes

  38. A B D C E This time, the path changes

  39. | "a" : "bc" } ❌ The program terminates without

    crashing, but marks the JSON as invalid. We always need a property to test after termination of the system under test In AFL, the most common property to test is whether the system under test crashes. JSON.NET doesn’t tend to crash, so I want to test whether it correctly validates JSON.
  40. https://www.flickr.com/photos/29278394@N00/696701369 Do this a lot. Hundreds of thousands, preferably millions

    or more. Do it at a large enough scale and you’ll probably find interesting things. Any questions about how fuzzing works in general before I go on?
  41. I m p o s s i b l e

    ? Or just really, amazingly difficult? https://commons.wikimedia.org/wiki/File:Impossible_cube_illusion_angle.svg I’d like to talk about some things which security and quality assurance have in common. Here’s one: They’re obviously important, but widely considered to be impossible to finish.
  42. When Is QA “Done”? Some people say this is impossible.

    Maybe it is, maybe it’s not. But I do think we should at least try to say what this might mean before we decide that.
  43. When Can We Call a System “Secure”? Some people say

    this is impossible. Maybe it is, maybe it’s not. But I do think we should at least try to say what this might mean before we decide that.
  44. https://xkcd.com/1316/ Part of why this is hard is software keeps

    getting larger — both the apps an OS they run on Open source actually made this a lot worse — more parts and keeps changing Has often been the case that a code base can grow larger than a team of humans can secure/QA, even using automation.
  45. That now happens on an empty project. You do File->New

    and you have megabytes of code you’re lucky if you can even run, never mind test?
  46. When a task is too big and too repetitive for

    a human, we want to use a computer instead, even if it’s not obvious how to do it.
  47. E x p l o r a t o r

    y https://dojo.ministryoftesting.com/lessons/exploratory-testing-an-api In both security and QA we do various sorts of automated testing and analysis and manual exploration and study. Good testers think a lot about where that line should be drawn Can we use computers to automate exploratory testing? I think we have to explore this question, because contemporary software is too big to do 100% of exploratory testing manually. There will always be a need to test how humans react to software, but we must automate as much analysis as possible.
  48. S e c u r i t y ⊇ Q

    A ? To try to start to answer these questions, let’s back up and ask what we know about QA and security. And am I correct at all to describe security as domain-specific QA? What even is security? QA?
  49. Behavior Developers, QAs, security analysts, and users are all interested

    in the actual behavior of software (click) Some of this will be described by a specification, formal or, more commonly, informal QAs will investigate whether the behavior (click) correctly conforms to the spec or (click) fails They usually also expand the specification as they work (click) Security analysts, on the other hand, are most interested in the areas where the behavior falls outside of the specification, but also where the specification is self- contradictory.
  50. Behavior Specification Developers, QAs, security analysts, and users are all

    interested in the actual behavior of software (click) Some of this will be described by a specification, formal or, more commonly, informal QAs will investigate whether the behavior (click) correctly conforms to the spec or (click) fails They usually also expand the specification as they work (click) Security analysts, on the other hand, are most interested in the areas where the behavior falls outside of the specification, but also where the specification is self- contradictory.
  51. Behavior Specification Developers, QAs, security analysts, and users are all

    interested in the actual behavior of software (click) Some of this will be described by a specification, formal or, more commonly, informal QAs will investigate whether the behavior (click) correctly conforms to the spec or (click) fails They usually also expand the specification as they work (click) Security analysts, on the other hand, are most interested in the areas where the behavior falls outside of the specification, but also where the specification is self- contradictory.
  52. Behavior Specification Developers, QAs, security analysts, and users are all

    interested in the actual behavior of software (click) Some of this will be described by a specification, formal or, more commonly, informal QAs will investigate whether the behavior (click) correctly conforms to the spec or (click) fails They usually also expand the specification as they work (click) Security analysts, on the other hand, are most interested in the areas where the behavior falls outside of the specification, but also where the specification is self- contradictory.
  53. Behavior Specification Developers, QAs, security analysts, and users are all

    interested in the actual behavior of software (click) Some of this will be described by a specification, formal or, more commonly, informal QAs will investigate whether the behavior (click) correctly conforms to the spec or (click) fails They usually also expand the specification as they work (click) Security analysts, on the other hand, are most interested in the areas where the behavior falls outside of the specification, but also where the specification is self- contradictory.
  54. Behavior Specification Developers, QAs, security analysts, and users are all

    interested in the actual behavior of software (click) Some of this will be described by a specification, formal or, more commonly, informal QAs will investigate whether the behavior (click) correctly conforms to the spec or (click) fails They usually also expand the specification as they work (click) Security analysts, on the other hand, are most interested in the areas where the behavior falls outside of the specification, but also where the specification is self- contradictory.
  55. P e o p l e https://www.flickr.com/photos/wocintechchat/25677176162/ What sort of

    people work in QA and security jobs?
  56. http://amanda.secured.org/in-securities-comic/ Characteristics of Security Pro, Michał Zalewski (Mee-how) https://lcamtuf.blogspot.com/2016/08/so-you-want-to-work-in-security-but-are.html Based

    on Parisa Tabriz’s article, but boiled down to a list of 4 points I have time to cover here today Infosec is all about the mismatch between our intuition and the actual behavior of the systems we build Security is a protoscience. Think of chemistry in the early 19th century: a glorious and messy thing, and will change! If you are not embarrassed by the views you held two years ago, you are getting complacent - and complacency kills. Walk in [the] shoes [of software engineers] for a while: write your own code, show it to the world, and be humiliated by all the horrible mistakes you will inevitably make.
  57. https://www.quora.com/What-qualities-make-a-good-QA-engineer —Thomas Peham Characteristics of QA Similar to security pros,

    but heavy emphasis on communication. Do developers value quality?
  58. T o o l s https://commons.wikimedia.org/wiki/ File:Tools,_arsenical_copper,_Naxos,_2700%E2%80%932200_BC,_BM,_GR_1969.12-31,_142703.jpg You might think

    that QA and security tools would be really different,
  59. but they can be surprisingly similar. Fiddler is used by

    people in both domains,
  60. And ZAP is quite similar to Fiddler with some security-specific

    features.
  61. S i m p l e T e s t

    i n g https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf Some tools are remarkably effective! You can fix large classes of bugs using static analysis in your CI system, but nothing forces you to use it.
  62. https://laurent22.github.io/so-injections/ You can prevent most SQLi via parameterized queries, but

    nothing forces you to use them… unless you run Coverity (which is free for open source) or Veracode or…
  63. While it’s true you can squash large classes of defects

    and security issues with simple fixes, you won’t get to 100% this way. Defects can be introduced downstream, even at the microcode level, and people can make intentional tradeoffs away from quality. Intel used to spend effort on formal validation of microcode. Rumor has it that’s been drastically scaled back
  64. https://en.wikipedia.org/wiki/File:Row_hammer.svg We see security bugs in hardware, too, like the

    rowhammer attack on physical memory. (Explain?) So some specific problems are easy to fix, but making an entire system correct and secure is much harder. Still, it does seem like we ought to at least pick the low-hanging fruit.
  65. S p e c i f i c a t

    i o n s https://lorinhochstein.wordpress.com/2014/06/04/crossing-the-river-with-tla/ Specifications. To say whether software behaves correctly, we have to define what is correct. QAs are sometimes cynical of claims for specifications, since we’ve been promised they’re the means to automatically create perfect software and, well, we’re not there yet. Specifying a system is not simple and most people aren’t good at it. Does that mean they’re a waste of time? We can’t talk about the correctness of a system without a specification. Specifications aren’t a royal road to correctness; they’re an obligation, a precondition of any assertion of quality.
  66. [<Test>] let testReadDoubleWithExponent() = let actual = parseString "10.0e1" actual

    |> shouldEqual (Parsed (JsonNumber "10.0e1")) But “specification” sounds formal and academic and probably we don’t generally get these handed to us on a silver platter. What does it mean in the real world? Well, a specification is simply something which must always be true. A unit test is a simple kind of spec. It’s useful because it’s enforced by the CI build. It’s limited because it’s only one example, but much better than nothing!
  67. let toHexString (bytes: byte[]) : string = //... Another example

    of something which is always true is a type signature. In F#, this function cannot return null, ever. That’s really useful! That single feature of the type system eliminates — for real — about a quarter of the bugs I see in production C#, Java, and JavaScript systems I’m asked to maintain. More specs: This function is invertible
  68. http://d3s.mff.cuni.cz/research/seminar/download/2010-02-23-Tobies-HypervisorVerification.pdf Formal methods combine specifications with theorem provers to demonstrate

    when specifications do or do not always hold in the code. And they work! I’ll give examples later on in the talk.
  69. Best Known Example Released Informal Spec Formal Spec Execute QuickCheck

    1999 “Reversing a list twice should result in the same list” prop_RevRev xs = reverse (reverse xs) == xs where types = xs::[Int] Main> quickCheck prop_RevRev OK, passed 100 tests. Earlier I mentioned Specification-based random testing. Idea is to produce lots of random input and see if a certain specification holds. QuickCheck is a well-known example (explain) (click) Does that sound more like QA or security? (click) But of course this is really similar to what AFL does, only the spec is a bit more general (click) and the number of iterations is far higher
  70. Best Known Example Released Informal Spec Formal Spec Execute QuickCheck

    1999 “Reversing a list twice should result in the same list” prop_RevRev xs = reverse (reverse xs) == xs where types = xs::[Int] Main> quickCheck prop_RevRev OK, passed 100 tests. Earlier I mentioned Specification-based random testing. Idea is to produce lots of random input and see if a certain specification holds. QuickCheck is a well-known example (explain) (click) Does that sound more like QA or security? (click) But of course this is really similar to what AFL does, only the spec is a bit more general (click) and the number of iterations is far higher
  71. Best Known Example Released Informal Spec Formal Spec Execute QuickCheck

    1999 “Reversing a list twice should result in the same list” prop_RevRev xs = reverse (reverse xs) == xs where types = xs::[Int] Main> quickCheck prop_RevRev OK, passed 100 tests. AFL 2007 “System under test shouldn’t crash no matter what I pass to it” if (WIFSIGNALED(status) #$ !stop_soon) { kill_signal = WTERMSIG(status); return FAULT_CRASH; } ./afl-fuzz -i testcase_dir -o findings_dir -- \ /path/to/tested/program [&&.program's cmdline&&.] Earlier I mentioned Specification-based random testing. Idea is to produce lots of random input and see if a certain specification holds. QuickCheck is a well-known example (explain) (click) Does that sound more like QA or security? (click) But of course this is really similar to what AFL does, only the spec is a bit more general (click) and the number of iterations is far higher
  72. Best Known Example Released Informal Spec Formal Spec Execute QuickCheck

    1999 “Reversing a list twice should result in the same list” prop_RevRev xs = reverse (reverse xs) == xs where types = xs::[Int] Main> quickCheck prop_RevRev OK, passed 100 tests. AFL 2007 “System under test shouldn’t crash no matter what I pass to it” if (WIFSIGNALED(status) #$ !stop_soon) { kill_signal = WTERMSIG(status); return FAULT_CRASH; } ./afl-fuzz -i testcase_dir -o findings_dir -- \ /path/to/tested/program [&&.program's cmdline&&.] Earlier I mentioned Specification-based random testing. Idea is to produce lots of random input and see if a certain specification holds. QuickCheck is a well-known example (explain) (click) Does that sound more like QA or security? (click) But of course this is really similar to what AFL does, only the spec is a bit more general (click) and the number of iterations is far higher
  73. https://www.flickr.com/photos/x1brett/2279939232 Unfortunately, having a method for showing when specifications hold

    makes it really obvious that producing coherent specs for your software isn’t always easy! Very easy to add requirements until they contradict each other
  74. https://medium.com/@betable/tifu-by-using-math-random-f1c308c4fd9d Even simple specifications can sometimes be different to prove.

    PRNG from Safari on left, old version of Chrome on right. Obviously, there’s an issue. Specification well-understood, but impossible to write a test which passes 100% of time for a perfect PRNG. We need a spec for what our system should do, but that doesn’t automatically make it testable
  75. Thought Experiment: W h a t I f A u

    t o m a t e d T e s t s W e r e P e r f e c t ? But let’s say we solved all those problems. You could prove software system perfectly conformed to the spec. Let’s say the spec was also right. Problem solved? QA = spec + people
  76. This is life or death. This is an alert screen.

    Epic EMR. One of 17,000 alerts the UCSF physicians received in that month alone. (click) Contains the number “160” (or “6160”) in 14 different places. None are “wrong” values. Nurse clicked through this and patient received 38 1/2 times the correct dose of an antibiotic. He’s fortunate to have survived. dark patterns
  77. This is life or death. This is an alert screen.

    Epic EMR. One of 17,000 alerts the UCSF physicians received in that month alone. (click) Contains the number “160” (or “6160”) in 14 different places. None are “wrong” values. Nurse clicked through this and patient received 38 1/2 times the correct dose of an antibiotic. He’s fortunate to have survived. dark patterns
  78. W h a t I f S e c u

    r i t y A n a l y s i s T o o l s W e r e P e r f e c t ? You could prove edge cases outside the spec did not exist. Let’s say the spec was also right. No SQLi, no overflows, no… anything. Problem solved?
  79. –DNI James Clapper “Something like 90 percent of cyber intrusions

    start with phishing… Somebody always falls for it.” https://twitter.com/ODNIgov/status/776070411482193920 Security = spec + people. People who open phishing emails aren’t even dumb. You have someone in HR, reviewing resumes, and so an essential part of their job is opening PDFs and Word files emailed to them by clueless, anonymous people on the Internet, right?
  80. https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45366.pdf If you display a security indicaition to a human

    being, how will they react? Google and UC Berkeley have conducted research on user response to security warnings
  81. The same team also produced an opt-in Chrome extension which

    would show you a survey if you clicked through a security warning. Knowing how humans react to security UI is a security-critical test!
  82. Manual Testing Examples Exploratory testing, Binary analysis Effort Very high

    Killer App Finding cases where code technically correct but fails at human- computer interaction Major Disadvantage Often misused OK, so proof that some piece of software conforms to a specification isn’t sufficient to call it perfect in terms of quality or security, but could we even have that proof? Let’s consider what we can do today, and the return on that effort.
  83. Dynamic Analysis Examples QuickCheck, AFL, sqlmap Effort Low Killer App

    More like an app killer, amiright? Major Disadvantage Tends to find a few specific (though important!) bugs
  84. Static Analysis Examples FxCop, FindBugs, Coverity, Veracode Effort Very low

    Killer App Cheaper than air. Just do it. Major Disadvantage Limited to finding a few hundred important kinds of bugs
  85. Formal Verification / Symbolic Execution Examples VCC, TLA+, Cryptol Effort

    High effort but correspondingly high return Killer App MiTLS, Hyper-V Memory Manager Major Disadvantage Hard to find people with skill set
  86. Program Synthesis Examples Nothing off the shelf, really, but Agda

    and Z3 help Effort PhD-level research Killer App Elimination of incidental complexity Major Disadvantage Doesn’t really exist in general form This may be the future of software. When you write code today you’re writing an informal specification in a language designed for general control of hardware. What if you wrote in a language better suited to your problem domain? This is not a DSL, as we think of it today, because DSLs don’t verify the consistency and satisfiability of specifications.
  87. For developers, the message here is: Don’t fix the bug.

    When you receive a defect report from a security or QA team, don’t fix it. Add something to your process, like static analysis, which makes the entire class of defect impossible. Then fix everything it catches.
  88. That’s all a bunch of words and hand-waving. Let’s bring

    this down to earth, shall we?
  89. How Amazon Web Services Uses Formal Methods “Formal methods are

    a big success at AWS, helping us prevent subtle but serious bugs from reaching production, bugs we would not have found through any other technique. They have helped us devise aggressive optimizations to complex algorithms without sacrificing quality.” http://research.microsoft.com/en-us/um/people/lamport/tla/amazon.html First, this stuff works in the real world, today. The scale at which customers use AWS services is far too large for even Amazon to comprehensively test. They work around this limitations with by formally specifying all of their protocols in TLA+
  90. Researchers at INRIA an Microsoft Research used a dependently typed

    language called F* to implement TLS and compare the behavior of the formally verified implementation to others on fuzzed input. Six major vulnerabilities so far, including Triple Handshake, FREAK, and Logjam
  91. Researchers at INRIA an Microsoft Research used a dependently typed

    language called F* to implement TLS and compare the behavior of the formally verified implementation to others on fuzzed input. Six major vulnerabilities so far, including Triple Handshake, FREAK, and Logjam
  92. “Finding and Understanding Bugs in C Compilers,” Yang et al.

    https://www.flux.utah.edu/paper/yang-pldi11 It's possible to write a correct C compiler, but the best C developers in the world can't do it in C. You can’t test every possible C program with ad hoc test suites. Yang et al. used “smart” fuzzing of the C language to cover the maximum surface area of a compiler. They report: “Twenty-five of our reported GCC bugs have been classified as P1, the maximum, release-blocking priority for GCC defects. Our results suggest that fixed test suites— the main way that compilers are tested—are an inadequate mechanism for quality control.” The only C compiler to survive this kind of testing was CompCert, which is formally verified in Coq. (Rust story?)
  93. These are phenomenally effective tools, and surprisingly under-utilized in industry.

    It turns out, you can try this at home! What I Learned Writing a .NET Fuzzer
  94. =================================== Technical "whitepaper" for afl-fuzz =================================== This document provides a

    quick overview of the guts of American Fuzzy Lop. See README for the general instruction manual; and for a discussion of motivations and design goals behind AFL, see historical_notes.txt. 0) Design statement ------------------- American Fuzzy Lop does its best not to focus on any singular principle of operation and not be a proof-of-concept for any specific theory. The tool can be thought of as a collection of hacks that have been tested in practice, found to be surprisingly effective, and have been implemented in the simplest, most robust way I could think of at the time. Many of the resulting features are made possible thanks to the availability of lightweight instrumentation that served as a foundation for the tool, but this mechanism should be thought of merely as a means to an end. The only true governing principles are speed, reliability, and ease of use. 1) Coverage measurements ------------------------ The instrumentation injected into compiled programs captures branch (edge) coverage, along with coarse branch-taken hit counts. The code injected at branch points is essentially equivalent to: cur_location = <COMPILE_TIME_RANDOM>; shared_mem[cur_location ^ prev_location]++; prev_location = cur_location >> 1; http://lcamtuf.coredump.cx/afl/technical_details.txt So as I said earlier, I love the concept of AFL, and especially the large amount of documentation that Michał Zalewski has written explaining the design choices and implementation details Now I want to explain my own journey
  95. M e m o r y AFL often finds memory

    crashes. You get this almost free. We know how to almost entirely eliminate these. In a managed language…. But the idea is still good. Can we find other kinds of errors as “easily” as memory-related crashes? This turned out to be quite a long road.
  96. { "a" : "bc" } But you have to start

    somewhere. Find a corpus. I started with. It’s valid JSON! Can we parse this?
  97. let jsonNetResult = try JsonConvert.DeserializeObject<obj>(str) |> ignore Success with |

    :? JsonReaderException as jre -> jre.Message |> Error | :? JsonSerializationException as jse -> jse.Message |> Error | :? System.FormatException as fe -> if fe.Message.StartsWith("Invalid hex character”) // hard coded in Json.NET then fe.Message |> Error else reraise() ⃪ T est ⬑ Special case error stuff I had to write a short program to run the deserializer, which I’ll call the test harness
  98. use proc = new Process() proc.StartInfo.FileName <- executablePath inputMethod.BeforeStart proc

    testCase.Data proc.StartInfo.UseShellExecute <- false proc.StartInfo.RedirectStandardOutput <- true proc.StartInfo.RedirectStandardError <- true proc.StartInfo.EnvironmentVariables.Add(SharedMemory.environmentVariableName, sharedMemoryName) let output = new System.Text.StringBuilder() let err = new System.Text.StringBuilder() proc.OutputDataReceived.Add(fun args -> output.Append(args.Data) |> ignore) proc.ErrorDataReceived.Add (fun args -> err.Append(args.Data) |> ignore) proc.Start() |> ignore inputMethod.AfterStart proc testCase.Data proc.BeginOutputReadLine() proc.BeginErrorReadLine() proc.WaitForExit() let exitCode = proc.ExitCode let crashed = exitCode = WinApi.ClrUnhandledExceptionCode ⃪ Set up ⃪ Read results ⃪ Important bit And another program to read the input data and execute the test harness executable and then see if it succeeded Pretty simple, so far. There’s a lot of code here. Don’t worry about the details. Setup, execute, collect data. The code is on my GitHub and I gave you the link at the beginning of the slides if you want to look deeper. But my original sample input wasn’t very interesting.
  99. Then I stood on the shoulders of giants. Turns out

    lots of people (well, two) like to collect problematic JSON. Now I have about 200 good test cases. But I want hundreds of thousands.
  100. /// An ordered list of functions to use when starting

    with a single piece of /// example data and producing new examples to try let private allStrategies = [ bitFlip 1 bitFlip 2 bitFlip 4 byteFlip 1 byteFlip 2 byteFlip 4 arith8 arith16 arith32 interest8 interest16 ] So I wrote a bunch of ways to transform that input into new cases This list is just copied from AFL
  101. let totalBits = bytes.Length * 8 let testCases = seq

    { for bit = 0 to totalBits - flipBits do let newBytes = Array.copy bytes let firstByte = bit / 8 let firstByteMask, secondByteMask = bitMasks(bit, flipBits) let newFirstByte = bytes.[firstByte] ^^^ firstByteMask newBytes.[firstByte] <- newFirstByte let secondByte = firstByte + 1 if secondByteMask <> 0uy && secondByte < bytes.Length then let newSecondByte = bytes.[secondByte] ^^^ secondByteMask newBytes.[secondByte] <- newSecondByte yield newBytes } Fuzz one byte → ^^^ means xor ↓ And I translated the AFL fuzz C code into F# So now I have a bunch of test cases, but I need to understand them. If I have an input and I flip one bit, maybe that’s a valuable new test case, or more likely it’s totally useless. How do I know?
  102. https://commons.wikimedia.org/wiki/File:CPT-Recursion-Factorial-Code.svg I need to trace all of the call stacks

    executed during the test. I’m looking for tests which produce new sequences of stacks.
  103. private static void F(string arg) { Console.WriteLine("f"); Console.Error.WriteLine("Error!"); Environment.Exit(1); }

    How do I know which call stacks happen during testing? Instrumentation. At the simplest level, we want to turn this:
  104. private static void F(string arg) { instrument.Trace(29875); Console.WriteLine("f"); Console.Error.WriteLine("Error!"); Environment.Exit(1);

    } ← Random number into this. Which is to say, whenever we enter some block, inform an external observer what just happened. In order to inject this code, I have a couple of choices
  105. private static void F(string arg) { #if MANUAL_INSTRUMENTATION instrument.Trace(29875); #endif

    Console.WriteLine("f"); Console.Error.WriteLine("Error!"); Environment.Exit(1); } I could manually add the instrumentation, which is painful
  106. Or I could write a profiler. The .NET framework provides

    a profiling API, but it requires hosting the runtime and is annoying in other ways
  107. An easier solution is Mono.Cecil, which is sort of like

    .NET reflection except you can modify .NET assemblies without loading them into your process first.
  108. let stringify (ob: obj) : string = JsonConvert.SerializeObject(ob) I want

    to be able to work with any .NET executable, so I instrument binaries instead of source. So I can write some really simple F# code, like this. Normally the compiler transforms it into (click). That’s complicated, so don’t worry much about it. The important thing is I need to add a couple of instructions (click). This just tells the fuzzer what is happening inside the program as it runs
  109. let stringify (ob: obj) : string = JsonConvert.SerializeObject(ob) // Method:

    System.String\u0020Program::stringify(System.Object) .body stringify { arg_02_0 [generated] arg_07_0 [generated] nop() arg_02_0 = ldloc(ob) arg_07_0 = call(JsonConvert::SerializeObject, arg_02_0) ret(arg_07_0) } I want to be able to work with any .NET executable, so I instrument binaries instead of source. So I can write some really simple F# code, like this. Normally the compiler transforms it into (click). That’s complicated, so don’t worry much about it. The important thing is I need to add a couple of instructions (click). This just tells the fuzzer what is happening inside the program as it runs
  110. let stringify (ob: obj) : string = JsonConvert.SerializeObject(ob) // Method:

    System.String\u0020Program::stringify(System.Object) .body stringify { arg_02_0 [generated] arg_07_0 [generated] nop() arg_02_0 = ldloc(ob) arg_07_0 = call(JsonConvert::SerializeObject, arg_02_0) ret(arg_07_0) } // Method: System.String\u0020Program::stringify(System.Object) .body stringify { arg_05_0 [generated] arg_0C_0 [generated] arg_11_0 [generated] arg_05_0 = ldc.i4(23831) call(Instrument::Trace, arg_05_0) nop() arg_0C_0 = ldloc(ob) arg_11_0 = call(JsonConvert::SerializeObject, arg_0C_0) ret(arg_11_0) } I want to be able to work with any .NET executable, so I instrument binaries instead of source. So I can write some really simple F# code, like this. Normally the compiler transforms it into (click). That’s complicated, so don’t worry much about it. The important thing is I need to add a couple of instructions (click). This just tells the fuzzer what is happening inside the program as it runs
  111. So what do we have so far? A ton of

    inputs, and lots of data about how the system under test behaves with them. Unfortunately, it doesn’t tell me very much. The only thing I was able to crash with this method was
  112. …Windows. Or maybe the Parallels memory manager. Not what I

    was after. But this was a rather difficult problem to work around. It’s hard to find bugs in an application when your OS is blue screening!
  113. http://www.json.org/ I need a way to determine if Json.NET is

    parsing the JSON correctly. So I thought I should write a JSON validator to check its behavior. Fortunately, there’s a standard! “Probably the boldest design decision I made was to not put a version number on JSON so there is no mechanism for revising it. We are stuck with JSON: whatever it is in its current form, that’s it.” -Crockford
  114. https://tools.ietf.org/html/rfc4627 So, naturally, there’s also another JSON standard

  115. http://www.ecma-international.org/ecma-262/5.1/#sec-15.12 And another JSON standard

  116. http://www.ecma-international.org/publications/standards/Ecma-404.htm And another JSON standard

  117. https://tools.ietf.org/html/rfc7158 And another JSON standard

  118. https://tools.ietf.org/html/rfc7159 And another JSON standard. And no, they don’t all

    agree on everything, nor is there a single, “latest” version. Despite this multitude of standards, there are still edge cases intentionally delegated to the implementer — what we would call “undefined behavior” in C.
  119. https://github.com/nst/STJSON I was going to write my own validator, but…

    Nicolas Seriot wrote a validator called STJSON which attempts to synthesize these as much as possible.
  120. https://github.com/CraigStuntz/Fizil/blob/master/StJson/StJsonParser.fs Swift doesn’t readily compile to Windows, but if you

    squint hard enough it kind of looks like F#, so I ported the code and used it to validate Json.NET's behavior.
  121. Standard Accepts, Json.NET Rejects Value 88888888888888888888888888888888888888888888888888 88888888888888888888888888888888888888888888888888 88888888888888888888888888888888888888888888888888 88888888888888888888888888888888888888888888888888 88888888888888888888888888888888888888888888888888

    Standard Says No limit Json.NET MaximumJavascriptIntegerCharacterLength = 380; Things JSON.NET fails on that the standard accepts
  122. Standard Rejects, Json.NET Accepts Value [,,,] Standard Says A JSON

    value MUST be an object, array, number, or string, or one of the following three literal names: false null true Json.NET [null, null, null, null] Things JSON.NET succeeds on that the standard rejects
  123. I m p l e m e n t a

    t i o n D e t a i l s I have lots of interesting stories from implementing this code, but some of them get kind of low level, so I’ll share just a couple I think are of general interest. Please do feel free to pull my code or reach out to me if you want complete details!
  124. let private insertTraceInstruction(ilProcessor: ILProcessor, before: Instruction, state) = let compileTimeRandom

    = state.Random.Next(0, UInt16.MaxValue |> Convert.ToInt32) let ldArg = ilProcessor.Create(OpCodes.Ldc_I4, compileTimeRandom) let callTrace = ilProcessor.Create(OpCodes.Call, state.Trace) ilProcessor.InsertBefore(before, ldArg) ilProcessor.InsertAfter (ldArg, callTrace) This margin is too narrow to contain a try/finally example, so see: https://goo.gl/W4y7JH Inserting the IL instructions I needed was fairly easy. Here is the important bit of the code which does it. How did I learn how to write this? I instrumented a small program “manually” by writing the instrumentation code myself, and then decompiled that program to figure out which IL instructions I needed. Inserting them with Mono.Cecil is just a few lines of code. try/finally is much, much harder. I won’t even try to walk you through it here. Look at the GitHub repo if you want to see how it’s done.
  125. Strong naming was a consistent pain for me. I’m altering

    the binaries of assemblies, and part of the point of strong naming is to stop you from doing just that, so naturally if the assembly is strongly named it can’t be loaded when I’m finished.
  126. let private removeStrongName (assemblyDefinition : AssemblyDefinition) = let name =

    assemblyDefinition.Name; name.HasPublicKey <- false; name.PublicKey <- Array.empty; assemblyDefinition.Modules |> Seq.iter ( fun moduleDefinition -> moduleDefinition.Attributes <- moduleDefinition.Attributes &&& ~~~ModuleAttributes.StrongNameSigned) let aptca = assemblyDefinition.CustomAttributes.FirstOrDefault( fun attr -> attr.AttributeType.FullName = typeof<System.Security.AllowPartiallyTrustedCallersAttribute>.FullName) assemblyDefinition.CustomAttributes.Remove aptca |> ignore assembly.MainModule.AssemblyReferences |> Seq.filter (fun reference -> Set.contains reference.Name assembliesToInstrument) |> Seq.iter (fun reference -> reference.PublicKeyToken <- null ) So I need to remove the strong name from any assembly I fuzz, but I also need to remove the PublicKeyToken from any other assembly which references it. Doing this in Mono.Cecil is not well-documented, and after quite a bit of time spent in GitHub issues and trial and error I figured out that it takes 5 distinct steps to do this.
  127. I n / O u t o f P r

    o c e s s In order to stop Windows/Parallels from crashing I decided to try and fuzz the system under test in process with the fuzzer itself to reduce the number of processes I was creating. This worked, but had a surprising result: The number of distinct paths through the code I found during testing changed. Why? Isn’t it running the same code with the same inputs? Yes, absolutely! But when you run a function, the .NET framework doesn’t guarantee that the same instructions will be executed each time. Let’s look at why.
  128. So my test harness is going to execute some code,

    and I’ve instrumented that code, which means I’ll get a series of trace events as the code runs. I might expect these events to come in the same order for executing a single function.
  129. But that’s not what happens. This is an actual trace

    of the method I just showed you, and it doesn’t happen at the same time. Why?
  130. –ECMA-335, Common Language Infrastructure (CLI), Partition I “If marked BeforeFieldInit

    then the type’s initializer method is executed at, or sometime before, first access to any static field defined for that type.” The .NET CLR guarantees a type initializer will be invoked before the type is used, but it doesn’t specify exactly when.
  131. f ( x ) = f ( x ) t

    i m e ( f ( x ) ) ! = t i m e ( f ( x ) ) ✅ ❌ For many methods, the CLR guarantees they’ll return the same result for the same argument, but does not guarantee the same instructions will be executed in the same order to get that result. What does this tell us about QA? What does this tell us about security? (Any thoughts on that?)
  132. U n i c o d e Original JSON {

    "a": "bc" } ASCII Bytes 7B 20 22 61 22 20 3A 20 22 62 63 22 20 7D UTF-8 with Byte Order Mark EF BB BF 7B 20 22 61 22 20 3A 20 22 62 63 22 20 7D UTF-16 BE with BOM FE FF 00 7B 00 20 00 22 00 61 00 22 00 20 00 3A 00 20 00 22 00 62 00 63 00 22 00 20 00 7D What does it mean to fuzz JSON? Which of these byte arrays should I fuzz? Web uses UTF-8. Browsers use WTF-8. Windows, C#, and JSON.NET like UTF-16. We must choose an encoding for our test corpus and then choose whether to convert or just fuzz as-is
  133. T h a n k Y o u ! Presentation

    Review Cassandra Faris Chad James Damian Synadinos Doug Mair Tommy Graves Source Code Inspiration Michał Zalewski Nicolas Seriot Everyone Who Works on dnSpy & Mono.Cecil Finally…
  134. None
  135. C r a i g S t u n t

    z @craigstuntz Craig.Stuntz@Improving.com http://www.craigstuntz.com http://www.meetup.com/Papers-We-Love-Columbus/ https://speakerdeck.com/craigstuntz