Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Mashing Up QA and Security - CodeMash 2017 - with notes

Mashing Up QA and Security - CodeMash 2017 - with notes

This version of the deck has speaker notes. I've published a separate version without notes.

Security is domain specific quality assurance, but developers, testers, and security professionals often don’t work together. When this type of disconnect exists between big groups of people who are very good at their jobs, there is usually a mostly untapped potential for learning. I’ve been exploring this landscape by writing an open source fuzzer aimed at discovering new test cases (not just crashes!) using binary rewriting of managed executables and genetic modification of a test corpus, implemented in F# and using Mono.Cecil. I’ll ontrast the fundamentals of each discipline, demonstrate tools used by experts on both sides of the security and QA fence, and challenge the audience to find new ways to mix them up. Expect to see lots of code and leave with ideas for making entire communities better, not just your own team!

Craig Stuntz

January 13, 2017
Tweet

More Decks by Craig Stuntz

Other Decks in Programming

Transcript

  1. M a s h i n g U p Q A
    a n d S e c u r i t y
    Craig Stuntz
    Improving
    https://speakerdeck.com/craigstuntz
    https://github.com/CraigStuntz/Fizil
    When I submit talk proposals, I ask myself two questions: 1) What will audience take away from this talk, and 2) What’s the weirdest stuff the conference will possibly
    accept?

    So I’m going to cover a lot of material, and some of it will seem quite strange. Please do feel free to raise your hand or just shout out a question. I’ll be happy to slow
    down and explain. I’m not planning on leaving the last 10 minutes empty for questions, so do ask whenever you feel it would be helpful. My goal here is to leave you with
    food for thought. I think I’ll have succeeded if you find at least one or two of the ideas here intriguing enough to want to research yourself next week.

    View full-size slide

  2. https://www.flickr.com/photos/futureshape/566200801
    Security is domain-specific QA, but the fields often seem worlds apart. Maybe it shouldn’t be that way?

    View full-size slide

  3. QAs and security analysts are smart people

    View full-size slide

  4. https://what-if.xkcd.com/49/
    who are sometimes considered a bit strange or maybe even feared by developers and business types

    View full-size slide

  5. but care very deeply about software correctness

    View full-size slide

  6. Unfortunately, they go to totally different parties

    View full-size slide

  7. Except CodeMash

    Can we take this opportunity to learn from each other?

    View full-size slide

  8. S o f t w a r e
    C o r r e c t n e s s

    OK, spoiler alert! There are four core ideas I’m going to explore in this talk.

    Security and QA both explore software correctness. I’d like to add some precision to what we mean when we say that software is “insecure” or “buggy.”

    View full-size slide

  9. M a n u a l
    A n a l y s i s

    Manual analysis adds value when we deal with human computer interaction. But you can do anything with manual analysis! Sometimes, stuff you shouldn’t…

    View full-size slide

  10. U n d e f i n e d
    B e h a v i o r

    Software often has unintended behaviors. Eliminating these helps security and quality, and can be automated much more commonly than generally suspected. We’re
    going to talk about how to specify the behavior of real systems consistently. It is a truth universally acknowledged that deleting code is an excellent way of improving the
    quality and security of an application.

    View full-size slide

  11. I m p l e m e n t i n g
    T h i s S t u f f

    I used these ideas to build a tool with legs in both disciplines, with interesting results!

    Now, my code is not a magic fix for all the world’s problems. It’s an experiment.

    I hope you’ll be inspired to try some experiments of your own. You’ll probably do it better than I do.

    View full-size slide

  12. I started down this rabbit hole years ago when I first learned about AFL

    Has anyone here used it/heard of it?

    For a super simple technique, it delivers some pretty impressive results

    We’ll talk more about how it works later, but first I’ll give an example of why I find it worthy of study.

    View full-size slide

  13. I started down this rabbit hole years ago when I first learned about AFL

    Has anyone here used it/heard of it?

    For a super simple technique, it delivers some pretty impressive results

    We’ll talk more about how it works later, but first I’ll give an example of why I find it worthy of study.

    View full-size slide

  14. 20 TB SWF
    files from Google index
    https://security.googleblog.com/2011/08/fuzzing-at-scale.html
    Here’s one example of AFL in use. Google researchers took…

    View full-size slide

  15. 1 week
    run time on
    2000 cores
    to find minimal set of
    20000 SWF
    files
    https://security.googleblog.com/2011/08/fuzzing-at-scale.html
    They simply observe the execution paths — the traces from runtime profiling — to select from those 20 TB of SWF files a representative subset which maximizes the
    many different ways a file can be parsed.

    View full-size slide

  16. 3 weeks
    run time on
    2000 cores
    with mutated inputs
    https://security.googleblog.com/2011/08/fuzzing-at-scale.html
    …and then go to town. Take those inputs, fiddle with the bits, and test again

    View full-size slide

  17. 㱺 400
    unique crash signatures
    https://security.googleblog.com/2011/08/fuzzing-at-scale.html
    and you start finding lots of bugs

    View full-size slide

  18. 㱺 106
    distinct security bugs
    https://security.googleblog.com/2011/08/fuzzing-at-scale.html
    which turn out to be pretty serious. This was around 5 years ago and it’s much more efficient today.

    As automated tests go, a month runtime on 2000 cores is fairly high. But with cloud computing this kind of infrastructure is available to anyone who needs it.

    View full-size slide

  19. But Flash is a giant pile of C code, and human beings can’t write safe C code. Finding memory access and overflow bugs in a giant pile of C code is sort of pedestrian.

    View full-size slide

  20. Also, the test is pretty simple: If the app crashes, there’s a bug (probably exploitable).

    View full-size slide

  21. https://www.flickr.com/photos/sloth_rider/392367929
    Sometimes the best way to understand something is to implement it yourself.

    Could we pick a harder problem?

    View full-size slide

  22. https://commons.wikimedia.org/wiki/File:ACT_recycling_truck.jpg
    Can rule out a whole lot of routine C memory errors with managed code / garbage collection.

    View full-size slide

  23. Can start with a system under test considered to be stable instead of Flash.

    View full-size slide

  24. So I started a project….

    View full-size slide

  25. Fizil AFL
    Runs on Windows ✅ There’s a fork
    Runs on Unix ❌ ✅
    Fast ❌ ✅
    Bunnies! ❌
    Process models In Process, Out of Process Fork Server, Out of Process
    Instrumentation guided Soon? ✅
    Automatic instrumentation .NET Assemblies Clang, GCC, Python
    Rich suite of fuzzing strategies Getting there! ✅
    Automatically disables crash reporting ✅ ❌
    Rich tooling ❌ ✅
    Proven track record ❌ ✅
    Stable ❌ ✅
    License Apache 2.0 Apache 2.0
    I take a lot of inspiration from AFL, going so far as to port some of the AFL C code to F# in Fizil. But my goals are really different, and the two tools do very different
    things.

    View full-size slide

  26. and eventually I was able to test real software

    View full-size slide

  27. and eventually I was able to test real software

    View full-size slide

  28. So naturally the first bug I found was in some giant pile of C code.

    View full-size slide

  29. https://unsplash.com/search/bug?photo=emTCWiq2txk
    Later, I started finding actual bugs.

    Now at this point, some of you are probably thinking, “yeah, yeah, get on to the interesting parts already,” and some are probably wondering what fuzzing is

    View full-size slide

  30. F u z z i n g
    https://commons.wikimedia.org/wiki/File:Rabbit_american_fuzzy_lop_buck_white.jpg
    It won’t take long to explain, so I’m going to give a quick overview so everyone can follow along

    Fuzzing is a technique we associate with security research, but it’s a special case of specification-based random testing, and it’s useful for QA as well, but very much
    underutilized!

    View full-size slide

  31. { "a" : "bc" }
    Run a program or function, the system under test, with some input. I’m testing a JSON parser, so let’s start with something simple

    View full-size slide

  32. A
    B D
    C
    E
    Observe the execution path. Like watching test code coverage, except we track the order in which lines are executed, not just whether or not they ever ran. We store a
    hash of the path.

    View full-size slide

  33. A
    B D
    C
    E
    Observe the execution path. Like watching test code coverage, except we track the order in which lines are executed, not just whether or not they ever ran. We store a
    hash of the path.

    View full-size slide

  34. { "a" : "bc" }

    For this input we note the parser terminates without crashing and indicates it’s valid JSON

    View full-size slide

  35. { "a" : "bc" }
    Alter that input by mutating the original and run again, still observing the path

    View full-size slide

  36. { "a" : "bc" }
    |
    Alter that input by mutating the original and run again, still observing the path

    View full-size slide

  37. A
    B D
    C
    E
    This time, the path changes

    View full-size slide

  38. A
    B D
    C
    E
    This time, the path changes

    View full-size slide

  39. | "a" : "bc" }

    The program terminates without crashing, but marks the JSON as invalid. We always need a property to test after termination of the system under test

    In AFL, the most common property to test is whether the system under test crashes. JSON.NET doesn’t tend to crash, so I want to test whether it correctly validates
    JSON.

    View full-size slide

  40. https://www.flickr.com/photos/29278394@N00/696701369
    Do this a lot. Hundreds of thousands, preferably millions or more. Do it at a large enough scale and you’ll probably find interesting things.

    Any questions about how fuzzing works in general before I go on?

    View full-size slide

  41. I m p o s s i b l e ?
    Or just really, amazingly difficult?
    https://commons.wikimedia.org/wiki/File:Impossible_cube_illusion_angle.svg
    I’d like to talk about some things which security and quality assurance have in common. Here’s one: They’re obviously important, but widely considered to be impossible
    to finish.

    View full-size slide

  42. When Is QA “Done”?

    Some people say this is impossible. Maybe it is, maybe it’s not. But I do think we should at least try to say what this might mean before we decide that.

    View full-size slide

  43. When Can We Call a System “Secure”?

    Some people say this is impossible. Maybe it is, maybe it’s not. But I do think we should at least try to say what this might mean before we decide that.

    View full-size slide

  44. https://xkcd.com/1316/
    Part of why this is hard is software keeps getting larger — both the apps an OS they run on

    Open source actually made this a lot worse — more parts and keeps changing

    Has often been the case that a code base can grow larger than a team of humans can secure/QA, even using automation.

    View full-size slide

  45. That now happens on an empty project. You do File->New and you have megabytes of code you’re lucky if you can even run, never mind test?

    View full-size slide

  46. When a task is too big and too repetitive for a human, we want to use a computer instead, even if it’s not obvious how to do it.

    View full-size slide

  47. E x p l o r a t o r y
    https://dojo.ministryoftesting.com/lessons/exploratory-testing-an-api
    In both security and QA we do various sorts of automated testing and analysis and manual exploration and study. Good testers think a lot about where that line should be
    drawn

    Can we use computers to automate exploratory testing?

    I think we have to explore this question, because contemporary software is too big to do 100% of exploratory testing manually. There will always be a need to test how
    humans react to software, but we must automate as much analysis as possible.

    View full-size slide

  48. S e c u r i t y ⊇ Q A ?
    To try to start to answer these questions, let’s back up and ask what we know about QA and security. And am I correct at all to describe security as domain-specific QA?
    What even is security? QA?

    View full-size slide

  49. Behavior
    Developers, QAs, security analysts, and users are all interested in the actual behavior of software (click)

    Some of this will be described by a specification, formal or, more commonly, informal

    QAs will investigate whether the behavior (click) correctly conforms to the spec or (click) fails

    They usually also expand the specification as they work (click)

    Security analysts, on the other hand, are most interested in the areas where the behavior falls outside of the specification, but also where the specification is self-
    contradictory.

    View full-size slide

  50. Behavior
    Specification
    Developers, QAs, security analysts, and users are all interested in the actual behavior of software (click)

    Some of this will be described by a specification, formal or, more commonly, informal

    QAs will investigate whether the behavior (click) correctly conforms to the spec or (click) fails

    They usually also expand the specification as they work (click)

    Security analysts, on the other hand, are most interested in the areas where the behavior falls outside of the specification, but also where the specification is self-
    contradictory.

    View full-size slide

  51. Behavior
    Specification
    Developers, QAs, security analysts, and users are all interested in the actual behavior of software (click)

    Some of this will be described by a specification, formal or, more commonly, informal

    QAs will investigate whether the behavior (click) correctly conforms to the spec or (click) fails

    They usually also expand the specification as they work (click)

    Security analysts, on the other hand, are most interested in the areas where the behavior falls outside of the specification, but also where the specification is self-
    contradictory.

    View full-size slide

  52. Behavior
    Specification
    Developers, QAs, security analysts, and users are all interested in the actual behavior of software (click)

    Some of this will be described by a specification, formal or, more commonly, informal

    QAs will investigate whether the behavior (click) correctly conforms to the spec or (click) fails

    They usually also expand the specification as they work (click)

    Security analysts, on the other hand, are most interested in the areas where the behavior falls outside of the specification, but also where the specification is self-
    contradictory.

    View full-size slide

  53. Behavior
    Specification
    Developers, QAs, security analysts, and users are all interested in the actual behavior of software (click)

    Some of this will be described by a specification, formal or, more commonly, informal

    QAs will investigate whether the behavior (click) correctly conforms to the spec or (click) fails

    They usually also expand the specification as they work (click)

    Security analysts, on the other hand, are most interested in the areas where the behavior falls outside of the specification, but also where the specification is self-
    contradictory.

    View full-size slide

  54. Behavior
    Specification
    Developers, QAs, security analysts, and users are all interested in the actual behavior of software (click)

    Some of this will be described by a specification, formal or, more commonly, informal

    QAs will investigate whether the behavior (click) correctly conforms to the spec or (click) fails

    They usually also expand the specification as they work (click)

    Security analysts, on the other hand, are most interested in the areas where the behavior falls outside of the specification, but also where the specification is self-
    contradictory.

    View full-size slide

  55. P e o p l e
    https://www.flickr.com/photos/wocintechchat/25677176162/
    What sort of people work in QA and security jobs?

    View full-size slide

  56. http://amanda.secured.org/in-securities-comic/
    Characteristics of Security Pro, Michał Zalewski (Mee-how) https://lcamtuf.blogspot.com/2016/08/so-you-want-to-work-in-security-but-are.html Based on Parisa Tabriz’s
    article, but boiled down to a list of 4 points I have time to cover here today

    Infosec is all about the mismatch between our intuition and the actual behavior of the systems we build

    Security is a protoscience. Think of chemistry in the early 19th century: a glorious and messy thing, and will change!

    If you are not embarrassed by the views you held two years ago, you are getting complacent - and complacency kills.

    Walk in [the] shoes [of software engineers] for a while: write your own code, show it to the world, and be humiliated by all the horrible mistakes you will inevitably make.

    View full-size slide

  57. https://www.quora.com/What-qualities-make-a-good-QA-engineer
    —Thomas Peham
    Characteristics of QA

    Similar to security pros, but heavy emphasis on communication. Do developers value quality?

    View full-size slide

  58. T o o l s
    https://commons.wikimedia.org/wiki/
    File:Tools,_arsenical_copper,_Naxos,_2700%E2%80%932200_BC,_BM,_GR_1969.12-31,_142703.jpg
    You might think that QA and security tools would be really different,

    View full-size slide

  59. but they can be surprisingly similar. Fiddler is used by people in both domains,

    View full-size slide

  60. And ZAP is quite similar to Fiddler with some security-specific features.

    View full-size slide

  61. S i m p l e T e s t i n g
    https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf
    Some tools are remarkably effective! You can fix large classes of bugs using static analysis in your CI system, but nothing forces you to use it.

    View full-size slide

  62. https://laurent22.github.io/so-injections/
    You can prevent most SQLi via parameterized queries, but nothing forces you to use them… unless you run Coverity (which is free for open source) or Veracode or…

    View full-size slide

  63. While it’s true you can squash large classes of defects and security issues with simple fixes, you won’t get to 100% this way. Defects can be introduced downstream,
    even at the microcode level, and people can make intentional tradeoffs away from quality. Intel used to spend effort on formal validation of microcode. Rumor has it that’s
    been drastically scaled back

    View full-size slide

  64. https://en.wikipedia.org/wiki/File:Row_hammer.svg
    We see security bugs in hardware, too, like the rowhammer attack on physical memory. (Explain?) So some specific problems are easy to fix, but making an entire system
    correct and secure is much harder. Still, it does seem like we ought to at least pick the low-hanging fruit.

    View full-size slide

  65. S p e c i f i c a t i o n s
    https://lorinhochstein.wordpress.com/2014/06/04/crossing-the-river-with-tla/
    Specifications. To say whether software behaves correctly, we have to define what is correct. QAs are sometimes cynical of claims for specifications, since we’ve been
    promised they’re the means to automatically create perfect software and, well, we’re not there yet. Specifying a system is not simple and most people aren’t good at it.
    Does that mean they’re a waste of time?

    We can’t talk about the correctness of a system without a specification.

    Specifications aren’t a royal road to correctness; they’re an obligation, a precondition of any assertion of quality.

    View full-size slide

  66. []
    let testReadDoubleWithExponent() =
    let actual = parseString "10.0e1"
    actual |> shouldEqual (Parsed (JsonNumber "10.0e1"))
    But “specification” sounds formal and academic and probably we don’t generally get these handed to us on a silver platter. What does it mean in the real world? Well, a
    specification is simply something which must always be true.

    A unit test is a simple kind of spec. It’s useful because it’s enforced by the CI build. It’s limited because it’s only one example, but much better than nothing!

    View full-size slide

  67. let toHexString (bytes: byte[]) : string =
    //...
    Another example of something which is always true is a type signature.

    In F#, this function cannot return null, ever. That’s really useful! That single feature of the type system eliminates — for real — about a quarter of the bugs I see in
    production C#, Java, and JavaScript systems I’m asked to maintain.

    More specs: This function is invertible

    View full-size slide

  68. http://d3s.mff.cuni.cz/research/seminar/download/2010-02-23-Tobies-HypervisorVerification.pdf
    Formal methods combine specifications with theorem provers to demonstrate when specifications do or do not always hold in the code. And they work! I’ll give examples
    later on in the talk.

    View full-size slide

  69. Best Known
    Example
    Released
    Informal
    Spec
    Formal
    Spec
    Execute
    QuickCheck
    1999
    “Reversing a list twice should result in the
    same list”
    prop_RevRev xs = reverse (reverse xs) == xs
    where types = xs::[Int]
    Main> quickCheck prop_RevRev
    OK, passed 100 tests.
    Earlier I mentioned Specification-based random testing. Idea is to produce lots of random input and see if a certain specification holds.

    QuickCheck is a well-known example (explain) (click) Does that sound more like QA or security? (click)

    But of course this is really similar to what AFL does, only the spec is a bit more general (click) and the number of iterations is far higher

    View full-size slide

  70. Best Known
    Example
    Released
    Informal
    Spec
    Formal
    Spec
    Execute
    QuickCheck
    1999
    “Reversing a list twice should result in the
    same list”
    prop_RevRev xs = reverse (reverse xs) == xs
    where types = xs::[Int]
    Main> quickCheck prop_RevRev
    OK, passed 100 tests.
    Earlier I mentioned Specification-based random testing. Idea is to produce lots of random input and see if a certain specification holds.

    QuickCheck is a well-known example (explain) (click) Does that sound more like QA or security? (click)

    But of course this is really similar to what AFL does, only the spec is a bit more general (click) and the number of iterations is far higher

    View full-size slide

  71. Best Known
    Example
    Released
    Informal
    Spec
    Formal
    Spec
    Execute
    QuickCheck
    1999
    “Reversing a list twice should result in the
    same list”
    prop_RevRev xs = reverse (reverse xs) == xs
    where types = xs::[Int]
    Main> quickCheck prop_RevRev
    OK, passed 100 tests.
    AFL
    2007
    “System under test shouldn’t crash no matter what I
    pass to it”
    if (WIFSIGNALED(status) #$ !stop_soon) {
    kill_signal = WTERMSIG(status);
    return FAULT_CRASH;
    }
    ./afl-fuzz -i testcase_dir -o findings_dir -- \
    /path/to/tested/program [&&.program's
    cmdline&&.]
    Earlier I mentioned Specification-based random testing. Idea is to produce lots of random input and see if a certain specification holds.

    QuickCheck is a well-known example (explain) (click) Does that sound more like QA or security? (click)

    But of course this is really similar to what AFL does, only the spec is a bit more general (click) and the number of iterations is far higher

    View full-size slide

  72. Best Known
    Example
    Released
    Informal
    Spec
    Formal
    Spec
    Execute
    QuickCheck
    1999
    “Reversing a list twice should result in the
    same list”
    prop_RevRev xs = reverse (reverse xs) == xs
    where types = xs::[Int]
    Main> quickCheck prop_RevRev
    OK, passed 100 tests.
    AFL
    2007
    “System under test shouldn’t crash no matter what I
    pass to it”
    if (WIFSIGNALED(status) #$ !stop_soon) {
    kill_signal = WTERMSIG(status);
    return FAULT_CRASH;
    }
    ./afl-fuzz -i testcase_dir -o findings_dir -- \
    /path/to/tested/program [&&.program's
    cmdline&&.]
    Earlier I mentioned Specification-based random testing. Idea is to produce lots of random input and see if a certain specification holds.

    QuickCheck is a well-known example (explain) (click) Does that sound more like QA or security? (click)

    But of course this is really similar to what AFL does, only the spec is a bit more general (click) and the number of iterations is far higher

    View full-size slide

  73. https://www.flickr.com/photos/x1brett/2279939232
    Unfortunately, having a method for showing when specifications hold makes it really obvious that producing coherent specs for your software isn’t always easy!

    Very easy to add requirements until they contradict each other

    View full-size slide

  74. https://medium.com/@betable/tifu-by-using-math-random-f1c308c4fd9d
    Even simple specifications can sometimes be different to prove.

    PRNG from Safari on left, old version of Chrome on right. Obviously, there’s an issue.

    Specification well-understood, but impossible to write a test which passes 100% of time for a perfect PRNG.

    We need a spec for what our system should do, but that doesn’t automatically make it testable

    View full-size slide

  75. Thought Experiment:
    W h a t I f
    A u t o m a t e d T e s t s
    W e r e P e r f e c t ?
    But let’s say we solved all those problems. You could prove software system perfectly conformed to the spec. Let’s say the spec was also right. Problem solved?

    QA = spec + people

    View full-size slide

  76. This is life or death. This is an alert screen.

    Epic EMR. One of 17,000 alerts the UCSF physicians received in that month alone. (click)

    Contains the number “160” (or “6160”) in 14 different places. None are “wrong” values.

    Nurse clicked through this and patient received 38 1/2 times the correct dose of an antibiotic. He’s fortunate to have survived.

    dark patterns

    View full-size slide

  77. This is life or death. This is an alert screen.

    Epic EMR. One of 17,000 alerts the UCSF physicians received in that month alone. (click)

    Contains the number “160” (or “6160”) in 14 different places. None are “wrong” values.

    Nurse clicked through this and patient received 38 1/2 times the correct dose of an antibiotic. He’s fortunate to have survived.

    dark patterns

    View full-size slide

  78. W h a t I f S e c u r i t y
    A n a l y s i s T o o l s
    W e r e P e r f e c t ?
    You could prove edge cases outside the spec did not exist. Let’s say the spec was also right.

    No SQLi, no overflows, no… anything. Problem solved?

    View full-size slide

  79. –DNI James Clapper
    “Something like 90 percent of cyber intrusions start with
    phishing… Somebody always falls for it.”
    https://twitter.com/ODNIgov/status/776070411482193920
    Security = spec + people. People who open phishing emails aren’t even dumb.

    You have someone in HR, reviewing resumes, and so an essential part of their job is opening PDFs and Word files emailed to them by clueless, anonymous people on the
    Internet, right?

    View full-size slide

  80. https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45366.pdf
    If you display a security indicaition to a human being, how will they react?

    Google and UC Berkeley have conducted research on user response to security warnings

    View full-size slide

  81. The same team also produced an opt-in Chrome extension which would show you a survey if you clicked through a security warning. Knowing how humans react to
    security UI is a security-critical test!

    View full-size slide

  82. Manual Testing
    Examples Exploratory testing, Binary analysis
    Effort Very high
    Killer App
    Finding cases where code technically correct but fails at human-
    computer interaction
    Major Disadvantage Often misused
    OK, so proof that some piece of software conforms to a specification isn’t sufficient to call it perfect in terms of quality or security, but could we even have that proof?
    Let’s consider what we can do today, and the return on that effort.

    View full-size slide

  83. Dynamic Analysis
    Examples QuickCheck, AFL, sqlmap
    Effort Low
    Killer App More like an app killer, amiright?
    Major Disadvantage Tends to find a few specific (though important!) bugs

    View full-size slide

  84. Static Analysis
    Examples FxCop, FindBugs, Coverity, Veracode
    Effort Very low
    Killer App Cheaper than air. Just do it.
    Major Disadvantage Limited to finding a few hundred important kinds of bugs

    View full-size slide

  85. Formal Verification / Symbolic Execution
    Examples VCC, TLA+, Cryptol
    Effort High effort but correspondingly high return
    Killer App MiTLS, Hyper-V Memory Manager
    Major Disadvantage Hard to find people with skill set

    View full-size slide

  86. Program Synthesis
    Examples Nothing off the shelf, really, but Agda and Z3 help
    Effort PhD-level research
    Killer App Elimination of incidental complexity
    Major Disadvantage Doesn’t really exist in general form
    This may be the future of software. When you write code today you’re writing an informal specification in a language designed for general control of hardware. What if
    you wrote in a language better suited to your problem domain? This is not a DSL, as we think of it today, because DSLs don’t verify the consistency and satisfiability of
    specifications.

    View full-size slide

  87. For developers, the message here is: Don’t fix the bug. When you receive a defect report from a security or QA team, don’t fix it. Add something to your process, like
    static analysis, which makes the entire class of defect impossible. Then fix everything it catches.

    View full-size slide

  88. That’s all a bunch of words and hand-waving. Let’s bring this down to earth, shall we?

    View full-size slide

  89. How Amazon Web Services Uses Formal Methods
    “Formal methods are a big success at AWS, helping us
    prevent subtle but serious bugs from reaching
    production, bugs we would not have found through any
    other technique. They have helped us devise aggressive
    optimizations to complex algorithms without sacrificing
    quality.”
    http://research.microsoft.com/en-us/um/people/lamport/tla/amazon.html
    First, this stuff works in the real world, today. The scale at which customers use AWS services is far too large for even Amazon to comprehensively test. They work
    around this limitations with by formally specifying all of their protocols in TLA+

    View full-size slide

  90. Researchers at INRIA an Microsoft Research used a dependently typed language called F* to implement TLS and compare the behavior of the formally verified
    implementation to others on fuzzed input.

    Six major vulnerabilities so far, including Triple Handshake, FREAK, and Logjam

    View full-size slide

  91. Researchers at INRIA an Microsoft Research used a dependently typed language called F* to implement TLS and compare the behavior of the formally verified
    implementation to others on fuzzed input.

    Six major vulnerabilities so far, including Triple Handshake, FREAK, and Logjam

    View full-size slide

  92. “Finding and Understanding Bugs in C Compilers,”
    Yang et al.
    https://www.flux.utah.edu/paper/yang-pldi11
    It's possible to write a correct C compiler, but the best C developers in the world can't do it in C. You can’t test every possible C program with ad hoc test suites. Yang et
    al. used “smart” fuzzing of the C language to cover the maximum surface area of a compiler. They report:

    “Twenty-five of our reported GCC bugs have been classified as P1, the maximum, release-blocking priority for GCC defects. Our results suggest that fixed test suites—
    the main way that compilers are tested—are an inadequate mechanism for quality control.”

    The only C compiler to survive this kind of testing was CompCert, which is formally verified in Coq. (Rust story?)

    View full-size slide

  93. These are phenomenally effective tools, and surprisingly under-utilized in industry.

    It turns out, you can try this at home! What I Learned Writing a .NET Fuzzer

    View full-size slide

  94. ===================================
    Technical "whitepaper" for afl-fuzz
    ===================================
    This document provides a quick overview of the guts of American Fuzzy Lop.
    See README for the general instruction manual; and for a discussion of
    motivations and design goals behind AFL, see historical_notes.txt.
    0) Design statement
    -------------------
    American Fuzzy Lop does its best not to focus on any singular principle of
    operation and not be a proof-of-concept for any specific theory. The tool can
    be thought of as a collection of hacks that have been tested in practice,
    found to be surprisingly effective, and have been implemented in the simplest,
    most robust way I could think of at the time.
    Many of the resulting features are made possible thanks to the availability of
    lightweight instrumentation that served as a foundation for the tool, but this
    mechanism should be thought of merely as a means to an end. The only true
    governing principles are speed, reliability, and ease of use.
    1) Coverage measurements
    ------------------------
    The instrumentation injected into compiled programs captures branch (edge)
    coverage, along with coarse branch-taken hit counts. The code injected at
    branch points is essentially equivalent to:
    cur_location = ;
    shared_mem[cur_location ^ prev_location]++;
    prev_location = cur_location >> 1;
    http://lcamtuf.coredump.cx/afl/technical_details.txt
    So as I said earlier, I love the concept of AFL, and especially the large amount of documentation that Michał Zalewski has written explaining the design choices and
    implementation details

    Now I want to explain my own journey

    View full-size slide

  95. M e m o r y
    AFL often finds memory crashes. You get this almost free.

    We know how to almost entirely eliminate these. In a managed language….

    But the idea is still good. Can we find other kinds of errors as “easily” as memory-related crashes?

    This turned out to be quite a long road.

    View full-size slide

  96. { "a" : "bc" }
    But you have to start somewhere.

    Find a corpus. I started with. It’s valid JSON! Can we parse this?

    View full-size slide

  97. let jsonNetResult =
    try JsonConvert.DeserializeObject(str) |> ignore
    Success
    with
    | :? JsonReaderException as jre -> jre.Message |> Error
    | :? JsonSerializationException as jse -> jse.Message |> Error
    | :? System.FormatException as fe ->
    if fe.Message.StartsWith("Invalid hex character”) // hard coded in Json.NET
    then fe.Message |> Error
    else reraise()
    ⃪ T
    est
    ⬑ Special case error stuff
    I had to write a short program to run the deserializer, which I’ll call the test harness

    View full-size slide

  98. use proc = new Process()
    proc.StartInfo.FileName <- executablePath
    inputMethod.BeforeStart proc testCase.Data
    proc.StartInfo.UseShellExecute <- false
    proc.StartInfo.RedirectStandardOutput <- true
    proc.StartInfo.RedirectStandardError <- true
    proc.StartInfo.EnvironmentVariables.Add(SharedMemory.environmentVariableName, sharedMemoryName)
    let output = new System.Text.StringBuilder()
    let err = new System.Text.StringBuilder()
    proc.OutputDataReceived.Add(fun args -> output.Append(args.Data) |> ignore)
    proc.ErrorDataReceived.Add (fun args -> err.Append(args.Data) |> ignore)
    proc.Start() |> ignore
    inputMethod.AfterStart proc testCase.Data
    proc.BeginOutputReadLine()
    proc.BeginErrorReadLine()
    proc.WaitForExit()
    let exitCode = proc.ExitCode
    let crashed = exitCode = WinApi.ClrUnhandledExceptionCode
    ⃪ Set up
    ⃪ Read results
    ⃪ Important bit
    And another program to read the input data and execute the test harness executable and then see if it succeeded

    Pretty simple, so far. There’s a lot of code here. Don’t worry about the details. Setup, execute, collect data. The code is on my GitHub and I gave you the link at the
    beginning of the slides if you want to look deeper.

    But my original sample input wasn’t very interesting.

    View full-size slide

  99. Then I stood on the shoulders of giants. Turns out lots of people (well, two) like to collect problematic JSON.

    Now I have about 200 good test cases. But I want hundreds of thousands.

    View full-size slide

  100. /// An ordered list of functions to use when starting with a single piece of
    /// example data and producing new examples to try
    let private allStrategies = [
    bitFlip 1
    bitFlip 2
    bitFlip 4
    byteFlip 1
    byteFlip 2
    byteFlip 4
    arith8
    arith16
    arith32
    interest8
    interest16
    ]
    So I wrote a bunch of ways to transform that input into new cases

    This list is just copied from AFL

    View full-size slide

  101. let totalBits = bytes.Length * 8
    let testCases = seq {
    for bit = 0 to totalBits - flipBits do
    let newBytes = Array.copy bytes
    let firstByte = bit / 8
    let firstByteMask, secondByteMask = bitMasks(bit, flipBits)
    let newFirstByte = bytes.[firstByte] ^^^ firstByteMask
    newBytes.[firstByte] <- newFirstByte
    let secondByte = firstByte + 1
    if secondByteMask <> 0uy && secondByte < bytes.Length
    then
    let newSecondByte = bytes.[secondByte] ^^^ secondByteMask
    newBytes.[secondByte] <- newSecondByte
    yield newBytes
    }
    Fuzz one byte →
    ^^^ means xor

    And I translated the AFL fuzz C code into F#

    So now I have a bunch of test cases, but I need to understand them. If I have an input and I flip one bit, maybe that’s a valuable new test case, or more likely it’s totally
    useless. How do I know?

    View full-size slide

  102. https://commons.wikimedia.org/wiki/File:CPT-Recursion-Factorial-Code.svg
    I need to trace all of the call stacks executed during the test.

    I’m looking for tests which produce new sequences of stacks.

    View full-size slide

  103. private static void F(string arg)
    {
    Console.WriteLine("f");
    Console.Error.WriteLine("Error!");
    Environment.Exit(1);
    }
    How do I know which call stacks happen during testing?

    Instrumentation. At the simplest level, we want to turn this:

    View full-size slide

  104. private static void F(string arg)
    {
    instrument.Trace(29875);
    Console.WriteLine("f");
    Console.Error.WriteLine("Error!");
    Environment.Exit(1);
    }
    ← Random number
    into this.

    Which is to say, whenever we enter some block, inform an external observer what just happened.

    In order to inject this code, I have a couple of choices

    View full-size slide

  105. private static void F(string arg)
    {
    #if MANUAL_INSTRUMENTATION
    instrument.Trace(29875);
    #endif
    Console.WriteLine("f");
    Console.Error.WriteLine("Error!");
    Environment.Exit(1);
    }
    I could manually add the instrumentation, which is painful

    View full-size slide

  106. Or I could write a profiler. The .NET framework provides a profiling API, but it requires hosting the runtime and is annoying in other ways

    View full-size slide

  107. An easier solution is Mono.Cecil, which is sort of like .NET reflection except you can modify .NET assemblies without loading them into your process first.

    View full-size slide

  108. let stringify (ob: obj) : string =
    JsonConvert.SerializeObject(ob)
    I want to be able to work with any .NET executable, so I instrument binaries instead of source.

    So I can write some really simple F# code, like this. Normally the compiler transforms it into (click). That’s complicated, so don’t worry much about it. The important thing
    is I need to add a couple of instructions (click). This just tells the fuzzer what is happening inside the program as it runs

    View full-size slide

  109. let stringify (ob: obj) : string =
    JsonConvert.SerializeObject(ob)
    // Method: System.String\u0020Program::stringify(System.Object)
    .body stringify {
    arg_02_0 [generated]
    arg_07_0 [generated]
    nop()
    arg_02_0 = ldloc(ob)
    arg_07_0 = call(JsonConvert::SerializeObject, arg_02_0)
    ret(arg_07_0)
    }
    I want to be able to work with any .NET executable, so I instrument binaries instead of source.

    So I can write some really simple F# code, like this. Normally the compiler transforms it into (click). That’s complicated, so don’t worry much about it. The important thing
    is I need to add a couple of instructions (click). This just tells the fuzzer what is happening inside the program as it runs

    View full-size slide

  110. let stringify (ob: obj) : string =
    JsonConvert.SerializeObject(ob)
    // Method: System.String\u0020Program::stringify(System.Object)
    .body stringify {
    arg_02_0 [generated]
    arg_07_0 [generated]
    nop()
    arg_02_0 = ldloc(ob)
    arg_07_0 = call(JsonConvert::SerializeObject, arg_02_0)
    ret(arg_07_0)
    }
    // Method: System.String\u0020Program::stringify(System.Object)
    .body stringify {
    arg_05_0 [generated]
    arg_0C_0 [generated]
    arg_11_0 [generated]
    arg_05_0 = ldc.i4(23831)
    call(Instrument::Trace, arg_05_0)
    nop()
    arg_0C_0 = ldloc(ob)
    arg_11_0 = call(JsonConvert::SerializeObject, arg_0C_0)
    ret(arg_11_0)
    }
    I want to be able to work with any .NET executable, so I instrument binaries instead of source.

    So I can write some really simple F# code, like this. Normally the compiler transforms it into (click). That’s complicated, so don’t worry much about it. The important thing
    is I need to add a couple of instructions (click). This just tells the fuzzer what is happening inside the program as it runs

    View full-size slide

  111. So what do we have so far? A ton of inputs, and lots of data about how the system under test behaves with them. Unfortunately, it doesn’t tell me very much. The only
    thing I was able to crash with this method was

    View full-size slide

  112. …Windows. Or maybe the Parallels memory manager. Not what I was after. But this was a rather difficult problem to work around. It’s hard to find bugs in an application
    when your OS is blue screening!

    View full-size slide

  113. http://www.json.org/
    I need a way to determine if Json.NET is parsing the JSON correctly. So I thought I should write a JSON validator to check its behavior. Fortunately, there’s a standard!

    “Probably the boldest design decision I made was to not put a version number on JSON so there is no mechanism for revising it. We are stuck with JSON: whatever it is
    in its current form, that’s it.” -Crockford

    View full-size slide

  114. https://tools.ietf.org/html/rfc4627
    So, naturally, there’s also another JSON standard

    View full-size slide

  115. http://www.ecma-international.org/ecma-262/5.1/#sec-15.12
    And another JSON standard

    View full-size slide

  116. http://www.ecma-international.org/publications/standards/Ecma-404.htm
    And another JSON standard

    View full-size slide

  117. https://tools.ietf.org/html/rfc7158
    And another JSON standard

    View full-size slide

  118. https://tools.ietf.org/html/rfc7159
    And another JSON standard. And no, they don’t all agree on everything, nor is there a single, “latest” version. Despite this multitude of standards, there are still edge
    cases intentionally delegated to the implementer — what we would call “undefined behavior” in C.

    View full-size slide

  119. https://github.com/nst/STJSON
    I was going to write my own validator, but…

    Nicolas Seriot wrote a validator called STJSON which attempts to synthesize these as much as possible.

    View full-size slide

  120. https://github.com/CraigStuntz/Fizil/blob/master/StJson/StJsonParser.fs
    Swift doesn’t readily compile to Windows, but if you squint hard enough it kind of looks like F#, so I ported the code and used it to validate Json.NET's behavior.

    View full-size slide

  121. Standard Accepts, Json.NET Rejects
    Value
    88888888888888888888888888888888888888888888888888
    88888888888888888888888888888888888888888888888888
    88888888888888888888888888888888888888888888888888
    88888888888888888888888888888888888888888888888888
    88888888888888888888888888888888888888888888888888
    Standard Says No limit
    Json.NET MaximumJavascriptIntegerCharacterLength = 380;
    Things JSON.NET fails on that the standard accepts

    View full-size slide

  122. Standard Rejects, Json.NET Accepts
    Value [,,,]
    Standard Says
    A JSON value MUST be an object, array, number, or string, or one
    of
    the following three literal names:
    false null true
    Json.NET [null, null, null, null]
    Things JSON.NET succeeds on that the standard rejects

    View full-size slide

  123. I m p l e m e n t a t i o n
    D e t a i l s
    I have lots of interesting stories from implementing this code, but some of them get kind of low level, so I’ll share just a couple I think are of general interest. Please do
    feel free to pull my code or reach out to me if you want complete details!

    View full-size slide

  124. let private insertTraceInstruction(ilProcessor: ILProcessor, before: Instruction, state) =
    let compileTimeRandom = state.Random.Next(0, UInt16.MaxValue |> Convert.ToInt32)
    let ldArg = ilProcessor.Create(OpCodes.Ldc_I4, compileTimeRandom)
    let callTrace = ilProcessor.Create(OpCodes.Call, state.Trace)
    ilProcessor.InsertBefore(before, ldArg)
    ilProcessor.InsertAfter (ldArg, callTrace)
    This margin is too narrow to contain a try/finally example, so see:
    https://goo.gl/W4y7JH
    Inserting the IL instructions I needed was fairly easy. Here is the important bit of the code which does it. How did I learn how to write this? I instrumented a small program
    “manually” by writing the instrumentation code myself, and then decompiled that program to figure out which IL instructions I needed. Inserting them with Mono.Cecil is
    just a few lines of code.

    try/finally is much, much harder. I won’t even try to walk you through it here. Look at the GitHub repo if you want to see how it’s done.

    View full-size slide

  125. Strong naming was a consistent pain for me.

    I’m altering the binaries of assemblies, and part of the point of strong naming is to stop you from doing just that, so naturally if the assembly is strongly named it can’t be
    loaded when I’m finished.

    View full-size slide

  126. let private removeStrongName (assemblyDefinition : AssemblyDefinition) =
    let name = assemblyDefinition.Name;
    name.HasPublicKey <- false;
    name.PublicKey <- Array.empty;
    assemblyDefinition.Modules |> Seq.iter (
    fun moduleDefinition ->
    moduleDefinition.Attributes <-
    moduleDefinition.Attributes &&& ~~~ModuleAttributes.StrongNameSigned)
    let aptca = assemblyDefinition.CustomAttributes.FirstOrDefault(
    fun attr -> attr.AttributeType.FullName
    = typeof.FullName)
    assemblyDefinition.CustomAttributes.Remove aptca |> ignore
    assembly.MainModule.AssemblyReferences
    |> Seq.filter (fun reference -> Set.contains reference.Name assembliesToInstrument)
    |> Seq.iter (fun reference ->
    reference.PublicKeyToken <- null
    )
    So I need to remove the strong name from any assembly I fuzz, but I also need to remove the PublicKeyToken from any other assembly which references it. Doing this in
    Mono.Cecil is not well-documented, and after quite a bit of time spent in GitHub issues and trial and error I figured out that it takes 5 distinct steps to do this.

    View full-size slide

  127. I n / O u t o f
    P r o c e s s
    In order to stop Windows/Parallels from crashing I decided to try and fuzz the system under test in process with the fuzzer itself to reduce the number of processes I was
    creating. This worked, but had a surprising result: The number of distinct paths through the code I found during testing changed. Why? Isn’t it running the same code
    with the same inputs? Yes, absolutely! But when you run a function, the .NET framework doesn’t guarantee that the same instructions will be executed each time. Let’s
    look at why.

    View full-size slide

  128. So my test harness is going to execute some code, and I’ve instrumented that code, which means I’ll get a series of trace events as the code runs. I might expect these
    events to come in the same order for executing a single function.

    View full-size slide

  129. But that’s not what happens. This is an actual trace of the method I just showed you, and it doesn’t happen at the same time. Why?

    View full-size slide

  130. –ECMA-335, Common Language Infrastructure (CLI),
    Partition I
    “If marked BeforeFieldInit then the type’s initializer
    method is executed at, or sometime before, first access to
    any static field defined for that type.”
    The .NET CLR guarantees a type initializer will be invoked before the type is used, but it doesn’t specify exactly when.

    View full-size slide

  131. f ( x ) = f ( x )
    t i m e ( f ( x ) ) ! = t i m e ( f ( x ) )


    For many methods, the CLR guarantees they’ll return the same result for the same argument, but does not guarantee the same instructions will be executed in the same
    order to get that result.

    What does this tell us about QA? What does this tell us about security?

    (Any thoughts on that?)

    View full-size slide

  132. U n i c o d e
    Original JSON
    { "a": "bc" }
    ASCII Bytes
    7B 20 22 61 22 20 3A 20 22 62 63 22 20 7D
    UTF-8 with Byte Order Mark
    EF BB BF 7B 20 22 61 22 20 3A 20 22 62 63 22 20 7D
    UTF-16 BE with BOM
    FE FF 00 7B 00 20 00 22 00 61 00 22 00 20 00 3A 00 20 00 22
    00 62 00 63 00 22 00 20 00 7D
    What does it mean to fuzz JSON? Which of these byte arrays should I fuzz?

    Web uses UTF-8. Browsers use WTF-8. Windows, C#, and JSON.NET like UTF-16.

    We must choose an encoding for our test corpus and then choose whether to convert or just fuzz as-is

    View full-size slide

  133. T h a n k Y o u !
    Presentation Review
    Cassandra Faris
    Chad James
    Damian Synadinos
    Doug Mair
    Tommy Graves
    Source Code Inspiration
    Michał Zalewski
    Nicolas Seriot
    Everyone Who Works on dnSpy
    & Mono.Cecil
    Finally…

    View full-size slide

  134. C r a i g S t u n t z
    @craigstuntz
    [email protected]
    http://www.craigstuntz.com
    http://www.meetup.com/Papers-We-Love-Columbus/
    https://speakerdeck.com/craigstuntz

    View full-size slide