Upgrade to Pro — share decks privately, control downloads, hide ads and more …

"Synthesizing Input Grammars" A Replication Study

"Synthesizing Input Grammars" A Replication Study

PLDI 2022

Rahul Gopinath

June 16, 2022
Tweet

More Decks by Rahul Gopinath

Other Decks in Research

Transcript

  1. “Synthesizing Input Grammars”: A Replication Study PLDI 2022 • San

    Diego Bachir Bendrissou • Rahul Gopinath • Andreas Zeller
  2. Synthesizing
 Program Input Grammars PLDI 2017 Synthesizing Program Input Grammars

    Osbert Bastani Stanford University, USA [email protected] Rahul Sharma Microsoft Research, India [email protected] Alex Aiken Stanford University, USA [email protected] Percy Liang Stanford University, USA [email protected] Abstract We present an algorithm for synthesizing a context-free grammar encoding the language of valid program inputs from a set of input examples and blackbox access to the program. Our algorithm addresses shortcomings of existing grammar inference algorithms, which both severely over- generalize and are prohibitively slow. Our implementation, GLADE, leverages the grammar synthesized by our algo- rithm to fuzz test programs with structured inputs. We show that GLADE substantially increases the incremental cover- age on valid inputs compared to two baseline fuzzers. CCS Concepts • Theory of computation ! Program analysis Keywords grammar synthesis; fuzzing 1. Introduction Documentation of program input formats, if available in a machine-readable form, can significantly aid many software analysis tools. However, such documentation is often poor; for example, the specifications of Flex [61] and Bison [20] input syntaxes are limited to informal documentation. Even when detailed specifications are available, they are often not in a machine-readable form; for example, the specification for ECMAScript 6 syntax is 20 pages in Annex A of [15], and the specification for Java class files is 268 pages in Chapter 4 of [45]. In this paper, we study the problem of automatically syn- thesizing grammars representing program input languages. Such a grammar synthesis algorithm has many potential ap- plications. Our primary motivation is the possibility of using synthesized grammars with grammar-based fuzzers [23, 28, 38]. For example, such inputs can be used to find bugs in real-world programs [24, 39, 48, 67], learn abstractions [41], predict performance [30], and aid dynamic analysis [42]. Be- yond fuzzing, a grammar synthesis algorithm could be used to reverse engineer input formats [29], in particular, network protocol message formats can help security analysts discover vulnerabilities in network programs [8, 35, 36, 66]. Synthe- sized grammars could also be used to whitelist program in- puts, thereby preventing exploits [49, 50, 58]. Approaches to synthesizing program input grammars typ- ically examine executions of the program, and then gen- eralize these observations to a representation of valid in- puts. These approaches can be either whitebox or blackbox. Whitebox approaches assume that the program code is avail- able for analysis and instrumentation, for example, using dy- namic taint analysis [29]. Such an approach is difficult when only the program binaries are available or when parts of the code (e.g., libraries) are missing. Furthermore, these tech- niques often require program-specific configuration or tun- ing, and may be affected by the structure of the code. We consider the blackbox setting, where we only require the ability to execute the program on a given input and observe its corresponding output. Since the algorithm does not exam- ine the program’s code, its performance depends only on the language of valid inputs, and not on implementation details. A number of existing language inference algorithms can be adapted to this setting [14]. However, we found them to be unsuitable for synthesizing program input grammars. In particular, L-Star [3] and RPNI [44], the most widely studied algorithms [6, 12, 13, 19, 62], were unable to learn or approximate even simple input languages such as XML, and furthermore do not scale even to small sets of seed inputs. Surprisingly, we found that L-Star and RPNI perform poorly even on the class of regular languages they target. The problem with these algorithms is that despite having theoretical guarantees, they depend on assumptions that do Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. PLDI’17, June 18–23, 2017, Barcelona, Spain c 2017 ACM. 978-1-4503-4988-8/17/06...$15.00 http://dx.doi.org/10.1145/3062341.3062349 95
  3. Why Synthesize Grammars? Input grammars have several usages, notably in

    test generation: • Random input test generators (fuzzers) produce lots of invalid inputs • But only valid inputs trigger functionality beyond the input parser • A grammar (or any formal language) helps producing valid inputs • Specifying grammars by hand is expensive • Extracting grammars from inputs or programs makes fuzzing e ff i cient
  4. Mining Input Grammars from Programs • Track input bytes through

    program execution • Bytes that are processed the same way form tokens and structures • Grammar structure re fl ects program structure • Highly accurate Mining Input Grammars from Dynamic Taints Matthias Höschele Saarland University Saarland Informatics Campus Saarbrücken, Germany [email protected] Andreas Zeller Saarland University Saarland Informatics Campus Saarbrücken, Germany [email protected] ABSTRACT Knowing which part of a program processes which parts of an in- put can reveal the structure of the input as well as the structure of the program. In a URL http://www.example.com/path/, for instance, the protocol http, the host www.example.com, and the path path would be handled by different functions and stored in different variables. Given a set of sample inputs, we use dynamic tainting to trace the data flow of each input character, and aggregate those input fragments that would be handled by the same function into lexical and syntactical entities. The result is a context-free grammar that reflects valid input structure. In its eval- uation, our AUTOGRAM prototype automatically produced readable and structurally accurate grammars for inputs like URLs, spread- sheets or configuration files. The resulting grammars not only allow simple reverse engineering of input formats, but can also directly serve as input for test generators. CCS Concepts •Software and its engineering ! Input / output; Dynamic anal- ysis; •Theory of computation ! Grammars and context-free lan- guages; •Social and professional topics ! Software reverse en- gineering; •Applied computing ! Document analysis; Keywords Input formats; context-free grammars; dynamic tainting; fuzzing 1. INTRODUCTION Since the invention of the Turing machine, a program is typi- cally described as a machine that input and output a string of sym- bols. The set of such strings that the machine accepts or produces is called a language. The field of formal language theory long has studied the structural aspects of such languages, using formal lan- guages like regular expressions or context-free grammars to exactly specify the language. The practical importance of such formal lan- guages cannot be overstated. In programming languages, software systems, computer networks, or general software development, for- mal languages (and equivalent automata diagrams) are among the Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). ASE 2016 Singapore c 2016 Copyright held by the owner/author(s). ACM ISBN 123-4567-24-567/08/06... $15.00 DOI: 10.475/123_4 http://user:[email protected]:80/command? foo=bar&lorem=ipsum#fragment http://www.guardian.co.uk/sports/worldcup#results ftp://bob:[email protected]/oss/debian7.iso Figure 1: Sample URL inputs URL ::= PROTOCOL ’://’ AUTHORITY PATH [’?’ QUERY] [’#’ REF] AUTHORITY ::= [USERINFO ’@’] HOST [’:’ PORT] PROTOCOL ::= ’http’ | ’ftp’ USERINFO ::= /[a-z]+/ ’:’ /[a-z]+/ HOST ::= /[a-z.]+/ PORT ::= ’80’ PATH ::= /\/[a-z0-9.]*/ QUERY ::= ’foo=bar&lorem=ipsum’ REF ::= /[a-z]+/ Figure 2: Grammar derived by AUTOGRAM from java.net.URL processing the inputs in Figure 1. Op- tional parts are enclosed in brackets [. . .]; regular expression shorthands are enclosed in /. . ./. most frequently used methods to specify inputs and outputs, and consequently, regular expressions and grammars are an essential part of computer science curricula. In this paper, we present a novel practical method that, given a set of program runs with inputs, automatically produces a context- free grammar that represents the language of the inputs seen. The resulting grammar facilitates understanding of the input structure; can serve as a base for automated test generation by feeding it into a producer; and can be used by a computer to parse, decompose, and analyze other inputs. Here is an example. java.net.URL is a Java class that parses a Uniform Resource Locator (URL) into its constituents. Assume a program p that uses java.net.URL to parse the URLs given in Figure 1. Given the program p and these inputs, our AUTOGRAM prototype automatically produces the grammar shown in Figure 2, which pretty accurately reflects the structure of the URLs processed. How do we obtain this grammar? The key idea is to dynamically observe how input is processed in a program. We instrument the program with dynamic taints that during execution, tagging each piece of data with the input fragment it comes from. Now, if some function of the program processes only a part of the input, or if a part or a value derived from it is stored in a variable, then this part becomes a syntactical entity. In our example, the method java.net.URL.set() eventu- ally stores the URL components as parsed in the java.net.URL class. Figure 3 shows the taints from the original input as its parts are being passed as arguments to java.net.URL.set(). The ASE 2016
  5. cesses which parts of an in- t as well as

    the structure of .example.com/path/, host www.example.com, by different functions and set of sample inputs, we ow of each input character, at would be handled by the cal entities. The result is a input structure. In its eval- matically produced readable r inputs like URLs, spread- ng grammars not only allow rmats, but can also directly ut / output; Dynamic anal- mmars and context-free lan- http://user:[email protected]:80/command? foo=bar&lorem=ipsum#fragment http://www.guardian.co.uk/sports/worldcup#results ftp://bob:[email protected]/oss/debian7.iso Figure 1: Sample URL inputs URL ::= PROTOCOL ’://’ AUTHORITY PATH [’?’ QUERY] [’#’ REF] AUTHORITY ::= [USERINFO ’@’] HOST [’:’ PORT] PROTOCOL ::= ’http’ | ’ftp’ USERINFO ::= /[a-z]+/ ’:’ /[a-z]+/ HOST ::= /[a-z.]+/ PORT ::= ’80’ PATH ::= /\/[a-z0-9.]*/ QUERY ::= ’foo=bar&lorem=ipsum’ REF ::= /[a-z]+/ Figure 2: Grammar derived by AUTOGRAM from java.net.URL processing the inputs in Figure 1. Op- tional parts are enclosed in brackets [. . .]; regular expression shorthands are enclosed in /. . ./.
  6. ABSTRACT Knowing which part of a program processes which parts

    of an in- put can reveal the structure of the input as well as the structure of the program. In a URL http://www.example.com/path/, for instance, the protocol http, the host www.example.com, and the path path would be handled by different functions and stored in different variables. Given a set of sample inputs, we use dynamic tainting to trace the data flow of each input character, and aggregate those input fragments that would be handled by the same function into lexical and syntactical entities. The result is a context-free grammar that reflects valid input structure. In its eval- uation, our AUTOGRAM prototype automatically produced readable and structurally accurate grammars for inputs like URLs, spread- sheets or configuration files. The resulting grammars not only allow simple reverse engineering of input formats, but can also directly serve as input for test generators. CCS Concepts •Software and its engineering ! Input / output; Dynamic anal- ysis; •Theory of computation ! Grammars and context-free lan- guages; •Social and professional topics ! Software reverse en- gineering; •Applied computing ! Document analysis; Keywords http://user:[email protected]:80/command? foo=bar&lorem=ipsum#fragment http://www.guardian.co.uk/sports/worldcup#results ftp://bob:[email protected]/oss/debian7.iso Figure 1: Sample URL inputs URL ::= PROTOCOL ’://’ AUTHORITY PATH [’?’ QUERY] [’#’ REF] AUTHORITY ::= [USERINFO ’@’] HOST [’:’ PORT] PROTOCOL ::= ’http’ | ’ftp’ USERINFO ::= /[a-z]+/ ’:’ /[a-z]+/ HOST ::= /[a-z.]+/ PORT ::= ’80’ PATH ::= /\/[a-z0-9.]*/ QUERY ::= ’foo=bar&lorem=ipsum’ REF ::= /[a-z]+/ Figure 2: Grammar derived by AUTOGRAM from java.net.URL processing the inputs in Figure 1. Op- tional parts are enclosed in brackets [. ..]; regular expression shorthands are enclosed in /. ../. most frequently used methods to specify inputs and outputs, and consequently, regular expressions and grammars are an essentia part of computer science curricula. Evaluating Grammars (Learned) grammars should be accurate: • Precision: percentage of produced inputs that are valid
 (Overly general grammar = low precision) • Recall: percentage of given valid inputs that are accepted
 (Overly speci fi c grammar = low recall) • F1 score: harmonic mean of precision and recall Validity can be established by a “golden grammar” or an accepting program
  7. Mining Input Grammars from Samples • Does not need a

    program or instrumentation! • Gold (1978): NP-hard problem, even for regular languages and with negative examples • Clark (2010): Best algorithm using a minimally adequate teacher (= membership queries – using a given acceptor program) • Angluin and Kharitonov (1995): Even with membership queries, a polynomial-time algorithm cannot exist (unless RSA is broken)
  8. Synthesizing
 Program Input Grammars PLDI 2017 Synthesizing Program Input Grammars

    Osbert Bastani Stanford University, USA [email protected] Rahul Sharma Microsoft Research, India [email protected] Alex Aiken Stanford University, USA [email protected] Percy Liang Stanford University, USA [email protected] Abstract We present an algorithm for synthesizing a context-free grammar encoding the language of valid program inputs from a set of input examples and blackbox access to the program. Our algorithm addresses shortcomings of existing grammar inference algorithms, which both severely over- generalize and are prohibitively slow. Our implementation, GLADE, leverages the grammar synthesized by our algo- rithm to fuzz test programs with structured inputs. We show that GLADE substantially increases the incremental cover- age on valid inputs compared to two baseline fuzzers. CCS Concepts • Theory of computation ! Program analysis Keywords grammar synthesis; fuzzing 1. Introduction Documentation of program input formats, if available in a machine-readable form, can significantly aid many software analysis tools. However, such documentation is often poor; for example, the specifications of Flex [61] and Bison [20] input syntaxes are limited to informal documentation. Even when detailed specifications are available, they are often not in a machine-readable form; for example, the specification for ECMAScript 6 syntax is 20 pages in Annex A of [15], and the specification for Java class files is 268 pages in Chapter 4 of [45]. In this paper, we study the problem of automatically syn- thesizing grammars representing program input languages. Such a grammar synthesis algorithm has many potential ap- plications. Our primary motivation is the possibility of using synthesized grammars with grammar-based fuzzers [23, 28, 38]. For example, such inputs can be used to find bugs in real-world programs [24, 39, 48, 67], learn abstractions [41], predict performance [30], and aid dynamic analysis [42]. Be- yond fuzzing, a grammar synthesis algorithm could be used to reverse engineer input formats [29], in particular, network protocol message formats can help security analysts discover vulnerabilities in network programs [8, 35, 36, 66]. Synthe- sized grammars could also be used to whitelist program in- puts, thereby preventing exploits [49, 50, 58]. Approaches to synthesizing program input grammars typ- ically examine executions of the program, and then gen- eralize these observations to a representation of valid in- puts. These approaches can be either whitebox or blackbox. Whitebox approaches assume that the program code is avail- able for analysis and instrumentation, for example, using dy- namic taint analysis [29]. Such an approach is difficult when only the program binaries are available or when parts of the code (e.g., libraries) are missing. Furthermore, these tech- niques often require program-specific configuration or tun- ing, and may be affected by the structure of the code. We consider the blackbox setting, where we only require the ability to execute the program on a given input and observe its corresponding output. Since the algorithm does not exam- ine the program’s code, its performance depends only on the language of valid inputs, and not on implementation details. A number of existing language inference algorithms can be adapted to this setting [14]. However, we found them to be unsuitable for synthesizing program input grammars. In particular, L-Star [3] and RPNI [44], the most widely studied algorithms [6, 12, 13, 19, 62], were unable to learn or approximate even simple input languages such as XML, and furthermore do not scale even to small sets of seed inputs. Surprisingly, we found that L-Star and RPNI perform poorly even on the class of regular languages they target. The problem with these algorithms is that despite having theoretical guarantees, they depend on assumptions that do Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. PLDI’17, June 18–23, 2017, Barcelona, Spain c 2017 ACM. 978-1-4503-4988-8/17/06...$15.00 http://dx.doi.org/10.1145/3062341.3062349 95 • An algorithm for synthesizing program input grammars from seed inputs and membership queries • Learns regular properties such as repetitions and alternations, then recursive productions • F1 scores of 0.90 and higher • Implemented in the GLADE fuzzer, increasing coverage by up to 6x over AFL
  9. GLADE Replication Package GitHub 2017 • GitHub 2019 (custom serialization

    format – no access to grammars – cannot add new subjects)
  10. Reimplementing GLADE • We reimplemented the GLADE algorithm from the

    paper • About six person-months of work • Obtained extremely large grammars that essentially enumerated inputs • OK for a fuzzer; bad for a grammar miner • Numbers not discussed in 2017 paper
  11. • We computed grammar accuracy • Trouble even learning simple

    grammars • Obtained much lower F1 scores
 than reported in 2017 paper Reimplementing GLADE
  12. What next? • Shared results and code with GLADE authors

    • Pointed us to the replication package • Did not want to discuss our code • We assumed we must be wrong
  13. Kulkarni et al. Learning Highly Recursive Input Grammars Neil Kulkarni*

    University of California, Berkeley [email protected] Caroline Lemieux* University of California, Berkeley [email protected] Koushik Sen University of California, Berkeley [email protected] Abstract—This paper presents ARVADA, an algorithm for learning context-free grammars from a set of positive examples and a Boolean-valued oracle. ARVADA learns a context-free grammar by building parse trees from the positive examples. Starting from initially flat trees, ARVADA builds structure to these trees with a key operation: it bubbles sequences of sibling nodes in the trees into a new node, adding a layer of indirection to the tree. Bubbling operations enable recursive generalization in the learned grammar. We evaluate ARVADA against GLADE and find it achieves on average increases of 4.98⇥ in recall and 3.13⇥ in F1 score, while incurring only a 1.27⇥ slowdown and requiring only 0.87⇥ as many calls to the oracle. ARVADA has a particularly marked improvement over GLADE on grammars with highly recursive structure, like those of programming languages. I. INTRODUCTION Learning a high-level language description from a set of examples in that language is a long-studied and difficult problem. While early interest in this problem was motivated by the desire to automatically learn human languages from examples, more recently the problem has been of interest in the context of learning program input languages. Learning a language of program inputs has several relevant applications, including generation of randomized test inputs [1], [2], [3], as well as providing a high-level specification of inputs, which can aid both comprehension and debugging. In this paper we focus on the problem of learning context- free grammars (CFGs) from a set of positive examples S and a Boolean-value oracle O. This is a similar setting as GLADE [4]. Like GLADE, and unlike other recent related works [5], [6], [7], we assume the oracle is black-box: our technique can only see the Boolean return value of the oracle. We adopted the use of an oracle as we believe that in practice, an oracle—e.g. in the form of a parser—is easier to obtain than good, information-carrying negative examples. In this paper, we describe a novel algorithm, ARVADA, for learning CFGs from example strings S and an oracle O. At a high-level, ARVADA attempts to create the smallest CFG possible that accommodates all the examples. It uses two key operations—bubbling and merging—to generalize the language as much as possible, while not overgeneralizing beyond the language accepted by O. To create this context-free grammar, ARVADA repeatedly performs the bubbling and merging operations on tree repre- sentations of the input examples. This set of trees is initialized with one “flat” tree per input example, i.e. the tree with a single root node whose children are the characters of the input string. The bubbling operation takes sequences of sibling nodes in the *Equal contribution. trees and adds a layer of indirection by replacing the sequence with a new node. This new node has the bubbled sequence of sibling nodes as children. Then ARVADA decides whether to accept or reject the proposed bubble by checking whether a relabeling of the new node enables sound generalization of the learned language. Essentially, labels of non-leaf nodes correspond to nontermi- nals in the learned grammar. Merging the labels of two distinct nodes in the trees adds new strings to the grammar’s language: the strings derivable from subtrees with the same label can be swapped. We call this the merge operation since it merges the labels of two nodes in the tree. If a valid merge occurs, the structure introduced by the bubble is preserved. Thus, merges introduce recursion when a parent node is merged with one of its descendants. If the label of the new node added in the bubbling operation cannot merge with any existing node in the trees, the bubble is rejected. That is, the introduced indirection node is removed, and the bubbled sequence of sibling nodes is restored to its original parent. These operations are repeated until no remaining bubbled sequence enables a valid merge. In this paper, we formalize this algorithm in ARVADA. We introduce heuristics in the ordering of bubble sequences minimize the number of bubbles ARVADA must check be- fore find a successful relabeling. We implement ARVADA in 2.2k LoC in Python, and make it available as open-source. We compare ARVADA to GLADE [4], a state-of-the-art for grammar learning engine with blackbox oracles. We evaluate it on parsers for several grammars taken from the evaluation of GLADE, Reinam [5], Mimid [7], as well as a few new highly- recursive grammars. On average across these benchmarks, ARVADA achieves 4.98⇥ higher recall and 3.13⇥ higher F1 score over GLADE. ARVADA incurs on a slowdown of 1.27⇥ over GLADE, while requiring 0.87⇥ as many oracle calls. We believe this slowdown is reasonable, especially given the difference in implementation language—ARVADA is implemented in Python, while GLADE is implemented in Java. Our contributions are as follows: • We introduce ARVADA, which learns grammars from in- puts strings and oracle via bubble-and-merge operations. • We distribute ARVADA’s implementation as open source: https://github.com/neil-kulkarni/arvada. • We evaluate ARVADA on a variety of benchmarks against the state-of-the-art method GLADE. II. MOTIVATING EXAMPLE ARVADA takes as input a set of example strings S and an oracle O. The oracle returns True if its input string is valid and False otherwise. ARVADA’s goal is to learn a grammar • ARVADA tool learns grammars from input strings • Small, structured grammars • Available as open-source • Compares to GLADE implementation • Con fi rmed our earlier observations (large grammars, low F1 score) ESEC/FSE 2021
  14. “Synthesizing Input Grammars”: A Replication Study Bachir Bendrissou CISPA Helmholtz

    Center For Information Security Germany [email protected] Rahul Gopinath CISPA Helmholtz Center For Information Security Germany [email protected] Andreas Zeller CISPA Helmholtz Center For Information Security Germany [email protected] Abstract When producing test inputs for a program, test generators (“fuzzers”) can greatly pro t from grammars that formally describe the language of expected inputs. In recent years, re- searchers thus have studied means to recover input grammars from programs and their executions. The GLADE algorithm by Bastani et al., published at PLDI 2017, was the rst black- box approach to claim context-free approximation of input speci cation for non-trivial languages such as XML, Lisp, URLs, and more. Prompted by recent observations that the GLADE algo- rithm may show lower performance than reported in the original paper, we have reimplemented the GLADE algorithm from scratch. Our evaluation con rms that the e ectiveness score (F1) reported in the GLADE paper is overly optimistic, and in some cases, based on the wrong language. Further- 1 Introduction Generating test inputs for a program (“fuzzing”) is much more e ective if the fuzzer knows the input language of the program under test—that is, the set of valid inputs that actu- ally leads to deeper functionality in the program. Input lan- guages are typically characterized by context-free grammars, and the recent interest in fuzzing thus has fueled research in recovering input grammars from existing programs. The GLADE algorithm by Bastani et al., published in “Syn- thesizing Input Grammars” at PLDI 2017 [6], automatically approximates an input grammar from a given program. In contrast to other approaches, GLADE does not make use of program code to infer input properties. Instead, it relies on feedback from the program whether a given input is valid or not, and synthesizes a multitude of trial inputs to infer the input grammar. GLADE claims substantial improvement over Thank
 You!
  15. Takeaways • Replicating work is important! Yet, replication is also

    a fruitless endeavor: • Either you con fi rm the original results – then there’s nothing new • Or you cannot reproduce them – then it’s your job to prove you are right • How can we encourage more replication studies? • What do authors have to supply (and support) to facilitate replication? • How can we encourage authors to investigate and report limitations?
  16. Input Invariants Dominic Steinhöfel CISPA Helmholtz Center for Information Security

    Saarbrücken, Germany [email protected] Andreas Zeller CISPA Helmholtz Center for Information Security Saarbrücken, Germany [email protected] ABSTRACT Grammar-based fuzzers are highly e￿cient in producing syntac- tically valid system inputs. However, as context-free grammars cannot capture semantic input features, generated inputs will often be rejected as semantically invalid by a target program. We pro- pose ISLa, a declarative speci￿cation language for context-sensitive properties of structured system inputs based on context-free gram- mars. With ISLa, it is possible to specify input constraints like “a variable has to be de￿ned before it is used,” “the length of the ‘￿le name’ block in a TAR ￿le has to equal 100 bytes,” or “the number of columns in all CSV rows must be identical.” ISLa constraints can be used for parsing or validation (“Does an input meet the expected constraint?”) as well as for fuzzing, since we provide both an evaluation and input generation component. ISLa embeds SMT formulas as an island language, leveraging the power of modern solvers like Z3 to solve atomic semantic constraints. On top, it adds universal and existential quanti￿ers over the struc- ture of derivation trees from a grammar, and structural (“X occurs before Y”) and semantic (“X is the checksum of Y”) predicates. ISLa constraints can be speci￿ed manually, but also mined from existing input samples. For this, our ISLearn prototype uses a cat- alog of common patterns (such as the ones above), instantiates these over input elements, and retains those candidates that hold for the inputs observed and whose instantiations are fully accepted by input-processing programs. The resulting constraints can then again be used for fuzzing and parsing. In our evaluation, we show that a few ISLa constraints su￿ce to produce inputs that are 100% semantically valid while still maintain- ing input diversity. Furthermore, we con￿rm that ISLearn mines use- ful constraints about de￿nition-use relationships and (implications between) the existence of “magic constants”, e.g., for programming languages and network packets. CCS CONCEPTS • Software and its engineering ! Software testing and debug- ging; Speci￿cation languages; Constraint and logic languages; Syntax; Semantics; Parsers; Software reverse engineering; Documen- tation; • Theory of computation ! Grammars and context- free languages; Formalisms. 1 INTRODUCTION Automated software testing with random inputs (fuzzing) [19] is an e￿ective technique for ￿nding bugs in programs. Pure random inputs can quickly discover errors in input processing. Yet, if a program expects complex structured inputs (e.g., C programs, JSON expressions, or binary formats), the chances of randomly produc- ing valid inputs that are accepted by the parser and reach deeper functionality are low. Language-based fuzzers [8, 12, 13] overcome this limitation by generating inputs from a speci￿cation of a program’s expected input language, frequently expressed as a Context-Free Grammar (CFG). This considerably increases the chance of producing an input passing the program’s parsing stage and reaching its core logic. Yet, while being great for parsing, CFGs are often too coarse for producing inputs. Consider, e.g., the language of XML documents (without document type). This language is not context free.1 Still, it can be approximated by a CFG. Fig. 1 shows an excerpt of a CFG for XML. When we used a coverage-based fuzzer to produce 10,000 strings from this grammar, exactly one produced document (<O L= cmV > õ! B7</O>) contained a matching tag pair. This result is typical for language-based fuzzers used with a language speci￿cation designed for parsing which therefore is more permissive than a language speci￿cation for producing would have to be. This is unfortunate, as hundreds of language speci￿cations for parsing exist. To allow for precise production, we need to enrich the grammar with more information, or switch to a di￿erent formalism. However, existing solutions all have their drawbacks. Using general purpose code to produce inputs, or enriching grammars with such code is closely tied to an implementation language, and does not allow for parsing and recombining inputs, which is a common feature of modern fuzzers. Unrestricted grammars can in principle specify any computable input property, but we see them as “Turing tar-pits,” in which “everything is possible, but nothing of interest is easy” [22]— just try, for instance, to express that some number is the sum of two input elements. Finally, one could also replace CFGs by a di￿erent formalism; but this would mean to renounce a concept that many developers know (e.g., from the ANTLR parser generator or RFCs). In this paper, we bring forward a di￿erent solution by propos- ing a (programming and target) language-independent, declarative speci￿cation language named ISLa (Input Speci￿cation Language) Alternatives for Mining Input Languages Mining Input Grammars from Dynamic Control Flow Rahul Gopinath [email protected] CISPA Helmholtz Center for Information Security Saarbrücken, Germany Björn Mathis [email protected] CISPA Helmholtz Center for Information Security Saarbrücken, Germany Andreas Zeller [email protected] CISPA Helmholtz Center for Information Security Saarbrücken, Germany ABSTRACT One of the key properties of a program is its input speci￿cation. Having a formal input speci￿cation can be critical in ￿elds such as vulnerability analysis, reverse engineering, software testing, clone detection, or refactoring. Unfortunately, accurate input speci￿ca- tions for typical programs are often unavailable or out of date. In this paper, we present a general algorithm that takes a program and a small set of sample inputs and automatically infers a readable context-free grammar capturing the input language of the program. We infer the syntactic input structure only by observing access of input characters at di￿erent locations of the input parser. This works on all stack based recursive descent input parsers, including parser combinators, and works entirely without program speci￿c heuristics. Our Mimid prototype produced accurate and readable grammars for a variety of evaluation subjects, including complex languages such as JSON, TinyC, and JavaScript. CCS CONCEPTS • Software and its engineering → Software reverse engineer- ing; Dynamic analysis; • Theory of computation → Grammars and context-free languages. KEYWORDS context-free grammar, dynamic analysis, fuzzing, data￿ow, control- ￿ow ACM Reference Format: Rahul Gopinath, Björn Mathis, and Andreas Zeller. 2020. Mining Input Grammars from Dynamic Control Flow. In Proceedings of The 28th ACM Joint European Software Engineering Conference and Symposium on the Foun- dations of Software Engineering (ESEC/FSE 2020). ACM, New York, NY, USA, 12 pages. 1 INTRODUCTION One of the key properties of a program is its input speci￿cation. Having a formal input speci￿cation is important in diverse ￿elds such as reverse engineering [18], program refactoring [29], and program comprehension [23, 44]. To generate complex system in- puts for testing, a speci￿cation for the input language is practically hSTARTi ::= hjson_rawi hjson_rawi ::= ‘ ’ hjson_string0i | ‘[’ hjson_list0i | ‘{’ hjson_dict0i | hjson_number0i | ‘true’ | ‘false’ | ‘null’ hjson_number0i ::= hjson_numberi+ | hjson_numberi+ ‘e’ hjson_numberi+ hjson_numberi ::= ‘+’ | ‘-’ | ‘.’ | [0-9] | ‘E’ | ‘e’ hjson_string0i ::= hjson_stringi* ‘ ’ hjson_list0i ::= ‘]’ | hjson_rawi (‘,’ hjson_rawi )* ‘]’ | ( ‘,’ hjson_rawi )+ (‘,’ hjson_rawi )* ‘]’ hjson_dict0i ::= ‘}’ | ( ‘ ’ hjson_string0i ‘:’ hjson_rawi ‘,’ )* ‘ ’ hjson_string0i ‘:’ hjson_rawi ‘}’ hjson_stringi ::= ‘ ’ | ‘!’ | ‘#’ | ‘$’ | ‘%’ | ‘&’ | ‘’’ | ‘*’ | ‘+’ | ‘-’ | ‘,’ | ‘.’ | ‘/’ | ‘:’ | ‘;’ | ‘<’ | ‘=’ | ‘>’ | ‘?’ | ‘@’ | ‘[’ | ‘]’ | ‘^’ | ’_’, ’‘’, | ‘{’ | ‘|’ | ‘}’ | ‘~’ | ‘[A-Za-z0-9]’ | ‘\’ hdecode_escapei hdecode_escapei ::= ‘ ’ | ‘/’ | ‘b’ | ‘f’ | ‘n’ | ‘r’ | ‘t’ Figure 1: JSON grammar extracted from microjson.py. While researchers have tried to tackle the problem of grammar recovery using black-box approaches [14, 48], the seminal paper by Angluin and Kharitonov [11] shows that a pure black-box approach is doomed to failure as there cannot be a polynomial time algorithm in terms of the number of queries needed for recovering a context-free grammar from membership queries alone. Hence, only white-box approaches that take program semantics into account can obtain an accurate input speci￿cation. The ￿rst white-box approach to extract input structures from programs is the work by Lin et al. [39, 40], which recovers parse trees from inputs using a combination of static and dynamic anal- ysis. However, Lin et al. stop at recovering the parse trees with limited labeling, and the recovery of a grammar from the parse trees is non-trivial (as the authors recognize in the paper, and as We are hiring!
  17. “Synthesizing Input Grammars”: A Replication Study PLDI 2022 • San

    Diego Bachir Bendrissou • Rahul Gopinath • Andreas Zeller Synthesizing
 Program Input Grammars PLDI 2017 Synthesizing Program Input Grammars Osbert Bastani Stanford University, USA [email protected] Rahul Sharma Microsoft Research, India [email protected] Alex Aiken Stanford University, USA [email protected] Percy Liang Stanford University, USA [email protected] Abstract We present an algorithm for synthesizing a context-free grammar encoding the language of valid program inputs from a set of input examples and blackbox access to the program. Our algorithm addresses shortcomings of existing grammar inference algorithms, which both severely over- generalize and are prohibitively slow. Our implementation, GLADE, leverages the grammar synthesized by our algo- rithm to fuzz test programs with structured inputs. We show that GLADE substantially increases the incremental cover- age on valid inputs compared to two baseline fuzzers. CCS Concepts • Theory of computation ! Program analysis Keywords grammar synthesis; fuzzing 1. Introduction Documentation of program input formats, if available in a machine-readable form, can significantly aid many software analysis tools. However, such documentation is often poor; for example, the specifications of Flex [61] and Bison [20] input syntaxes are limited to informal documentation. Even when detailed specifications are available, they are often not in a machine-readable form; for example, the specification for ECMAScript 6 syntax is 20 pages in Annex A of [15], and the specification for Java class files is 268 pages in Chapter 4 of [45]. In this paper, we study the problem of automatically syn- thesizing grammars representing program input languages. Such a grammar synthesis algorithm has many potential ap- plications. Our primary motivation is the possibility of using synthesized grammars with grammar-based fuzzers [23, 28, 38]. For example, such inputs can be used to find bugs in real-world programs [24, 39, 48, 67], learn abstractions [41], predict performance [30], and aid dynamic analysis [42]. Be- yond fuzzing, a grammar synthesis algorithm could be used to reverse engineer input formats [29], in particular, network protocol message formats can help security analysts discover vulnerabilities in network programs [8, 35, 36, 66]. Synthe- sized grammars could also be used to whitelist program in- puts, thereby preventing exploits [49, 50, 58]. Approaches to synthesizing program input grammars typ- ically examine executions of the program, and then gen- eralize these observations to a representation of valid in- puts. These approaches can be either whitebox or blackbox. Whitebox approaches assume that the program code is avail- able for analysis and instrumentation, for example, using dy- namic taint analysis [29]. Such an approach is difficult when only the program binaries are available or when parts of the code (e.g., libraries) are missing. Furthermore, these tech- niques often require program-specific configuration or tun- ing, and may be affected by the structure of the code. We consider the blackbox setting, where we only require the ability to execute the program on a given input and observe its corresponding output. Since the algorithm does not exam- ine the program’s code, its performance depends only on the language of valid inputs, and not on implementation details. A number of existing language inference algorithms can be adapted to this setting [14]. However, we found them to be unsuitable for synthesizing program input grammars. In particular, L-Star [3] and RPNI [44], the most widely studied algorithms [6, 12, 13, 19, 62], were unable to learn or approximate even simple input languages such as XML, and furthermore do not scale even to small sets of seed inputs. Surprisingly, we found that L-Star and RPNI perform poorly even on the class of regular languages they target. The problem with these algorithms is that despite having theoretical guarantees, they depend on assumptions that do Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. PLDI’17, June 18–23, 2017, Barcelona, Spain c 2017 ACM. 978-1-4503-4988-8/17/06...$15.00 http://dx.doi.org/10.1145/3062341.3062349 95 • An algorithm for synthesizing program input grammars from seed inputs and membership queries • Learns regular properties such as repetitions and alternations, then recursive productions • F1 scores of 0.90 and higher • Implemented in the GLADE fuzzer, increasing coverage by up to 6x over AFL GLADE Grammars (This is the only example in the paper) Reimplementing GLADE • We reimplemented the GLADE algorithm from the paper • About six person-months of work • Obtained extremely large grammars that essentially enumerated inputs • OK for a fuzzer; bad for a grammar miner • Obtained much lower accuracy scores
 than reported in 2017 paper Takeaways • Replicating work is important! Yet, replication is also a fruitless endeavor: • Either you confirm the original results – then there’s nothing new • Or you cannot reproduce them – then it’s your job to prove you are right • How can we encourage more replication studies? • What do authors have to supply (and support) to facilitate replication? • How can we encourage authors to investigate and 
 report limitations? • To mine input languages, consider white-box miners
 (Mimid, AUTOGRAM/Fuzzingbook), ISLa, or ARVADA