Saarbrücken, Germany
[email protected] Andreas Zeller CISPA Helmholtz Center for Information Security Saarbrücken, Germany
[email protected] ABSTRACT Grammar-based fuzzers are highly ecient in producing syntac- tically valid system inputs. However, as context-free grammars cannot capture semantic input features, generated inputs will often be rejected as semantically invalid by a target program. We pro- pose ISLa, a declarative specication language for context-sensitive properties of structured system inputs based on context-free gram- mars. With ISLa, it is possible to specify input constraints like “a variable has to be dened before it is used,” “the length of the ‘le name’ block in a TAR le has to equal 100 bytes,” or “the number of columns in all CSV rows must be identical.” ISLa constraints can be used for parsing or validation (“Does an input meet the expected constraint?”) as well as for fuzzing, since we provide both an evaluation and input generation component. ISLa embeds SMT formulas as an island language, leveraging the power of modern solvers like Z3 to solve atomic semantic constraints. On top, it adds universal and existential quantiers over the struc- ture of derivation trees from a grammar, and structural (“X occurs before Y”) and semantic (“X is the checksum of Y”) predicates. ISLa constraints can be specied manually, but also mined from existing input samples. For this, our ISLearn prototype uses a cat- alog of common patterns (such as the ones above), instantiates these over input elements, and retains those candidates that hold for the inputs observed and whose instantiations are fully accepted by input-processing programs. The resulting constraints can then again be used for fuzzing and parsing. In our evaluation, we show that a few ISLa constraints suce to produce inputs that are 100% semantically valid while still maintain- ing input diversity. Furthermore, we conrm that ISLearn mines use- ful constraints about denition-use relationships and (implications between) the existence of “magic constants”, e.g., for programming languages and network packets. CCS CONCEPTS • Software and its engineering ! Software testing and debug- ging; Specication languages; Constraint and logic languages; Syntax; Semantics; Parsers; Software reverse engineering; Documen- tation; • Theory of computation ! Grammars and context- free languages; Formalisms. 1 INTRODUCTION Automated software testing with random inputs (fuzzing) [19] is an eective technique for nding bugs in programs. Pure random inputs can quickly discover errors in input processing. Yet, if a program expects complex structured inputs (e.g., C programs, JSON expressions, or binary formats), the chances of randomly produc- ing valid inputs that are accepted by the parser and reach deeper functionality are low. Language-based fuzzers [8, 12, 13] overcome this limitation by generating inputs from a specication of a program’s expected input language, frequently expressed as a Context-Free Grammar (CFG). This considerably increases the chance of producing an input passing the program’s parsing stage and reaching its core logic. Yet, while being great for parsing, CFGs are often too coarse for producing inputs. Consider, e.g., the language of XML documents (without document type). This language is not context free.1 Still, it can be approximated by a CFG. Fig. 1 shows an excerpt of a CFG for XML. When we used a coverage-based fuzzer to produce 10,000 strings from this grammar, exactly one produced document (<O L= cmV > õ! B7</O>) contained a matching tag pair. This result is typical for language-based fuzzers used with a language specication designed for parsing which therefore is more permissive than a language specication for producing would have to be. This is unfortunate, as hundreds of language specications for parsing exist. To allow for precise production, we need to enrich the grammar with more information, or switch to a dierent formalism. However, existing solutions all have their drawbacks. Using general purpose code to produce inputs, or enriching grammars with such code is closely tied to an implementation language, and does not allow for parsing and recombining inputs, which is a common feature of modern fuzzers. Unrestricted grammars can in principle specify any computable input property, but we see them as “Turing tar-pits,” in which “everything is possible, but nothing of interest is easy” [22]— just try, for instance, to express that some number is the sum of two input elements. Finally, one could also replace CFGs by a dierent formalism; but this would mean to renounce a concept that many developers know (e.g., from the ANTLR parser generator or RFCs). In this paper, we bring forward a dierent solution by propos- ing a (programming and target) language-independent, declarative specication language named ISLa (Input Specication Language) Alternatives for Mining Input Languages Mining Input Grammars from Dynamic Control Flow Rahul Gopinath
[email protected] CISPA Helmholtz Center for Information Security Saarbrücken, Germany Björn Mathis
[email protected] CISPA Helmholtz Center for Information Security Saarbrücken, Germany Andreas Zeller
[email protected] CISPA Helmholtz Center for Information Security Saarbrücken, Germany ABSTRACT One of the key properties of a program is its input specication. Having a formal input specication can be critical in elds such as vulnerability analysis, reverse engineering, software testing, clone detection, or refactoring. Unfortunately, accurate input specica- tions for typical programs are often unavailable or out of date. In this paper, we present a general algorithm that takes a program and a small set of sample inputs and automatically infers a readable context-free grammar capturing the input language of the program. We infer the syntactic input structure only by observing access of input characters at dierent locations of the input parser. This works on all stack based recursive descent input parsers, including parser combinators, and works entirely without program specic heuristics. Our Mimid prototype produced accurate and readable grammars for a variety of evaluation subjects, including complex languages such as JSON, TinyC, and JavaScript. CCS CONCEPTS • Software and its engineering → Software reverse engineer- ing; Dynamic analysis; • Theory of computation → Grammars and context-free languages. KEYWORDS context-free grammar, dynamic analysis, fuzzing, dataow, control- ow ACM Reference Format: Rahul Gopinath, Björn Mathis, and Andreas Zeller. 2020. Mining Input Grammars from Dynamic Control Flow. In Proceedings of The 28th ACM Joint European Software Engineering Conference and Symposium on the Foun- dations of Software Engineering (ESEC/FSE 2020). ACM, New York, NY, USA, 12 pages. 1 INTRODUCTION One of the key properties of a program is its input specication. Having a formal input specication is important in diverse elds such as reverse engineering [18], program refactoring [29], and program comprehension [23, 44]. To generate complex system in- puts for testing, a specication for the input language is practically hSTARTi ::= hjson_rawi hjson_rawi ::= ‘ ’ hjson_string0i | ‘[’ hjson_list0i | ‘{’ hjson_dict0i | hjson_number0i | ‘true’ | ‘false’ | ‘null’ hjson_number0i ::= hjson_numberi+ | hjson_numberi+ ‘e’ hjson_numberi+ hjson_numberi ::= ‘+’ | ‘-’ | ‘.’ | [0-9] | ‘E’ | ‘e’ hjson_string0i ::= hjson_stringi* ‘ ’ hjson_list0i ::= ‘]’ | hjson_rawi (‘,’ hjson_rawi )* ‘]’ | ( ‘,’ hjson_rawi )+ (‘,’ hjson_rawi )* ‘]’ hjson_dict0i ::= ‘}’ | ( ‘ ’ hjson_string0i ‘:’ hjson_rawi ‘,’ )* ‘ ’ hjson_string0i ‘:’ hjson_rawi ‘}’ hjson_stringi ::= ‘ ’ | ‘!’ | ‘#’ | ‘$’ | ‘%’ | ‘&’ | ‘’’ | ‘*’ | ‘+’ | ‘-’ | ‘,’ | ‘.’ | ‘/’ | ‘:’ | ‘;’ | ‘<’ | ‘=’ | ‘>’ | ‘?’ | ‘@’ | ‘[’ | ‘]’ | ‘^’ | ’_’, ’‘’, | ‘{’ | ‘|’ | ‘}’ | ‘~’ | ‘[A-Za-z0-9]’ | ‘\’ hdecode_escapei hdecode_escapei ::= ‘ ’ | ‘/’ | ‘b’ | ‘f’ | ‘n’ | ‘r’ | ‘t’ Figure 1: JSON grammar extracted from microjson.py. While researchers have tried to tackle the problem of grammar recovery using black-box approaches [14, 48], the seminal paper by Angluin and Kharitonov [11] shows that a pure black-box approach is doomed to failure as there cannot be a polynomial time algorithm in terms of the number of queries needed for recovering a context-free grammar from membership queries alone. Hence, only white-box approaches that take program semantics into account can obtain an accurate input specication. The rst white-box approach to extract input structures from programs is the work by Lin et al. [39, 40], which recovers parse trees from inputs using a combination of static and dynamic anal- ysis. However, Lin et al. stop at recovering the parse trees with limited labeling, and the recovery of a grammar from the parse trees is non-trivial (as the authors recognize in the paper, and as We are hiring!