Input Invariants
Dominic Steinhöfel
CISPA Helmholtz Center for Information Security
Saarbrücken, Germany
[email protected]
Andreas Zeller
CISPA Helmholtz Center for Information Security
Saarbrücken, Germany
[email protected]
ABSTRACT
Grammar-based fuzzers are highly ecient in producing syntac-
tically valid system inputs. However, as context-free grammars
cannot capture semantic input features, generated inputs will often
be rejected as semantically invalid by a target program. We pro-
pose ISLa, a declarative specication language for context-sensitive
properties of structured system inputs based on context-free gram-
mars. With ISLa, it is possible to specify input constraints like “a
variable has to be dened before it is used,” “the length of the ‘le
name’ block in a TAR le has to equal 100 bytes,” or “the number
of columns in all CSV rows must be identical.”
ISLa constraints can be used for parsing or validation (“Does an
input meet the expected constraint?”) as well as for fuzzing, since
we provide both an evaluation and input generation component. ISLa
embeds SMT formulas as an island language, leveraging the power
of modern solvers like Z3 to solve atomic semantic constraints.
On top, it adds universal and existential quantiers over the struc-
ture of derivation trees from a grammar, and structural (“X occurs
before Y”) and semantic (“X is the checksum of Y”) predicates.
ISLa constraints can be specied manually, but also mined from
existing input samples. For this, our ISLearn prototype uses a cat-
alog of common patterns (such as the ones above), instantiates
these over input elements, and retains those candidates that hold
for the inputs observed and whose instantiations are fully accepted
by input-processing programs. The resulting constraints can then
again be used for fuzzing and parsing.
In our evaluation, we show that a few ISLa constraints suce to
produce inputs that are 100% semantically valid while still maintain-
ing input diversity. Furthermore, we conrm that ISLearn mines use-
ful constraints about denition-use relationships and (implications
between) the existence of “magic constants”, e.g., for programming
languages and network packets.
CCS CONCEPTS
• Software and its engineering ! Software testing and debug-
ging; Specication languages; Constraint and logic languages;
Syntax; Semantics; Parsers; Software reverse engineering; Documen-
tation; • Theory of computation ! Grammars and context-
free languages; Formalisms.
1 INTRODUCTION
Automated software testing with random inputs (fuzzing) [19] is
an eective technique for nding bugs in programs. Pure random
inputs can quickly discover errors in input processing. Yet, if a
program expects complex structured inputs (e.g., C programs, JSON
expressions, or binary formats), the chances of randomly produc-
ing valid inputs that are accepted by the parser and reach deeper
functionality are low.
Language-based fuzzers [8, 12, 13] overcome this limitation by
generating inputs from a specication of a program’s expected
input language, frequently expressed as a Context-Free Grammar
(CFG). This considerably increases the chance of producing an input
passing the program’s parsing stage and reaching its core logic. Yet,
while being great for parsing, CFGs are often too coarse for producing
inputs. Consider, e.g., the language of XML documents (without
document type). This language is not context free.1 Still, it can be
approximated by a CFG. Fig. 1 shows an excerpt of a CFG for XML.
When we used a coverage-based fuzzer to produce 10,000 strings
from this grammar, exactly one produced document (
õ! B7) contained a matching tag pair. This result is typical for
language-based fuzzers used with a language specication designed
for parsing which therefore is more permissive than a language
specication for producing would have to be. This is unfortunate,
as hundreds of language specications for parsing exist.
To allow for precise production, we need to enrich the grammar
with more information, or switch to a dierent formalism. However,
existing solutions all have their drawbacks. Using general purpose
code to produce inputs, or enriching grammars with such code is
closely tied to an implementation language, and does not allow
for parsing and recombining inputs, which is a common feature of
modern fuzzers. Unrestricted grammars can in principle specify any
computable input property, but we see them as “Turing tar-pits,” in
which “everything is possible, but nothing of interest is easy” [22]—
just try, for instance, to express that some number is the sum of two
input elements. Finally, one could also replace CFGs by a dierent
formalism; but this would mean to renounce a concept that many
developers know (e.g., from the ANTLR parser generator or RFCs).
In this paper, we bring forward a dierent solution by propos-
ing a (programming and target) language-independent, declarative
specication language named ISLa (Input Specication Language)
Alternatives for Mining Input Languages
Mining Input Grammars from Dynamic Control Flow
Rahul Gopinath
[email protected]
CISPA Helmholtz Center for
Information Security
Saarbrücken, Germany
Björn Mathis
[email protected]
CISPA Helmholtz Center for
Information Security
Saarbrücken, Germany
Andreas Zeller
[email protected]
CISPA Helmholtz Center for
Information Security
Saarbrücken, Germany
ABSTRACT
One of the key properties of a program is its input specication.
Having a formal input specication can be critical in elds such as
vulnerability analysis, reverse engineering, software testing, clone
detection, or refactoring. Unfortunately, accurate input specica-
tions for typical programs are often unavailable or out of date.
In this paper, we present a general algorithm that takes a program
and a small set of sample inputs and automatically infers a readable
context-free grammar capturing the input language of the program.
We infer the syntactic input structure only by observing access
of input characters at dierent locations of the input parser. This
works on all stack based recursive descent input parsers, including
parser combinators, and works entirely without program specic
heuristics. Our Mimid prototype produced accurate and readable
grammars for a variety of evaluation subjects, including complex
languages such as JSON, TinyC, and JavaScript.
CCS CONCEPTS
• Software and its engineering → Software reverse engineer-
ing; Dynamic analysis; • Theory of computation → Grammars
and context-free languages.
KEYWORDS
context-free grammar, dynamic analysis, fuzzing, dataow, control-
ow
ACM Reference Format:
Rahul Gopinath, Björn Mathis, and Andreas Zeller. 2020. Mining Input
Grammars from Dynamic Control Flow. In Proceedings of The 28th ACM
Joint European Software Engineering Conference and Symposium on the Foun-
dations of Software Engineering (ESEC/FSE 2020). ACM, New York, NY, USA,
12 pages.
1 INTRODUCTION
One of the key properties of a program is its input specication.
Having a formal input specication is important in diverse elds
such as reverse engineering [18], program refactoring [29], and
program comprehension [23, 44]. To generate complex system in-
puts for testing, a specication for the input language is practically
hSTARTi ::= hjson_rawi
hjson_rawi ::= ‘ ’ hjson_string0i | ‘[’ hjson_list0i | ‘{’ hjson_dict0i
| hjson_number0i | ‘true’ | ‘false’ | ‘null’
hjson_number0i ::= hjson_numberi+
| hjson_numberi+ ‘e’ hjson_numberi+
hjson_numberi ::= ‘+’ | ‘-’ | ‘.’ | [0-9] | ‘E’ | ‘e’
hjson_string0i ::= hjson_stringi* ‘ ’
hjson_list0i ::= ‘]’
| hjson_rawi (‘,’ hjson_rawi )* ‘]’
| ( ‘,’ hjson_rawi )+ (‘,’ hjson_rawi )* ‘]’
hjson_dict0i ::= ‘}’
| ( ‘ ’ hjson_string0i ‘:’ hjson_rawi ‘,’ )*
‘ ’ hjson_string0i ‘:’ hjson_rawi ‘}’
hjson_stringi ::= ‘ ’ | ‘!’ | ‘#’ | ‘$’ | ‘%’ | ‘&’ | ‘’’
| ‘*’ | ‘+’ | ‘-’ | ‘,’ | ‘.’ | ‘/’ | ‘:’ | ‘;’
| ‘<’ | ‘=’ | ‘>’ | ‘?’ | ‘@’ | ‘[’ | ‘]’ | ‘^’ | ’_’, ’‘’,
| ‘{’ | ‘|’ | ‘}’ | ‘~’
| ‘[A-Za-z0-9]’ | ‘\’ hdecode_escapei
hdecode_escapei ::= ‘ ’ | ‘/’ | ‘b’ | ‘f’ | ‘n’ | ‘r’ | ‘t’
Figure 1: JSON grammar extracted from microjson.py.
While researchers have tried to tackle the problem of grammar
recovery using black-box approaches [14, 48], the seminal paper by
Angluin and Kharitonov [11] shows that a pure black-box approach
is doomed to failure as there cannot be a polynomial time algorithm in
terms of the number of queries needed for recovering a context-free
grammar from membership queries alone. Hence, only white-box
approaches that take program semantics into account can obtain
an accurate input specication.
The rst white-box approach to extract input structures from
programs is the work by Lin et al. [39, 40], which recovers parse
trees from inputs using a combination of static and dynamic anal-
ysis. However, Lin et al. stop at recovering the parse trees with
limited labeling, and the recovery of a grammar from the parse
trees is non-trivial (as the authors recognize in the paper, and as
We are
hiring!