Slide 1

Slide 1 text

Learning Grammars Without Samples Rahul Gopinath Postdoctoral Researcher CISPA Helmholtz Center for Information Security

Slide 2

Slide 2 text

Why Learn a Grammar? 2

Slide 3

Slide 3 text

Structured Inputs are Everywhere 3 POST /InStock HTTP/1.1 Host: www.stock.org Content-Type: application/soap+xml; charset=utf-8 Content-Length: 312 IBM

Slide 4

Slide 4 text

Structured Inputs are Everywhere 4 POST /InStock HTTP/1.1 Host: www.stock.org Content-Type: application/soap+xml; charset=utf-8 Content-Length: 312 IBM HTTP POST

Slide 5

Slide 5 text

Structured Inputs are Everywhere 5 POST /InStock HTTP/1.1 Host: www.stock.org Content-Type: application/soap+xml; charset=utf-8 Content-Length: 312 IBM HTTP POST XML PAYLOAD

Slide 6

Slide 6 text

Structured Inputs are Everywhere 6 POST /InStock HTTP/1.1 Host: www.stock.org Content-Type: application/soap+xml; charset=utf-8 Content-Length: 312 IBM HTTP POST XML PAYLOAD SOAP

Slide 7

Slide 7 text

Structured Inputs are Everywhere 7 POST /InStock HTTP/1.1 Host: www.stock.org Content-Type: application/soap+xml; charset=utf-8 Content-Length: 312 IBM HTTP POST XML PAYLOAD SOAP RPC Call

Slide 8

Slide 8 text

We need grammars to reach the pot of gold 8 HTTP HTTP Parser XML XML Parser SOAP SOAP Parser RPC RPC Parser Application

Slide 9

Slide 9 text

A Naive Fuzzer 9 $ ./fuzzit.py [ ; x 1 - G P Z + w c c k c ] ; , N 9 J + ? # 6 ^ 6 \ e ? ] 9 l u 2 _ % ' 4 G X " 0 V U B [ E / r ~ f A p u 6 b 8 < { % s i q 8 Z h . 6 { V , h r ? ; { Ti . r 3 P I x M M M v 6 { x S ^ + ' H q ! A x B " Y X R S @ ! Kd6;wtAMefFWM(`|J_<1~o}z3K(CCzRH JIIvHz>_*. \ > J r l U 3 2 ~ e G P ? l R = b F 3 + ; y $ 3 l o d Q < B 8 9 ! 5 " W 2 f K * v E 7 v { ' ) K C - i , c { < [ ~ m ! ] o ; { . ' } G j \ ( X } EtYetrpbY@aGZ1{P!AZU7x#4(Rtn!q4nCwqol^y6}0| Ko=*JK~;zMKV=9Nai:wxu{J&UV#HaU)*BiC<),`+t*g ka&]BS6R&j? # t P 7 i a V } - } ` \ ? [ _ [ Z ^ L B M P G - FKj'\xwuZ1=Q`^`5,$N$Q@[!CuRzJ2D|vBy! ^ z k h d f 3 C 5 P A k R ? V h n | 3='i2Qx]D$qs4O`1@fevnG'2\11Vf3piU37@55ap\zIy l"'f,$ee,J4Gw:cgNKLie3nx9(`efSlg6#[K"@WjhZ} r[Scun&sBCS,T[/vY'pduwgzDlVNy7'rnzxNwI) (ynBa>%|b`;`9fG]P_0hdG~$@6 3]KAeEnQ7lU)3Pn, 0)G/6N-wyzj/MTd#A;r HTTP Parser XML Parser SOAP Parser RPC Parser Application

Slide 10

Slide 10 text

A Naive Fuzzer 10 $ ./fuzzit.py [ ; x 1 - G P Z + w c c k c ] ; , N 9 J + ? # 6 ^ 6 \ e ? ] 9 l u 2 _ % ' 4 G X " 0 V U B [ E / r ~ f A p u 6 b 8 < { % s i q 8 Z h . 6 { V , h r ? ; { Ti . r 3 P I x M M M v 6 { x S ^ + ' H q ! A x B " Y X R S @ ! Kd6;wtAMefFWM(`|J_<1~o}z3K(CCzRH JIIvHz>_*. \ > J r l U 3 2 ~ e G P ? l R = b F 3 + ; y $ 3 l o d Q < B 8 9 ! 5 " W 2 f K * v E 7 v { ' ) K C - i , c { < [ ~ m ! ] o ; { . ' } G j \ ( X } EtYetrpbY@aGZ1{P!AZU7x#4(Rtn!q4nCwqol^y6}0| Ko=*JK~;zMKV=9Nai:wxu{J&UV#HaU)*BiC<),`+t*g ka&]BS6R&j? # t P 7 i a V } - } ` \ ? [ _ [ Z ^ L B M P G - FKj'\xwuZ1=Q`^`5,$N$Q@[!CuRzJ2D|vBy! ^ z k h d f 3 C 5 P A k R ? V h n | 3='i2Qx]D$qs4O`1@fevnG'2\11Vf3piU37@55ap\zIy l"'f,$ee,J4Gw:cgNKLie3nx9(`efSlg6#[K"@WjhZ} r[Scun&sBCS,T[/vY'pduwgzDlVNy7'rnzxNwI) (ynBa>%|b`;`9fG]P_0hdG~$@6 3]KAeEnQ7lU)3Pn, 0)G/6N-wyzj/MTd#A;r HTTP Parser XML Parser SOAP Parser RPC Parser Application

Slide 11

Slide 11 text

What we need is the Input Specification! 11

Slide 12

Slide 12 text

If you already have the Specification: 12 • JSFunFuzz • GramFuzz • LangFuzz • CSMITH

Slide 13

Slide 13 text

But, where do we get grammars from? 13

Slide 14

Slide 14 text

But, where do we get grammars from? 14 Sample Inputs Grammar Use sample inputs to dynamically mine the grammar (AUTOGRAM - ASE 2016)

Slide 15

Slide 15 text

But, where do we get sample inputs from? 15 Sample Inputs? Grammar If we had a grammar, we can use it to generate sample inputs But we don't

Slide 16

Slide 16 text

Developer Produced Grammar? 16 • Often out of sync with the program • Can result in blind spots

Slide 17

Slide 17 text

State of the Art in Generating Inputs: KLEE, AFL and GLADE 17 • Uses Symbolic Execution AFL • Uses Coverage guided Fuzzing Glade • Blackbox Grammar Recovery

Slide 18

Slide 18 text

Glade: 18 • Blackbox technique • Very slow to generate meaningful grammars Glade

Slide 19

Slide 19 text

AFL: 19 AFL • Uses (Branch) Coverage guided Fuzzing • Mutate sample inputs (if available) • The inputs generated are shallow and simple. • Very few valid inputs • Performance affected by complexity of input space

Slide 20

Slide 20 text

KLEE: 20 • Uses Symbolic Execution • Very fast to explore simple languages • But suffers when the input space becomes complex

Slide 21

Slide 21 text

Do we need full symbolic execution? 21 Idea: Solve only the next character. pFuzzer

Slide 22

Slide 22 text

Do we need full symbolic execution? 22 x ✘ – checks for digit – checks for "true"/"false" – checks for '"' – checks for '[' – checks for '{' pFuzzer Program

Slide 23

Slide 23 text

Do we need full symbolic execution? 23 pFuzzer Program x Replace x with a digit

Slide 24

Slide 24 text

Do we need full symbolic execution? 24 pFuzzer Program 0 ✔

Slide 25

Slide 25 text

Do we need full symbolic execution? 25 [ - – checks for digit – checks for "true"/"false" – checks for '"' – checks for '[' – checks for '{' pFuzzer Program Succeeds but does not terminate

Slide 26

Slide 26 text

Do we need full symbolic execution? 26 -✘ – checks for digit – checks for "true"/"false" – checks for '"' – checks for '[' – checks for '{' pFuzzer Program [z

Slide 27

Slide 27 text

Do we need full symbolic execution? 27 -- pFuzzer Program [true Replaces z with true -- succeeds but does not terminate

Slide 28

Slide 28 text

Do we need full symbolic execution? 28 --✘ – checks for ',' – checks for ']' pFuzzer Program [true1

Slide 29

Slide 29 text

Do we need full symbolic execution? 29 --✔ pFuzzer Program [true]

Slide 30

Slide 30 text

Comparing to KLEE and AFL 30 The average number of tokens found by KLEE AFL and pFuzzer for each token length (Mathis 2019 PLDI)

Slide 31

Slide 31 text

pFuzzer: Implementations in Python and C 31 • Relies on dynamic taint tracking, and character comparisons • Assumes availability of source (LLVM bitcode) at this point, but it is not a requirement.

Slide 32

Slide 32 text

DEMO

Slide 33

Slide 33 text

Longest Generated Input for JSON Parser 33 [false ,[{ "o":{ , "$dYPrlj@?BR": 397 [+ ]"S|+| 4GzCW(C":-94}} ],[false,null]] ............................ nullÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ AFL KLEE pFuzzer pFuzzer generates strings with better structure than either AFL or KLEE

Slide 34

Slide 34 text

Feeding the Results Back to the Grammar Miner 34 Program Under Test Parser-Directed Test Generator Comparisons Tests Dynamic Taints Grammar Learner Test Inputs ꋶ ɠ ɡ ɢ Inputs + Equivalence Classes Grammar Fuzzer

Slide 35

Slide 35 text

AUTOGRAM: Grammar from Samples 35

Slide 36

Slide 36 text

36 protected void parseURL(URL u, String spec, int start, int limit) { String protocol = u.getProtocol(); String authority = u.getAuthority(); String userInfo = u.getUserInfo(); String host = u.getHost(); int port = u.getPort(); int i = 0; boolean isUNCName = (start <= limit - 4) && (spec.charAt(start) == '/') && (spec.charAt(start + 1) == '/') && (spec.charAt(start + 2) == '/') && (spec.charAt(start + 3) == '/'); if (!isUNCName && (start <= limit - 2) && (spec.charAt(start) == '/') && (spec.charAt(start + 1) == '/')) { start += 2; i = spec.indexOf('/', start); if (i < 0) { i = spec.indexOf('?', start); if (i < 0) i = limit; } host = authority = spec.substring(start, i); int ind = authority.indexOf('@'); if (ind != -1) { userInfo = authority.substring(0, ind); host = authority.substring(ind+1); } else userInfo = null; if (host != null) { if (host.length()>0 && (host.charAt(0) == '[')) { if ((ind = host.indexOf(']')) > 2) { String nhost = host ; host = nhost.substring(0,ind+1); port = -1 ; if (nhost.length() > ind+1) { if (nhost.charAt(ind+1) == ':') { ++ind ; if (nhost.length() > (ind + 1)) port = Integer.parseInt(nhost.substring(ind+1)); } } } } else { ind = host.indexOf(':'); port = -1; if (ind >= 0) { if (host.length() > (ind + 1)) { port = Integer.parseInt(host.substring(ind + 1)); } host = host.substring(0, ind); } } } else host = ""; start = i; if (host == null) host = “"; ... setURL(u, protocol, host, port, authority, userInfo, ...); http://admin:[email protected]:80/command?foo=bar&lorem=ipsum#fragment http://www.guardian.co.uk/sports/worldcup#results ftp://bob:[email protected]/oss/debian7.iso URL ::= PROTOCOL ‘://‘ AUTHORITY PATH [‘?’ QUERY] [‘#’ REF] AUTHORITY ::= [USERINFO ‘@‘] HOST [‘:’ PORT] PROTOCOL ::= ‘http’ | ‘ftp’ USERINFO ::= r{[a-z]+} ‘:’ r{[a-z0-9]+} HOST ::= r{[a-z.]+} PORT ::= ’80’ PATH ::= r{/[a-z0-9.]*} QUERY ::= ‘foo=bar&lorem=ipsum’ REF ::= r{[a-z]+}

Slide 37

Slide 37 text

37 parseURL(spec) -> setURL(protocol, host, port, authority,…) -> setUserInfo(user, password) http://admin:[email protected]:80 :spec parseURL http 80 www.google.com admin:pass123 setURL :protocol :authority :port :host admin pass123 setUserInfo

Slide 38

Slide 38 text

38 parseURL(spec) -> setURL(protocol, host, port, authority,…) -> setUserInfo(user, password) http 80 www.google.com admin:pass123 http://admin:[email protected]:80 :spec setURL :protocol :authority parseURL :port :host admin pass123 setUserInfo ftp example.ftp.com boo:12345 ftp://boo:[email protected] :spec :protocol :authority :host boo 12345

Slide 39

Slide 39 text

39 parseURL(spec) -> setURL(protocol, host, port, authority,…) -> setUserInfo(user, password) http | ftp [80] www.google.com |example.ftp.com :spec setURL :protocol :authority parseURL :port :host admin|boo pass123|12345 setUserInfo SPEC ::= PROTOCOL ‘://‘ AUTHORITY ‘@’ HOST [‘:’ PORT] AUTHORITY ::= USER ‘:’ PASSWORD USER ::=r{[a-z]+} PASSWORD ::=r{[a-z0-9]+} HOST ::=r{[a-z]+} PORT ::=r{[0-9]+} SPEC AUTHORITY

Slide 40

Slide 40 text

40 • Requires dynamic tainting to track input fragments • Works well for recursive-descent parsers • Provides strict guarantees on the grammar produced. • A generated string from the inferred grammar is guaranteed to be a valid input
 (under certain assumptions). • Relatively heavyweight, and language/runtime specific • JVM/LLVM for now • Problems with implicit taint propagation and internal calls AUTOGRAM

Slide 41

Slide 41 text

41 What if we relax our constraints? Is taint tracking strictly needed?

Slide 42

Slide 42 text

42 DEMO: fuzzingbook: GrammarMining

Slide 43

Slide 43 text

43 • An abstract representation of a set of executions • Or of the program itself if the set of executions are representative • Language agnostic, and low implementation complexity • Can be used to identify behavioral changes • E.g. Duplicates in Mutation Analysis • Fingerprinting programs • Clone detection • Refactoring What can you do with the grammar?

Slide 44

Slide 44 text

44 Is input specification sufficient? How can we make fuzzing better?

Slide 45

Slide 45 text

45 • Taint Tracking • Concolic Execution • Metamorphic testing • Reorder elements • Delete/Insert/Duplicate idempotent elements • Differential testing • Carving unit tests Better Oracles

Slide 46

Slide 46 text

No content