Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Learning Grammars without Samples

Learning Grammars without Samples

Avatar for Rahul Gopinath

Rahul Gopinath

April 02, 2019
Tweet

More Decks by Rahul Gopinath

Other Decks in Research

Transcript

  1. Structured Inputs are Everywhere 3 POST /InStock HTTP/1.1 Host: www.stock.org

    Content-Type: application/soap+xml; charset=utf-8 Content-Length: 312 <?xml version="1.0"?> <soap:Envelope xmlns:soap="http://www.w3.org/2001/12/soap-envelope" soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding"> <soap:Body xmlns:m="http://www.stock.org/stock"> <m:GetStockPrice> <m:StockName>IBM</m:StockName> </m:GetStockPrice> </soap:Body> </soap:Envelope>
  2. Structured Inputs are Everywhere 4 POST /InStock HTTP/1.1 Host: www.stock.org

    Content-Type: application/soap+xml; charset=utf-8 Content-Length: 312 <?xml version="1.0"?> <soap:Envelope xmlns:soap="http://www.w3.org/2001/12/soap-envelope" soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding"> <soap:Body xmlns:m="http://www.stock.org/stock"> <m:GetStockPrice> <m:StockName>IBM</m:StockName> </m:GetStockPrice> </soap:Body> </soap:Envelope> HTTP POST
  3. Structured Inputs are Everywhere 5 POST /InStock HTTP/1.1 Host: www.stock.org

    Content-Type: application/soap+xml; charset=utf-8 Content-Length: 312 <?xml version="1.0"?> <soap:Envelope xmlns:soap="http://www.w3.org/2001/12/soap-envelope" soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding"> <soap:Body xmlns:m="http://www.stock.org/stock"> <m:GetStockPrice> <m:StockName>IBM</m:StockName> </m:GetStockPrice> </soap:Body> </soap:Envelope> HTTP POST XML PAYLOAD
  4. Structured Inputs are Everywhere 6 POST /InStock HTTP/1.1 Host: www.stock.org

    Content-Type: application/soap+xml; charset=utf-8 Content-Length: 312 <?xml version="1.0"?> <soap:Envelope xmlns:soap="http://www.w3.org/2001/12/soap-envelope" soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding"> <soap:Body xmlns:m="http://www.stock.org/stock"> <m:GetStockPrice> <m:StockName>IBM</m:StockName> </m:GetStockPrice> </soap:Body> </soap:Envelope> HTTP POST XML PAYLOAD SOAP
  5. Structured Inputs are Everywhere 7 POST /InStock HTTP/1.1 Host: www.stock.org

    Content-Type: application/soap+xml; charset=utf-8 Content-Length: 312 <?xml version="1.0"?> <soap:Envelope xmlns:soap="http://www.w3.org/2001/12/soap-envelope" soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding"> <soap:Body xmlns:m="http://www.stock.org/stock"> <m:GetStockPrice> <m:StockName>IBM</m:StockName> </m:GetStockPrice> </soap:Body> </soap:Envelope> HTTP POST XML PAYLOAD SOAP RPC Call
  6. We need grammars to reach the pot of gold 8

    HTTP HTTP Parser XML XML Parser SOAP SOAP Parser RPC RPC Parser Application
  7. A Naive Fuzzer 9 $ ./fuzzit.py [ ; x 1

    - G P Z + w c c k c ] ; , N 9 J + ? # 6 ^ 6 \ e ? ] 9 l u 2 _ % ' 4 G X " 0 V U B [ E / r ~ f A p u 6 b 8 < { % s i q 8 Z h . 6 { V , h r ? ; { Ti . r 3 P I x M M M v 6 { x S ^ + ' H q ! A x B " Y X R S @ ! Kd6;wtAMefFWM(`|J_<1~o}z3K(CCzRH JIIvHz>_*. \ > J r l U 3 2 ~ e G P ? l R = b F 3 + ; y $ 3 l o d Q < B 8 9 ! 5 " W 2 f K * v E 7 v { ' ) K C - i , c { < [ ~ m ! ] o ; { . ' } G j \ ( X } EtYetrpbY@aGZ1{P!AZU7x#4(Rtn!q4nCwqol^y6}0| Ko=*JK~;zMKV=9Nai:wxu{J&UV#HaU)*BiC<),`+t*g ka<W=Z.%T5WGHZpI30D<Pq>&]BS6R&j? # t P 7 i a V } - } ` \ ? [ _ [ Z ^ L B M P G - FKj'\xwuZ1=Q`^`5,$N$Q@[!CuRzJ2D|vBy! ^ z k h d f 3 C 5 P A k R ? V h n | 3='i2Qx]D$qs4O`1@fevnG'2\11Vf3piU37@55ap\zIy l"'f,$ee,J4Gw:cgNKLie3nx9(`efSlg6#[K"@WjhZ} r[Scun&sBCS,T[/vY'pduwgzDlVNy7'rnzxNwI) (ynBa>%|b`;`9fG]P_0hdG~$@6 3]KAeEnQ7lU)3Pn, 0)G/6N-wyzj/MTd#A;r HTTP Parser XML Parser SOAP Parser RPC Parser Application
  8. A Naive Fuzzer 10 $ ./fuzzit.py [ ; x 1

    - G P Z + w c c k c ] ; , N 9 J + ? # 6 ^ 6 \ e ? ] 9 l u 2 _ % ' 4 G X " 0 V U B [ E / r ~ f A p u 6 b 8 < { % s i q 8 Z h . 6 { V , h r ? ; { Ti . r 3 P I x M M M v 6 { x S ^ + ' H q ! A x B " Y X R S @ ! Kd6;wtAMefFWM(`|J_<1~o}z3K(CCzRH JIIvHz>_*. \ > J r l U 3 2 ~ e G P ? l R = b F 3 + ; y $ 3 l o d Q < B 8 9 ! 5 " W 2 f K * v E 7 v { ' ) K C - i , c { < [ ~ m ! ] o ; { . ' } G j \ ( X } EtYetrpbY@aGZ1{P!AZU7x#4(Rtn!q4nCwqol^y6}0| Ko=*JK~;zMKV=9Nai:wxu{J&UV#HaU)*BiC<),`+t*g ka<W=Z.%T5WGHZpI30D<Pq>&]BS6R&j? # t P 7 i a V } - } ` \ ? [ _ [ Z ^ L B M P G - FKj'\xwuZ1=Q`^`5,$N$Q@[!CuRzJ2D|vBy! ^ z k h d f 3 C 5 P A k R ? V h n | 3='i2Qx]D$qs4O`1@fevnG'2\11Vf3piU37@55ap\zIy l"'f,$ee,J4Gw:cgNKLie3nx9(`efSlg6#[K"@WjhZ} r[Scun&sBCS,T[/vY'pduwgzDlVNy7'rnzxNwI) (ynBa>%|b`;`9fG]P_0hdG~$@6 3]KAeEnQ7lU)3Pn, 0)G/6N-wyzj/MTd#A;r HTTP Parser XML Parser SOAP Parser RPC Parser Application
  9. But, where do we get grammars from? 14 Sample Inputs

    Grammar Use sample inputs to dynamically mine the grammar (AUTOGRAM - ASE 2016)
  10. But, where do we get sample inputs from? 15 Sample

    Inputs? Grammar If we had a grammar, we can use it to generate sample inputs But we don't
  11. Developer Produced Grammar? 16 • Often out of sync with

    the program • Can result in blind spots
  12. State of the Art in Generating Inputs: KLEE, AFL and

    GLADE 17 • Uses Symbolic Execution AFL • Uses Coverage guided Fuzzing Glade • Blackbox Grammar Recovery
  13. AFL: 19 AFL • Uses (Branch) Coverage guided Fuzzing •

    Mutate sample inputs (if available) • The inputs generated are shallow and simple. • Very few valid inputs • Performance affected by complexity of input space
  14. KLEE: 20 • Uses Symbolic Execution • Very fast to

    explore simple languages • But suffers when the input space becomes complex
  15. Do we need full symbolic execution? 22 x ✘ –

    checks for digit – checks for "true"/"false" – checks for '"' – checks for '[' – checks for '{' pFuzzer Program
  16. Do we need full symbolic execution? 25 [ - –

    checks for digit – checks for "true"/"false" – checks for '"' – checks for '[' – checks for '{' pFuzzer Program Succeeds but does not terminate
  17. Do we need full symbolic execution? 26 -✘ – checks

    for digit – checks for "true"/"false" – checks for '"' – checks for '[' – checks for '{' pFuzzer Program [z
  18. Do we need full symbolic execution? 27 -- pFuzzer Program

    [true Replaces z with true -- succeeds but does not terminate
  19. Do we need full symbolic execution? 28 --✘ – checks

    for ',' – checks for ']' pFuzzer Program [true1
  20. Comparing to KLEE and AFL 30 The average number of

    tokens found by KLEE AFL and pFuzzer for each token length (Mathis 2019 PLDI)
  21. pFuzzer: Implementations in Python and C 31 • Relies on

    dynamic taint tracking, and character comparisons • Assumes availability of source (LLVM bitcode) at this point, but it is not a requirement.
  22. Longest Generated Input for JSON Parser 33 [false ,[{ "o":{

    , "$dYPrlj@?BR": 397 [+ ]"S|+| 4GzCW(C":-94}} ],[false,null]] ............................ nullÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ AFL KLEE pFuzzer pFuzzer generates strings with better structure than either AFL or KLEE
  23. Feeding the Results Back to the Grammar Miner 34 Program

    Under Test Parser-Directed Test Generator Comparisons Tests Dynamic Taints Grammar Learner Test Inputs ꋶ ɠ ɡ ɢ Inputs + Equivalence Classes Grammar Fuzzer
  24. 36 protected void parseURL(URL u, String spec, int start, int

    limit) { String protocol = u.getProtocol(); String authority = u.getAuthority(); String userInfo = u.getUserInfo(); String host = u.getHost(); int port = u.getPort(); int i = 0; boolean isUNCName = (start <= limit - 4) && (spec.charAt(start) == '/') && (spec.charAt(start + 1) == '/') && (spec.charAt(start + 2) == '/') && (spec.charAt(start + 3) == '/'); if (!isUNCName && (start <= limit - 2) && (spec.charAt(start) == '/') && (spec.charAt(start + 1) == '/')) { start += 2; i = spec.indexOf('/', start); if (i < 0) { i = spec.indexOf('?', start); if (i < 0) i = limit; } host = authority = spec.substring(start, i); int ind = authority.indexOf('@'); if (ind != -1) { userInfo = authority.substring(0, ind); host = authority.substring(ind+1); } else userInfo = null; if (host != null) { if (host.length()>0 && (host.charAt(0) == '[')) { if ((ind = host.indexOf(']')) > 2) { String nhost = host ; host = nhost.substring(0,ind+1); port = -1 ; if (nhost.length() > ind+1) { if (nhost.charAt(ind+1) == ':') { ++ind ; if (nhost.length() > (ind + 1)) port = Integer.parseInt(nhost.substring(ind+1)); } } } } else { ind = host.indexOf(':'); port = -1; if (ind >= 0) { if (host.length() > (ind + 1)) { port = Integer.parseInt(host.substring(ind + 1)); } host = host.substring(0, ind); } } } else host = ""; start = i; if (host == null) host = “"; ... setURL(u, protocol, host, port, authority, userInfo, ...); http://admin:[email protected]:80/command?foo=bar&lorem=ipsum#fragment http://www.guardian.co.uk/sports/worldcup#results ftp://bob:[email protected]/oss/debian7.iso URL ::= PROTOCOL ‘://‘ AUTHORITY PATH [‘?’ QUERY] [‘#’ REF] AUTHORITY ::= [USERINFO ‘@‘] HOST [‘:’ PORT] PROTOCOL ::= ‘http’ | ‘ftp’ USERINFO ::= r{[a-z]+} ‘:’ r{[a-z0-9]+} HOST ::= r{[a-z.]+} PORT ::= ’80’ PATH ::= r{/[a-z0-9.]*} QUERY ::= ‘foo=bar&lorem=ipsum’ REF ::= r{[a-z]+}
  25. 37 parseURL(spec) -> setURL(protocol, host, port, authority,…) -> setUserInfo(user, password)

    http://admin:[email protected]:80 :spec parseURL http 80 www.google.com admin:pass123 setURL :protocol :authority :port :host admin pass123 setUserInfo
  26. 38 parseURL(spec) -> setURL(protocol, host, port, authority,…) -> setUserInfo(user, password)

    http 80 www.google.com admin:pass123 http://admin:[email protected]:80 :spec setURL :protocol :authority parseURL :port :host admin pass123 setUserInfo ftp example.ftp.com boo:12345 ftp://boo:[email protected] :spec :protocol :authority :host boo 12345
  27. 39 parseURL(spec) -> setURL(protocol, host, port, authority,…) -> setUserInfo(user, password)

    http | ftp [80] www.google.com |example.ftp.com :spec setURL :protocol :authority parseURL :port :host admin|boo pass123|12345 setUserInfo SPEC ::= PROTOCOL ‘://‘ AUTHORITY ‘@’ HOST [‘:’ PORT] AUTHORITY ::= USER ‘:’ PASSWORD USER ::=r{[a-z]+} PASSWORD ::=r{[a-z0-9]+} HOST ::=r{[a-z]+} PORT ::=r{[0-9]+} SPEC AUTHORITY
  28. 40 • Requires dynamic tainting to track input fragments •

    Works well for recursive-descent parsers • Provides strict guarantees on the grammar produced. • A generated string from the inferred grammar is guaranteed to be a valid input
 (under certain assumptions). • Relatively heavyweight, and language/runtime specific • JVM/LLVM for now • Problems with implicit taint propagation and internal calls AUTOGRAM
  29. 43 • An abstract representation of a set of executions

    • Or of the program itself if the set of executions are representative • Language agnostic, and low implementation complexity • Can be used to identify behavioral changes • E.g. Duplicates in Mutation Analysis • Fingerprinting programs • Clone detection • Refactoring What can you do with the grammar?
  30. 45 • Taint Tracking • Concolic Execution • Metamorphic testing

    • Reorder elements • Delete/Insert/Duplicate idempotent elements • Differential testing • Carving unit tests Better Oracles