Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Learning Grammars without Samples

Learning Grammars without Samples

Rahul Gopinath

April 02, 2019
Tweet

More Decks by Rahul Gopinath

Other Decks in Research

Transcript

  1. Structured Inputs are Everywhere 3 POST /InStock HTTP/1.1 Host: www.stock.org

    Content-Type: application/soap+xml; charset=utf-8 Content-Length: 312 <?xml version="1.0"?> <soap:Envelope xmlns:soap="http://www.w3.org/2001/12/soap-envelope" soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding"> <soap:Body xmlns:m="http://www.stock.org/stock"> <m:GetStockPrice> <m:StockName>IBM</m:StockName> </m:GetStockPrice> </soap:Body> </soap:Envelope>
  2. Structured Inputs are Everywhere 4 POST /InStock HTTP/1.1 Host: www.stock.org

    Content-Type: application/soap+xml; charset=utf-8 Content-Length: 312 <?xml version="1.0"?> <soap:Envelope xmlns:soap="http://www.w3.org/2001/12/soap-envelope" soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding"> <soap:Body xmlns:m="http://www.stock.org/stock"> <m:GetStockPrice> <m:StockName>IBM</m:StockName> </m:GetStockPrice> </soap:Body> </soap:Envelope> HTTP POST
  3. Structured Inputs are Everywhere 5 POST /InStock HTTP/1.1 Host: www.stock.org

    Content-Type: application/soap+xml; charset=utf-8 Content-Length: 312 <?xml version="1.0"?> <soap:Envelope xmlns:soap="http://www.w3.org/2001/12/soap-envelope" soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding"> <soap:Body xmlns:m="http://www.stock.org/stock"> <m:GetStockPrice> <m:StockName>IBM</m:StockName> </m:GetStockPrice> </soap:Body> </soap:Envelope> HTTP POST XML PAYLOAD
  4. Structured Inputs are Everywhere 6 POST /InStock HTTP/1.1 Host: www.stock.org

    Content-Type: application/soap+xml; charset=utf-8 Content-Length: 312 <?xml version="1.0"?> <soap:Envelope xmlns:soap="http://www.w3.org/2001/12/soap-envelope" soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding"> <soap:Body xmlns:m="http://www.stock.org/stock"> <m:GetStockPrice> <m:StockName>IBM</m:StockName> </m:GetStockPrice> </soap:Body> </soap:Envelope> HTTP POST XML PAYLOAD SOAP
  5. Structured Inputs are Everywhere 7 POST /InStock HTTP/1.1 Host: www.stock.org

    Content-Type: application/soap+xml; charset=utf-8 Content-Length: 312 <?xml version="1.0"?> <soap:Envelope xmlns:soap="http://www.w3.org/2001/12/soap-envelope" soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding"> <soap:Body xmlns:m="http://www.stock.org/stock"> <m:GetStockPrice> <m:StockName>IBM</m:StockName> </m:GetStockPrice> </soap:Body> </soap:Envelope> HTTP POST XML PAYLOAD SOAP RPC Call
  6. We need grammars to reach the pot of gold 8

    HTTP HTTP Parser XML XML Parser SOAP SOAP Parser RPC RPC Parser Application
  7. A Naive Fuzzer 9 $ ./fuzzit.py [ ; x 1

    - G P Z + w c c k c ] ; , N 9 J + ? # 6 ^ 6 \ e ? ] 9 l u 2 _ % ' 4 G X " 0 V U B [ E / r ~ f A p u 6 b 8 < { % s i q 8 Z h . 6 { V , h r ? ; { Ti . r 3 P I x M M M v 6 { x S ^ + ' H q ! A x B " Y X R S @ ! Kd6;wtAMefFWM(`|J_<1~o}z3K(CCzRH JIIvHz>_*. \ > J r l U 3 2 ~ e G P ? l R = b F 3 + ; y $ 3 l o d Q < B 8 9 ! 5 " W 2 f K * v E 7 v { ' ) K C - i , c { < [ ~ m ! ] o ; { . ' } G j \ ( X } EtYetrpbY@aGZ1{P!AZU7x#4(Rtn!q4nCwqol^y6}0| Ko=*JK~;zMKV=9Nai:wxu{J&UV#HaU)*BiC<),`+t*g ka<W=Z.%T5WGHZpI30D<Pq>&]BS6R&j? # t P 7 i a V } - } ` \ ? [ _ [ Z ^ L B M P G - FKj'\xwuZ1=Q`^`5,$N$Q@[!CuRzJ2D|vBy! ^ z k h d f 3 C 5 P A k R ? V h n | 3='i2Qx]D$qs4O`1@fevnG'2\11Vf3piU37@55ap\zIy l"'f,$ee,J4Gw:cgNKLie3nx9(`efSlg6#[K"@WjhZ} r[Scun&sBCS,T[/vY'pduwgzDlVNy7'rnzxNwI) (ynBa>%|b`;`9fG]P_0hdG~$@6 3]KAeEnQ7lU)3Pn, 0)G/6N-wyzj/MTd#A;r HTTP Parser XML Parser SOAP Parser RPC Parser Application
  8. A Naive Fuzzer 10 $ ./fuzzit.py [ ; x 1

    - G P Z + w c c k c ] ; , N 9 J + ? # 6 ^ 6 \ e ? ] 9 l u 2 _ % ' 4 G X " 0 V U B [ E / r ~ f A p u 6 b 8 < { % s i q 8 Z h . 6 { V , h r ? ; { Ti . r 3 P I x M M M v 6 { x S ^ + ' H q ! A x B " Y X R S @ ! Kd6;wtAMefFWM(`|J_<1~o}z3K(CCzRH JIIvHz>_*. \ > J r l U 3 2 ~ e G P ? l R = b F 3 + ; y $ 3 l o d Q < B 8 9 ! 5 " W 2 f K * v E 7 v { ' ) K C - i , c { < [ ~ m ! ] o ; { . ' } G j \ ( X } EtYetrpbY@aGZ1{P!AZU7x#4(Rtn!q4nCwqol^y6}0| Ko=*JK~;zMKV=9Nai:wxu{J&UV#HaU)*BiC<),`+t*g ka<W=Z.%T5WGHZpI30D<Pq>&]BS6R&j? # t P 7 i a V } - } ` \ ? [ _ [ Z ^ L B M P G - FKj'\xwuZ1=Q`^`5,$N$Q@[!CuRzJ2D|vBy! ^ z k h d f 3 C 5 P A k R ? V h n | 3='i2Qx]D$qs4O`1@fevnG'2\11Vf3piU37@55ap\zIy l"'f,$ee,J4Gw:cgNKLie3nx9(`efSlg6#[K"@WjhZ} r[Scun&sBCS,T[/vY'pduwgzDlVNy7'rnzxNwI) (ynBa>%|b`;`9fG]P_0hdG~$@6 3]KAeEnQ7lU)3Pn, 0)G/6N-wyzj/MTd#A;r HTTP Parser XML Parser SOAP Parser RPC Parser Application
  9. But, where do we get grammars from? 14 Sample Inputs

    Grammar Use sample inputs to dynamically mine the grammar (AUTOGRAM - ASE 2016)
  10. But, where do we get sample inputs from? 15 Sample

    Inputs? Grammar If we had a grammar, we can use it to generate sample inputs But we don't
  11. Developer Produced Grammar? 16 • Often out of sync with

    the program • Can result in blind spots
  12. State of the Art in Generating Inputs: KLEE, AFL and

    GLADE 17 • Uses Symbolic Execution AFL • Uses Coverage guided Fuzzing Glade • Blackbox Grammar Recovery
  13. AFL: 19 AFL • Uses (Branch) Coverage guided Fuzzing •

    Mutate sample inputs (if available) • The inputs generated are shallow and simple. • Very few valid inputs • Performance affected by complexity of input space
  14. KLEE: 20 • Uses Symbolic Execution • Very fast to

    explore simple languages • But suffers when the input space becomes complex
  15. Do we need full symbolic execution? 22 x ✘ –

    checks for digit – checks for "true"/"false" – checks for '"' – checks for '[' – checks for '{' pFuzzer Program
  16. Do we need full symbolic execution? 25 [ - –

    checks for digit – checks for "true"/"false" – checks for '"' – checks for '[' – checks for '{' pFuzzer Program Succeeds but does not terminate
  17. Do we need full symbolic execution? 26 -✘ – checks

    for digit – checks for "true"/"false" – checks for '"' – checks for '[' – checks for '{' pFuzzer Program [z
  18. Do we need full symbolic execution? 27 -- pFuzzer Program

    [true Replaces z with true -- succeeds but does not terminate
  19. Do we need full symbolic execution? 28 --✘ – checks

    for ',' – checks for ']' pFuzzer Program [true1
  20. Comparing to KLEE and AFL 30 The average number of

    tokens found by KLEE AFL and pFuzzer for each token length (Mathis 2019 PLDI)
  21. pFuzzer: Implementations in Python and C 31 • Relies on

    dynamic taint tracking, and character comparisons • Assumes availability of source (LLVM bitcode) at this point, but it is not a requirement.
  22. Longest Generated Input for JSON Parser 33 [false ,[{ "o":{

    , "$dYPrlj@?BR": 397 [+ ]"S|+| 4GzCW(C":-94}} ],[false,null]] ............................ nullÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ AFL KLEE pFuzzer pFuzzer generates strings with better structure than either AFL or KLEE
  23. Feeding the Results Back to the Grammar Miner 34 Program

    Under Test Parser-Directed Test Generator Comparisons Tests Dynamic Taints Grammar Learner Test Inputs ꋶ ɠ ɡ ɢ Inputs + Equivalence Classes Grammar Fuzzer
  24. 36 protected void parseURL(URL u, String spec, int start, int

    limit) { String protocol = u.getProtocol(); String authority = u.getAuthority(); String userInfo = u.getUserInfo(); String host = u.getHost(); int port = u.getPort(); int i = 0; boolean isUNCName = (start <= limit - 4) && (spec.charAt(start) == '/') && (spec.charAt(start + 1) == '/') && (spec.charAt(start + 2) == '/') && (spec.charAt(start + 3) == '/'); if (!isUNCName && (start <= limit - 2) && (spec.charAt(start) == '/') && (spec.charAt(start + 1) == '/')) { start += 2; i = spec.indexOf('/', start); if (i < 0) { i = spec.indexOf('?', start); if (i < 0) i = limit; } host = authority = spec.substring(start, i); int ind = authority.indexOf('@'); if (ind != -1) { userInfo = authority.substring(0, ind); host = authority.substring(ind+1); } else userInfo = null; if (host != null) { if (host.length()>0 && (host.charAt(0) == '[')) { if ((ind = host.indexOf(']')) > 2) { String nhost = host ; host = nhost.substring(0,ind+1); port = -1 ; if (nhost.length() > ind+1) { if (nhost.charAt(ind+1) == ':') { ++ind ; if (nhost.length() > (ind + 1)) port = Integer.parseInt(nhost.substring(ind+1)); } } } } else { ind = host.indexOf(':'); port = -1; if (ind >= 0) { if (host.length() > (ind + 1)) { port = Integer.parseInt(host.substring(ind + 1)); } host = host.substring(0, ind); } } } else host = ""; start = i; if (host == null) host = “"; ... setURL(u, protocol, host, port, authority, userInfo, ...); http://admin:[email protected]:80/command?foo=bar&lorem=ipsum#fragment http://www.guardian.co.uk/sports/worldcup#results ftp://bob:[email protected]/oss/debian7.iso URL ::= PROTOCOL ‘://‘ AUTHORITY PATH [‘?’ QUERY] [‘#’ REF] AUTHORITY ::= [USERINFO ‘@‘] HOST [‘:’ PORT] PROTOCOL ::= ‘http’ | ‘ftp’ USERINFO ::= r{[a-z]+} ‘:’ r{[a-z0-9]+} HOST ::= r{[a-z.]+} PORT ::= ’80’ PATH ::= r{/[a-z0-9.]*} QUERY ::= ‘foo=bar&lorem=ipsum’ REF ::= r{[a-z]+}
  25. 37 parseURL(spec) -> setURL(protocol, host, port, authority,…) -> setUserInfo(user, password)

    http://admin:[email protected]:80 :spec parseURL http 80 www.google.com admin:pass123 setURL :protocol :authority :port :host admin pass123 setUserInfo
  26. 38 parseURL(spec) -> setURL(protocol, host, port, authority,…) -> setUserInfo(user, password)

    http 80 www.google.com admin:pass123 http://admin:[email protected]:80 :spec setURL :protocol :authority parseURL :port :host admin pass123 setUserInfo ftp example.ftp.com boo:12345 ftp://boo:[email protected] :spec :protocol :authority :host boo 12345
  27. 39 parseURL(spec) -> setURL(protocol, host, port, authority,…) -> setUserInfo(user, password)

    http | ftp [80] www.google.com |example.ftp.com :spec setURL :protocol :authority parseURL :port :host admin|boo pass123|12345 setUserInfo SPEC ::= PROTOCOL ‘://‘ AUTHORITY ‘@’ HOST [‘:’ PORT] AUTHORITY ::= USER ‘:’ PASSWORD USER ::=r{[a-z]+} PASSWORD ::=r{[a-z0-9]+} HOST ::=r{[a-z]+} PORT ::=r{[0-9]+} SPEC AUTHORITY
  28. 40 • Requires dynamic tainting to track input fragments •

    Works well for recursive-descent parsers • Provides strict guarantees on the grammar produced. • A generated string from the inferred grammar is guaranteed to be a valid input
 (under certain assumptions). • Relatively heavyweight, and language/runtime specific • JVM/LLVM for now • Problems with implicit taint propagation and internal calls AUTOGRAM
  29. 43 • An abstract representation of a set of executions

    • Or of the program itself if the set of executions are representative • Language agnostic, and low implementation complexity • Can be used to identify behavioral changes • E.g. Duplicates in Mutation Analysis • Fingerprinting programs • Clone detection • Refactoring What can you do with the grammar?
  30. 45 • Taint Tracking • Concolic Execution • Metamorphic testing

    • Reorder elements • Delete/Insert/Duplicate idempotent elements • Differential testing • Carving unit tests Better Oracles