Learning Grammars without Samples

Learning Grammars without Samples

D27cb84e0d30e2778e9b66d6a5f42106?s=128

Rahul Gopinath

April 02, 2019
Tweet

Transcript

  1. Learning Grammars Without Samples Rahul Gopinath Postdoctoral Researcher CISPA Helmholtz

    Center for Information Security
  2. Why Learn a Grammar? 2

  3. Structured Inputs are Everywhere 3 POST /InStock HTTP/1.1 Host: www.stock.org

    Content-Type: application/soap+xml; charset=utf-8 Content-Length: 312 <?xml version="1.0"?> <soap:Envelope xmlns:soap="http://www.w3.org/2001/12/soap-envelope" soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding"> <soap:Body xmlns:m="http://www.stock.org/stock"> <m:GetStockPrice> <m:StockName>IBM</m:StockName> </m:GetStockPrice> </soap:Body> </soap:Envelope>
  4. Structured Inputs are Everywhere 4 POST /InStock HTTP/1.1 Host: www.stock.org

    Content-Type: application/soap+xml; charset=utf-8 Content-Length: 312 <?xml version="1.0"?> <soap:Envelope xmlns:soap="http://www.w3.org/2001/12/soap-envelope" soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding"> <soap:Body xmlns:m="http://www.stock.org/stock"> <m:GetStockPrice> <m:StockName>IBM</m:StockName> </m:GetStockPrice> </soap:Body> </soap:Envelope> HTTP POST
  5. Structured Inputs are Everywhere 5 POST /InStock HTTP/1.1 Host: www.stock.org

    Content-Type: application/soap+xml; charset=utf-8 Content-Length: 312 <?xml version="1.0"?> <soap:Envelope xmlns:soap="http://www.w3.org/2001/12/soap-envelope" soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding"> <soap:Body xmlns:m="http://www.stock.org/stock"> <m:GetStockPrice> <m:StockName>IBM</m:StockName> </m:GetStockPrice> </soap:Body> </soap:Envelope> HTTP POST XML PAYLOAD
  6. Structured Inputs are Everywhere 6 POST /InStock HTTP/1.1 Host: www.stock.org

    Content-Type: application/soap+xml; charset=utf-8 Content-Length: 312 <?xml version="1.0"?> <soap:Envelope xmlns:soap="http://www.w3.org/2001/12/soap-envelope" soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding"> <soap:Body xmlns:m="http://www.stock.org/stock"> <m:GetStockPrice> <m:StockName>IBM</m:StockName> </m:GetStockPrice> </soap:Body> </soap:Envelope> HTTP POST XML PAYLOAD SOAP
  7. Structured Inputs are Everywhere 7 POST /InStock HTTP/1.1 Host: www.stock.org

    Content-Type: application/soap+xml; charset=utf-8 Content-Length: 312 <?xml version="1.0"?> <soap:Envelope xmlns:soap="http://www.w3.org/2001/12/soap-envelope" soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding"> <soap:Body xmlns:m="http://www.stock.org/stock"> <m:GetStockPrice> <m:StockName>IBM</m:StockName> </m:GetStockPrice> </soap:Body> </soap:Envelope> HTTP POST XML PAYLOAD SOAP RPC Call
  8. We need grammars to reach the pot of gold 8

    HTTP HTTP Parser XML XML Parser SOAP SOAP Parser RPC RPC Parser Application
  9. A Naive Fuzzer 9 $ ./fuzzit.py [ ; x 1

    - G P Z + w c c k c ] ; , N 9 J + ? # 6 ^ 6 \ e ? ] 9 l u 2 _ % ' 4 G X " 0 V U B [ E / r ~ f A p u 6 b 8 < { % s i q 8 Z h . 6 { V , h r ? ; { Ti . r 3 P I x M M M v 6 { x S ^ + ' H q ! A x B " Y X R S @ ! Kd6;wtAMefFWM(`|J_<1~o}z3K(CCzRH JIIvHz>_*. \ > J r l U 3 2 ~ e G P ? l R = b F 3 + ; y $ 3 l o d Q < B 8 9 ! 5 " W 2 f K * v E 7 v { ' ) K C - i , c { < [ ~ m ! ] o ; { . ' } G j \ ( X } EtYetrpbY@aGZ1{P!AZU7x#4(Rtn!q4nCwqol^y6}0| Ko=*JK~;zMKV=9Nai:wxu{J&UV#HaU)*BiC<),`+t*g ka<W=Z.%T5WGHZpI30D<Pq>&]BS6R&j? # t P 7 i a V } - } ` \ ? [ _ [ Z ^ L B M P G - FKj'\xwuZ1=Q`^`5,$N$Q@[!CuRzJ2D|vBy! ^ z k h d f 3 C 5 P A k R ? V h n | 3='i2Qx]D$qs4O`1@fevnG'2\11Vf3piU37@55ap\zIy l"'f,$ee,J4Gw:cgNKLie3nx9(`efSlg6#[K"@WjhZ} r[Scun&sBCS,T[/vY'pduwgzDlVNy7'rnzxNwI) (ynBa>%|b`;`9fG]P_0hdG~$@6 3]KAeEnQ7lU)3Pn, 0)G/6N-wyzj/MTd#A;r HTTP Parser XML Parser SOAP Parser RPC Parser Application
  10. A Naive Fuzzer 10 $ ./fuzzit.py [ ; x 1

    - G P Z + w c c k c ] ; , N 9 J + ? # 6 ^ 6 \ e ? ] 9 l u 2 _ % ' 4 G X " 0 V U B [ E / r ~ f A p u 6 b 8 < { % s i q 8 Z h . 6 { V , h r ? ; { Ti . r 3 P I x M M M v 6 { x S ^ + ' H q ! A x B " Y X R S @ ! Kd6;wtAMefFWM(`|J_<1~o}z3K(CCzRH JIIvHz>_*. \ > J r l U 3 2 ~ e G P ? l R = b F 3 + ; y $ 3 l o d Q < B 8 9 ! 5 " W 2 f K * v E 7 v { ' ) K C - i , c { < [ ~ m ! ] o ; { . ' } G j \ ( X } EtYetrpbY@aGZ1{P!AZU7x#4(Rtn!q4nCwqol^y6}0| Ko=*JK~;zMKV=9Nai:wxu{J&UV#HaU)*BiC<),`+t*g ka<W=Z.%T5WGHZpI30D<Pq>&]BS6R&j? # t P 7 i a V } - } ` \ ? [ _ [ Z ^ L B M P G - FKj'\xwuZ1=Q`^`5,$N$Q@[!CuRzJ2D|vBy! ^ z k h d f 3 C 5 P A k R ? V h n | 3='i2Qx]D$qs4O`1@fevnG'2\11Vf3piU37@55ap\zIy l"'f,$ee,J4Gw:cgNKLie3nx9(`efSlg6#[K"@WjhZ} r[Scun&sBCS,T[/vY'pduwgzDlVNy7'rnzxNwI) (ynBa>%|b`;`9fG]P_0hdG~$@6 3]KAeEnQ7lU)3Pn, 0)G/6N-wyzj/MTd#A;r HTTP Parser XML Parser SOAP Parser RPC Parser Application
  11. What we need is the Input Specification! 11

  12. If you already have the Specification: 12 • JSFunFuzz •

    GramFuzz • LangFuzz • CSMITH
  13. But, where do we get grammars from? 13

  14. But, where do we get grammars from? 14 Sample Inputs

    Grammar Use sample inputs to dynamically mine the grammar (AUTOGRAM - ASE 2016)
  15. But, where do we get sample inputs from? 15 Sample

    Inputs? Grammar If we had a grammar, we can use it to generate sample inputs But we don't
  16. Developer Produced Grammar? 16 • Often out of sync with

    the program • Can result in blind spots
  17. State of the Art in Generating Inputs: KLEE, AFL and

    GLADE 17 • Uses Symbolic Execution AFL • Uses Coverage guided Fuzzing Glade • Blackbox Grammar Recovery
  18. Glade: 18 • Blackbox technique • Very slow to generate

    meaningful grammars Glade
  19. AFL: 19 AFL • Uses (Branch) Coverage guided Fuzzing •

    Mutate sample inputs (if available) • The inputs generated are shallow and simple. • Very few valid inputs • Performance affected by complexity of input space
  20. KLEE: 20 • Uses Symbolic Execution • Very fast to

    explore simple languages • But suffers when the input space becomes complex
  21. Do we need full symbolic execution? 21 Idea: Solve only

    the next character. pFuzzer
  22. Do we need full symbolic execution? 22 x ✘ –

    checks for digit – checks for "true"/"false" – checks for '"' – checks for '[' – checks for '{' pFuzzer Program
  23. Do we need full symbolic execution? 23 pFuzzer Program x

    Replace x with a digit
  24. Do we need full symbolic execution? 24 pFuzzer Program 0

  25. Do we need full symbolic execution? 25 [ - –

    checks for digit – checks for "true"/"false" – checks for '"' – checks for '[' – checks for '{' pFuzzer Program Succeeds but does not terminate
  26. Do we need full symbolic execution? 26 -✘ – checks

    for digit – checks for "true"/"false" – checks for '"' – checks for '[' – checks for '{' pFuzzer Program [z
  27. Do we need full symbolic execution? 27 -- pFuzzer Program

    [true Replaces z with true -- succeeds but does not terminate
  28. Do we need full symbolic execution? 28 --✘ – checks

    for ',' – checks for ']' pFuzzer Program [true1
  29. Do we need full symbolic execution? 29 --✔ pFuzzer Program

    [true]
  30. Comparing to KLEE and AFL 30 The average number of

    tokens found by KLEE AFL and pFuzzer for each token length (Mathis 2019 PLDI)
  31. pFuzzer: Implementations in Python and C 31 • Relies on

    dynamic taint tracking, and character comparisons • Assumes availability of source (LLVM bitcode) at this point, but it is not a requirement.
  32. DEMO

  33. Longest Generated Input for JSON Parser 33 [false ,[{ "o":{

    , "$dYPrlj@?BR": 397 [+ ]"S|+| 4GzCW(C":-94}} ],[false,null]] ............................ nullÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ AFL KLEE pFuzzer pFuzzer generates strings with better structure than either AFL or KLEE
  34. Feeding the Results Back to the Grammar Miner 34 Program

    Under Test Parser-Directed Test Generator Comparisons Tests Dynamic Taints Grammar Learner Test Inputs ꋶ ɠ ɡ ɢ Inputs + Equivalence Classes Grammar Fuzzer
  35. AUTOGRAM: Grammar from Samples 35

  36. 36 protected void parseURL(URL u, String spec, int start, int

    limit) { String protocol = u.getProtocol(); String authority = u.getAuthority(); String userInfo = u.getUserInfo(); String host = u.getHost(); int port = u.getPort(); int i = 0; boolean isUNCName = (start <= limit - 4) && (spec.charAt(start) == '/') && (spec.charAt(start + 1) == '/') && (spec.charAt(start + 2) == '/') && (spec.charAt(start + 3) == '/'); if (!isUNCName && (start <= limit - 2) && (spec.charAt(start) == '/') && (spec.charAt(start + 1) == '/')) { start += 2; i = spec.indexOf('/', start); if (i < 0) { i = spec.indexOf('?', start); if (i < 0) i = limit; } host = authority = spec.substring(start, i); int ind = authority.indexOf('@'); if (ind != -1) { userInfo = authority.substring(0, ind); host = authority.substring(ind+1); } else userInfo = null; if (host != null) { if (host.length()>0 && (host.charAt(0) == '[')) { if ((ind = host.indexOf(']')) > 2) { String nhost = host ; host = nhost.substring(0,ind+1); port = -1 ; if (nhost.length() > ind+1) { if (nhost.charAt(ind+1) == ':') { ++ind ; if (nhost.length() > (ind + 1)) port = Integer.parseInt(nhost.substring(ind+1)); } } } } else { ind = host.indexOf(':'); port = -1; if (ind >= 0) { if (host.length() > (ind + 1)) { port = Integer.parseInt(host.substring(ind + 1)); } host = host.substring(0, ind); } } } else host = ""; start = i; if (host == null) host = “"; ... setURL(u, protocol, host, port, authority, userInfo, ...); http://admin:pass123@www.google.com:80/command?foo=bar&lorem=ipsum#fragment http://www.guardian.co.uk/sports/worldcup#results ftp://bob:12345@ftp.example.com/oss/debian7.iso URL ::= PROTOCOL ‘://‘ AUTHORITY PATH [‘?’ QUERY] [‘#’ REF] AUTHORITY ::= [USERINFO ‘@‘] HOST [‘:’ PORT] PROTOCOL ::= ‘http’ | ‘ftp’ USERINFO ::= r{[a-z]+} ‘:’ r{[a-z0-9]+} HOST ::= r{[a-z.]+} PORT ::= ’80’ PATH ::= r{/[a-z0-9.]*} QUERY ::= ‘foo=bar&lorem=ipsum’ REF ::= r{[a-z]+}
  37. 37 parseURL(spec) -> setURL(protocol, host, port, authority,…) -> setUserInfo(user, password)

    http://admin:pass123@www.google.com:80 :spec parseURL http 80 www.google.com admin:pass123 setURL :protocol :authority :port :host admin pass123 setUserInfo
  38. 38 parseURL(spec) -> setURL(protocol, host, port, authority,…) -> setUserInfo(user, password)

    http 80 www.google.com admin:pass123 http://admin:pass123@www.google.com:80 :spec setURL :protocol :authority parseURL :port :host admin pass123 setUserInfo ftp example.ftp.com boo:12345 ftp://boo:12345@ftp.example.com :spec :protocol :authority :host boo 12345
  39. 39 parseURL(spec) -> setURL(protocol, host, port, authority,…) -> setUserInfo(user, password)

    http | ftp [80] www.google.com |example.ftp.com :spec setURL :protocol :authority parseURL :port :host admin|boo pass123|12345 setUserInfo SPEC ::= PROTOCOL ‘://‘ AUTHORITY ‘@’ HOST [‘:’ PORT] AUTHORITY ::= USER ‘:’ PASSWORD USER ::=r{[a-z]+} PASSWORD ::=r{[a-z0-9]+} HOST ::=r{[a-z]+} PORT ::=r{[0-9]+} SPEC AUTHORITY
  40. 40 • Requires dynamic tainting to track input fragments •

    Works well for recursive-descent parsers • Provides strict guarantees on the grammar produced. • A generated string from the inferred grammar is guaranteed to be a valid input
 (under certain assumptions). • Relatively heavyweight, and language/runtime specific • JVM/LLVM for now • Problems with implicit taint propagation and internal calls AUTOGRAM
  41. 41 What if we relax our constraints? Is taint tracking

    strictly needed?
  42. 42 DEMO: fuzzingbook: GrammarMining

  43. 43 • An abstract representation of a set of executions

    • Or of the program itself if the set of executions are representative • Language agnostic, and low implementation complexity • Can be used to identify behavioral changes • E.g. Duplicates in Mutation Analysis • Fingerprinting programs • Clone detection • Refactoring What can you do with the grammar?
  44. 44 Is input specification sufficient? How can we make fuzzing

    better?
  45. 45 • Taint Tracking • Concolic Execution • Metamorphic testing

    • Reorder elements • Delete/Insert/Duplicate idempotent elements • Differential testing • Carving unit tests Better Oracles
  46. None