Look Ma No Hands: Learning Input Grammar without Inputs

Look Ma No Hands: Learning Input Grammar without Inputs

D27cb84e0d30e2778e9b66d6a5f42106?s=128

Rahul Gopinath

June 12, 2018
Tweet

Transcript

  1. Rahul Gopinath Postdoctoral Researcher Look Ma No Hands Learning Input

    Grammar without Inputs 1
  2. Why learn input grammars? 2

  3. Why learn input grammars? 2

  4. POST /InStock HTTP/1.1 Host: www.stock.org Content-Type: application/soap+xml; charset=utf-8 Content-Length: 312

    <?xml version="1.0"?> <soap:Envelope xmlns:soap="http://www.w3.org/2001/12/soap-envelope" soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding"> <soap:Body xmlns:m="http://www.stock.org/stock"> <m:GetStockPrice> <m:StockName>IBM</m:StockName> </m:GetStockPrice> </soap:Body> </soap:Envelope> 3
  5. POST /InStock HTTP/1.1 Host: www.stock.org Content-Type: application/soap+xml; charset=utf-8 Content-Length: 312

    <?xml version="1.0"?> <soap:Envelope xmlns:soap="http://www.w3.org/2001/12/soap-envelope" soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding"> <soap:Body xmlns:m="http://www.stock.org/stock"> <m:GetStockPrice> <m:StockName>IBM</m:StockName> </m:GetStockPrice> </soap:Body> </soap:Envelope> HTTP POST 4
  6. POST /InStock HTTP/1.1 Host: www.stock.org Content-Type: application/soap+xml; charset=utf-8 Content-Length: 312

    <?xml version="1.0"?> <soap:Envelope xmlns:soap="http://www.w3.org/2001/12/soap-envelope" soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding"> <soap:Body xmlns:m="http://www.stock.org/stock"> <m:GetStockPrice> <m:StockName>IBM</m:StockName> </m:GetStockPrice> </soap:Body> </soap:Envelope> HTTP POST XML PAYLOAD 5
  7. POST /InStock HTTP/1.1 Host: www.stock.org Content-Type: application/soap+xml; charset=utf-8 Content-Length: 312

    <?xml version="1.0"?> <soap:Envelope xmlns:soap="http://www.w3.org/2001/12/soap-envelope" soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding"> <soap:Body xmlns:m="http://www.stock.org/stock"> <m:GetStockPrice> <m:StockName>IBM</m:StockName> </m:GetStockPrice> </soap:Body> </soap:Envelope> HTTP POST XML PAYLOAD SOAP 6
  8. POST /InStock HTTP/1.1 Host: www.stock.org Content-Type: application/soap+xml; charset=utf-8 Content-Length: 312

    <?xml version="1.0"?> <soap:Envelope xmlns:soap="http://www.w3.org/2001/12/soap-envelope" soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding"> <soap:Body xmlns:m="http://www.stock.org/stock"> <m:GetStockPrice> <m:StockName>IBM</m:StockName> </m:GetStockPrice> </soap:Body> </soap:Envelope> HTTP POST XML PAYLOAD SOAP RPC Call 7
  9. HTTP POST XML PAYLOAD SOAP RPC Call HTTP Parser XML

    Parser SOAP Parser RPC Parser Application 8
  10. HTTP Parser XML Parser SOAP Parser RPC Parser Application Target

    9
  11. $ ./fuzzit.py [ ; x 1 - G P Z

    + w c c k c ] ; , N 9 J + ? # 6 ^ 6 \ e ? ] 9 l u 2 _ % ' 4 G X " 0 V U B [ E / r ~ f A p u 6 b 8 < { % s i q 8 Z h . 6 { V , h r ? ; { Ti . r 3 P I x M M M v 6 { x S ^ + ' H q ! A x B " Y X R S @ ! Kd6;wtAMefFWM(`|J_<1~o}z3K(CCzRH JIIvHz>_*. \ > J r l U 3 2 ~ e G P ? l R = b F 3 + ; y $ 3 l o d Q < B 8 9 ! 5 " W 2 f K * v E 7 v { ' ) K C - i , c { < [ ~ m ! ] o ; { . ' } G j \ ( X } EtYetrpbY@aGZ1{P!AZU7x#4(Rtn!q4nCwqol^y6}0| Ko=*JK~;zMKV=9Nai:wxu{J&UV#HaU)*BiC<),`+t*g ka<W=Z.%T5WGHZpI30D<Pq>&]BS6R&j? # t P 7 i a V } - } ` \ ? [ _ [ Z ^ L B M P G - FKj'\xwuZ1=Q`^`5,$N$Q@[!CuRzJ2D|vBy! ^ z k h d f 3 C 5 P A k R ? V h n | 3='i2Qx]D$qs4O`1@fevnG'2\11Vf3piU37@55ap\zIy l"'f,$ee,J4Gw:cgNKLie3nx9(`efSlg6#[K"@WjhZ} r[Scun&sBCS,T[/vY'pduwgzDlVNy7'rnzxNwI) (ynBa>%|b`;`9fG]P_0hdG~$@6 3]KAeEnQ7lU)3Pn, 0)G/6N-wyzj/MTd#A;r A Naive Fuzzer HTTP Parser XML Parser SOAP Parser RPC Parser Application 10
  12. A Naive Fuzzer HTTP Parser XML Parser SOAP Parser RPC

    Parser Application $ ./fuzzit.py [ ; x 1 - G P Z + w c c k c ] ; , N 9 J + ? # 6 ^ 6 \ e ? ] 9 l u 2 _ % ' 4 G X " 0 V U B [ E / r ~ f A p u 6 b 8 < { % s i q 8 Z h . 6 { V , h r ? ; { Ti . r 3 P I x M M M v 6 { x S ^ + ' H q ! A x B " Y X R S @ ! Kd6;wtAMefFWM(`|J_<1~o}z3K(CCzRH JIIvHz>_*. \ > J r l U 3 2 ~ e G P ? l R = b F 3 + ; y $ 3 l o d Q < B 8 9 ! 5 " W 2 f K * v E 7 v { ' ) K C - i , c { < [ ~ m ! ] o ; { . ' } G j \ ( X } EtYetrpbY@aGZ1{P!AZU7x#4(Rtn!q4nCwqol^y6}0| Ko=*JK~;zMKV=9Nai:wxu{J&UV#HaU)*BiC<),`+t*g ka<W=Z.%T5WGHZpI30D<Pq>&]BS6R&j? # t P 7 i a V } - } ` \ ? [ _ [ Z ^ L B M P G - FKj'\xwuZ1=Q`^`5,$N$Q@[!CuRzJ2D|vBy! ^ z k h d f 3 C 5 P A k R ? V h n | 3='i2Qx]D$qs4O`1@fevnG'2\11Vf3piU37@55ap\zIy l"'f,$ee,J4Gw:cgNKLie3nx9(`efSlg6#[K"@WjhZ} r[Scun&sBCS,T[/vY'pduwgzDlVNy7'rnzxNwI) (ynBa>%|b`;`9fG]P_0hdG~$@6 3]KAeEnQ7lU)3Pn, 0)G/6N-wyzj/MTd#A;r 11
  13. What we need is the Input Grammar! 12

  14. 1. JSFunFuzz 2. GramFuzz 3. LangFuzz If you already have

    the Grammar: 13
  15. What if you don't have the grammar? 14

  16. State of the ART 15 AFL Glade [PLDI2017]

  17. AFL Fuzz • Mutate sample inputs (if available) • Branch

    coverage directed 16
  18. AFL Fuzz Parser Time (sec) Stmt Coverage JSON Parser 7942

    36(49)% MathExpr 77 63(77)% URLParser 14 56(62)% • Few valid inputs produced • Doesn't explore the input space very well • Performance is affected by complexity of grammar 17
  19. • Explore input space symbolically • Very fast to explore

    simple input languages Parser Time (sec) Stmt Coverage MathExpr 5.25 99(99) % URLParser 0.58 98(99)% * compared with equivalent C programs 18 KLEE
  20. KLEE •Explore input space symbolically •Performance suffers with even slightly

    complex grammars Parser Time (sec) Stmt Coverage MathExpr 5.25 99 (99) % URLParser 0.58 98 (99) % JSON Parser 14617 31 (31) % * compared with equivalent C programs 19
  21. AUTOGRAM Context-free grammar from samples 20

  22. AUTOGRAM http://admin:pass123@www.google.com:80/command?foo=bar&lorem=ipsum#fragment http://www.guardian.co.uk/sports/worldcup#results ftp://bob:12345@ftp.example.com/oss/debian7.iso 21

  23. AUTOGRAM protected void parseURL(URL u, String spec, int start, int

    limit) { String protocol = u.getProtocol(); String authority = u.getAuthority(); String userInfo = u.getUserInfo(); String host = u.getHost(); int port = u.getPort(); int i = 0; boolean isUNCName = (start <= limit - 4) && (spec.charAt(start) == '/') && (spec.charAt(start + 1) == '/') && (spec.charAt(start + 2) == '/') && (spec.charAt(start + 3) == '/'); if (!isUNCName && (start <= limit - 2) && (spec.charAt(start) == '/') && (spec.charAt(start + 1) == '/')) { start += 2; i = spec.indexOf('/', start); if (i < 0) { i = spec.indexOf('?', start); if (i < 0) i = limit; } host = authority = spec.substring(start, i); int ind = authority.indexOf('@'); if (ind != -1) { userInfo = authority.substring(0, ind); host = authority.substring(ind+1); } else userInfo = null; if (host != null) { if (host.length()>0 && (host.charAt(0) == '[')) { if ((ind = host.indexOf(']')) > 2) { String nhost = host ; host = nhost.substring(0,ind+1); port = -1 ; if (nhost.length() > ind+1) { if (nhost.charAt(ind+1) == ':') { ++ind ; if (nhost.length() > (ind + 1)) port = Integer.parseInt(nhost.substring(ind+1)); } } } } else { ind = host.indexOf(':'); port = -1; if (ind >= 0) { if (host.length() > (ind + 1)) { port = Integer.parseInt(host.substring(ind + 1)); } host = host.substring(0, ind); } } } else host = ""; start = i; if (host == null) host = “"; ... setURL(u, protocol, host, port, authority, userInfo, ...); 21
  24. AUTOGRAM http://admin:pass123@www.google.com:80/command?foo=bar&lorem=ipsum#fragment http://www.guardian.co.uk/sports/worldcup#results ftp://bob:12345@ftp.example.com/oss/debian7.iso URL ::= PROTOCOL ‘://‘ AUTHORITY PATH

    [‘?’ QUERY] [‘#’ REF] AUTHORITY ::= [USERINFO ‘@‘] HOST [‘:’ PORT] PROTOCOL ::= ‘http’ | ‘ftp’ USERINFO ::= r{[a-z]+} ‘:’ r{[a-z0-9]+} HOST ::= r{[a-z.]+} PORT ::= ’80’ PATH ::= r{/[a-z0-9.]*} QUERY ::= ‘foo=bar&lorem=ipsum’ REF ::= r{[a-z]+} 21
  25. parseURL(spec) -> setURL(protocol, host, port, authority,…) -> setUserInfo(user, password) http://admin:pass123@www.google.com:80

    22
  26. parseURL(spec) -> setURL(protocol, host, port, authority,…) -> setUserInfo(user, password) http://admin:pass123@www.google.com:80

    :spec parseURL 22
  27. parseURL(spec) -> setURL(protocol, host, port, authority,…) -> setUserInfo(user, password) http://admin:pass123@www.google.com:80

    :spec parseURL http 80 www.google.com admin:pass123 setURL :protocol :authority :port :host 22
  28. parseURL(spec) -> setURL(protocol, host, port, authority,…) -> setUserInfo(user, password) http://admin:pass123@www.google.com:80

    :spec parseURL http 80 www.google.com admin:pass123 setURL :protocol :authority :port :host admin pass123 setUserInfo 22
  29. parseURL(spec) -> setURL(protocol, host, port, authority,…) -> setUserInfo(user, password) http

    80 www.google.com admin:pass123 http://admin:pass123@www.google.com:80 :spec setURL :protocol :authority parseURL :port :host admin pass123 setUserInfo ftp example.ftp.com boo:12345 ftp://boo:12345@ftp.example.com :spec :protocol :authority :host boo 12345 23
  30. parseURL(spec) -> setURL(protocol, host, port, authority,…) -> setUserInfo(user, password) http

    | ftp [80] www.google.com |example.ftp.com :spec setURL :protocol :authority parseURL :port :host admin|boo pass123|12345 setUserInfo SPEC AUTHORITY 24
  31. parseURL(spec) -> setURL(protocol, host, port, authority,…) -> setUserInfo(user, password) http

    | ftp [80] www.google.com |example.ftp.com :spec setURL :protocol :authority parseURL :port :host admin|boo pass123|12345 setUserInfo SPEC ::= PROTOCOL ‘://‘ AUTHORITY ‘@’ HOST [‘:’ PORT] AUTHORITY ::= USER ‘:’ PASSWORD USER ::=r{[a-z]+} PASSWORD ::=r{[a-z0-9]+} HOST ::=r{[a-z]+} PORT ::=r{[0-9]+} SPEC AUTHORITY 24
  32. We still need samples But: 25

  33. We still need samples • Could result in grammar with

    blindspots AUTOGRAM 26
  34. We still need samples Symbolic execution is unscalable for complex

    parsers AUTOGRAM 27
  35. Symbolic execution is unscalable for complex parsers Do we need

    full constraint solving? AUTOGRAM 28
  36. Solve only the next character Idea! 29

  37. PyChains Start with an empty string input = "" EOF?

    Yes No Reject! 30
  38. PyChains Fix the problem with a random character EOF? No

    Yes Reject! input = "x" 31
  39. PyChains Fix the problem with a random character isDigit(input[0]) Yes

    Reject! input = "x" input[0] in ['+', '-'] input[0] == '(' else !EOF(input[0:]) 32
  40. PyChains Fix the problem with the choice "(" isDigit(input[0]) Yes

    Reject! input = "(" input[0] in ['+', '-'] input[0] == '(' else !EOF(input[0:]) 33
  41. PyChains Continue with the next character isDigit(input[0]) Yes Reject! input

    = "(" !EOF(input[0:]) input[0] in ['+', '-'] input[0] == '(' else !EOF(input[1:]) 34
  42. PyChains Continue with the next character isDigit(input[0]) Yes Reject! input

    = "(y" !EOF(input[0:]) input[0] in ['+', '-'] input[0] == '(' else !EOF(input[1:]) isDigit(input[1]) input[1] in ['+', '-'] input[1] == "(" input[1] == ")" else Reject! 35
  43. PyChains isDigit(input[0]) input = "(1+2)" input[0] in ['+', '-'] input[0]

    == '(' else isDigit(input[1]) input[1] in ['+', '-'] input[1] == "(" input[1] == ")" isDigit(input[2]) input[2] in ['+', '-'] input[2] == "(" input[2] == ")" isDigit(input[3]) input[3] in ['+', '-'] input[3] == "(" input[3] == ")" isDigit(input[3]) input[3] in ['+', '-'] input[3] == "(" input[3] == ")" Accept! 36
  44. PyChains • Relies on: • Dynamic taint tracking • Tracing

    character comparisons 37
  45. PyChains • Faster for complex input languages Parser Time (sec)

    Stmt Coverage JSON Parser 1713 100 (44) % MathExpr 122 99 (62) % URLParser 1665 100 (56) % 38 Complexity
  46. Limitations • Not as fast as naive fuzzers
 (considering #inputs

    produced) 39
  47. Limitations • Problems with mezzanine validations
 (secondary validations in the

    current layer) def parse_num(input): i = 0 while is_digit(input[i]) or input[i] in ['.','+','-']: i = i+1 return input[:i], input[i:] def parse_arithmetic(input): value1, rest = parse_num(input) if rest[0] not in ['+', '-']: raise ParseException(rest) value2, rest = parse_num(rest[1:]) if rest != '': raise ParseException(rest) return (rest[0], float(value1), float(value2)) parse('10.0.1+1') ValueError 'Invalid Int' parse('99+1') (+,99,1) parse('2.1-3') (-,2.1,3) 40
  48. Limitations • Problems with mezzanine validations • Solution: Throw out

    accumulated characters from the point of secondary validation, and start again. 10.0.1+1 ValueError 'Invalid Int' 10.0? ... 10.05345+563.334 Inefficient! 41
  49. PyChains | PyGram | Fuzz Grammar Inference Engine: PyGram Sample

    inputs Generated inputs (Infer Grammar) Fix for speed 42
  50. PyChains | PyGram Mezzanine Validations Partial prefixes Partial decomposition of

    input 43
  51. Mezzanine Validations 44 http /mypath?a=b [ffcc:xxx http://[ffcc:xxx/mypath?a=b :spec :protocol :path

    :host if host[0] == ‘[’: validateIPv6(host) Mezzanine validation Generate new host string by limited symbolic execution (Research in progress) • Not as costly as full symbolic execution • Not as costly as throwing out and restarting at the mezzanine validation point
  52. PyChains | PyGram | Fuzz Fix for Mezzanine Validations Partial

    prefixes Generated inputs Partial decomposition of input 45
  53. PyChains | PyGram | Fuzz Toolchain: Pygmalion Partial prefixes Generated

    inputs Advantages: • No samples required • Explores the complete input space • Fast Partial decomposition of input 46 Caution: • Research in progress • Currently only in Python (3.6) • PyGram works only on Top-Down Recursive Descent style parsers.
  54. Pygmalion PyChains | Trace | Track | Mine | Infer

    | Refine | Fuzz Grammar Inference Engine: PyGram 47
  55. Pygmalion PyChains | Trace | Track | Mine | Infer

    | Refine => CFG Generate inputs Language specific: Comparisons and Taints Generate Dynamic Dataflow Graph Generate Parse Tree Infer Context Free Grammar Generalize The Grammar 48
  56. 49 DEMO

  57. !50 Summary