$30 off During Our Annual Pro Sale. View Details »

Look Ma No Hands: Learning Input Grammar without Inputs

Look Ma No Hands: Learning Input Grammar without Inputs

Rahul Gopinath

June 12, 2018
Tweet

More Decks by Rahul Gopinath

Other Decks in Research

Transcript

  1. Rahul Gopinath
    Postdoctoral Researcher
    Look Ma No Hands
    Learning Input Grammar without Inputs
    1

    View Slide

  2. Why learn input grammars?
    2

    View Slide

  3. Why learn input grammars?
    2

    View Slide

  4. POST /InStock HTTP/1.1
    Host: www.stock.org
    Content-Type: application/soap+xml; charset=utf-8
    Content-Length: 312

    xmlns:soap="http://www.w3.org/2001/12/soap-envelope"
    soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding">


    IBM



    3

    View Slide

  5. POST /InStock HTTP/1.1
    Host: www.stock.org
    Content-Type: application/soap+xml; charset=utf-8
    Content-Length: 312

    xmlns:soap="http://www.w3.org/2001/12/soap-envelope"
    soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding">


    IBM



    HTTP POST
    4

    View Slide

  6. POST /InStock HTTP/1.1
    Host: www.stock.org
    Content-Type: application/soap+xml; charset=utf-8
    Content-Length: 312

    xmlns:soap="http://www.w3.org/2001/12/soap-envelope"
    soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding">


    IBM



    HTTP POST
    XML PAYLOAD
    5

    View Slide

  7. POST /InStock HTTP/1.1
    Host: www.stock.org
    Content-Type: application/soap+xml; charset=utf-8
    Content-Length: 312

    xmlns:soap="http://www.w3.org/2001/12/soap-envelope"
    soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding">


    IBM



    HTTP POST
    XML PAYLOAD
    SOAP
    6

    View Slide

  8. POST /InStock HTTP/1.1
    Host: www.stock.org
    Content-Type: application/soap+xml; charset=utf-8
    Content-Length: 312

    xmlns:soap="http://www.w3.org/2001/12/soap-envelope"
    soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding">


    IBM



    HTTP POST
    XML PAYLOAD
    SOAP
    RPC Call
    7

    View Slide

  9. HTTP POST
    XML PAYLOAD
    SOAP
    RPC Call
    HTTP Parser
    XML Parser
    SOAP Parser
    RPC Parser
    Application
    8

    View Slide

  10. HTTP Parser
    XML Parser
    SOAP Parser
    RPC Parser
    Application
    Target
    9

    View Slide

  11. $ ./fuzzit.py
    [ ; x 1 - G P Z + w c c k c ] ; , N 9 J + ?
    # 6 ^ 6 \ e ? ] 9 l u 2 _ % ' 4 G X " 0 V U B [ E / r
    ~ f A p u 6 b 8 < { % s i q 8 Z h . 6 { V , h r ? ;
    { Ti . r 3 P I x M M M v 6 { x S ^ + ' H q ! A x B " Y X R S @ !
    Kd6;wtAMefFWM(`|J_<1~o}z3K(CCzRH JIIvHz>_*.
    \ > J r l U 3 2 ~ e G P ? l R = b F 3 + ; y $ 3 l o d Q < B 8 9 !
    5 " W 2 f K * v E 7 v { ' ) K C - i , c { < [ ~ m ! ] o ; { . ' } G j \ ( X }
    EtYetrpbY@aGZ1{P!AZU7x#4(Rtn!q4nCwqol^y6}0|
    Ko=*JK~;zMKV=9Nai:wxu{J&UV#HaU)*BiC<),`+t*g
    ka&]BS6R&j?
    # t P 7 i a V } - } ` \ ? [ _ [ Z ^ L B M P G -
    FKj'\xwuZ1=Q`^`5,$N$Q@[!CuRzJ2D|vBy!
    ^ z k h d f 3 C 5 P A k R ? V h n |
    3='i2Qx]D$qs4O`1@fevnG'2\11Vf3piU37@55ap\zIy
    l"'f,$ee,J4Gw:cgNKLie3nx9(`efSlg6#[K"@WjhZ}
    r[Scun&sBCS,T[/vY'pduwgzDlVNy7'rnzxNwI)
    (ynBa>%|b`;`9fG]P_0hdG~$@6 3]KAeEnQ7lU)3Pn,
    0)G/6N-wyzj/MTd#A;r
    A Naive Fuzzer
    HTTP Parser
    XML Parser
    SOAP Parser
    RPC Parser
    Application
    10

    View Slide

  12. A Naive Fuzzer
    HTTP Parser
    XML Parser
    SOAP Parser
    RPC Parser
    Application
    $ ./fuzzit.py
    [ ; x 1 - G P Z + w c c k c ] ; , N 9 J + ?
    # 6 ^ 6 \ e ? ] 9 l u 2 _ % ' 4 G X " 0 V U B [ E / r
    ~ f A p u 6 b 8 < { % s i q 8 Z h . 6 { V , h r ? ;
    { Ti . r 3 P I x M M M v 6 { x S ^ + ' H q ! A x B " Y X R S @ !
    Kd6;wtAMefFWM(`|J_<1~o}z3K(CCzRH JIIvHz>_*.
    \ > J r l U 3 2 ~ e G P ? l R = b F 3 + ; y $ 3 l o d Q < B 8 9 !
    5 " W 2 f K * v E 7 v { ' ) K C - i , c { < [ ~ m ! ] o ; { . ' } G j \ ( X }
    EtYetrpbY@aGZ1{P!AZU7x#4(Rtn!q4nCwqol^y6}0|
    Ko=*JK~;zMKV=9Nai:wxu{J&UV#HaU)*BiC<),`+t*g
    ka&]BS6R&j?
    # t P 7 i a V } - } ` \ ? [ _ [ Z ^ L B M P G -
    FKj'\xwuZ1=Q`^`5,$N$Q@[!CuRzJ2D|vBy!
    ^ z k h d f 3 C 5 P A k R ? V h n |
    3='i2Qx]D$qs4O`1@fevnG'2\11Vf3piU37@55ap\zIy
    l"'f,$ee,J4Gw:cgNKLie3nx9(`efSlg6#[K"@WjhZ}
    r[Scun&sBCS,T[/vY'pduwgzDlVNy7'rnzxNwI)
    (ynBa>%|b`;`9fG]P_0hdG~$@6 3]KAeEnQ7lU)3Pn,
    0)G/6N-wyzj/MTd#A;r
    11

    View Slide

  13. What we need is the
    Input Grammar!
    12

    View Slide

  14. 1. JSFunFuzz
    2. GramFuzz
    3. LangFuzz
    If you already have the Grammar:
    13

    View Slide

  15. What if you don't have the grammar?
    14

    View Slide

  16. State of the ART
    15
    AFL
    Glade
    [PLDI2017]

    View Slide

  17. AFL Fuzz
    • Mutate sample inputs (if available)
    • Branch coverage directed
    16

    View Slide

  18. AFL Fuzz
    Parser Time (sec) Stmt Coverage
    JSON Parser 7942 36(49)%
    MathExpr 77 63(77)%
    URLParser 14 56(62)%
    • Few valid inputs produced
    • Doesn't explore the input space very well
    • Performance is affected by complexity of grammar
    17

    View Slide

  19. • Explore input space symbolically
    • Very fast to explore simple input languages
    Parser Time (sec) Stmt Coverage
    MathExpr 5.25 99(99) %
    URLParser 0.58 98(99)%
    * compared with equivalent C programs
    18
    KLEE

    View Slide

  20. KLEE
    •Explore input space symbolically
    •Performance suffers with even slightly
    complex grammars
    Parser Time (sec) Stmt Coverage
    MathExpr 5.25 99 (99) %
    URLParser 0.58 98 (99) %
    JSON Parser 14617 31 (31) %
    * compared with equivalent C programs
    19

    View Slide

  21. AUTOGRAM
    Context-free grammar
    from samples
    20

    View Slide

  22. AUTOGRAM
    http://admin:pa[email protected]:80/command?foo=bar&lorem=ipsum#fragment
    http://www.guardian.co.uk/sports/worldcup#results
    ftp://bob:[email protected]/oss/debian7.iso
    21

    View Slide

  23. AUTOGRAM
    protected void parseURL(URL u, String spec, int start, int limit) {
    String protocol = u.getProtocol();
    String authority = u.getAuthority();
    String userInfo = u.getUserInfo();
    String host = u.getHost();
    int port = u.getPort();
    int i = 0;
    boolean isUNCName = (start <= limit - 4) && (spec.charAt(start)
    == '/') &&
    (spec.charAt(start + 1) == '/') &&
    (spec.charAt(start + 2) == '/') &&
    (spec.charAt(start + 3) == '/');
    if (!isUNCName && (start <= limit - 2) && (spec.charAt(start) ==
    '/') &&
    (spec.charAt(start + 1) == '/')) {
    start += 2;
    i = spec.indexOf('/', start);
    if (i < 0) {
    i = spec.indexOf('?', start);
    if (i < 0) i = limit;
    }
    host = authority = spec.substring(start, i);
    int ind = authority.indexOf('@');
    if (ind != -1) {
    userInfo = authority.substring(0, ind);
    host = authority.substring(ind+1);
    } else userInfo = null;
    if (host != null) {
    if (host.length()>0 && (host.charAt(0) == '[')) {
    if ((ind = host.indexOf(']')) > 2) {
    String nhost = host ;
    host = nhost.substring(0,ind+1);
    port = -1 ;
    if (nhost.length() > ind+1) {
    if (nhost.charAt(ind+1) == ':') {
    ++ind ;
    if (nhost.length() > (ind + 1))
    port =
    Integer.parseInt(nhost.substring(ind+1));
    }
    }
    }
    } else {
    ind = host.indexOf(':');
    port = -1;
    if (ind >= 0) {
    if (host.length() > (ind + 1)) {
    port = Integer.parseInt(host.substring(ind +
    1));
    }
    host = host.substring(0, ind);
    }
    }
    } else host = "";
    start = i;
    if (host == null) host = “";
    ...
    setURL(u, protocol, host, port, authority, userInfo, ...);
    21

    View Slide

  24. AUTOGRAM
    http://admin:pa[email protected]:80/command?foo=bar&lorem=ipsum#fragment
    http://www.guardian.co.uk/sports/worldcup#results
    ftp://bob:[email protected]/oss/debian7.iso
    URL ::= PROTOCOL ‘://‘ AUTHORITY PATH
    [‘?’ QUERY] [‘#’ REF]
    AUTHORITY ::= [USERINFO ‘@‘] HOST [‘:’ PORT]
    PROTOCOL ::= ‘http’ | ‘ftp’
    USERINFO ::= r{[a-z]+} ‘:’ r{[a-z0-9]+}
    HOST ::= r{[a-z.]+}
    PORT ::= ’80’
    PATH ::= r{/[a-z0-9.]*}
    QUERY ::= ‘foo=bar&lorem=ipsum’
    REF ::= r{[a-z]+}
    21

    View Slide

  25. parseURL(spec)
    -> setURL(protocol, host, port, authority,…)
    -> setUserInfo(user, password)
    http://admin:[email protected]:80
    22

    View Slide

  26. parseURL(spec)
    -> setURL(protocol, host, port, authority,…)
    -> setUserInfo(user, password)
    http://admin:[email protected]:80 :spec
    parseURL
    22

    View Slide

  27. parseURL(spec)
    -> setURL(protocol, host, port, authority,…)
    -> setUserInfo(user, password)
    http://admin:[email protected]:80 :spec
    parseURL
    http
    80
    www.google.com
    admin:pass123
    setURL
    :protocol
    :authority
    :port
    :host
    22

    View Slide

  28. parseURL(spec)
    -> setURL(protocol, host, port, authority,…)
    -> setUserInfo(user, password)
    http://admin:[email protected]:80 :spec
    parseURL
    http
    80
    www.google.com
    admin:pass123
    setURL
    :protocol
    :authority
    :port
    :host
    admin
    pass123
    setUserInfo
    22

    View Slide

  29. parseURL(spec)
    -> setURL(protocol, host, port, authority,…)
    -> setUserInfo(user, password)
    http
    80
    www.google.com
    admin:pass123
    http://admin:[email protected]:80 :spec
    setURL
    :protocol
    :authority
    parseURL
    :port
    :host
    admin
    pass123
    setUserInfo
    ftp
    example.ftp.com
    boo:12345
    ftp://boo:[email protected] :spec :protocol
    :authority
    :host
    boo
    12345
    23

    View Slide

  30. parseURL(spec)
    -> setURL(protocol, host, port, authority,…)
    -> setUserInfo(user, password)
    http | ftp
    [80]
    www.google.com
    |example.ftp.com
    :spec
    setURL
    :protocol
    :authority
    parseURL
    :port
    :host
    admin|boo
    pass123|12345
    setUserInfo
    SPEC
    AUTHORITY
    24

    View Slide

  31. parseURL(spec)
    -> setURL(protocol, host, port, authority,…)
    -> setUserInfo(user, password)
    http | ftp
    [80]
    www.google.com
    |example.ftp.com
    :spec
    setURL
    :protocol
    :authority
    parseURL
    :port
    :host
    admin|boo
    pass123|12345
    setUserInfo
    SPEC ::= PROTOCOL ‘://‘ AUTHORITY ‘@’ HOST [‘:’ PORT]
    AUTHORITY ::= USER ‘:’ PASSWORD
    USER ::=r{[a-z]+}
    PASSWORD ::=r{[a-z0-9]+}
    HOST ::=r{[a-z]+}
    PORT ::=r{[0-9]+}
    SPEC
    AUTHORITY
    24

    View Slide

  32. We still need samples
    But:
    25

    View Slide

  33. We still need samples
    • Could result in grammar with
    blindspots
    AUTOGRAM
    26

    View Slide

  34. We still need samples
    Symbolic execution is unscalable for
    complex parsers
    AUTOGRAM
    27

    View Slide

  35. Symbolic execution is unscalable for
    complex parsers
    Do we need full constraint solving?
    AUTOGRAM
    28

    View Slide

  36. Solve only the next character
    Idea!
    29

    View Slide

  37. PyChains
    Start with an empty string
    input = "" EOF?
    Yes
    No
    Reject!
    30

    View Slide

  38. PyChains
    Fix the problem with a random character
    EOF?
    No
    Yes
    Reject!
    input = "x"
    31

    View Slide

  39. PyChains
    Fix the problem with a random character
    isDigit(input[0]) Yes
    Reject!
    input = "x"
    input[0] in ['+', '-']
    input[0] == '('
    else
    !EOF(input[0:])
    32

    View Slide

  40. PyChains
    Fix the problem with the choice "("
    isDigit(input[0]) Yes
    Reject!
    input = "("
    input[0] in ['+', '-']
    input[0] == '('
    else
    !EOF(input[0:])
    33

    View Slide

  41. PyChains
    Continue with the next character
    isDigit(input[0]) Yes
    Reject!
    input = "("
    !EOF(input[0:])
    input[0] in ['+', '-']
    input[0] == '('
    else
    !EOF(input[1:])
    34

    View Slide

  42. PyChains
    Continue with the next character
    isDigit(input[0]) Yes
    Reject!
    input = "(y"
    !EOF(input[0:])
    input[0] in ['+', '-']
    input[0] == '('
    else
    !EOF(input[1:])
    isDigit(input[1])
    input[1] in ['+', '-']
    input[1] == "("
    input[1] == ")"
    else Reject!
    35

    View Slide

  43. PyChains
    isDigit(input[0])
    input = "(1+2)"
    input[0] in ['+', '-']
    input[0] == '('
    else
    isDigit(input[1])
    input[1] in ['+', '-']
    input[1] == "("
    input[1] == ")"
    isDigit(input[2])
    input[2] in ['+', '-']
    input[2] == "("
    input[2] == ")"
    isDigit(input[3])
    input[3] in ['+', '-']
    input[3] == "("
    input[3] == ")"
    isDigit(input[3])
    input[3] in ['+', '-']
    input[3] == "("
    input[3] == ")"
    Accept!
    36

    View Slide

  44. PyChains
    • Relies on:
    • Dynamic taint tracking
    • Tracing character comparisons
    37

    View Slide

  45. PyChains
    • Faster for complex input languages
    Parser Time (sec) Stmt Coverage
    JSON Parser 1713 100 (44) %
    MathExpr 122 99 (62) %
    URLParser 1665 100 (56) %
    38
    Complexity

    View Slide

  46. Limitations
    • Not as fast as naive fuzzers

    (considering #inputs produced)
    39

    View Slide

  47. Limitations
    • Problems with mezzanine validations

    (secondary validations in the current layer)
    def parse_num(input):
    i = 0
    while is_digit(input[i]) or input[i] in ['.','+','-']:
    i = i+1
    return input[:i], input[i:]
    def parse_arithmetic(input):
    value1, rest = parse_num(input)
    if rest[0] not in ['+', '-']:
    raise ParseException(rest)
    value2, rest = parse_num(rest[1:])
    if rest != '':
    raise ParseException(rest)
    return (rest[0], float(value1), float(value2))
    parse('10.0.1+1') ValueError 'Invalid Int'
    parse('99+1') (+,99,1)
    parse('2.1-3') (-,2.1,3)
    40

    View Slide

  48. Limitations
    • Problems with mezzanine validations
    • Solution: Throw out accumulated characters from the point of
    secondary validation, and start again.
    10.0.1+1 ValueError 'Invalid Int'
    10.0?
    ...
    10.05345+563.334
    Inefficient!
    41

    View Slide

  49. PyChains | PyGram | Fuzz
    Grammar Inference Engine: PyGram
    Sample inputs
    Generated inputs
    (Infer Grammar)
    Fix for speed
    42

    View Slide

  50. PyChains | PyGram
    Mezzanine Validations
    Partial prefixes
    Partial decomposition of input
    43

    View Slide

  51. Mezzanine Validations
    44
    http
    /mypath?a=b
    [ffcc:xxx
    http://[ffcc:xxx/mypath?a=b :spec :protocol
    :path
    :host
    if host[0] == ‘[’:
    validateIPv6(host)
    Mezzanine validation
    Generate new host string by limited symbolic execution (Research in progress)
    • Not as costly as full symbolic execution
    • Not as costly as throwing out and restarting at the mezzanine validation point

    View Slide

  52. PyChains | PyGram | Fuzz
    Fix for Mezzanine Validations
    Partial prefixes
    Generated inputs
    Partial decomposition of input
    45

    View Slide

  53. PyChains | PyGram | Fuzz
    Toolchain: Pygmalion
    Partial prefixes
    Generated inputs
    Advantages:
    • No samples required
    • Explores the complete input space
    • Fast
    Partial decomposition of input
    46
    Caution:
    • Research in progress
    • Currently only in Python (3.6)
    • PyGram works only on Top-Down
    Recursive Descent style parsers.

    View Slide

  54. Pygmalion
    PyChains | Trace | Track | Mine | Infer | Refine | Fuzz
    Grammar Inference Engine: PyGram
    47

    View Slide

  55. Pygmalion
    PyChains | Trace | Track | Mine | Infer | Refine => CFG
    Generate inputs
    Language specific:
    Comparisons and Taints
    Generate
    Dynamic Dataflow
    Graph
    Generate
    Parse Tree
    Infer Context Free
    Grammar
    Generalize
    The Grammar
    48

    View Slide

  56. 49
    DEMO

    View Slide

  57. !50
    Summary

    View Slide