Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Learning Grammars without Samples

Learning Grammars without Samples

Rahul Gopinath

April 02, 2019
Tweet

More Decks by Rahul Gopinath

Other Decks in Research

Transcript

  1. Learning Grammars
    Without Samples
    Rahul Gopinath
    Postdoctoral Researcher
    CISPA Helmholtz Center for Information Security

    View full-size slide

  2. Why Learn a Grammar?
    2

    View full-size slide

  3. Structured Inputs are Everywhere
    3
    POST /InStock HTTP/1.1
    Host: www.stock.org
    Content-Type: application/soap+xml; charset=utf-8
    Content-Length: 312

    xmlns:soap="http://www.w3.org/2001/12/soap-envelope"
    soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding">


    IBM



    View full-size slide

  4. Structured Inputs are Everywhere
    4
    POST /InStock HTTP/1.1
    Host: www.stock.org
    Content-Type: application/soap+xml; charset=utf-8
    Content-Length: 312

    xmlns:soap="http://www.w3.org/2001/12/soap-envelope"
    soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding">


    IBM



    HTTP POST

    View full-size slide

  5. Structured Inputs are Everywhere
    5
    POST /InStock HTTP/1.1
    Host: www.stock.org
    Content-Type: application/soap+xml; charset=utf-8
    Content-Length: 312

    xmlns:soap="http://www.w3.org/2001/12/soap-envelope"
    soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding">


    IBM



    HTTP POST
    XML PAYLOAD

    View full-size slide

  6. Structured Inputs are Everywhere
    6
    POST /InStock HTTP/1.1
    Host: www.stock.org
    Content-Type: application/soap+xml; charset=utf-8
    Content-Length: 312

    xmlns:soap="http://www.w3.org/2001/12/soap-envelope"
    soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding">


    IBM



    HTTP POST
    XML PAYLOAD
    SOAP

    View full-size slide

  7. Structured Inputs are Everywhere
    7
    POST /InStock HTTP/1.1
    Host: www.stock.org
    Content-Type: application/soap+xml; charset=utf-8
    Content-Length: 312

    xmlns:soap="http://www.w3.org/2001/12/soap-envelope"
    soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding">


    IBM



    HTTP POST
    XML PAYLOAD
    SOAP
    RPC Call

    View full-size slide

  8. We need grammars to reach the pot of gold
    8
    HTTP
    HTTP Parser
    XML
    XML Parser
    SOAP
    SOAP Parser
    RPC
    RPC Parser
    Application

    View full-size slide

  9. A Naive Fuzzer
    9
    $ ./fuzzit.py
    [ ; x 1 - G P Z + w c c k c ] ; , N 9 J + ?
    # 6 ^ 6 \ e ? ] 9 l u 2 _ % ' 4 G X " 0 V U B [ E / r
    ~ f A p u 6 b 8 < { % s i q 8 Z h . 6 { V , h r ? ;
    { Ti . r 3 P I x M M M v 6 { x S ^ + ' H q ! A x B " Y X R S @ !
    Kd6;wtAMefFWM(`|J_<1~o}z3K(CCzRH JIIvHz>_*.
    \ > J r l U 3 2 ~ e G P ? l R = b F 3 + ; y $ 3 l o d Q < B 8 9 !
    5 " W 2 f K * v E 7 v { ' ) K C - i , c { < [ ~ m ! ] o ; { . ' } G j \ ( X }
    EtYetrpbY@aGZ1{P!AZU7x#4(Rtn!q4nCwqol^y6}0|
    Ko=*JK~;zMKV=9Nai:wxu{J&UV#HaU)*BiC<),`+t*g
    ka&]BS6R&j?
    # t P 7 i a V } - } ` \ ? [ _ [ Z ^ L B M P G -
    FKj'\xwuZ1=Q`^`5,$N$Q@[!CuRzJ2D|vBy!
    ^ z k h d f 3 C 5 P A k R ? V h n |
    3='i2Qx]D$qs4O`1@fevnG'2\11Vf3piU37@55ap\zIy
    l"'f,$ee,J4Gw:cgNKLie3nx9(`efSlg6#[K"@WjhZ}
    r[Scun&sBCS,T[/vY'pduwgzDlVNy7'rnzxNwI)
    (ynBa>%|b`;`9fG]P_0hdG~$@6 3]KAeEnQ7lU)3Pn,
    0)G/6N-wyzj/MTd#A;r
    HTTP Parser
    XML Parser
    SOAP Parser
    RPC Parser
    Application

    View full-size slide

  10. A Naive Fuzzer
    10
    $ ./fuzzit.py
    [ ; x 1 - G P Z + w c c k c ] ; , N 9 J + ?
    # 6 ^ 6 \ e ? ] 9 l u 2 _ % ' 4 G X " 0 V U B [ E / r
    ~ f A p u 6 b 8 < { % s i q 8 Z h . 6 { V , h r ? ;
    { Ti . r 3 P I x M M M v 6 { x S ^ + ' H q ! A x B " Y X R S @ !
    Kd6;wtAMefFWM(`|J_<1~o}z3K(CCzRH JIIvHz>_*.
    \ > J r l U 3 2 ~ e G P ? l R = b F 3 + ; y $ 3 l o d Q < B 8 9 !
    5 " W 2 f K * v E 7 v { ' ) K C - i , c { < [ ~ m ! ] o ; { . ' } G j \ ( X }
    EtYetrpbY@aGZ1{P!AZU7x#4(Rtn!q4nCwqol^y6}0|
    Ko=*JK~;zMKV=9Nai:wxu{J&UV#HaU)*BiC<),`+t*g
    ka&]BS6R&j?
    # t P 7 i a V } - } ` \ ? [ _ [ Z ^ L B M P G -
    FKj'\xwuZ1=Q`^`5,$N$Q@[!CuRzJ2D|vBy!
    ^ z k h d f 3 C 5 P A k R ? V h n |
    3='i2Qx]D$qs4O`1@fevnG'2\11Vf3piU37@55ap\zIy
    l"'f,$ee,J4Gw:cgNKLie3nx9(`efSlg6#[K"@WjhZ}
    r[Scun&sBCS,T[/vY'pduwgzDlVNy7'rnzxNwI)
    (ynBa>%|b`;`9fG]P_0hdG~$@6 3]KAeEnQ7lU)3Pn,
    0)G/6N-wyzj/MTd#A;r
    HTTP Parser
    XML Parser
    SOAP Parser
    RPC Parser
    Application

    View full-size slide

  11. What we need is the Input Specification!
    11

    View full-size slide

  12. If you already have the Specification:
    12
    • JSFunFuzz
    • GramFuzz
    • LangFuzz
    • CSMITH

    View full-size slide

  13. But, where do we get grammars from?
    13

    View full-size slide

  14. But, where do we get grammars from?
    14
    Sample Inputs Grammar
    Use sample inputs to dynamically mine the grammar
    (AUTOGRAM - ASE 2016)

    View full-size slide

  15. But, where do we get sample inputs from?
    15
    Sample Inputs? Grammar
    If we had a grammar, we can use it to generate sample inputs
    But we don't

    View full-size slide

  16. Developer Produced Grammar?
    16
    • Often out of sync with the program
    • Can result in blind spots

    View full-size slide

  17. State of the Art in Generating Inputs: KLEE, AFL and GLADE
    17
    • Uses Symbolic Execution
    AFL • Uses Coverage guided Fuzzing
    Glade • Blackbox Grammar Recovery

    View full-size slide

  18. Glade:
    18
    • Blackbox technique
    • Very slow to generate meaningful grammars
    Glade

    View full-size slide

  19. AFL:
    19
    AFL • Uses (Branch) Coverage guided Fuzzing
    • Mutate sample inputs (if available)
    • The inputs generated are shallow and
    simple.
    • Very few valid inputs
    • Performance affected by complexity of input
    space

    View full-size slide

  20. KLEE:
    20
    • Uses Symbolic Execution
    • Very fast to explore simple languages
    • But suffers when the input space becomes
    complex

    View full-size slide

  21. Do we need full symbolic execution?
    21
    Idea: Solve only the next character. pFuzzer

    View full-size slide

  22. Do we need full symbolic execution?
    22
    x ✘
    – checks for digit
    – checks for "true"/"false"
    – checks for '"'
    – checks for '['
    – checks for '{'
    pFuzzer Program

    View full-size slide

  23. Do we need full symbolic execution?
    23
    pFuzzer Program
    x
    Replace x with a digit

    View full-size slide

  24. Do we need full symbolic execution?
    24
    pFuzzer Program
    0 ✔

    View full-size slide

  25. Do we need full symbolic execution?
    25
    [ -
    – checks for digit
    – checks for "true"/"false"
    – checks for '"'
    – checks for '['
    – checks for '{'
    pFuzzer Program
    Succeeds but does not terminate

    View full-size slide

  26. Do we need full symbolic execution?
    26
    -✘
    – checks for digit
    – checks for "true"/"false"
    – checks for '"'
    – checks for '['
    – checks for '{'
    pFuzzer Program
    [z

    View full-size slide

  27. Do we need full symbolic execution?
    27
    --
    pFuzzer Program
    [true
    Replaces z with true -- succeeds but does not terminate

    View full-size slide

  28. Do we need full symbolic execution?
    28
    --✘
    – checks for ','
    – checks for ']'
    pFuzzer Program
    [true1

    View full-size slide

  29. Do we need full symbolic execution?
    29
    --✔
    pFuzzer Program
    [true]

    View full-size slide

  30. Comparing to KLEE and AFL
    30
    The average number of tokens found by KLEE AFL and pFuzzer for each token length (Mathis 2019 PLDI)

    View full-size slide

  31. pFuzzer: Implementations in Python and C
    31
    • Relies on dynamic taint tracking, and character comparisons
    • Assumes availability of source (LLVM bitcode) at this point, but it is not a requirement.

    View full-size slide

  32. Longest Generated Input for JSON Parser
    33
    [false ,[{ "o":{ , "$dYPrlj@?BR": 397 [+ ]"S|+|
    4GzCW(C":-94}} ],[false,null]]
    ............................
    nullÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
    AFL
    KLEE
    pFuzzer
    pFuzzer generates strings with better structure than either AFL or KLEE

    View full-size slide

  33. Feeding the Results Back to the Grammar Miner
    34
    Program
    Under Test
    Parser-Directed
    Test Generator
    Comparisons
    Tests Dynamic Taints
    Grammar
    Learner
    Test Inputs

    ɠ ɡ
    ɢ
    Inputs +
    Equivalence Classes
    Grammar
    Fuzzer

    View full-size slide

  34. AUTOGRAM: Grammar from Samples
    35

    View full-size slide

  35. 36
    protected void parseURL(URL u, String spec, int start, int limit) {
    String protocol = u.getProtocol();
    String authority = u.getAuthority();
    String userInfo = u.getUserInfo();
    String host = u.getHost();
    int port = u.getPort();
    int i = 0;
    boolean isUNCName = (start <= limit - 4) && (spec.charAt(start)
    == '/') &&
    (spec.charAt(start + 1) == '/') &&
    (spec.charAt(start + 2) == '/') &&
    (spec.charAt(start + 3) == '/');
    if (!isUNCName && (start <= limit - 2) && (spec.charAt(start) ==
    '/') &&
    (spec.charAt(start + 1) == '/')) {
    start += 2;
    i = spec.indexOf('/', start);
    if (i < 0) {
    i = spec.indexOf('?', start);
    if (i < 0) i = limit;
    }
    host = authority = spec.substring(start, i);
    int ind = authority.indexOf('@');
    if (ind != -1) {
    userInfo = authority.substring(0, ind);
    host = authority.substring(ind+1);
    } else userInfo = null;
    if (host != null) {
    if (host.length()>0 && (host.charAt(0) == '[')) {
    if ((ind = host.indexOf(']')) > 2) {
    String nhost = host ;
    host = nhost.substring(0,ind+1);
    port = -1 ;
    if (nhost.length() > ind+1) {
    if (nhost.charAt(ind+1) == ':') {
    ++ind ;
    if (nhost.length() > (ind + 1))
    port =
    Integer.parseInt(nhost.substring(ind+1));
    }
    }
    }
    } else {
    ind = host.indexOf(':');
    port = -1;
    if (ind >= 0) {
    if (host.length() > (ind + 1)) {
    port = Integer.parseInt(host.substring(ind +
    1));
    }
    host = host.substring(0, ind);
    }
    }
    } else host = "";
    start = i;
    if (host == null) host = “";
    ...
    setURL(u, protocol, host, port, authority, userInfo, ...);
    http://admin:[email protected]:80/command?foo=bar&lorem=ipsum#fragment
    http://www.guardian.co.uk/sports/worldcup#results
    ftp://bob:[email protected]/oss/debian7.iso
    URL ::= PROTOCOL ‘://‘ AUTHORITY PATH
    [‘?’ QUERY] [‘#’ REF]
    AUTHORITY ::= [USERINFO ‘@‘] HOST [‘:’ PORT]
    PROTOCOL ::= ‘http’ | ‘ftp’
    USERINFO ::= r{[a-z]+} ‘:’ r{[a-z0-9]+}
    HOST ::= r{[a-z.]+}
    PORT ::= ’80’
    PATH ::= r{/[a-z0-9.]*}
    QUERY ::= ‘foo=bar&lorem=ipsum’
    REF ::= r{[a-z]+}

    View full-size slide

  36. 37
    parseURL(spec)
    -> setURL(protocol, host, port, authority,…)
    -> setUserInfo(user, password)
    http://admin:[email protected]:80 :spec
    parseURL
    http
    80
    www.google.com
    admin:pass123
    setURL
    :protocol
    :authority
    :port
    :host
    admin
    pass123
    setUserInfo

    View full-size slide

  37. 38
    parseURL(spec)
    -> setURL(protocol, host, port, authority,…)
    -> setUserInfo(user, password)
    http
    80
    www.google.com
    admin:pass123
    http://admin:[email protected]:80 :spec
    setURL
    :protocol
    :authority
    parseURL
    :port
    :host
    admin
    pass123
    setUserInfo
    ftp
    example.ftp.com
    boo:12345
    ftp://boo:[email protected] :spec :protocol
    :authority
    :host
    boo
    12345

    View full-size slide

  38. 39
    parseURL(spec)
    -> setURL(protocol, host, port, authority,…)
    -> setUserInfo(user, password)
    http | ftp
    [80]
    www.google.com
    |example.ftp.com
    :spec
    setURL
    :protocol
    :authority
    parseURL
    :port
    :host
    admin|boo
    pass123|12345
    setUserInfo
    SPEC ::= PROTOCOL ‘://‘ AUTHORITY ‘@’ HOST [‘:’ PORT]
    AUTHORITY ::= USER ‘:’ PASSWORD
    USER ::=r{[a-z]+}
    PASSWORD ::=r{[a-z0-9]+}
    HOST ::=r{[a-z]+}
    PORT ::=r{[0-9]+}
    SPEC
    AUTHORITY

    View full-size slide

  39. 40
    • Requires dynamic tainting to track input fragments

    • Works well for recursive-descent parsers

    • Provides strict guarantees on the grammar produced.

    • A generated string from the inferred grammar is guaranteed to be a
    valid input

    (under certain assumptions).

    • Relatively heavyweight, and language/runtime specific

    • JVM/LLVM for now

    • Problems with implicit taint propagation and internal calls
    AUTOGRAM

    View full-size slide

  40. 41
    What if we relax our constraints? Is taint tracking strictly needed?

    View full-size slide

  41. 42
    DEMO: fuzzingbook: GrammarMining

    View full-size slide

  42. 43
    • An abstract representation of a set of executions

    • Or of the program itself if the set of executions are representative

    • Language agnostic, and low implementation complexity

    • Can be used to identify behavioral changes

    • E.g. Duplicates in Mutation Analysis

    • Fingerprinting programs

    • Clone detection

    • Refactoring
    What can you do with the grammar?

    View full-size slide

  43. 44
    Is input specification sufficient?
    How can we make fuzzing better?

    View full-size slide

  44. 45
    • Taint Tracking

    • Concolic Execution

    • Metamorphic testing

    • Reorder elements

    • Delete/Insert/Duplicate idempotent elements

    • Differential testing

    • Carving unit tests
    Better Oracles

    View full-size slide