Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Learning Grammars without Samples

Learning Grammars without Samples

Rahul Gopinath

April 02, 2019
Tweet

More Decks by Rahul Gopinath

Other Decks in Research

Transcript

  1. Learning Grammars
    Without Samples
    Rahul Gopinath
    Postdoctoral Researcher
    CISPA Helmholtz Center for Information Security

    View Slide

  2. Why Learn a Grammar?
    2

    View Slide

  3. Structured Inputs are Everywhere
    3
    POST /InStock HTTP/1.1
    Host: www.stock.org
    Content-Type: application/soap+xml; charset=utf-8
    Content-Length: 312

    xmlns:soap="http://www.w3.org/2001/12/soap-envelope"
    soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding">


    IBM



    View Slide

  4. Structured Inputs are Everywhere
    4
    POST /InStock HTTP/1.1
    Host: www.stock.org
    Content-Type: application/soap+xml; charset=utf-8
    Content-Length: 312

    xmlns:soap="http://www.w3.org/2001/12/soap-envelope"
    soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding">


    IBM



    HTTP POST

    View Slide

  5. Structured Inputs are Everywhere
    5
    POST /InStock HTTP/1.1
    Host: www.stock.org
    Content-Type: application/soap+xml; charset=utf-8
    Content-Length: 312

    xmlns:soap="http://www.w3.org/2001/12/soap-envelope"
    soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding">


    IBM



    HTTP POST
    XML PAYLOAD

    View Slide

  6. Structured Inputs are Everywhere
    6
    POST /InStock HTTP/1.1
    Host: www.stock.org
    Content-Type: application/soap+xml; charset=utf-8
    Content-Length: 312

    xmlns:soap="http://www.w3.org/2001/12/soap-envelope"
    soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding">


    IBM



    HTTP POST
    XML PAYLOAD
    SOAP

    View Slide

  7. Structured Inputs are Everywhere
    7
    POST /InStock HTTP/1.1
    Host: www.stock.org
    Content-Type: application/soap+xml; charset=utf-8
    Content-Length: 312

    xmlns:soap="http://www.w3.org/2001/12/soap-envelope"
    soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding">


    IBM



    HTTP POST
    XML PAYLOAD
    SOAP
    RPC Call

    View Slide

  8. We need grammars to reach the pot of gold
    8
    HTTP
    HTTP Parser
    XML
    XML Parser
    SOAP
    SOAP Parser
    RPC
    RPC Parser
    Application

    View Slide

  9. A Naive Fuzzer
    9
    $ ./fuzzit.py
    [ ; x 1 - G P Z + w c c k c ] ; , N 9 J + ?
    # 6 ^ 6 \ e ? ] 9 l u 2 _ % ' 4 G X " 0 V U B [ E / r
    ~ f A p u 6 b 8 < { % s i q 8 Z h . 6 { V , h r ? ;
    { Ti . r 3 P I x M M M v 6 { x S ^ + ' H q ! A x B " Y X R S @ !
    Kd6;wtAMefFWM(`|J_<1~o}z3K(CCzRH JIIvHz>_*.
    \ > J r l U 3 2 ~ e G P ? l R = b F 3 + ; y $ 3 l o d Q < B 8 9 !
    5 " W 2 f K * v E 7 v { ' ) K C - i , c { < [ ~ m ! ] o ; { . ' } G j \ ( X }
    [email protected]{P!AZU7x#4(Rtn!q4nCwqol^y6}0|
    Ko=*JK~;zMKV=9Nai:wxu{J&UV#HaU)*BiC<),`+t*g
    ka&]BS6R&j?
    # t P 7 i a V } - } ` \ ? [ _ [ Z ^ L B M P G -
    FKj'\xwuZ1=Q`^`5,[email protected][!CuRzJ2D|vBy!
    ^ z k h d f 3 C 5 P A k R ? V h n |
    3='i2Qx]D$qs4O`[email protected]'2\[email protected]\zIy
    l"'f,$ee,J4Gw:cgNKLie3nx9(`efSlg6#[K"@WjhZ}
    r[Scun&sBCS,T[/vY'pduwgzDlVNy7'rnzxNwI)
    (ynBa>%|b`;`9fG][email protected] 3]KAeEnQ7lU)3Pn,
    0)G/6N-wyzj/MTd#A;r
    HTTP Parser
    XML Parser
    SOAP Parser
    RPC Parser
    Application

    View Slide

  10. A Naive Fuzzer
    10
    $ ./fuzzit.py
    [ ; x 1 - G P Z + w c c k c ] ; , N 9 J + ?
    # 6 ^ 6 \ e ? ] 9 l u 2 _ % ' 4 G X " 0 V U B [ E / r
    ~ f A p u 6 b 8 < { % s i q 8 Z h . 6 { V , h r ? ;
    { Ti . r 3 P I x M M M v 6 { x S ^ + ' H q ! A x B " Y X R S @ !
    Kd6;wtAMefFWM(`|J_<1~o}z3K(CCzRH JIIvHz>_*.
    \ > J r l U 3 2 ~ e G P ? l R = b F 3 + ; y $ 3 l o d Q < B 8 9 !
    5 " W 2 f K * v E 7 v { ' ) K C - i , c { < [ ~ m ! ] o ; { . ' } G j \ ( X }
    [email protected]{P!AZU7x#4(Rtn!q4nCwqol^y6}0|
    Ko=*JK~;zMKV=9Nai:wxu{J&UV#HaU)*BiC<),`+t*g
    ka&]BS6R&j?
    # t P 7 i a V } - } ` \ ? [ _ [ Z ^ L B M P G -
    FKj'\xwuZ1=Q`^`5,[email protected][!CuRzJ2D|vBy!
    ^ z k h d f 3 C 5 P A k R ? V h n |
    3='i2Qx]D$qs4O`[email protected]'2\[email protected]\zIy
    l"'f,$ee,J4Gw:cgNKLie3nx9(`efSlg6#[K"@WjhZ}
    r[Scun&sBCS,T[/vY'pduwgzDlVNy7'rnzxNwI)
    (ynBa>%|b`;`9fG][email protected] 3]KAeEnQ7lU)3Pn,
    0)G/6N-wyzj/MTd#A;r
    HTTP Parser
    XML Parser
    SOAP Parser
    RPC Parser
    Application

    View Slide

  11. What we need is the Input Specification!
    11

    View Slide

  12. If you already have the Specification:
    12
    • JSFunFuzz
    • GramFuzz
    • LangFuzz
    • CSMITH

    View Slide

  13. But, where do we get grammars from?
    13

    View Slide

  14. But, where do we get grammars from?
    14
    Sample Inputs Grammar
    Use sample inputs to dynamically mine the grammar
    (AUTOGRAM - ASE 2016)

    View Slide

  15. But, where do we get sample inputs from?
    15
    Sample Inputs? Grammar
    If we had a grammar, we can use it to generate sample inputs
    But we don't

    View Slide

  16. Developer Produced Grammar?
    16
    • Often out of sync with the program
    • Can result in blind spots

    View Slide

  17. State of the Art in Generating Inputs: KLEE, AFL and GLADE
    17
    • Uses Symbolic Execution
    AFL • Uses Coverage guided Fuzzing
    Glade • Blackbox Grammar Recovery

    View Slide

  18. Glade:
    18
    • Blackbox technique
    • Very slow to generate meaningful grammars
    Glade

    View Slide

  19. AFL:
    19
    AFL • Uses (Branch) Coverage guided Fuzzing
    • Mutate sample inputs (if available)
    • The inputs generated are shallow and
    simple.
    • Very few valid inputs
    • Performance affected by complexity of input
    space

    View Slide

  20. KLEE:
    20
    • Uses Symbolic Execution
    • Very fast to explore simple languages
    • But suffers when the input space becomes
    complex

    View Slide

  21. Do we need full symbolic execution?
    21
    Idea: Solve only the next character. pFuzzer

    View Slide

  22. Do we need full symbolic execution?
    22
    x ✘
    – checks for digit
    – checks for "true"/"false"
    – checks for '"'
    – checks for '['
    – checks for '{'
    pFuzzer Program

    View Slide

  23. Do we need full symbolic execution?
    23
    pFuzzer Program
    x
    Replace x with a digit

    View Slide

  24. Do we need full symbolic execution?
    24
    pFuzzer Program
    0 ✔

    View Slide

  25. Do we need full symbolic execution?
    25
    [ -
    – checks for digit
    – checks for "true"/"false"
    – checks for '"'
    – checks for '['
    – checks for '{'
    pFuzzer Program
    Succeeds but does not terminate

    View Slide

  26. Do we need full symbolic execution?
    26
    -✘
    – checks for digit
    – checks for "true"/"false"
    – checks for '"'
    – checks for '['
    – checks for '{'
    pFuzzer Program
    [z

    View Slide

  27. Do we need full symbolic execution?
    27
    --
    pFuzzer Program
    [true
    Replaces z with true -- succeeds but does not terminate

    View Slide

  28. Do we need full symbolic execution?
    28
    --✘
    – checks for ','
    – checks for ']'
    pFuzzer Program
    [true1

    View Slide

  29. Do we need full symbolic execution?
    29
    --✔
    pFuzzer Program
    [true]

    View Slide

  30. Comparing to KLEE and AFL
    30
    The average number of tokens found by KLEE AFL and pFuzzer for each token length (Mathis 2019 PLDI)

    View Slide

  31. pFuzzer: Implementations in Python and C
    31
    • Relies on dynamic taint tracking, and character comparisons
    • Assumes availability of source (LLVM bitcode) at this point, but it is not a requirement.

    View Slide

  32. DEMO

    View Slide

  33. Longest Generated Input for JSON Parser
    33
    [false ,[{ "o":{ , "[email protected]?BR": 397 [+ ]"S|+|
    4GzCW(C":-94}} ],[false,null]]
    ............................
    nullÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
    AFL
    KLEE
    pFuzzer
    pFuzzer generates strings with better structure than either AFL or KLEE

    View Slide

  34. Feeding the Results Back to the Grammar Miner
    34
    Program
    Under Test
    Parser-Directed
    Test Generator
    Comparisons
    Tests Dynamic Taints
    Grammar
    Learner
    Test Inputs

    ɠ ɡ
    ɢ
    Inputs +
    Equivalence Classes
    Grammar
    Fuzzer

    View Slide

  35. AUTOGRAM: Grammar from Samples
    35

    View Slide

  36. 36
    protected void parseURL(URL u, String spec, int start, int limit) {
    String protocol = u.getProtocol();
    String authority = u.getAuthority();
    String userInfo = u.getUserInfo();
    String host = u.getHost();
    int port = u.getPort();
    int i = 0;
    boolean isUNCName = (start <= limit - 4) && (spec.charAt(start)
    == '/') &&
    (spec.charAt(start + 1) == '/') &&
    (spec.charAt(start + 2) == '/') &&
    (spec.charAt(start + 3) == '/');
    if (!isUNCName && (start <= limit - 2) && (spec.charAt(start) ==
    '/') &&
    (spec.charAt(start + 1) == '/')) {
    start += 2;
    i = spec.indexOf('/', start);
    if (i < 0) {
    i = spec.indexOf('?', start);
    if (i < 0) i = limit;
    }
    host = authority = spec.substring(start, i);
    int ind = authority.indexOf('@');
    if (ind != -1) {
    userInfo = authority.substring(0, ind);
    host = authority.substring(ind+1);
    } else userInfo = null;
    if (host != null) {
    if (host.length()>0 && (host.charAt(0) == '[')) {
    if ((ind = host.indexOf(']')) > 2) {
    String nhost = host ;
    host = nhost.substring(0,ind+1);
    port = -1 ;
    if (nhost.length() > ind+1) {
    if (nhost.charAt(ind+1) == ':') {
    ++ind ;
    if (nhost.length() > (ind + 1))
    port =
    Integer.parseInt(nhost.substring(ind+1));
    }
    }
    }
    } else {
    ind = host.indexOf(':');
    port = -1;
    if (ind >= 0) {
    if (host.length() > (ind + 1)) {
    port = Integer.parseInt(host.substring(ind +
    1));
    }
    host = host.substring(0, ind);
    }
    }
    } else host = "";
    start = i;
    if (host == null) host = “";
    ...
    setURL(u, protocol, host, port, authority, userInfo, ...);
    http://admin:[email protected]:80/command?foo=bar&lorem=ipsum#fragment
    http://www.guardian.co.uk/sports/worldcup#results
    ftp://bob:[email protected]/oss/debian7.iso
    URL ::= PROTOCOL ‘://‘ AUTHORITY PATH
    [‘?’ QUERY] [‘#’ REF]
    AUTHORITY ::= [USERINFO ‘@‘] HOST [‘:’ PORT]
    PROTOCOL ::= ‘http’ | ‘ftp’
    USERINFO ::= r{[a-z]+} ‘:’ r{[a-z0-9]+}
    HOST ::= r{[a-z.]+}
    PORT ::= ’80’
    PATH ::= r{/[a-z0-9.]*}
    QUERY ::= ‘foo=bar&lorem=ipsum’
    REF ::= r{[a-z]+}

    View Slide

  37. 37
    parseURL(spec)
    -> setURL(protocol, host, port, authority,…)
    -> setUserInfo(user, password)
    http://admin:[email protected]:80 :spec
    parseURL
    http
    80
    www.google.com
    admin:pass123
    setURL
    :protocol
    :authority
    :port
    :host
    admin
    pass123
    setUserInfo

    View Slide

  38. 38
    parseURL(spec)
    -> setURL(protocol, host, port, authority,…)
    -> setUserInfo(user, password)
    http
    80
    www.google.com
    admin:pass123
    http://admin:[email protected]:80 :spec
    setURL
    :protocol
    :authority
    parseURL
    :port
    :host
    admin
    pass123
    setUserInfo
    ftp
    example.ftp.com
    boo:12345
    ftp://boo:[email protected] :spec :protocol
    :authority
    :host
    boo
    12345

    View Slide

  39. 39
    parseURL(spec)
    -> setURL(protocol, host, port, authority,…)
    -> setUserInfo(user, password)
    http | ftp
    [80]
    www.google.com
    |example.ftp.com
    :spec
    setURL
    :protocol
    :authority
    parseURL
    :port
    :host
    admin|boo
    pass123|12345
    setUserInfo
    SPEC ::= PROTOCOL ‘://‘ AUTHORITY ‘@’ HOST [‘:’ PORT]
    AUTHORITY ::= USER ‘:’ PASSWORD
    USER ::=r{[a-z]+}
    PASSWORD ::=r{[a-z0-9]+}
    HOST ::=r{[a-z]+}
    PORT ::=r{[0-9]+}
    SPEC
    AUTHORITY

    View Slide

  40. 40
    • Requires dynamic tainting to track input fragments

    • Works well for recursive-descent parsers

    • Provides strict guarantees on the grammar produced.

    • A generated string from the inferred grammar is guaranteed to be a
    valid input

    (under certain assumptions).

    • Relatively heavyweight, and language/runtime specific

    • JVM/LLVM for now

    • Problems with implicit taint propagation and internal calls
    AUTOGRAM

    View Slide

  41. 41
    What if we relax our constraints? Is taint tracking strictly needed?

    View Slide

  42. 42
    DEMO: fuzzingbook: GrammarMining

    View Slide

  43. 43
    • An abstract representation of a set of executions

    • Or of the program itself if the set of executions are representative

    • Language agnostic, and low implementation complexity

    • Can be used to identify behavioral changes

    • E.g. Duplicates in Mutation Analysis

    • Fingerprinting programs

    • Clone detection

    • Refactoring
    What can you do with the grammar?

    View Slide

  44. 44
    Is input specification sufficient?
    How can we make fuzzing better?

    View Slide

  45. 45
    • Taint Tracking

    • Concolic Execution

    • Metamorphic testing

    • Reorder elements

    • Delete/Insert/Duplicate idempotent elements

    • Differential testing

    • Carving unit tests
    Better Oracles

    View Slide

  46. View Slide