Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Mining Input Grammars from Dynamic Control Flow

Rahul Gopinath
November 10, 2020

Mining Input Grammars from Dynamic Control Flow

FSE 2020

Rahul Gopinath

November 10, 2020
Tweet

More Decks by Rahul Gopinath

Other Decks in Research

Transcript

  1. Mining Input Grammars
    from
    Dynamic Control Flow
    Rahul Gopinath
    Björn Mathis
    Andreas Zeller
    CISPA Helmholtz Center for Information Security

    View full-size slide

  2. Why do We Need a Grammar?

    View full-size slide

  3. Traditional Fuzzing
    Program

    View full-size slide

  4. Traditional Fuzzing
    $ ./fuzz
    [;x1-GPZ+wcckc];,N9J+?#6^6\e?]9lu

    2_%'4GX"0VUB[E/r ~fApu6b8<{%siq8Z

    h.6{V,hr?;{Ti.r3PIxMMMv6{xS^+'Hq!

    AxB"YXRS@!Kd6;wtAMefFWM(`|J_<1~o}

    z3K(CCzRH JIIvHz>_*.\>JrlU32~eGP?

    lR=bF3+;y$3lodQ)KC-i,c{<[~m!]o;{.'}Gj\(X}EtYetrp

    bY@aGZ1{P!AZU7x#4(Rtn!q4nCwqol^y6

    }0|Ko=*JK~;zMKV=9Nai:wxu{J&UV#HaU

    )*BiC<),`+t*gkaPq>&]BS6R&j?#tP7iaV}-}`\?[_[Z^LBM

    PG-FKj'\xwuZ1=Q`^`5,$N$Q@[!CuRzJ2

    D|vBy!^zkhdf3C5PAkR?V((-%>i2Qx]D$qs4O`1@fevnG'2\11Vf3piU37@

    5:dfd45*(7^%5ap\zIyl"'f,$ee,J4Gw:

    cgNKLie3nx9(`efSlg6#[K"@WjhZ}r[Sc

    un&sBCS,T[/3]KAeEnQ7lU)3Pn,0)G/6N

    -wyzj/MTd#A;r*(ds./df3r8Odaf?/<#r
    Program

    View full-size slide

  5. Structured Inputs
    $ ./fuzz
    [;x1-GPZ+wcckc];,N9J+?#6^6\e?]9lu

    2_%'4GX"0VUB[E/r ~fApu6b8<{%siq8Z

    h.6{V,hr?;{Ti.r3PIxMMMv6{xS^+'Hq!

    AxB"YXRS@!Kd6;wtAMefFWM(`|J_<1~o}

    z3K(CCzRH JIIvHz>_*.\>JrlU32~eGP?

    lR=bF3+;y$3lodQ)KC-i,c{<[~m!]o;{.'}Gj\(X}EtYetrp

    bY@aGZ1{P!AZU7x#4(Rtn!q4nCwqol^y6

    }0|Ko=*JK~;zMKV=9Nai:wxu{J&UV#HaU

    )*BiC<),`+t*gkaPq>&]BS6R&j?#tP7iaV}-}`\?[_[Z^LBM

    PG-FKj'\xwuZ1=Q`^`5,$N$Q@[!CuRzJ2

    D|vBy!^zkhdf3C5PAkR?V((-%>i2Qx]D$qs4O`1@fevnG'2\11Vf3piU37@

    5:dfd45*(7^%5ap\zIyl"'f,$ee,J4Gw:

    cgNKLie3nx9(`efSlg6#[K"@WjhZ}r[Sc

    un&sBCS,T[/3]KAeEnQ7lU)3Pn,0)G/6N

    -wyzj/MTd#A;r*(ds./df3r8Odaf?/<#r
    Interpreter

    View full-size slide

  6. Structured Inputs
    $ ./fuzz
    [;x1-GPZ+wcckc];,N9J+?#6^6\e?]9lu

    2_%'4GX"0VUB[E/r ~fApu6b8<{%siq8Z

    h.6{V,hr?;{Ti.r3PIxMMMv6{xS^+'Hq!

    AxB"YXRS@!Kd6;wtAMefFWM(`|J_<1~o}

    z3K(CCzRH JIIvHz>_*.\>JrlU32~eGP?

    lR=bF3+;y$3lodQ)KC-i,c{<[~m!]o;{.'}Gj\(X}EtYetrp

    bY@aGZ1{P!AZU7x#4(Rtn!q4nCwqol^y6

    }0|Ko=*JK~;zMKV=9Nai:wxu{J&UV#HaU

    )*BiC<),`+t*gkaPq>&]BS6R&j?#tP7iaV}-}`\?[_[Z^LBM

    PG-FKj'\xwuZ1=Q`^`5,$N$Q@[!CuRzJ2

    D|vBy!^zkhdf3C5PAkR?V((-%>i2Qx]D$qs4O`1@fevnG'2\11Vf3piU37@

    5:dfd45*(7^%5ap\zIyl"'f,$ee,J4Gw:

    cgNKLie3nx9(`efSlg6#[K"@WjhZ}r[Sc

    un&sBCS,T[/3]KAeEnQ7lU)3Pn,0)G/6N

    -wyzj/MTd#A;r*(ds./df3r8Odaf?/<#r
    Parser
    Syntax Error
    Interpreter
    #

    View full-size slide

  7. def process_input(input):
    try:
    ast = parse(input)
    res = evaluate(ast)
    return res
    except SyntaxError:
    return Error

    View full-size slide

  8. def process_input(input):
    try:
    ast = parse(input)
    res = evaluate(ast)
    return res
    except SyntaxError:
    return Error
    The Core

    View full-size slide

  9. SYNTAX ERROR

    View full-size slide

  10. Use an Input Grammar

    View full-size slide

  11. 9
    :=
    := ' + '
    | ' - '
    |
    := ' * '
    | ' / '
    |
    := '+'
    | '-'
    | '(' ')'
    | '.'
    |
    :=
    |
    := [0-9]

    View full-size slide

  12. 10
    :=
    := ' + '
    | ' - '
    |
    := ' * '
    | ' / '
    |
    := '+'
    | '-'
    | '(' ')'
    | '.'
    |
    :=
    |
    := [0-9]

    View full-size slide

  13. 10
    :=
    := ' + '
    | ' - '
    |
    := ' * '
    | ' / '
    |
    := '+'
    | '-'
    | '(' ')'
    | '.'
    |
    :=
    |
    := [0-9]
    8.2 - 27 - -9 / +((+9 * --2 + --+-+-((-1 * +(8 -
    5 - 6)) * (-(a-+(((+(4))))) - ++4) / +(-+---((5.6
    - --(3 * -1.8 * +(6 * +-(((-(-6) * ---+6)) / +--(+-
    +-7 * (-0 * (+(((((2)) + 8 - 3 - ++9.0 + ---(--+7
    / (1 / +++6.37) + (1) / 482) / +++-+0)))) +
    8.2 - 27 - -9 / +((+9 * --2 + --+-+-((-1 * +(8 -
    5 - 6)) * (-(a-+(((+(4))))) - ++4) / +(-+---((5.6
    - --(3 * -1.8 * +(6 * +-(((-(-6) * ---+6)) / +--(+-
    +-7 * (-0 * (+(((((2)) + 8 - 3 - ++9.0 + ---(--+7
    / (1 / +++6.37) + (1) / 482) / +++-+0)))) * -
    +5 + 7.513)))) - (+1 / ++((-84)))))))) * ++5 /
    +-(--2 - -++-9.0)))) / 5 * --++090 + * -+5 +
    7.513)))) - (+1 / ++((-84)))))))) * 8.2 - 27 - -9
    / +((+9 * --2 + --+-+-((-1 * +(8 - 5 - 6)) * (-(a-
    +(((+(4))))) - ++4) / +(-+---((5.6 - --(3 * -1.8 *
    +(6 * +-(((-(-6) * ---+6)) / +--(+-+-7 * (-0 * (+
    (((((2)) + 8 - 3 - ++9.0 + ---(--+7 / (1 / ++
    +6.37) + (1) / 482) / +++-+0)))) * -+5 +
    7.513)))) - (+1 / ++((-84)))))))) * ++5 / +-(--2
    - -++-9.0)))) / 5 * --++090 ++5 / +-(--2 - -+
    +-9.0)))) / 5 * --++090

    View full-size slide

  14. def process_input(input):
    try:
    ast = parse(input)
    res = evaluate(ast)
    return res
    except SyntaxError:
    return Error

    View full-size slide

  15. def process_input(input):
    try:
    ast = parse(input)
    res = evaluate(ast)
    return res
    except SyntaxError:
    return Error
    The Core

    View full-size slide

  16. Where to Get the Grammar From?

    View full-size slide

  17. •Reference Specification?

    View full-size slide

  18. The standard spec
    •Reference Specification?

    View full-size slide

  19. The standard spec
    Buggy Implementation
    •Reference Specification?

    View full-size slide

  20. The standard spec
    Buggy Implementation
    "Extra" Features
    •Reference Specification?

    View full-size slide

  21. The standard spec
    Buggy Implementation
    "Extra" Features
    "Be liberal in what you accept, and conservative in what you send"

    Postel's Law
    "Accepted" Bugs
    •Reference Specification?

    View full-size slide

  22. Where to Get a Grammar From?
    • Design Documentation?

    View full-size slide

  23. Where to Get a Grammar From?
    • Design Documentation?
    What design documentation?

    View full-size slide

  24. https://www.json.org

    View full-size slide

  25. object
    { }
    { members }
    members
    pair
    pair , members
    pair
    string : value
    array
    [ ]
    [ elements ]
    elements
    value
    value , elements
    value
    string
    number
    object
    array
    true
    false
    null
    string
    " "
    " chars "
    chars
    char
    char chars
    char
    UNICODE \ [",\,CTRL]
    \" \\ \/ \b \f \n \r \t
    \u hex hex hex hex
    number
    int
    int frac
    int exp
    int frac exp
    int
    digit
    onenine digits
    - digit
    - onenine digits
    frac
    . digits
    exp
    e digits
    hex
    digit
    A - F
    a - f
    digits
    digit
    digit digits
    e
    e e+ e-
    E E+ E-
    https://www.json.org

    View full-size slide

  26. object
    { }
    { members }
    members
    pair
    pair , members
    pair
    string : value
    array
    [ ]
    [ elements ]
    elements
    value
    value , elements
    value
    string
    number
    object
    array
    true
    false
    null
    string
    " "
    " chars "
    chars
    char
    char chars
    char
    UNICODE \ [",\,CTRL]
    \" \\ \/ \b \f \n \r \t
    \u hex hex hex hex
    number
    int
    int frac
    int exp
    int frac exp
    int
    digit
    onenine digits
    - digit
    - onenine digits
    frac
    . digits
    exp
    e digits
    hex
    digit
    A - F
    a - f
    digits
    digit
    digit digits
    e
    e e+ e-
    E E+ E-
    Example: JSON https://www.json.org

    View full-size slide

  27. object
    { }
    { members }
    members
    pair
    pair , members
    pair
    string : value
    array
    [ ]
    [ elements ]
    elements
    value
    value , elements
    value
    string
    number
    object
    array
    true
    false
    null
    string
    " "
    " chars "
    chars
    char
    char chars
    char
    UNICODE \ [",\,CTRL]
    \" \\ \/ \b \f \n \r \t
    \u hex hex hex hex
    number
    int
    int frac
    int exp
    int frac exp
    int
    digit
    onenine digits
    - digit
    - onenine digits
    frac
    . digits
    exp
    e digits
    hex
    digit
    A - F
    a - f
    digits
    digit
    digit digits
    e
    e e+ e-
    E E+ E-
    Example: JSON https://www.json.org Parsing JSON is a Minefield: http://seriot.ch/

    View full-size slide

  28. Where to Get a Grammar From?
    Hand-written parsers already encode the grammar

    View full-size slide

  29. :=
    := '+'
    | '-'
    |
    := '*'
    | '/'
    |
    := '+'
    | '-'
    | '(' ')'
    | '.'
    |
    :=
    |
    := [0-9]
    def start_parse():
    parse_expr()
    def parse_expr():
    parse_term()
    match lookahead():
    case '+': consume(); parse_expr()
    case '-': consume(); parse_expr()
    default: pass
    def parse_term():
    parse_factor()
    match lookahead():
    case '*': consume(); parse_term()
    case '/': consume(); parse_term()
    default: pass

    View full-size slide

  30. :=
    := '+'
    | '-'
    |
    := '*'
    | '/'
    |
    := '+'
    | '-'
    | '(' ')'
    | '.'
    |
    :=
    |
    := [0-9]
    def start_parse():
    parse_expr()
    def parse_expr():
    parse_term()
    match lookahead():
    case '+': consume(); parse_expr()
    case '-': consume(); parse_expr()
    default: pass
    def parse_term():
    parse_factor()
    match lookahead():
    case '*': consume(); parse_term()
    case '/': consume(); parse_term()
    default: pass

    View full-size slide

  31. :=
    := '+'
    | '-'
    |
    := '*'
    | '/'
    |
    := '+'
    | '-'
    | '(' ')'
    | '.'
    |
    :=
    |
    := [0-9]
    def start_parse():
    parse_expr()
    def parse_expr():
    parse_term()
    match lookahead():
    case '+': consume(); parse_expr()
    case '-': consume(); parse_expr()
    default: pass
    def parse_term():
    parse_factor()
    match lookahead():
    case '*': consume(); parse_term()
    case '/': consume(); parse_term()
    default: pass
    :=

    View full-size slide

  32. :=
    := '+'
    | '-'
    |
    := '*'
    | '/'
    |
    := '+'
    | '-'
    | '(' ')'
    | '.'
    |
    :=
    |
    := [0-9]
    def start_parse():
    parse_expr()
    def parse_expr():
    parse_term()
    match lookahead():
    case '+': consume(); parse_expr()
    case '-': consume(); parse_expr()
    default: pass
    def parse_term():
    parse_factor()
    match lookahead():
    case '*': consume(); parse_term()
    case '/': consume(); parse_term()
    default: pass
    :=

    View full-size slide

  33. :=
    := '+'
    | '-'
    |
    := '*'
    | '/'
    |
    := '+'
    | '-'
    | '(' ')'
    | '.'
    |
    :=
    |
    := [0-9]
    def start_parse():
    parse_expr()
    def parse_expr():
    parse_term()
    match lookahead():
    case '+': consume(); parse_expr()
    case '-': consume(); parse_expr()
    default: pass
    def parse_term():
    parse_factor()
    match lookahead():
    case '*': consume(); parse_term()
    case '/': consume(); parse_term()
    default: pass
    :=
    :=

    View full-size slide

  34. :=
    := '+'
    | '-'
    |
    := '*'
    | '/'
    |
    := '+'
    | '-'
    | '(' ')'
    | '.'
    |
    :=
    |
    := [0-9]
    def start_parse():
    parse_expr()
    def parse_expr():
    parse_term()
    match lookahead():
    case '+': consume(); parse_expr()
    case '-': consume(); parse_expr()
    default: pass
    def parse_term():
    parse_factor()
    match lookahead():
    case '*': consume(); parse_term()
    case '/': consume(); parse_term()
    default: pass
    :=
    := '+'

    View full-size slide

  35. :=
    := '+'
    | '-'
    |
    := '*'
    | '/'
    |
    := '+'
    | '-'
    | '(' ')'
    | '.'
    |
    :=
    |
    := [0-9]
    def start_parse():
    parse_expr()
    def parse_expr():
    parse_term()
    match lookahead():
    case '+': consume(); parse_expr()
    case '-': consume(); parse_expr()
    default: pass
    def parse_term():
    parse_factor()
    match lookahead():
    case '*': consume(); parse_term()
    case '/': consume(); parse_term()
    default: pass
    :=
    := '+'

    View full-size slide

  36. :=
    := '+'
    | '-'
    |
    := '*'
    | '/'
    |
    := '+'
    | '-'
    | '(' ')'
    | '.'
    |
    :=
    |
    := [0-9]
    def start_parse():
    parse_expr()
    def parse_expr():
    parse_term()
    match lookahead():
    case '+': consume(); parse_expr()
    case '-': consume(); parse_expr()
    default: pass
    def parse_term():
    parse_factor()
    match lookahead():
    case '*': consume(); parse_term()
    case '/': consume(); parse_term()
    default: pass
    :=
    := '+'
    | '-'

    View full-size slide

  37. :=
    := '+'
    | '-'
    |
    := '*'
    | '/'
    |
    := '+'
    | '-'
    | '(' ')'
    | '.'
    |
    :=
    |
    := [0-9]
    def start_parse():
    parse_expr()
    def parse_expr():
    parse_term()
    match lookahead():
    case '+': consume(); parse_expr()
    case '-': consume(); parse_expr()
    default: pass
    def parse_term():
    parse_factor()
    match lookahead():
    case '*': consume(); parse_term()
    case '/': consume(); parse_term()
    default: pass
    :=
    := '+'
    | '-'

    View full-size slide

  38. :=
    := '+'
    | '-'
    |
    := '*'
    | '/'
    |
    := '+'
    | '-'
    | '(' ')'
    | '.'
    |
    :=
    |
    := [0-9]
    def start_parse():
    parse_expr()
    def parse_expr():
    parse_term()
    match lookahead():
    case '+': consume(); parse_expr()
    case '-': consume(); parse_expr()
    default: pass
    def parse_term():
    parse_factor()
    match lookahead():
    case '*': consume(); parse_term()
    case '/': consume(); parse_term()
    default: pass
    :=
    := '+'
    | '-'
    |

    View full-size slide

  39. :=
    := '+'
    | '-'
    |
    := '*'
    | '/'
    |
    := '+'
    | '-'
    | '(' ')'
    | '.'
    |
    :=
    |
    := [0-9]
    def start_parse():
    parse_expr()
    def parse_expr():
    parse_term()
    match lookahead():
    case '+': consume(); parse_expr()
    case '-': consume(); parse_expr()
    default: pass
    def parse_term():
    parse_factor()
    match lookahead():
    case '*': consume(); parse_term()
    case '/': consume(); parse_term()
    default: pass
    :=
    := '+'
    | '-'
    |

    View full-size slide

  40. :=
    := '+'
    | '-'
    |
    := '*'
    | '/'
    |
    := '+'
    | '-'
    | '(' ')'
    | '.'
    |
    :=
    |
    := [0-9]
    def start_parse():
    parse_expr()
    def parse_expr():
    parse_term()
    match lookahead():
    case '+': consume(); parse_expr()
    case '-': consume(); parse_expr()
    default: pass
    def parse_term():
    parse_factor()
    match lookahead():
    case '*': consume(); parse_term()
    case '/': consume(); parse_term()
    default: pass
    :=
    := '+'
    | '-'
    |
    :=

    View full-size slide

  41. :=
    := '+'
    | '-'
    |
    := '*'
    | '/'
    |
    := '+'
    | '-'
    | '(' ')'
    | '.'
    |
    :=
    |
    := [0-9]
    def start_parse():
    parse_expr()
    def parse_expr():
    parse_term()
    match lookahead():
    case '+': consume(); parse_expr()
    case '-': consume(); parse_expr()
    default: pass
    def parse_term():
    parse_factor()
    match lookahead():
    case '*': consume(); parse_term()
    case '/': consume(); parse_term()
    default: pass
    :=
    := '+'
    | '-'
    |
    := '*'

    View full-size slide

  42. :=
    := '+'
    | '-'
    |
    := '*'
    | '/'
    |
    := '+'
    | '-'
    | '(' ')'
    | '.'
    |
    :=
    |
    := [0-9]
    def start_parse():
    parse_expr()
    def parse_expr():
    parse_term()
    match lookahead():
    case '+': consume(); parse_expr()
    case '-': consume(); parse_expr()
    default: pass
    def parse_term():
    parse_factor()
    match lookahead():
    case '*': consume(); parse_term()
    case '/': consume(); parse_term()
    default: pass
    :=
    := '+'
    | '-'
    |
    := '*'

    View full-size slide

  43. :=
    := '+'
    | '-'
    |
    := '*'
    | '/'
    |
    := '+'
    | '-'
    | '(' ')'
    | '.'
    |
    :=
    |
    := [0-9]
    def start_parse():
    parse_expr()
    def parse_expr():
    parse_term()
    match lookahead():
    case '+': consume(); parse_expr()
    case '-': consume(); parse_expr()
    default: pass
    def parse_term():
    parse_factor()
    match lookahead():
    case '*': consume(); parse_term()
    case '/': consume(); parse_term()
    default: pass
    :=
    := '+'
    | '-'
    |
    := '*'
    | '/'

    View full-size slide

  44. :=
    := '+'
    | '-'
    |
    := '*'
    | '/'
    |
    := '+'
    | '-'
    | '(' ')'
    | '.'
    |
    :=
    |
    := [0-9]
    def start_parse():
    parse_expr()
    def parse_expr():
    parse_term()
    match lookahead():
    case '+': consume(); parse_expr()
    case '-': consume(); parse_expr()
    default: pass
    def parse_term():
    parse_factor()
    match lookahead():
    case '*': consume(); parse_term()
    case '/': consume(); parse_term()
    default: pass
    :=
    := '+'
    | '-'
    |
    := '*'
    | '/'

    View full-size slide

  45. :=
    := '+'
    | '-'
    |
    := '*'
    | '/'
    |
    := '+'
    | '-'
    | '(' ')'
    | '.'
    |
    :=
    |
    := [0-9]
    def start_parse():
    parse_expr()
    def parse_expr():
    parse_term()
    match lookahead():
    case '+': consume(); parse_expr()
    case '-': consume(); parse_expr()
    default: pass
    def parse_term():
    parse_factor()
    match lookahead():
    case '*': consume(); parse_term()
    case '/': consume(); parse_term()
    default: pass
    :=
    := '+'
    | '-'
    |
    := '*'
    | '/'
    |

    View full-size slide

  46. :=
    := '+'
    | '-'
    |
    := '*'
    | '/'
    |
    := '+'
    | '-'
    | '(' ')'
    | '.'
    |
    :=
    |
    := [0-9]
    def start_parse():
    parse_expr()
    def parse_expr():
    parse_term()
    match lookahead():
    case '+': consume(); parse_expr()
    case '-': consume(); parse_expr()
    default: pass
    def parse_term():
    parse_factor()
    match lookahead():
    case '*': consume(); parse_term()
    case '/': consume(); parse_term()
    default: pass

    View full-size slide

  47. :=
    := '+'
    | '-'
    |
    := '*'
    | '/'
    |
    := '+'
    | '-'
    | '(' ')'
    | '.'
    |
    :=
    |
    := [0-9]
    def start_parse():
    parse_expr()
    def parse_expr():
    parse_term()
    match lookahead():
    case '+': consume(); parse_expr()
    case '-': consume(); parse_expr()
    default: pass
    def parse_term():
    parse_factor()
    match lookahead():
    case '*': consume(); parse_term()
    case '/': consume(); parse_term()
    default: pass
    How to Extract This Grammar?

    View full-size slide

  48. How to Extract This Grammar?

    View full-size slide

  49. How to Extract This Grammar?
    • Inputs -> Dynamic Control Dependence trees

    View full-size slide

  50. How to Extract This Grammar?
    • Inputs -> Dynamic Control Dependence trees
    • DCD Trees -> Context Free Grammar

    View full-size slide

  51. Control Dependence Graph
    Statement B is control dependent on A if A determines whether B executes.
    def parse_csv(s,i):
    while s[i:]:
    if is_digit(s[i]):
    n,j = num(s[i:])
    i = i+j
    else:
    comma(s[i])
    i += 1

    View full-size slide

  52. Control Dependence Graph
    Statement B is control dependent on A if A determines whether B executes.
    def parse_csv(s,i):
    while s[i:]:
    if is_digit(s[i]):
    n,j = num(s[i:])
    i = i+j
    else:
    comma(s[i])
    i += 1
    CDG for parse_csv

    View full-size slide

  53. Control Dependence Graph
    Statement B is control dependent on A if A determines whether B executes.
    def parse_csv(s,i):
    while s[i:]:
    if is_digit(s[i]):
    n,j = num(s[i:])
    i = i+j
    else:
    comma(s[i])
    i += 1
    CDG for parse_csv
    while: determines

    whether

    if: executes

    View full-size slide

  54. def parse_csv(s,i):
    while s[i:]:
    if is_digit(s[i]):
    n,j = num(s[i:])
    i = i+j
    else:
    comma(s[i])
    i += 1
    CDG for parse_csv
    Dynamic Control Dependence Tree
    Each statement execution is represented as a separate node

    View full-size slide

  55. def parse_csv(s,i):
    while s[i:]:
    if is_digit(s[i]):
    n,j = num(s[i:])
    i = i+j
    else:
    comma(s[i])
    i += 1
    CDG for parse_csv
    Dynamic Control Dependence Tree
    Each statement execution is represented as a separate node
    DCD Tree for call parse_csv()

    View full-size slide

  56. def parse_csv(s,i):
    while s[i:]:
    if is_digit(s[i]):
    n,j = num(s[i:])
    i = i+j
    else:
    comma(s[i])
    i += 1
    DCD Tree ~ Parse Tree
    •No tracking beyond input buffer

    •Characters are attached to nodes where they are accessed last
    "12,"
    "12,"

    View full-size slide

  57. def parse_csv(s,i):
    while s[i:]:
    if is_digit(s[i]):
    n,j = num(s[i:])
    i = i+j
    else:
    comma(s[i])
    i += 1
    '1' '2' ','
    DCD Tree ~ Parse Tree
    •No tracking beyond input buffer

    •Characters are attached to nodes where they are accessed last
    "12,"
    "12,"

    View full-size slide

  58. def is_digit(i): return i in '0123456789'
    def parse_num(s,i):
    n = ''
    while s[i:] and is_digit(s[i]):
    n += s[i]
    i = i +1
    return i,n
    def parse_paren(s, i):
    assert s[i] == '('
    i, v = parse_expr(s, i+1)
    if s[i:] == '': raise Ex(s, i)
    assert s[i] == ')'
    return i+1, v
    def parse_expr(s, i = 0):
    expr, is_op = [], True
    while s[i:]:
    c = s[i]
    if isdigit(c):
    if not is_op: raise Ex(s,i)
    i,num = parse_num(s,i)
    expr.append(num)
    is_op = False
    elif c in ['+', '-', '*', '/']:
    if is_op: raise Ex(s,i)
    expr.append(c)
    is_op, i = True, i + 1
    elif c == '(':
    if not is_op: raise Ex(s,i)
    i, cexpr = parse_paren(s, i)
    expr.append(cexpr)
    is_op = False
    elif c == ')': break
    else: raise Ex(s,i)
    if is_op: raise Ex(s,i)
    return i, expr
    9+3/4
    Parse tree for parse_expr('9+3/4')

    View full-size slide

  59. def is_digit(i): return i in '0123456789'
    def parse_num(s,i):
    n = ''
    while s[i:] and is_digit(s[i]):
    n += s[i]
    i = i +1
    return i,n
    def parse_paren(s, i):
    assert s[i] == '('
    i, v = parse_expr(s, i+1)
    if s[i:] == '': raise Ex(s, i)
    assert s[i] == ')'
    return i+1, v
    def parse_expr(s, i = 0):
    expr, is_op = [], True
    while s[i:]:
    c = s[i]
    if isdigit(c):
    if not is_op: raise Ex(s,i)
    i,num = parse_num(s,i)
    expr.append(num)
    is_op = False
    elif c in ['+', '-', '*', '/']:
    if is_op: raise Ex(s,i)
    expr.append(c)
    is_op, i = True, i + 1
    elif c == '(':
    if not is_op: raise Ex(s,i)
    i, cexpr = parse_paren(s, i)
    expr.append(cexpr)
    is_op = False
    elif c == ')': break
    else: raise Ex(s,i)
    if is_op: raise Ex(s,i)
    return i, expr
    9+3/4
    Parse tree for parse_expr('9+3/4')

    View full-size slide

  60. def is_digit(i): return i in '0123456789'
    def parse_num(s,i):
    n = ''
    while s[i:] and is_digit(s[i]):
    n += s[i]
    i = i +1
    return i,n
    def parse_paren(s, i):
    assert s[i] == '('
    i, v = parse_expr(s, i+1)
    if s[i:] == '': raise Ex(s, i)
    assert s[i] == ')'
    return i+1, v
    def parse_expr(s, i = 0):
    expr, is_op = [], True
    while s[i:]:
    c = s[i]
    if isdigit(c):
    if not is_op: raise Ex(s,i)
    i,num = parse_num(s,i)
    expr.append(num)
    is_op = False
    elif c in ['+', '-', '*', '/']:
    if is_op: raise Ex(s,i)
    expr.append(c)
    is_op, i = True, i + 1
    elif c == '(':
    if not is_op: raise Ex(s,i)
    i, cexpr = parse_paren(s, i)
    expr.append(cexpr)
    is_op = False
    elif c == ')': break
    else: raise Ex(s,i)
    if is_op: raise Ex(s,i)
    return i, expr
    9+3/4
    Identifying Compatible Nodes
    Which nodes correspond to the same nonterminal

    View full-size slide

  61. def is_digit(i): return i in '0123456789'
    def parse_num(s,i):
    n = ''
    while s[i:] and is_digit(s[i]):
    n += s[i]
    i = i +1
    return i,n
    def parse_paren(s, i):
    assert s[i] == '('
    i, v = parse_expr(s, i+1)
    if s[i:] == '': raise Ex(s, i)
    assert s[i] == ')'
    return i+1, v
    def parse_expr(s, i = 0):
    expr, is_op = [], True
    while s[i:]:
    c = s[i]
    if isdigit(c):
    if not is_op: raise Ex(s,i)
    i,num = parse_num(s,i)
    expr.append(num)
    is_op = False
    elif c in ['+', '-', '*', '/']:
    if is_op: raise Ex(s,i)
    expr.append(c)
    is_op, i = True, i + 1
    elif c == '(':
    if not is_op: raise Ex(s,i)
    i, cexpr = parse_paren(s, i)
    expr.append(cexpr)
    is_op = False
    elif c == ')': break
    else: raise Ex(s,i)
    if is_op: raise Ex(s,i)
    return i, expr
    9+3/4
    Identifying Compatible Nodes
    Which nodes correspond to the same nonterminal

    View full-size slide

  62. (9 + 1) * 3
    3 * (9 + 1)

    View full-size slide

  63. (9 + 1) * 3
    3 * (9 + 1)

    View full-size slide

  64. 9 + 1
    3 * (9 + 1)

    View full-size slide

  65. 9 + 1
    3 * (9 + 1)

    View full-size slide

  66. 3 (9 + 1) *
    3 * (9 + 1)

    View full-size slide

  67. 3 (9 + 1) *
    3 * (9 + 1)

    View full-size slide

  68. 3*(1)
    1
    :=

    :=

    View full-size slide

  69. :=
    |
    |
    |
    :=
    :=
    |
    :=
    := '3' | '1'
    := '(' ')'
    :=
    := '*'

    View full-size slide

  70. :=
    :=
    |
    :=
    |
    |
    |
    :=
    :=
    |
    :=
    := '3' | '1'
    := '(' ')'
    :=
    := '*'

    View full-size slide

  71. Generalization of Lexical Tokens

    View full-size slide

  72. Widening Lexical Tokens with Active Learning
    := 19458790
    | 809451
    | 243
    | 48095284094435
    := 1.0043
    | 34.343
    | 232.232
    | 2988.343
    := ""
    "dfa 39&*("
    "0989"
    "0._3"

    View full-size slide

  73. Widening Lexical Tokens with Active Learning
    := 19458790
    | 809451
    | 243
    | 48095284094435
    := 1.0043
    | 34.343
    | 232.232
    | 2988.343
    := ""
    "dfa 39&*("
    "0989"
    "0._3"
    := [0-9]+
    := [0-9]+[.][0-9]+
    := "[:alphanum:]+"

    View full-size slide

  74. def is_digit(i): return i in '0123456789'
    def parse_num(s,i):
    n = ''
    while s[i:] and is_digit(s[i]):
    n += s[i]
    i = i +1
    return i,n
    def parse_paren(s, i):
    assert s[i] == '('
    i, v = parse_expr(s, i+1)
    if s[i:] == '': raise Ex(s, i)
    assert s[i] == ')'
    return i+1, v
    def parse_expr(s, i = 0):
    expr, is_op = [], True
    while s[i:]:
    c = s[i]
    if isdigit(c):
    if not is_op: raise Ex(s,i)
    i,num = parse_num(s,i)
    expr.append(num)
    is_op = False
    elif c in ['+', '-', '*', '/']:
    if is_op: raise Ex(s,i)
    expr.append(c)
    is_op, i = True, i + 1
    elif c == '(':
    if not is_op: raise Ex(s,i)
    i, cexpr = parse_paren(s, i)
    expr.append(cexpr)
    is_op = False
    elif c == ')': break
    else: raise Ex(s,i)
    if is_op: raise Ex(s,i)
    return i, expr
    :=
    :=
    |
    :=
    |
    := '(' ')'
    |
    := '*' | '+' | '-' | '/'
    :=
    |
    : [0-9]
    calc.py Recovered Arithmetic Grammar

    View full-size slide

  75. ::=
    ::= '"'
    | '['
    | '{'
    |
    | 'true'
    | 'false'
    | 'null'
    ::= +
    | + 'e' +
    ::= '+' | '-' | '.' | [0-9] | 'E' | 'e'
    ::= * '"'
    ::= ']'
    | (',')* ']'
    | ( ',' )+ (',' )* ']'
    ::= '}'
    | ( '"' ':' ',' )*
    '"' ':' '}'
    ::= ' ' | '!' | '#' | '$' | '%' | '&' | '''
    | '*' | '+' | '-' | ',' | '.' | '/' | ':' | ';'
    | '<' | '=' | '>' | '?' | '@' | '[' | ']' | '^'
    | '_', ''',| '{' | '|' | '}' | '~'
    | '[A-Za-z0-9]'
    | '\'
    ::= '"' | '/' | 'b' | 'f' | 'n' | 'r' | 't'
    # parse out a key/value pair
    elif c == '"':
    key = _from_json_string(stm)
    stm.skipspaces()
    c = stm.next()
    if c != ':':
    raise JSONError(E_COLON, stm, stm.pos)
    stm.skipspaces()
    val = _from_json_raw(stm)
    result[key] = val
    expect_key = 0
    continue
    raise JSONError(E_MALF, stm, stm.pos)
    def _from_json_raw(stm):
    while True:
    stm.skipspaces()
    c = stm.peek()
    if c == '"':
    return _from_json_string(stm)
    elif c == '{':
    return _from_json_dict(stm)
    elif c == '[':
    return _from_json_list(stm)
    elif c == 't':
    return _from_json_fixed(stm, 'true', True, E_BOOL)
    elif c == 'f':
    return _from_json_fixed(stm, 'false', False, E_BOOL)
    elif c == 'n':
    return _from_json_fixed(stm, 'null', None, E_NULL)
    elif c in NUMSTART:
    return _from_json_number(stm)
    raise JSONError(E_MALF, stm, stm.pos)
    def from_json(data):
    stm = JSONStream(data)
    return _from_json_raw(stm) microjson.py Recovered JSON grammar

    View full-size slide

  76. • Languages: C, Python
    • Style: Adhoc, Text book, Parser Combinators
    • Evaluation: Precision and Recall

    View full-size slide

  77. Subjects Language Grammar Style
    calc.py Python CFG textbook parser
    mathexpr.py Python CFG textbook OO
    cgidecode.py Python RG automata
    urlparse.py Python RG ad hoc
    microjson.py Python CFG optimized parser
    parseclisp.py Python CFG parser combinator
    jsonparser.c C CFG optimized parser
    tiny.c C CFG lexer + parser
    mjs.c C CFG lexer + parser
    Evaluation: Subjects

    View full-size slide

  78. Evaluation: Recall
    Subjects AUTOGRAM Mimid
    calc.py 36.5 % 100.0 %
    mathexpr.py 30.3 % 87.5 %
    cgidecode.py 47.8 % 100.0 %
    urlparse.py 100.0 % 100.0 %
    microjson.py 53.8 % 98.7 %
    parseclisp.py 100.0 % 99.3 %
    jsonparser.c n/a 100.0 %
    tiny.c n/a 100.0 %
    mjs.c n/a 95.4 %
    Inputs generated by inferred grammar that were accepted by the program

    View full-size slide

  79. Subjects AUTOGRAM Mimid
    calc.py 0.0 % 100.0 %
    mathexpr.py 0.0 % 92.7 %
    cgidecode.py 35.1 % 100.0 %
    urlparse.py 100.0 % 96.4 %
    microjson.py 0.0 % 93.0 %
    parseclisp.py 37.6 % 80.6 %
    jsonparser.c n/a 83.8 %
    tiny.c n/a 92.8 %
    mjs.c n/a 95.9 %
    Inputs generated by golden grammar that were accepted by the inferred grammar parser
    Evaluation: Precision

    View full-size slide

  80. • Grammar Generation with light weight instrumentation
    • Can be applied to multiple styles of parsers
    • Generates accurate and readable grammars
    • Replication package is available as s Jupyter notebook
    Mimid

    View full-size slide

  81. https://github.com/vrthra/mimid

    View full-size slide

  82. https://github.com/vrthra/mimid

    View full-size slide

  83. https://github.com/vrthra/mimid

    View full-size slide

  84. https://github.com/vrthra/mimid

    View full-size slide

  85. https://github.com/vrthra/mimid

    View full-size slide

  86. https://github.com/vrthra/mimid

    View full-size slide

  87. https://github.com/vrthra/mimid

    View full-size slide