Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Mining Input Grammars from Dynamic Control Flow

Mining Input Grammars from Dynamic Control Flow

FSE 2020

D27cb84e0d30e2778e9b66d6a5f42106?s=128

Rahul Gopinath

November 10, 2020
Tweet

Transcript

  1. Mining Input Grammars from Dynamic Control Flow Rahul Gopinath Björn

    Mathis Andreas Zeller CISPA Helmholtz Center for Information Security
  2. Why do We Need a Grammar?

  3. Traditional Fuzzing Program

  4. Traditional Fuzzing $ ./fuzz [;x1-GPZ+wcckc];,N9J+?#6^6\e?]9lu
 2_%'4GX"0VUB[E/r ~fApu6b8<{%siq8Z
 h.6{V,hr?;{Ti.r3PIxMMMv6{xS^+'Hq!
 AxB"YXRS@!Kd6;wtAMefFWM(`|J_<1~o}
 z3K(CCzRH

    JIIvHz>_*.\>JrlU32~eGP?
 lR=bF3+;y$3lodQ<B89!5"W2fK*vE7v{'
 )KC-i,c{<[~m!]o;{.'}Gj\(X}EtYetrp
 bY@aGZ1{P!AZU7x#4(Rtn!q4nCwqol^y6
 }0|Ko=*JK~;zMKV=9Nai:wxu{J&UV#HaU
 )*BiC<),`+t*gka<W=Z.%T5WGHZpI30D<
 Pq>&]BS6R&j?#tP7iaV}-}`\?[_[Z^LBM
 PG-FKj'\xwuZ1=Q`^`5,$N$Q@[!CuRzJ2
 D|vBy!^zkhdf3C5PAkR?V((-%><hn|3='
 i2Qx]D$qs4O`1@fevnG'2\11Vf3piU37@
 5:dfd45*(7^%5ap\zIyl"'f,$ee,J4Gw:
 cgNKLie3nx9(`efSlg6#[K"@WjhZ}r[Sc
 un&sBCS,T[/3]KAeEnQ7lU)3Pn,0)G/6N
 -wyzj/MTd#A;r*(ds./df3r8Odaf?/<#r Program ✘
  5. Structured Inputs $ ./fuzz [;x1-GPZ+wcckc];,N9J+?#6^6\e?]9lu
 2_%'4GX"0VUB[E/r ~fApu6b8<{%siq8Z
 h.6{V,hr?;{Ti.r3PIxMMMv6{xS^+'Hq!
 AxB"YXRS@!Kd6;wtAMefFWM(`|J_<1~o}
 z3K(CCzRH

    JIIvHz>_*.\>JrlU32~eGP?
 lR=bF3+;y$3lodQ<B89!5"W2fK*vE7v{'
 )KC-i,c{<[~m!]o;{.'}Gj\(X}EtYetrp
 bY@aGZ1{P!AZU7x#4(Rtn!q4nCwqol^y6
 }0|Ko=*JK~;zMKV=9Nai:wxu{J&UV#HaU
 )*BiC<),`+t*gka<W=Z.%T5WGHZpI30D<
 Pq>&]BS6R&j?#tP7iaV}-}`\?[_[Z^LBM
 PG-FKj'\xwuZ1=Q`^`5,$N$Q@[!CuRzJ2
 D|vBy!^zkhdf3C5PAkR?V((-%><hn|3='
 i2Qx]D$qs4O`1@fevnG'2\11Vf3piU37@
 5:dfd45*(7^%5ap\zIyl"'f,$ee,J4Gw:
 cgNKLie3nx9(`efSlg6#[K"@WjhZ}r[Sc
 un&sBCS,T[/3]KAeEnQ7lU)3Pn,0)G/6N
 -wyzj/MTd#A;r*(ds./df3r8Odaf?/<#r Interpreter
  6. Structured Inputs $ ./fuzz [;x1-GPZ+wcckc];,N9J+?#6^6\e?]9lu
 2_%'4GX"0VUB[E/r ~fApu6b8<{%siq8Z
 h.6{V,hr?;{Ti.r3PIxMMMv6{xS^+'Hq!
 AxB"YXRS@!Kd6;wtAMefFWM(`|J_<1~o}
 z3K(CCzRH

    JIIvHz>_*.\>JrlU32~eGP?
 lR=bF3+;y$3lodQ<B89!5"W2fK*vE7v{'
 )KC-i,c{<[~m!]o;{.'}Gj\(X}EtYetrp
 bY@aGZ1{P!AZU7x#4(Rtn!q4nCwqol^y6
 }0|Ko=*JK~;zMKV=9Nai:wxu{J&UV#HaU
 )*BiC<),`+t*gka<W=Z.%T5WGHZpI30D<
 Pq>&]BS6R&j?#tP7iaV}-}`\?[_[Z^LBM
 PG-FKj'\xwuZ1=Q`^`5,$N$Q@[!CuRzJ2
 D|vBy!^zkhdf3C5PAkR?V((-%><hn|3='
 i2Qx]D$qs4O`1@fevnG'2\11Vf3piU37@
 5:dfd45*(7^%5ap\zIyl"'f,$ee,J4Gw:
 cgNKLie3nx9(`efSlg6#[K"@WjhZ}r[Sc
 un&sBCS,T[/3]KAeEnQ7lU)3Pn,0)G/6N
 -wyzj/MTd#A;r*(ds./df3r8Odaf?/<#r Parser Syntax Error Interpreter #
  7. def process_input(input): try: ast = parse(input) res = evaluate(ast) return

    res except SyntaxError: return Error
  8. def process_input(input): try: ast = parse(input) res = evaluate(ast) return

    res except SyntaxError: return Error The Core
  9. None
  10. SYNTAX ERROR

  11. Use an Input Grammar

  12. 9

  13. 9 <start> := <expr> <expr> := <term> ' + '

    <expr> | <term> ' - ' <expr> | <term> <term> := <factor> ' * ' <term> | <factor> ' / ' <term> | <factor> <factor> := '+' <factor> | '-' <factor> | '(' <expr> ')' | <integer> '.' <integer> | <integer> <integer>:= <digit> <integer> | <digit> <digit> := [0-9]
  14. 10 <start> := <expr> <expr> := <term> ' + '

    <expr> | <term> ' - ' <expr> | <term> <term> := <factor> ' * ' <term> | <factor> ' / ' <term> | <factor> <factor> := '+' <factor> | '-' <factor> | '(' <expr> ')' | <integer> '.' <integer> | <integer> <integer>:= <digit> <integer> | <digit> <digit> := [0-9]
  15. 10 <start> := <expr> <expr> := <term> ' + '

    <expr> | <term> ' - ' <expr> | <term> <term> := <factor> ' * ' <term> | <factor> ' / ' <term> | <factor> <factor> := '+' <factor> | '-' <factor> | '(' <expr> ')' | <integer> '.' <integer> | <integer> <integer>:= <digit> <integer> | <digit> <digit> := [0-9] 8.2 - 27 - -9 / +((+9 * --2 + --+-+-((-1 * +(8 - 5 - 6)) * (-(a-+(((+(4))))) - ++4) / +(-+---((5.6 - --(3 * -1.8 * +(6 * +-(((-(-6) * ---+6)) / +--(+- +-7 * (-0 * (+(((((2)) + 8 - 3 - ++9.0 + ---(--+7 / (1 / +++6.37) + (1) / 482) / +++-+0)))) + 8.2 - 27 - -9 / +((+9 * --2 + --+-+-((-1 * +(8 - 5 - 6)) * (-(a-+(((+(4))))) - ++4) / +(-+---((5.6 - --(3 * -1.8 * +(6 * +-(((-(-6) * ---+6)) / +--(+- +-7 * (-0 * (+(((((2)) + 8 - 3 - ++9.0 + ---(--+7 / (1 / +++6.37) + (1) / 482) / +++-+0)))) * - +5 + 7.513)))) - (+1 / ++((-84)))))))) * ++5 / +-(--2 - -++-9.0)))) / 5 * --++090 + * -+5 + 7.513)))) - (+1 / ++((-84)))))))) * 8.2 - 27 - -9 / +((+9 * --2 + --+-+-((-1 * +(8 - 5 - 6)) * (-(a- +(((+(4))))) - ++4) / +(-+---((5.6 - --(3 * -1.8 * +(6 * +-(((-(-6) * ---+6)) / +--(+-+-7 * (-0 * (+ (((((2)) + 8 - 3 - ++9.0 + ---(--+7 / (1 / ++ +6.37) + (1) / 482) / +++-+0)))) * -+5 + 7.513)))) - (+1 / ++((-84)))))))) * ++5 / +-(--2 - -++-9.0)))) / 5 * --++090 ++5 / +-(--2 - -+ +-9.0)))) / 5 * --++090
  16. def process_input(input): try: ast = parse(input) res = evaluate(ast) return

    res except SyntaxError: return Error
  17. def process_input(input): try: ast = parse(input) res = evaluate(ast) return

    res except SyntaxError: return Error The Core
  18. Where to Get the Grammar From?

  19. •Reference Specification?

  20. The standard spec •Reference Specification?

  21. The standard spec Buggy Implementation •Reference Specification?

  22. The standard spec Buggy Implementation "Extra" Features •Reference Specification?

  23. The standard spec Buggy Implementation "Extra" Features "Be liberal in

    what you accept, and conservative in what you send"
 Postel's Law "Accepted" Bugs •Reference Specification?
  24. Where to Get a Grammar From? • Design Documentation?

  25. Where to Get a Grammar From? • Design Documentation? What

    design documentation?
  26. https://www.json.org

  27. object { } { members } members pair pair ,

    members pair string : value array [ ] [ elements ] elements value value , elements value string number object array true false null string " " " chars " chars char char chars char UNICODE \ [",\,CTRL] \" \\ \/ \b \f \n \r \t \u hex hex hex hex number int int frac int exp int frac exp int digit onenine digits - digit - onenine digits frac . digits exp e digits hex digit A - F a - f digits digit digit digits e e e+ e- E E+ E- https://www.json.org
  28. object { } { members } members pair pair ,

    members pair string : value array [ ] [ elements ] elements value value , elements value string number object array true false null string " " " chars " chars char char chars char UNICODE \ [",\,CTRL] \" \\ \/ \b \f \n \r \t \u hex hex hex hex number int int frac int exp int frac exp int digit onenine digits - digit - onenine digits frac . digits exp e digits hex digit A - F a - f digits digit digit digits e e e+ e- E E+ E- Example: JSON https://www.json.org
  29. object { } { members } members pair pair ,

    members pair string : value array [ ] [ elements ] elements value value , elements value string number object array true false null string " " " chars " chars char char chars char UNICODE \ [",\,CTRL] \" \\ \/ \b \f \n \r \t \u hex hex hex hex number int int frac int exp int frac exp int digit onenine digits - digit - onenine digits frac . digits exp e digits hex digit A - F a - f digits digit digit digits e e e+ e- E E+ E- Example: JSON https://www.json.org Parsing JSON is a Minefield: http://seriot.ch/
  30. Where to Get a Grammar From? Hand-written parsers already encode

    the grammar
  31. <start> := <expr> <expr> := <term> '+' <expr> | <term>

    '-' <expr> | <term> <term> := <factor> '*' <term> | <factor> '/' <term> | <factor> <factor> := '+' <factor> | '-' <factor> | '(' <expr> ')' | <integer> '.' <integer> | <integer> <integer> := <digit> <integer> | <digit> <digit> := [0-9] def start_parse(): parse_expr() def parse_expr(): parse_term() match lookahead(): case '+': consume(); parse_expr() case '-': consume(); parse_expr() default: pass def parse_term(): parse_factor() match lookahead(): case '*': consume(); parse_term() case '/': consume(); parse_term() default: pass
  32. <start> := <expr> <expr> := <term> '+' <expr> | <term>

    '-' <expr> | <term> <term> := <factor> '*' <term> | <factor> '/' <term> | <factor> <factor> := '+' <factor> | '-' <factor> | '(' <expr> ')' | <integer> '.' <integer> | <integer> <integer> := <digit> <integer> | <digit> <digit> := [0-9] def start_parse(): parse_expr() def parse_expr(): parse_term() match lookahead(): case '+': consume(); parse_expr() case '-': consume(); parse_expr() default: pass def parse_term(): parse_factor() match lookahead(): case '*': consume(); parse_term() case '/': consume(); parse_term() default: pass <start>
  33. <start> := <expr> <expr> := <term> '+' <expr> | <term>

    '-' <expr> | <term> <term> := <factor> '*' <term> | <factor> '/' <term> | <factor> <factor> := '+' <factor> | '-' <factor> | '(' <expr> ')' | <integer> '.' <integer> | <integer> <integer> := <digit> <integer> | <digit> <digit> := [0-9] def start_parse(): parse_expr() def parse_expr(): parse_term() match lookahead(): case '+': consume(); parse_expr() case '-': consume(); parse_expr() default: pass def parse_term(): parse_factor() match lookahead(): case '*': consume(); parse_term() case '/': consume(); parse_term() default: pass <start> := <expr>
  34. <start> := <expr> <expr> := <term> '+' <expr> | <term>

    '-' <expr> | <term> <term> := <factor> '*' <term> | <factor> '/' <term> | <factor> <factor> := '+' <factor> | '-' <factor> | '(' <expr> ')' | <integer> '.' <integer> | <integer> <integer> := <digit> <integer> | <digit> <digit> := [0-9] def start_parse(): parse_expr() def parse_expr(): parse_term() match lookahead(): case '+': consume(); parse_expr() case '-': consume(); parse_expr() default: pass def parse_term(): parse_factor() match lookahead(): case '*': consume(); parse_term() case '/': consume(); parse_term() default: pass <start> := <expr> <expr>
  35. <start> := <expr> <expr> := <term> '+' <expr> | <term>

    '-' <expr> | <term> <term> := <factor> '*' <term> | <factor> '/' <term> | <factor> <factor> := '+' <factor> | '-' <factor> | '(' <expr> ')' | <integer> '.' <integer> | <integer> <integer> := <digit> <integer> | <digit> <digit> := [0-9] def start_parse(): parse_expr() def parse_expr(): parse_term() match lookahead(): case '+': consume(); parse_expr() case '-': consume(); parse_expr() default: pass def parse_term(): parse_factor() match lookahead(): case '*': consume(); parse_term() case '/': consume(); parse_term() default: pass <start> := <expr> <expr> := <term>
  36. <start> := <expr> <expr> := <term> '+' <expr> | <term>

    '-' <expr> | <term> <term> := <factor> '*' <term> | <factor> '/' <term> | <factor> <factor> := '+' <factor> | '-' <factor> | '(' <expr> ')' | <integer> '.' <integer> | <integer> <integer> := <digit> <integer> | <digit> <digit> := [0-9] def start_parse(): parse_expr() def parse_expr(): parse_term() match lookahead(): case '+': consume(); parse_expr() case '-': consume(); parse_expr() default: pass def parse_term(): parse_factor() match lookahead(): case '*': consume(); parse_term() case '/': consume(); parse_term() default: pass <start> := <expr> <expr> := <term> '+'
  37. <start> := <expr> <expr> := <term> '+' <expr> | <term>

    '-' <expr> | <term> <term> := <factor> '*' <term> | <factor> '/' <term> | <factor> <factor> := '+' <factor> | '-' <factor> | '(' <expr> ')' | <integer> '.' <integer> | <integer> <integer> := <digit> <integer> | <digit> <digit> := [0-9] def start_parse(): parse_expr() def parse_expr(): parse_term() match lookahead(): case '+': consume(); parse_expr() case '-': consume(); parse_expr() default: pass def parse_term(): parse_factor() match lookahead(): case '*': consume(); parse_term() case '/': consume(); parse_term() default: pass <start> := <expr> <expr> := <term> '+' <expr>
  38. <start> := <expr> <expr> := <term> '+' <expr> | <term>

    '-' <expr> | <term> <term> := <factor> '*' <term> | <factor> '/' <term> | <factor> <factor> := '+' <factor> | '-' <factor> | '(' <expr> ')' | <integer> '.' <integer> | <integer> <integer> := <digit> <integer> | <digit> <digit> := [0-9] def start_parse(): parse_expr() def parse_expr(): parse_term() match lookahead(): case '+': consume(); parse_expr() case '-': consume(); parse_expr() default: pass def parse_term(): parse_factor() match lookahead(): case '*': consume(); parse_term() case '/': consume(); parse_term() default: pass <start> := <expr> <expr> := <term> '+' <expr> | <term> '-'
  39. <start> := <expr> <expr> := <term> '+' <expr> | <term>

    '-' <expr> | <term> <term> := <factor> '*' <term> | <factor> '/' <term> | <factor> <factor> := '+' <factor> | '-' <factor> | '(' <expr> ')' | <integer> '.' <integer> | <integer> <integer> := <digit> <integer> | <digit> <digit> := [0-9] def start_parse(): parse_expr() def parse_expr(): parse_term() match lookahead(): case '+': consume(); parse_expr() case '-': consume(); parse_expr() default: pass def parse_term(): parse_factor() match lookahead(): case '*': consume(); parse_term() case '/': consume(); parse_term() default: pass <start> := <expr> <expr> := <term> '+' <expr> | <term> '-' <expr>
  40. <start> := <expr> <expr> := <term> '+' <expr> | <term>

    '-' <expr> | <term> <term> := <factor> '*' <term> | <factor> '/' <term> | <factor> <factor> := '+' <factor> | '-' <factor> | '(' <expr> ')' | <integer> '.' <integer> | <integer> <integer> := <digit> <integer> | <digit> <digit> := [0-9] def start_parse(): parse_expr() def parse_expr(): parse_term() match lookahead(): case '+': consume(); parse_expr() case '-': consume(); parse_expr() default: pass def parse_term(): parse_factor() match lookahead(): case '*': consume(); parse_term() case '/': consume(); parse_term() default: pass <start> := <expr> <expr> := <term> '+' <expr> | <term> '-' <expr> | <term>
  41. <start> := <expr> <expr> := <term> '+' <expr> | <term>

    '-' <expr> | <term> <term> := <factor> '*' <term> | <factor> '/' <term> | <factor> <factor> := '+' <factor> | '-' <factor> | '(' <expr> ')' | <integer> '.' <integer> | <integer> <integer> := <digit> <integer> | <digit> <digit> := [0-9] def start_parse(): parse_expr() def parse_expr(): parse_term() match lookahead(): case '+': consume(); parse_expr() case '-': consume(); parse_expr() default: pass def parse_term(): parse_factor() match lookahead(): case '*': consume(); parse_term() case '/': consume(); parse_term() default: pass <start> := <expr> <expr> := <term> '+' <expr> | <term> '-' <expr> | <term> <term>
  42. <start> := <expr> <expr> := <term> '+' <expr> | <term>

    '-' <expr> | <term> <term> := <factor> '*' <term> | <factor> '/' <term> | <factor> <factor> := '+' <factor> | '-' <factor> | '(' <expr> ')' | <integer> '.' <integer> | <integer> <integer> := <digit> <integer> | <digit> <digit> := [0-9] def start_parse(): parse_expr() def parse_expr(): parse_term() match lookahead(): case '+': consume(); parse_expr() case '-': consume(); parse_expr() default: pass def parse_term(): parse_factor() match lookahead(): case '*': consume(); parse_term() case '/': consume(); parse_term() default: pass <start> := <expr> <expr> := <term> '+' <expr> | <term> '-' <expr> | <term> <term> := <factor>
  43. <start> := <expr> <expr> := <term> '+' <expr> | <term>

    '-' <expr> | <term> <term> := <factor> '*' <term> | <factor> '/' <term> | <factor> <factor> := '+' <factor> | '-' <factor> | '(' <expr> ')' | <integer> '.' <integer> | <integer> <integer> := <digit> <integer> | <digit> <digit> := [0-9] def start_parse(): parse_expr() def parse_expr(): parse_term() match lookahead(): case '+': consume(); parse_expr() case '-': consume(); parse_expr() default: pass def parse_term(): parse_factor() match lookahead(): case '*': consume(); parse_term() case '/': consume(); parse_term() default: pass <start> := <expr> <expr> := <term> '+' <expr> | <term> '-' <expr> | <term> <term> := <factor> '*' <term>
  44. <start> := <expr> <expr> := <term> '+' <expr> | <term>

    '-' <expr> | <term> <term> := <factor> '*' <term> | <factor> '/' <term> | <factor> <factor> := '+' <factor> | '-' <factor> | '(' <expr> ')' | <integer> '.' <integer> | <integer> <integer> := <digit> <integer> | <digit> <digit> := [0-9] def start_parse(): parse_expr() def parse_expr(): parse_term() match lookahead(): case '+': consume(); parse_expr() case '-': consume(); parse_expr() default: pass def parse_term(): parse_factor() match lookahead(): case '*': consume(); parse_term() case '/': consume(); parse_term() default: pass <start> := <expr> <expr> := <term> '+' <expr> | <term> '-' <expr> | <term> <term> := <factor> '*' <term>
  45. <start> := <expr> <expr> := <term> '+' <expr> | <term>

    '-' <expr> | <term> <term> := <factor> '*' <term> | <factor> '/' <term> | <factor> <factor> := '+' <factor> | '-' <factor> | '(' <expr> ')' | <integer> '.' <integer> | <integer> <integer> := <digit> <integer> | <digit> <digit> := [0-9] def start_parse(): parse_expr() def parse_expr(): parse_term() match lookahead(): case '+': consume(); parse_expr() case '-': consume(); parse_expr() default: pass def parse_term(): parse_factor() match lookahead(): case '*': consume(); parse_term() case '/': consume(); parse_term() default: pass <start> := <expr> <expr> := <term> '+' <expr> | <term> '-' <expr> | <term> <term> := <factor> '*' <term> | <factor> '/'
  46. <start> := <expr> <expr> := <term> '+' <expr> | <term>

    '-' <expr> | <term> <term> := <factor> '*' <term> | <factor> '/' <term> | <factor> <factor> := '+' <factor> | '-' <factor> | '(' <expr> ')' | <integer> '.' <integer> | <integer> <integer> := <digit> <integer> | <digit> <digit> := [0-9] def start_parse(): parse_expr() def parse_expr(): parse_term() match lookahead(): case '+': consume(); parse_expr() case '-': consume(); parse_expr() default: pass def parse_term(): parse_factor() match lookahead(): case '*': consume(); parse_term() case '/': consume(); parse_term() default: pass <start> := <expr> <expr> := <term> '+' <expr> | <term> '-' <expr> | <term> <term> := <factor> '*' <term> | <factor> '/' <term>
  47. <start> := <expr> <expr> := <term> '+' <expr> | <term>

    '-' <expr> | <term> <term> := <factor> '*' <term> | <factor> '/' <term> | <factor> <factor> := '+' <factor> | '-' <factor> | '(' <expr> ')' | <integer> '.' <integer> | <integer> <integer> := <digit> <integer> | <digit> <digit> := [0-9] def start_parse(): parse_expr() def parse_expr(): parse_term() match lookahead(): case '+': consume(); parse_expr() case '-': consume(); parse_expr() default: pass def parse_term(): parse_factor() match lookahead(): case '*': consume(); parse_term() case '/': consume(); parse_term() default: pass <start> := <expr> <expr> := <term> '+' <expr> | <term> '-' <expr> | <term> <term> := <factor> '*' <term> | <factor> '/' <term> | <factor>
  48. <start> := <expr> <expr> := <term> '+' <expr> | <term>

    '-' <expr> | <term> <term> := <factor> '*' <term> | <factor> '/' <term> | <factor> <factor> := '+' <factor> | '-' <factor> | '(' <expr> ')' | <integer> '.' <integer> | <integer> <integer> := <digit> <integer> | <digit> <digit> := [0-9] def start_parse(): parse_expr() def parse_expr(): parse_term() match lookahead(): case '+': consume(); parse_expr() case '-': consume(); parse_expr() default: pass def parse_term(): parse_factor() match lookahead(): case '*': consume(); parse_term() case '/': consume(); parse_term() default: pass
  49. <start> := <expr> <expr> := <term> '+' <expr> | <term>

    '-' <expr> | <term> <term> := <factor> '*' <term> | <factor> '/' <term> | <factor> <factor> := '+' <factor> | '-' <factor> | '(' <expr> ')' | <integer> '.' <integer> | <integer> <integer> := <digit> <integer> | <digit> <digit> := [0-9] def start_parse(): parse_expr() def parse_expr(): parse_term() match lookahead(): case '+': consume(); parse_expr() case '-': consume(); parse_expr() default: pass def parse_term(): parse_factor() match lookahead(): case '*': consume(); parse_term() case '/': consume(); parse_term() default: pass How to Extract This Grammar?
  50. How to Extract This Grammar?

  51. How to Extract This Grammar? • Inputs -> Dynamic Control

    Dependence trees
  52. How to Extract This Grammar? • Inputs -> Dynamic Control

    Dependence trees • DCD Trees -> Context Free Grammar
  53. Control Dependence Graph Statement B is control dependent on A

    if A determines whether B executes. def parse_csv(s,i): while s[i:]: if is_digit(s[i]): n,j = num(s[i:]) i = i+j else: comma(s[i]) i += 1
  54. Control Dependence Graph Statement B is control dependent on A

    if A determines whether B executes. def parse_csv(s,i): while s[i:]: if is_digit(s[i]): n,j = num(s[i:]) i = i+j else: comma(s[i]) i += 1 CDG for parse_csv
  55. Control Dependence Graph Statement B is control dependent on A

    if A determines whether B executes. def parse_csv(s,i): while s[i:]: if is_digit(s[i]): n,j = num(s[i:]) i = i+j else: comma(s[i]) i += 1 CDG for parse_csv while: determines whether if: executes
  56. def parse_csv(s,i): while s[i:]: if is_digit(s[i]): n,j = num(s[i:]) i

    = i+j else: comma(s[i]) i += 1 CDG for parse_csv Dynamic Control Dependence Tree Each statement execution is represented as a separate node
  57. def parse_csv(s,i): while s[i:]: if is_digit(s[i]): n,j = num(s[i:]) i

    = i+j else: comma(s[i]) i += 1 CDG for parse_csv Dynamic Control Dependence Tree Each statement execution is represented as a separate node DCD Tree for call parse_csv()
  58. def parse_csv(s,i): while s[i:]: if is_digit(s[i]): n,j = num(s[i:]) i

    = i+j else: comma(s[i]) i += 1 DCD Tree ~ Parse Tree •No tracking beyond input buffer •Characters are attached to nodes where they are accessed last "12," "12,"
  59. def parse_csv(s,i): while s[i:]: if is_digit(s[i]): n,j = num(s[i:]) i

    = i+j else: comma(s[i]) i += 1 '1' '2' ',' DCD Tree ~ Parse Tree •No tracking beyond input buffer •Characters are attached to nodes where they are accessed last "12," "12,"
  60. def is_digit(i): return i in '0123456789' def parse_num(s,i): n =

    '' while s[i:] and is_digit(s[i]): n += s[i] i = i +1 return i,n def parse_paren(s, i): assert s[i] == '(' i, v = parse_expr(s, i+1) if s[i:] == '': raise Ex(s, i) assert s[i] == ')' return i+1, v def parse_expr(s, i = 0): expr, is_op = [], True while s[i:]: c = s[i] if isdigit(c): if not is_op: raise Ex(s,i) i,num = parse_num(s,i) expr.append(num) is_op = False elif c in ['+', '-', '*', '/']: if is_op: raise Ex(s,i) expr.append(c) is_op, i = True, i + 1 elif c == '(': if not is_op: raise Ex(s,i) i, cexpr = parse_paren(s, i) expr.append(cexpr) is_op = False elif c == ')': break else: raise Ex(s,i) if is_op: raise Ex(s,i) return i, expr 9+3/4 Parse tree for parse_expr('9+3/4')
  61. def is_digit(i): return i in '0123456789' def parse_num(s,i): n =

    '' while s[i:] and is_digit(s[i]): n += s[i] i = i +1 return i,n def parse_paren(s, i): assert s[i] == '(' i, v = parse_expr(s, i+1) if s[i:] == '': raise Ex(s, i) assert s[i] == ')' return i+1, v def parse_expr(s, i = 0): expr, is_op = [], True while s[i:]: c = s[i] if isdigit(c): if not is_op: raise Ex(s,i) i,num = parse_num(s,i) expr.append(num) is_op = False elif c in ['+', '-', '*', '/']: if is_op: raise Ex(s,i) expr.append(c) is_op, i = True, i + 1 elif c == '(': if not is_op: raise Ex(s,i) i, cexpr = parse_paren(s, i) expr.append(cexpr) is_op = False elif c == ')': break else: raise Ex(s,i) if is_op: raise Ex(s,i) return i, expr 9+3/4 Parse tree for parse_expr('9+3/4')
  62. def is_digit(i): return i in '0123456789' def parse_num(s,i): n =

    '' while s[i:] and is_digit(s[i]): n += s[i] i = i +1 return i,n def parse_paren(s, i): assert s[i] == '(' i, v = parse_expr(s, i+1) if s[i:] == '': raise Ex(s, i) assert s[i] == ')' return i+1, v def parse_expr(s, i = 0): expr, is_op = [], True while s[i:]: c = s[i] if isdigit(c): if not is_op: raise Ex(s,i) i,num = parse_num(s,i) expr.append(num) is_op = False elif c in ['+', '-', '*', '/']: if is_op: raise Ex(s,i) expr.append(c) is_op, i = True, i + 1 elif c == '(': if not is_op: raise Ex(s,i) i, cexpr = parse_paren(s, i) expr.append(cexpr) is_op = False elif c == ')': break else: raise Ex(s,i) if is_op: raise Ex(s,i) return i, expr 9+3/4 Identifying Compatible Nodes Which nodes correspond to the same nonterminal
  63. def is_digit(i): return i in '0123456789' def parse_num(s,i): n =

    '' while s[i:] and is_digit(s[i]): n += s[i] i = i +1 return i,n def parse_paren(s, i): assert s[i] == '(' i, v = parse_expr(s, i+1) if s[i:] == '': raise Ex(s, i) assert s[i] == ')' return i+1, v def parse_expr(s, i = 0): expr, is_op = [], True while s[i:]: c = s[i] if isdigit(c): if not is_op: raise Ex(s,i) i,num = parse_num(s,i) expr.append(num) is_op = False elif c in ['+', '-', '*', '/']: if is_op: raise Ex(s,i) expr.append(c) is_op, i = True, i + 1 elif c == '(': if not is_op: raise Ex(s,i) i, cexpr = parse_paren(s, i) expr.append(cexpr) is_op = False elif c == ')': break else: raise Ex(s,i) if is_op: raise Ex(s,i) return i, expr 9+3/4 Identifying Compatible Nodes Which nodes correspond to the same nonterminal
  64. 3 * (9 + 1)

  65. 3 * (9 + 1)

  66. (9 + 1) * 3 3 * (9 + 1)

  67. (9 + 1) * 3 3 * (9 + 1)

  68. 3 * (9 + 1)

  69. 3 * (9 + 1)

  70. 9 + 1 3 * (9 + 1)

  71. 9 + 1 3 * (9 + 1)

  72. 3 * (9 + 1)

  73. 3 * (9 + 1)

  74. 3 (9 + 1) * 3 * (9 + 1)

  75. 3 (9 + 1) * 3 * (9 + 1)

  76. 3*(1) 1

  77. 3*(1) 1

  78. 3*(1) 1 <parse_expr> := <while 1:1> <while 1:0> <while 1:1>

  79. 3*(1) 1 <parse_expr> := <while 1:1> <while 1:0> <while 1:1>

    <while 1:1> <parse_expr> :=
  80. <parse_expr> := <while 1:1> <while 1:0> <while 1:1> | <while

    1:1> <while 1:0> <while 1:1> <while 1:0> <while 1:1> <while 1:0> <while 1:1> | <while 1:1> <while 1:0> <while 1:1> <while 1:0> <while 1:1> | <while 1:1> <while 1:1> := <if 1:1> <if 1:1> := <parse_num> | <parse_paren> <parse_num> := <is_digit> <is_digit> := '3' | '1' <parse_paren>:= '(' <parse_expr> ')' <while 1:0> := <if 1:0> <if 1:0> := '*'
  81. <parse_expr> := <while_s> <while_s> := <while_1:1> <while_1:0> <while_s> | <while_1:1>

    <parse_expr> := <while 1:1> <while 1:0> <while 1:1> | <while 1:1> <while 1:0> <while 1:1> <while 1:0> <while 1:1> <while 1:0> <while 1:1> | <while 1:1> <while 1:0> <while 1:1> <while 1:0> <while 1:1> | <while 1:1> <while 1:1> := <if 1:1> <if 1:1> := <parse_num> | <parse_paren> <parse_num> := <is_digit> <is_digit> := '3' | '1' <parse_paren>:= '(' <parse_expr> ')' <while 1:0> := <if 1:0> <if 1:0> := '*'
  82. Generalization of Lexical Tokens

  83. Widening Lexical Tokens with Active Learning <int> := 19458790 |

    809451 | 243 | 48095284094435 <decimal> := 1.0043 | 34.343 | 232.232 | 2988.343 <string> := "" "dfa 39&*(" "0989" "0._3"
  84. Widening Lexical Tokens with Active Learning <int> := 19458790 |

    809451 | 243 | 48095284094435 <decimal> := 1.0043 | 34.343 | 232.232 | 2988.343 <string> := "" "dfa 39&*(" "0989" "0._3" <int> := [0-9]+ <decimal> := [0-9]+[.][0-9]+ <string> := "[:alphanum:]+"
  85. def is_digit(i): return i in '0123456789' def parse_num(s,i): n =

    '' while s[i:] and is_digit(s[i]): n += s[i] i = i +1 return i,n def parse_paren(s, i): assert s[i] == '(' i, v = parse_expr(s, i+1) if s[i:] == '': raise Ex(s, i) assert s[i] == ')' return i+1, v def parse_expr(s, i = 0): expr, is_op = [], True while s[i:]: c = s[i] if isdigit(c): if not is_op: raise Ex(s,i) i,num = parse_num(s,i) expr.append(num) is_op = False elif c in ['+', '-', '*', '/']: if is_op: raise Ex(s,i) expr.append(c) is_op, i = True, i + 1 elif c == '(': if not is_op: raise Ex(s,i) i, cexpr = parse_paren(s, i) expr.append(cexpr) is_op = False elif c == ')': break else: raise Ex(s,i) if is_op: raise Ex(s,i) return i, expr <START> := <parse_expr.0-0-c> <parse_expr.0-0-c> := <parse_expr.0-1-s><parse_expr.0> | <parse_expr.0> <parse_expr.0-1-s> := <parse_expr.0><parse_expr.0-2> | <parse_expr.0><parse_expr.0-2><parse_expr.0-1-s> <parse_expr.0> := '(' <parse_expr.0-0-c> ')' | <parse_num.0-1-s> <parse_expr.0-2> := '*' | '+' | '-' | '/' <parse_num.0-1-s> := <is_digit.0-0-c> | <is_digit.0-0-c><parse_num.0-1-s> <is_digit.0-0-c> : [0-9] calc.py Recovered Arithmetic Grammar
  86. <START> ::= <json_raw> <json_raw> ::= '"' <json_string'> | '[' <json_list'>

    | '{' <json_dict'> | <json_number'> | 'true' | 'false' | 'null' <json_number'> ::= <json_number>+ | <json_number>+ 'e' <json_number>+ <json_number> ::= '+' | '-' | '.' | [0-9] | 'E' | 'e' <json_string'> ::= <json_string>* '"' <json_list'> ::= ']' | <json_raw> (','<json_raw>)* ']' | ( ',' <json_raw>)+ (',' <json_raw>)* ']' <json_dict'> ::= '}' | ( '"' <json_string'> ':' <json_raw> ',' )* '"'<json_string'> ':' <json_raw> '}' <json_string> ::= ' ' | '!' | '#' | '$' | '%' | '&' | ''' | '*' | '+' | '-' | ',' | '.' | '/' | ':' | ';' | '<' | '=' | '>' | '?' | '@' | '[' | ']' | '^' | '_', ''',| '{' | '|' | '}' | '~' | '[A-Za-z0-9]' | '\' <decode_escape> <decode_escape> ::= '"' | '/' | 'b' | 'f' | 'n' | 'r' | 't' # parse out a key/value pair elif c == '"': key = _from_json_string(stm) stm.skipspaces() c = stm.next() if c != ':': raise JSONError(E_COLON, stm, stm.pos) stm.skipspaces() val = _from_json_raw(stm) result[key] = val expect_key = 0 continue raise JSONError(E_MALF, stm, stm.pos) def _from_json_raw(stm): while True: stm.skipspaces() c = stm.peek() if c == '"': return _from_json_string(stm) elif c == '{': return _from_json_dict(stm) elif c == '[': return _from_json_list(stm) elif c == 't': return _from_json_fixed(stm, 'true', True, E_BOOL) elif c == 'f': return _from_json_fixed(stm, 'false', False, E_BOOL) elif c == 'n': return _from_json_fixed(stm, 'null', None, E_NULL) elif c in NUMSTART: return _from_json_number(stm) raise JSONError(E_MALF, stm, stm.pos) def from_json(data): stm = JSONStream(data) return _from_json_raw(stm) microjson.py Recovered JSON grammar
  87. Evaluation

  88. • Languages: C, Python • Style: Adhoc, Text book, Parser

    Combinators • Evaluation: Precision and Recall
  89. Subjects Language Grammar Style calc.py Python CFG textbook parser mathexpr.py

    Python CFG textbook OO cgidecode.py Python RG automata urlparse.py Python RG ad hoc microjson.py Python CFG optimized parser parseclisp.py Python CFG parser combinator jsonparser.c C CFG optimized parser tiny.c C CFG lexer + parser mjs.c C CFG lexer + parser Evaluation: Subjects
  90. Evaluation: Recall Subjects AUTOGRAM Mimid calc.py 36.5 % 100.0 %

    mathexpr.py 30.3 % 87.5 % cgidecode.py 47.8 % 100.0 % urlparse.py 100.0 % 100.0 % microjson.py 53.8 % 98.7 % parseclisp.py 100.0 % 99.3 % jsonparser.c n/a 100.0 % tiny.c n/a 100.0 % mjs.c n/a 95.4 % Inputs generated by inferred grammar that were accepted by the program
  91. Subjects AUTOGRAM Mimid calc.py 0.0 % 100.0 % mathexpr.py 0.0

    % 92.7 % cgidecode.py 35.1 % 100.0 % urlparse.py 100.0 % 96.4 % microjson.py 0.0 % 93.0 % parseclisp.py 37.6 % 80.6 % jsonparser.c n/a 83.8 % tiny.c n/a 92.8 % mjs.c n/a 95.9 % Inputs generated by golden grammar that were accepted by the inferred grammar parser Evaluation: Precision
  92. • Grammar Generation with light weight instrumentation • Can be

    applied to multiple styles of parsers • Generates accurate and readable grammars • Replication package is available as s Jupyter notebook Mimid
  93. https://github.com/vrthra/mimid

  94. https://github.com/vrthra/mimid

  95. https://github.com/vrthra/mimid

  96. https://github.com/vrthra/mimid

  97. https://github.com/vrthra/mimid

  98. https://github.com/vrthra/mimid

  99. https://github.com/vrthra/mimid