Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to Talk to Strange Programs and Find Bugs

How to Talk to Strange Programs and Find Bugs

Talk at ASESS'24

Rahul Gopinath

February 01, 2024
Tweet

More Decks by Rahul Gopinath

Other Decks in Research

Transcript

  1. 2 How to Talk to Strange Programs and Find Bugs

    Rahul Gopinath https://rahul.gopinath.org [email protected] @[email protected] i.e., when the input speci fi cation is not known
  2. 6

  3. 8 Input ✓ ✘ Testing @app.route('/admin') def admin(): username =

    request.cookies.get("username") if not username: return {"Error": "Specify username in Cookie"} username = urllib.quote(os.path.basename(username)) url = "http://permissions:5000/permissions/{}".format(username) resp = requests.request(method="GET", url=url) # "superadmin\ud888" will be simpli fi ed to "superadmin" ret = ujson.loads(resp.text) if resp.status_code == 200: if "superadmin" in ret["roles"]: return {"OK": "Superadmin Access granted"} else: e = u"Access denied. User has following roles: {}".format(ret["roles"]) return {"Error": e}, 401 else:return {"Error": ret["Error"]}, 500
  4. @app.route('/admin') def admin(): username = request.cookies.get("username") if not username: return

    {"Error": "Specify username in Cookie"} username = urllib.quote(os.path.basename(username)) url = "http://permissions:5000/permissions/{}".format(username) resp = requests.request(method="GET", url=url) # "superadmin\ud888" will be simpli fi ed to "superadmin" ret = ujson.loads(resp.text) if resp.status_code == 200: if "superadmin" in ret["roles"]: return {"OK": "Superadmin Access granted"} else: e = u"Access denied. User has following roles: {}".format(ret["roles"]) return {"Error": e}, 401 else:return {"Error": ret["Error"]}, 500 [ ; x 1 - G P Z + w c c k c ] ; , N 9 J + ? # 6 ^ 6 \ e ? ] 9 l u 2 _ % ' 4 G X " 0 V U B [ E / r ~ f A p u 6 b 8 < { % s i q 8 Z h . 6 { V , h r ? ; {Ti.r3PIxMMMv6{xS^+'Hq!AxB"YXRS@! Kd6;wtAMefFWM(`|J_<1~o}z3K(CCzRH J I I v H z > _ * . \ > J r l U 3 2 ~ e G P ? lR=bF3+;y$3lodQ<B89!5"W2fK*vE7v{')KC- i,c{<[~m!]o;{.'}Gj\(X}EtYetrpbY@aGZ1{P! A Z U 7 x # 4 ( R t n ! q 4 n C w q o l ^ y 6 } 0 | Ko=*JK~;zMKV=9Nai:wxu{J&UV#HaU)*Bi C < ) , ` + t * g k a < W = Z . % T 5 W G H Z p I 3 0 D < P q > & ] B S 6 R & j ? # t P 7 i a V } - } ` \ ? [ _ [ Z ^ L B M P G - FKj'\xwuZ1=Q`^`5,$N$Q@[!CuRzJ2D|vBy! ^ z k h d f 3 C 5 P A k R ? V ( ( - % > < h n | 3='i2Qx]D$qs4O`1@fevnG'2\11Vf3piU37@ 5 : d f d 4 5 * ( 7 ^ % 5 a p \ z I y l " ' f , $ee,J4Gw:cgNKLie3nx9(`efSlg6#[K"@Wjh Z}r[Scun&sBCS,T[/3]KAeEnQ7lU)3Pn,0)G/ 6N-wyzj/MTd#A;r Program Automating Testing 9 https://www.fuzzingbook.org/html/Fuzzer.html Fuzzing
  5. @app.route('/admin') def admin(): username = request.cookies.get("username") if not username: return

    {"Error": "Specify username in Cookie"} username = urllib.quote(os.path.basename(username)) url = "http://permissions:5000/permissions/{}".format(username) resp = requests.request(method="GET", url=url) # "superadmin\ud888" will be simpli fi ed to "superadmin" ret = ujson.loads(resp.text) if resp.status_code == 200: if "superadmin" in ret["roles"]: return {"OK": "Superadmin Access granted"} else: e = u"Access denied. User has following roles: {}".format(ret["roles"]) return {"Error": e}, 401 else:return {"Error": ret["Error"]}, 500 [ ; x 1 - G P Z + w c c k c ] ; , N 9 J + ? # 6 ^ 6 \ e ? ] 9 l u 2 _ % ' 4 G X " 0 V U B [ E / r ~ f A p u 6 b 8 < { % s i q 8 Z h . 6 { V , h r ? ; {Ti.r3PIxMMMv6{xS^+'Hq!AxB"YXRS@! Kd6;wtAMefFWM(`|J_<1~o}z3K(CCzRH J I I v H z > _ * . \ > J r l U 3 2 ~ e G P ? lR=bF3+;y$3lodQ<B89!5"W2fK*vE7v{')KC- i,c{<[~m!]o;{.'}Gj\(X}EtYetrpbY@aGZ1{P! A Z U 7 x # 4 ( R t n ! q 4 n C w q o l ^ y 6 } 0 | Ko=*JK~;zMKV=9Nai:wxu{J&UV#HaU)*Bi C < ) , ` + t * g k a < W = Z . % T 5 W G H Z p I 3 0 D < P q > & ] B S 6 R & j ? # t P 7 i a V } - } ` \ ? [ _ [ Z ^ L B M P G - FKj'\xwuZ1=Q`^`5,$N$Q@[!CuRzJ2D|vBy! ^ z k h d f 3 C 5 P A k R ? V h n | 3='i2Qx]D$qs4O`1@fevnG'2\11Vf3piU37@ 5 5 a p \ z I y l " ' f , $ee,J4Gw:cgNKLie3nx9(`efSlg6#[K"@Wjh Z}r[Scun&sBCS,T[/3]KAeEnQ7lU)3Pn,0)G/ 6N-wyzj/MTd#A;r Structured Inputs SYNTAX ERROR ✘ 10
  6. def process_input(input): try: val = parse(input) res = process(val) return

    res except SyntaxError: return Error def process_input(input): try: ✘val = parse(input) res = process(val) return res except SyntaxError: return Error SYNTAX ERROR 11 Parser ✘
  7. SYNTAX ERROR def process_input(input): try: ✘val = parse(input) res =

    process(val) return res except SyntaxError: return Error 12 The Core ✘
  8. 14 def process_input(input): try: val = parse(input) res = process(val)

    return res except SyntaxError: return Error 14 {
 '<json>' : [['<elt>']], '<elt>' : [['<object>'], ['<array>'], ['<string>'], ['<number>'], ['true'], ['false'], ['null']], '<object>' : [['{', '<items>','}'], ['{}']], '<items>' : [['<item>,',',<items>'], ['<item>']], '<item>' : [['<string>',':', '<elt>']], '<array>' : [['[', '<elts>', ']'], ['[]']], '<elts>' : [['<elt>,',',<elts>'], ['<elt>']], '<string>' : [['"', '<chars>', '"'], ['""']], '<chars>' : [['<char>','<chars>'], ['<char>']], '<number>' : [['<digits>']], '<digits>' : [['<digit>','<digits>'], ['<digit>']], '<char>' : [[c] for c in string.characters] '<digit>' : [[c] for c in string.digits]
 } Fix: Input Grammar
  9. 15 def process_input(input): try: ✔val = parse(input) res = process(val)

    return res except SyntaxError: return Error 15 {
 '<json>' : [['<elt>']], '<elt>' : [['<object>'], ['<array>'], ['<string>'], ['<number>'], ['true'], ['false'], ['null']], '<object>' : [['{', '<items>','}'], ['{}']], '<items>' : [['<item>,',',<items>'], ['<item>']], '<item>' : [['<string>',':', '<elt>']], '<array>' : [['[', '<elts>', ']'], ['[]']], '<elts>' : [['<elt>,',',<elts>'], ['<elt>']], '<string>' : [['"', '<chars>', '"'], ['""']], '<chars>' : [['<char>','<chars>'], ['<char>']], '<number>' : [['<digits>']], '<digits>' : [['<digit>','<digits>'], ['<digit>']], '<char>' : [[c] for c in string.characters] '<digit>' : [[c] for c in string.digits]
 } ✓ Fix: Input Grammar
  10. 20 Track input string accesses and comparisons [ 3 ,

    ] PARSE ERROR c ∉{'0'.., '[', '{'}
  11. 23 • Identify character comparisons or EOF Key Idea: Viable

    Pre fi xes 23 • Complete with one of the compared characters [ 3 , 1 ]
  12. 24 Viable Pre fi xes 24 Limitations • Performance •

    Lack of control [ x [ ; @ [ 3 _ [ 3 _ [ 3 $ _ [ 3 , _ [ 3 , x _ [ 3 , 1 _ [ 3 , 1 ; _ [ 3 , 1 ]
  13. 25 Mathis, Gopinath, Mera, Kampmann, Höschele, and Zeller. Parser Directed

    Fuzzing. PLDI 2019. Mathis, Gopinath and Zeller Learning Input Tokens for Effective Fuzzing. ISSTA 2020. Viable Pre fi xes
  14. 26

  15. 28 Formal Languages Language Descriptions: Grammars Regular Context Free Recursively

    Enumerable (Chomsky,1956) Argument Stack Return Stack 28
  16. 29 Grammar <start> := <expr> <expr> := <expr> '+' <expr>

    | <expr> '-' <expr> | <expr> '/' <expr> | <expr> '*' <expr> | '(' <expr> ')' | <number> <number> := <integer> | <integer> '.' <integer> <integer>:= <digit> <integer> | <digit> <digit> := [0-9] Arithmetic expression grammar De f inition for <expr> <expr> key
  17. 30 <start> := <expr> <expr> := <expr> '+' <expr> |

    <expr> '-' <expr> | <expr> '/' <expr> | <expr> '*' <expr> | '(' <expr> ')' | <number> <number> := <integer> | <integer> '.' <integer> <integer>:= <digit> <integer> | <digit> <digit> := [0-9] Grammar Arithmetic expression grammar Expansion Rule Terminal Symbol Nonterminal Symbol
  18. 31 Grammars For Parsing (8 / 3) * 49 <start>

    := <expr> <expr> := <expr> '+' <expr> | <expr> '-' <expr> | <expr> '/' <expr> | <expr> '*' <expr> | '(' <expr> ')' | <number> <number> := <integer> | <integer> '.' <integer> <integer>:= <digit> <integer> | <digit> <digit> := [0-9]
  19. 32 Grammars 8.2 - 27 - -9 / +((+9 *

    --2 + --+-+- ((-1 * +(8 - 5 - 6)) * (-(a-+(((+(4) )))) - ++4) / +(-+---((5.6 - --(3 * -1.8 * +(6 * +-(((-(-6) * ---+6)) / +--(+-+-7 * (-0 * (+(((((2)) + 8 - 3 - ++9.0 + ---(--+7 / (1 / +++6.37) + (1) / 482) / +++-+0)))) + 8.2 - 27 - -9 / +((+9 * --2 + --+-+-((-1 * + (8 - 5 - 6)) * (-(a-+(((+(4))))) - + +4) / +(-+---((5.6 - --(3 * -1.8 * + (6 * +-(((-(-6) * ---+6)) / +--(+-+- 7 * (-0 * (+(((((2)) + 8 - 3 - ++9.0 + ---(--+7 / (1 / +++6.37) + (1) / 482) / +++-+0)))) * -+5 + 7.513)))) - (+1 / ++((-84)))))))) * ++5 / +-(- -2 - -++-9.0)))) / 5 * --++090 + * - +5 + 7.513)))) - (+1 / ++((-84)))))) )) * 8.2 - 27 - -9 / +((+9 * --2 + - -+-+-((-1 * +(8 - 5 - 6)) * (-(a-+(( (+(4))))) - ++4) / +(-+---((5.6 - -- (3 * -1.8 * +(6 * +-(((-(-6) * ---+6 )) / +--(+-+-7 * (-0 * (+(((((2)) + 8 - 3 - ++9.0 + ---(--+7 / (1 / +++6 .37) + (1) / 482) / +++-+0)))) * -+5 + 7.513)))) - (+1 / ++((-84)))))))) * ++5 / +-(--2 - -++-9.0)))) / 5 * --++090 ++5 / +-(--2 - -++-9.0)))) / 5 * --++090 <start> := <expr> <expr> := <expr> '+' <expr> | <expr> '-' <expr> | <expr> '/' <expr> | <expr> '*' <expr> | '(' <expr> ')' | <number> <number> := <integer> | <integer> '.' <integer> <integer>:= <digit> <integer> | <digit> <digit> := [0-9] For Fuzzing (Hanford 1970) (Purdom 1972)
  20. 33 Grammars As effective producers Interpreter Parser ✘ ✔ 8.2

    - 27 - -9 / +((+9 * --2 + --+-+- ((-1 * +(8 - 5 - 6)) * (-(a-+(((+(4) )))) - ++4) / +(-+---((5.6 - --(3 * -1.8 * +(6 * +-(((-(-6) * ---+6)) / +--(+-+-7 * (-0 * (+(((((2)) + 8 - 3 - ++9.0 + ---(--+7 / (1 / +++6.37) + (1) / 482) / +++-+0)))) + 8.2 - 27 - -9 / +((+9 * --2 + --+-+-((-1 * + (8 - 5 - 6)) * (-(a-+(((+(4))))) - + +4) / +(-+---((5.6 - --(3 * -1.8 * + (6 * +-(((-(-6) * ---+6)) / +--(+-+- 7 * (-0 * (+(((((2)) + 8 - 3 - ++9.0 + ---(--+7 / (1 / +++6.37) + (1) / 482) / +++-+0)))) * -+5 + 7.513)))) - (+1 / ++((-84)))))))) * ++5 / +-(- -2 - -++-9.0)))) / 5 * --++090 + * - +5 + 7.513)))) - (+1 / ++((-84)))))) )) * 8.2 - 27 - -9 / +((+9 * --2 + - -+-+-((-1 * +(8 - 5 - 6)) * (-(a-+(( (+(4))))) - ++4) / +(-+---((5.6 - -- (3 * -1.8 * +(6 * +-(((-(-6) * ---+6 )) / +--(+-+-7 * (-0 * (+(((((2)) + 8 - 3 - ++9.0 + ---(--+7 / (1 / +++6 .37) + (1) / 482) / +++-+0)))) * -+5 + 7.513)))) - (+1 / ++((-84)))))))) * ++5 / +-(--2 - -++-9.0)))) / 5 * --++090 ++5 / +-(--2 - -++-9.0)))) / 5 * --++090
  21. 34 Grammars <start> := <expr> <expr> := <expr> '+' <expr>

    | <expr> '-' <expr> | <expr> '/' <expr> | <expr> '*' <expr> | '(' <expr> ')' | <number> <number> := <integer> | <integer> '.' <integer> <integer>:= <digit> <integer> | <digit> <digit> := [0-9] As efficient producers def start(): expr() def expr(): match (random() % 6): case 0: expr(); print('+'); expr() case 1: expr(); print('-'); expr() case 2: expr(); print('/'); expr() case 3: expr(); print('*'); expr() case 4: print('('); expr(); print(')') case 5: number() def number(): match (random() % 2): case 0: integer() case 1: integer(); print('.'); integer() def integer(): match (random() % 2): case 0: digit(); integer() case 1: digit() def digit(): match (random() % 10): case 0: print('0') case 1: print('1') case 2: print('2') case 3: print('3') case 4: print('4') case 5: print('5') case 6: print('6') case 7: print('7') Compiled Grammar (F1)
  22. QUIRK_ALLOW_ASCII_CONTROL_CODES QUIRK_ALLOW_BACKSLASH_A QUIRK_ALLOW_BACKSLASH_CAPITAL_U QUIRK_ALLOW_BACKSLASH_E QUIRK_ALLOW_BACKSLASH_NEW_LINE QUIRK_ALLOW_BACKSLASH_QUESTION_MARK QUIRK_ALLOW_BACKSLASH_SINGLE_QUOTE QUIRK_ALLOW_BACKSLASH_V QUIRK_ALLOW_BACKSLASH_X_AS_BYTES QUIRK_ALLOW_BACKSLASH_X_AS_CODE_POINTS

    QUIRK_ALLOW_BACKSLASH_ZERO QUIRK_ALLOW_COMMENT_BLOCK QUIRK_ALLOW_COMMENT_LINE QUIRK_ALLOW_EXTRA_COMMA QUIRK_ALLOW_INF_NAN_NUMBERS QUIRK_ALLOW_LEADING_ASCII_RECORD_SEPARATOR QUIRK_ALLOW_LEADING_UNICODE_BYTE_ORDER_MARK QUIRK_ALLOW_TRAILING_FILLER QUIRK_EXPECT_TRAILING_NEW_LINE_OR_EOF QUIRK_JSON_POINTER_ALLOW_TILDE_N_TILDE_R_TILDE_T QUIRK_REPLACE_INVALID_UNICODE JSON common quirks from https://github.com/google/wuffs 39
  23. "Be liberal in what you accept, and conservative in what

    you send"
 Postel's Law The Specification The Implementation Extra "Features" Where to Get the Grammar From? 40 Bugs
  24. 43 Where to Get the Grammar From? Handwritten parsers contain

    the parse structure key value key value scheme parse_scheme parse_hostpath parse_querystring parse_fragment domain TLD subdomain parse_host subdirectory parse_fslocation binary parse_binaryname parameters parse_parameters parse_url
  25. 44 Where to Get the Grammar From? Mining Grammar from

    a hand-written parser https://www.example.com/forum/questions/cgi?tag=networking&order=newwest#top key value key value split scheme parse_scheme host path parse_hostpath query string parse_querystring fragment parse_fragment domain TLD subdomain parse_host subdirectory parse_fslocation binary parse_binaryname parameters parse_parameters With Dynamic Data Flow Analysis parseurl
  26. 45 http://user:[email protected]:80/?q=path#ref urlparse:url = 'http://user:[email protected]:80/?q=path#ref' urlsplit:scheme = 'http' urlsplit:netloc =

    'user:[email protected]:80' urlsplit:fragment = 'ref' urlsplit:query = 'q=path' https://soft-eng.sydney.edu.au:80/ urlparse:url = 'https://soft-eng.sydney.edu.au:80/' urlsplit:scheme = 'https' urlsplit:netloc = 'soft-eng.sydney.edu.au:80' http://www.fuzzingbook.org/#News urlparse:url = 'http://www.fuzzingbook.org/#News' urlsplit:scheme = 'http' urlsplit:netloc = 'www.fuzzingbook.org' urlsplit:fragment = 'News' Mining with Dynamic Data Flow Analysis
  27. 46 { '<urlparse:url>': [ ['<urlsplit:scheme>', '://', '<urlsplit:netloc>', '/?', '<urlsplit:query>', '#',

    '<urlsplit:fragment>'], ['<urlsplit:scheme>', '://', '<urlsplit:netloc>', '/#','<urlsplit:fragment>'], ['<urlsplit:scheme>', '://', '<urlsplit:netloc>', '/']], '<urlsplit:scheme>' : [ ['http'], ['http']], '<urlsplit:netloc>': [ ['user:[email protected]:80'], ['www.fuzzingbook.org'], ['soft-eng.sydney.edu.au']], '<urlsplit:query>' : [ ['q=path']], '<urlsplit:fragment>' : [ ['ref'], ['News']],
 } http://user:[email protected]:80/?q=path#ref urlparse:url = 'http://user:[email protected]:80/?q=path#ref' urlsplit:scheme = 'http' urlsplit:netloc = 'user:[email protected]:80' urlsplit:fragment = 'ref' urlsplit:query = 'q=path' https://soft-eng.sydney.edu.au:80/ urlparse:url = 'https://soft-eng.sydney.edu.au:80/' urlsplit:scheme = 'https' urlsplit:netloc = 'soft-eng.sydney.edu.au:80' http://www.fuzzingbook.org/#News urlparse:url = 'http://www.fuzzingbook.org/#News' urlsplit:scheme = 'http' urlsplit:netloc = 'www.fuzzingbook.org' urlsplit:fragment = 'News' Mining with Dynamic Data Flow Analysis
  28. { '<urlparse:url>': [ ['<urlsplit:scheme>', '://', '<urlsplit:netloc>', '/?', '<urlsplit:query>', '#', '<urlsplit:fragment>'],

    ['<urlsplit:scheme>', '://', '<urlsplit:netloc>', '/#','<urlsplit:fragment>'], ['<urlsplit:scheme>', '://', '<urlsplit:netloc>', '/']], '<urlsplit:scheme>' : [ ['http'], ['http']], '<urlsplit:netloc>': [ ['user:[email protected]:80'], ['www.fuzzingbook.org'], ['soft-eng.sydney.edu.au']], '<urlsplit:query>' : [ ['q=path']], '<urlsplit:fragment>' : [ ['ref'], ['News']],
 } Limitations • Poor accuracy in most handwritten parsers • Handwritten parsers are not often well formed • Control flow is ignored Mining with Dynamic Data Flow Analysis
  29. <F> := <A> | <B> <F> := <A> <B> <C>

    <Fs> := <B> <Fs> | <empty> Structured Control Flow to Grammar Sequence A B C [F] Selection cond A B [F] F T Iteration cond B [F] 48 Function [F] <F> := ...
  30. 49 1. Extract the input string accesses 2. Attach control

    fl ow information Hand-written parsers already encode the grammar Mining with Dynamic Control Flow Analysis
  31. 50 • Inputs + control fl ow -> Dynamic Control

    Dependence Trees • DCD Trees -> Parse Tree Mining with Dynamic Control Flow Analysis
  32. 51 Control Dependence Graph Statement B is control dependent on

    A if A determines whether B executes. def parse_csv(s,i): while s[i:]: if is_digit(s[i]): n,j = num(s[i:]) i = i+j else: comma(s[i]) i += 1 CDG for parse_csv while: determines whether if: executes
  33. 52 def parse_csv(s,i): while s[i:]: if is_digit(s[i]): n,j = num(s[i:])

    i = i+j else: comma(s[i]) i += 1 CDG for parse_csv Dynamic Control Dependence Tree Each statement execution is represented as a separate node DCD Tree for call parse_csv()
  34. 53 def parse_csv(s,i): while s[i:]: if is_digit(s[i]): n,j = num(s[i:])

    i = i+j else: comma(s[i]) i += 1 '1' '2' ',' DCD Tree ~ Parse Tree •No tracking beyond input bu ff er •Characters are attached to nodes where they are accessed last "12," "12,"
  35. 54 def is_digit(i): return i in '0123456789' def parse_num(s,i): n

    = '' while s[i:] and is_digit(s[i]): n += s[i] i = i +1 return i,n def parse_paren(s, i): assert s[i] == '(' i, v = parse_expr(s, i+1) if s[i:] == '': raise Ex(s, i) assert s[i] == ')' return i+1, v def parse_expr(s, i = 0): expr, is_op = [], True while s[i:]: c = s[i] if isdigit(c): if not is_op: raise Ex(s,i) i,num = parse_num(s,i) expr.append(num) is_op = False elif c in ['+', '-', '*', '/']: if is_op: raise Ex(s,i) expr.append(c) is_op, i = True, i + 1 elif c == '(': if not is_op: raise Ex(s,i) i, cexpr = parse_paren(s, i) expr.append(cexpr) is_op = False elif c == ')': break else: raise Ex(s,i) if is_op: raise Ex(s,i) return i, expr 9+3/4 Parse tree for parse_expr('9+3/4')
  36. 55 def is_digit(i): return i in '0123456789' def parse_num(s,i): n

    = '' while s[i:] and is_digit(s[i]): n += s[i] i = i +1 return i,n def parse_paren(s, i): assert s[i] == '(' i, v = parse_expr(s, i+1) if s[i:] == '': raise Ex(s, i) assert s[i] == ')' return i+1, v def parse_expr(s, i = 0): expr, is_op = [], True while s[i:]: c = s[i] if isdigit(c): if not is_op: raise Ex(s,i) i,num = parse_num(s,i) expr.append(num) is_op = False elif c in ['+', '-', '*', '/']: if is_op: raise Ex(s,i) expr.append(c) is_op, i = True, i + 1 elif c == '(': if not is_op: raise Ex(s,i) i, cexpr = parse_paren(s, i) expr.append(cexpr) is_op = False elif c == ')': break else: raise Ex(s,i) if is_op: raise Ex(s,i) return i, expr 9+3/4 Identifying Compatible Nodes Which nodes correspond to the same nonterminal
  37. <parse_expr> := <while_s> <while_s> := <while_1:1> <while_1:0> <while_s> | <while_1:1>

    <parse_expr> := <while 1:1> <while 1:0> <while 1:1> | <while 1:1> <while 1:0> <while 1:1> <while 1:0> <while 1:1> <while 1:0> <while 1:1> | <while 1:1> <while 1:0> <while 1:1> <while 1:0> <while 1:1> | <while 1:1> <while 1:1> := <if 1:1> <if 1:1> := <parse_num> | <parse_paren> <parse_num> := <is_digit> <is_digit> := '3' | '1' <parse_paren>:= '(' <parse_expr> ')' <while 1:0> := <if 1:0> <if 1:0> := '*' 61
  38. 62 def is_digit(i): return i in '0123456789' def parse_num(s,i): n

    = '' while s[i:] and is_digit(s[i]): n += s[i] i = i +1 return i,n def parse_paren(s, i): assert s[i] == '(' i, v = parse_expr(s, i+1) if s[i:] == '': raise Ex(s, i) assert s[i] == ')' return i+1, v def parse_expr(s, i = 0): expr, is_op = [], True while s[i:]: c = s[i] if isdigit(c): if not is_op: raise Ex(s,i) i,num = parse_num(s,i) expr.append(num) is_op = False elif c in ['+', '-', '*', '/']: if is_op: raise Ex(s,i) expr.append(c) is_op, i = True, i + 1 elif c == '(': if not is_op: raise Ex(s,i) i, cexpr = parse_paren(s, i) expr.append(cexpr) is_op = False elif c == ')': break else: raise Ex(s,i) if is_op: raise Ex(s,i) return i, expr <START> := <parse_expr.0-0-c> <parse_expr.0-0-c> := <parse_expr.0-1-s><parse_expr.0> | <parse_expr.0> <parse_expr.0-1-s> := <parse_expr.0><parse_expr.0-2> | <parse_expr.0><parse_expr.0-2><parse_expr.0-1-s> <parse_expr.0> := '(' <parse_expr.0-0-c> ')' | <parse_num.0-1-s> <parse_expr.0-2> := '*' | '+' | '-' | '/' <parse_num.0-1-s> := <is_digit.0-0-c> | <is_digit.0-0-c><parse_num.0-1-s> <is_digit.0-0-c> : [0-9] calc.py Recovered Arithmetic Grammar
  39. 63 8.2 - 27 - -9 / +((+9 * --2

    + --+-+- ((-1 * +(8 - 5 - 6)) * (-(a-+(((+(4) )))) - ++4) / +(-+---((5.6 - --(3 * -1.8 * +(6 * +-(((-(-6) * ---+6)) / +--(+-+-7 * (-0 * (+(((((2)) + 8 - 3 - ++9.0 + ---(--+7 / (1 / +++6.37) + (1) / 482) / +++-+0)))) + 8.2 - 27 - -9 / +((+9 * --2 + --+-+-((-1 * + (8 - 5 - 6)) * (-(a-+(((+(4))))) - + +4) / +(-+---((5.6 - --(3 * -1.8 * + (6 * +-(((-(-6) * ---+6)) / +--(+-+- 7 * (-0 * (+(((((2)) + 8 - 3 - ++9.0 + ---(--+7 / (1 / +++6.37) + (1) / 482) / +++-+0)))) * -+5 + 7.513)))) - (+1 / ++((-84)))))))) * ++5 / +-(- -2 - -++-9.0)))) / 5 * --++090 + * - +5 + 7.513)))) - (+1 / ++((-84)))))) )) * 8.2 - 27 - -9 / +((+9 * --2 + - -+-+-((-1 * +(8 - 5 - 6)) * (-(a-+(( (+(4))))) - ++4) / +(-+---((5.6 - -- (3 * -1.8 * +(6 * +-(((-(-6) * ---+6 )) / +--(+-+-7 * (-0 * (+(((((2)) + 8 - 3 - ++9.0 + ---(--+7 / (1 / +++6 .37) + (1) / 482) / +++-+0)))) * -+5 + 7.513)))) - (+1 / ++((-84)))))))) * ++5 / +-(--2 - -++-9.0)))) / 5 * --++090 ++5 / +-(--2 - -++-9.0)))) / 5 * --++090 <START> := <parse_expr.0-0-c> <parse_expr.0-0-c> := <parse_expr.0-1-s><parse_expr.0> | <parse_expr.0> <parse_expr.0-1-s> := <parse_expr.0><parse_expr.0-2> | <parse_expr.0><parse_expr.0-2><parse_expr.0-1-s> <parse_expr.0> := '(' <parse_expr.0-0-c> ')' | <parse_num.0-1-s> <parse_expr.0-2> := '*' | '+' | '-' | '/' <parse_num.0-1-s> := <is_digit.0-0-c> | <is_digit.0-0-c><parse_num.0-1-s> <is_digit.0-0-c> : [0-9]
  40. 64 <START> ::= <json_raw> <json_raw> ::= '"' <json_string'> | '['

    <json_list'> | '{' <json_dict'> | <json_number'> | 'true' | 'false' | 'null' <json_number'> ::= <json_number>+ | <json_number>+ 'e' <json_number>+ <json_number> ::= '+' | '-' | '.' | [0-9] | 'E' | 'e' <json_string'> ::= <json_string>* '"' <json_list'> ::= ']' | <json_raw> (','<json_raw>)* ']' | ( ',' <json_raw>)+ (',' <json_raw>)* ']' <json_dict'> ::= '}' | ( '"' <json_string'> ':' <json_raw> ',' )* '"'<json_string'> ':' <json_raw> '}' <json_string> ::= ' ' | '!' | '#' | '$' | '%' | '&' | ''' | '*' | '+' | '-' | ',' | '.' | '/' | ':' | ';' | '<' | '=' | '>' | '?' | '@' | '[' | ']' | '^' | '_', ''',| '{' | '|' | '}' | '~' | '[A-Za-z0-9]' | '\' <decode_escape> <decode_escape> ::= '"' | '/' | 'b' | 'f' | 'n' | 'r' | 't' stm.next() if expect_key: raise JSONError(E_DKEY, stm, stm.pos) if c == '}': return result expect_key = 1 continue # parse out a key/value pair elif c == '"': key = _from_json_string(stm) stm.skipspaces() c = stm.next() if c != ':': raise JSONError(E_COLON, stm, stm.pos) stm.skipspaces() val = _from_json_raw(stm) result[key] = val expect_key = 0 continue raise JSONError(E_MALF, stm, stm.pos) def _from_json_raw(stm): while True: stm.skipspaces() c = stm.peek() if c == '"': return _from_json_string(stm) elif c == '{': return _from_json_dict(stm) elif c == '[': return _from_json_list(stm) elif c == 't': return _from_json_fixed(stm, 'true', True, E_BOOL) elif c == 'f': return _from_json_fixed(stm, 'false', False, E_BOOL) elif c == 'n': return _from_json_fixed(stm, 'null', None, E_NULL) elif c in NUMSTART: return _from_json_number(stm) raise JSONError(E_MALF, stm, stm.pos) def from_json(data): stm = JSONStream(data) return _from_json_raw(stm) microjson.py Recovered JSON grammar
  41. def json_raw(stm): while True: stm.skipspaces() c = stm.peek() if c

    == 't': return json_fixed(stm, 'true') elif c == 'f': return json_fixed(stm, 'false') elif c == 'n': return json_fixed(stm, 'null') elif c == '"': return json_string(stm) elif c == '{': return json_dict(stm) elif c == '[': return json_list(stm) elif c in NUMSTART: return json_number(stm) raise JSONError(E_MALF, stm, stm.pos) <json_raw>::= <json_object>
 | <json_list>
 | <json_string>
 | <json_number>
 | <json_fixed(true)> | <json_fixed(false)> | <json_fixed(null)> <json_string> ::= `"` <chars> `"` | `""` <chars> ::= <char><chars> | <char> <json_object> ::= `{`<items>`}` | `{}` <items> ::= <item>`,`<items> | <item> <item> ::= <string>`:`<elt> <json_list> ::= `[`<elts>`]` | `[]` <elts> ::= <elt>`,`<elts> | <elt> <json_number> ::= <digits> <digits> ::= <digit><digits> | <digit> https://github.com/phensley/microjson MicroJSON 65 65
  42. def json_raw(stm): while True: stm.skipspaces() c = stm.peek() if c

    == 't': return json_fixed(stm, 'true') elif c == 'f': return json_fixed(stm, 'false') elif c == 'n': return json_fixed(stm, 'null') elif c == '"': return json_string(stm) elif c == '{': return json_dict(stm) elif c == '[': return json_list(stm) elif c in NUMSTART: return json_number(stm) raise JSONError(E_MALF, stm, stm.pos) <json_raw>::= <json_object>
 | <json_list>
 | <json_string>
 | <json_number>
 | <json_fixed(true)> | <json_fixed(false)> | <json_fixed(null)> <json_string> ::= `"` <chars> `"` | `""` <chars> ::= <char><chars> | <char> <json_object> ::= `{`<items>`}` | `{}` <items> ::= <item>`,`<items> | <item> <item> ::= <string>`:`<elt> <json_list> ::= `[`<elts>`]` | `[]` <elts> ::= <elt>`,`<elts> | <elt> <json_number> ::= <digits> <digits> ::= <digit><digits> | <digit> https://github.com/phensley/microjson MicroJSON 66
  43. 67 <START> ::= <json_raw> <json_raw> ::= '"' <json_string'> | '['

    <json_list'> | '{' <json_dict'> | <json_number'> | 'true' | 'false' | 'null' <json_number'> ::= <json_number>+ | <json_number>+ 'e' <json_number>+ <json_number> ::= '+' | '-' | '.' | [0-9] | 'E' | 'e' <json_string'> ::= <json_string>* '"' <json_list'> ::= ']' | <json_raw> (','<json_raw>)* ']' | ( ',' <json_raw>)+ (',' <json_raw>)* ']' <json_dict'> ::= '}' | ( '"' <json_string'> ':' <json_raw> ',' )* '"'<json_string'> ':' <json_raw> '}' <json_string> ::= ' ' | '!' | '#' | '$' | '%' | '&' | ''' | '*' | '+' | '-' | ',' | '.' | '/' | ':' | ';' | '<' | '=' | '>' | '?' | '@' | '[' | ']' | '^' | '_', ''',| '{' | '|' | '}' | '~' | '[A-Za-z0-9]' | '\' <decode_escape> <decode_escape> ::= '"' | '/' | 'b' | 'f' | 'n' | 'r' | 't' stm.next() if expect_key: raise JSONError(E_DKEY, stm, stm.pos) if c == '}': return result expect_key = 1 continue # parse out a key/value pair elif c == '"': key = _from_json_string(stm) stm.skipspaces() c = stm.next() if c != ':': raise JSONError(E_COLON, stm, stm.pos) stm.skipspaces() val = _from_json_raw(stm) result[key] = val expect_key = 0 continue raise JSONError(E_MALF, stm, stm.pos) def _from_json_raw(stm): while True: stm.skipspaces() c = stm.peek() if c == '"': return _from_json_string(stm) elif c == '{': return _from_json_dict(stm) elif c == '[': return _from_json_list(stm) elif c == 't': return _from_json_fixed(stm, 'true', True, E_BOOL) elif c == 'f': return _from_json_fixed(stm, 'false', False, E_BOOL) elif c == 'n': return _from_json_fixed(stm, 'null', None, E_NULL) elif c in NUMSTART: return _from_json_number(stm) raise JSONError(E_MALF, stm, stm.pos) def from_json(data): stm = JSONStream(data) return _from_json_raw(stm) microjson.py Recovered JSON grammar Mimid Gopinath, Mathis, and Zeller. Mining Input Grammars from Dynamic Control Flow. ESEC/FSE 2020. 67
  44. Service topology map of Uber showing hundreds of microservices (Source:

    Uber Engineering) Instrumentation ability or source code access is not always guaranteed
  45. 76 • Differentiate incomplete and incorrect inputs Key Idea: Viable

    Pre fi xes 76 • Solve one character at a time systematically
  46. 77 Example Generator a [ 5 1 b , }

    4 ] a ∉ [,],{,},",0,1,2,3,4,5,.,. b ∉ [,],0,1,2,3,4,5,6,7,8,9,, } ∉ [,],0,1,2,3,4,5,6,7,8,9,0,, [51,4] 77
  47. 78 Pre fi xQ AFL(black) INI 62.5 65 CSV 65.7

    68.3 JSON 13.8 9.2 TinyC 86.8 47.9 MJS 28.0 19.0 Quality of Examples Branch Coverage Obtained C programs
  48. 79 Pre fi xQ AFL(black) AFL(gray) INI 62.5 65 77.5

    CSV 65.7 68.3 68.5 JSON 13.8 9.2 22.5 TinyC 86.8 47.9 81.6 MJS 28.0 19.0 29.9 Quality of Examples Tex Crash: ]9xdy[zSf$\theta{f!;} ;i\nonfrenchspacing !$$\prec q;7O/, $\downbrace fi ll @Pz \mathstrut{}$^: aK[X|?$47$ ,`D f$)Cg8$* Branch Coverage Obtained C programs
  49. 83 Grammar Inference (L*) L* (Angluin'84) Learner membership: w ∈

    L? equivalence: G = L? yes/no counterexample yes/no Teacher
  50. 85 Grammar Inference (L*) Learner Teacher w G = L?

    Equivalences Queries are not possible in software engineering scenarios
  51. 86 L* Teacher with PAC Guarantees ab ✓ ✓ abb

    ✘ ✘ bb ✓ ✓ aaaa ✓ ✓ bbb ✓ ✘
  52. 87 L* Teacher with PAC Guarantees ab ✓ ✓ abb

    ✘ ✘ bb ✓ ✓ aaaa ✓ ✓ bbb ✘ ✘ aaa ✘ ✘ abab ✘ ✘
  53. 88 L* Teacher with PAC Guarantees Probably Approximately Correct (Valiant'84)

    Pr(L(A)≢X ≤ ϵ) ≥ 1−δ 1-∈: accuracy 1-δ: confidence Equivalence Query = Multiple Membership Checks Checks come from some sampling distribution D over A* We only get a PAC guarantee based on D qi = [1/ϵ (ln(1/δ) + i ln(2))] Checks made in place of ith equivalence query:
  54. 89 Grammar Inference (PAC-L*) Learner Pre fi x Oracle w

    Random Sampler (D) Blackbox Hypothesis w ∈ D L(*) Substituting Equivalence Queries ab ✓ ✓ abb ✘ ✘ bb ✓ ✓ aaaa ✓ ✓ bbb ✓ ✘
  55. 90 Grammar Inference (PAC-L*) Learner Pre fi x Oracle w

    Random Sampler (D) w ∈ D L(*) Substituting Equivalence Queries Search Space
  56. 91 Positive and Negative Examples with Pre fi x Queries

    a [ 5 1 b , } 4 ] a ∉ [,],{,},",0,1,2,3,4,5,.,. b ∉ [,],0,1,2,3,4,5,6,7,8,9,, } ∉ [,],0,1,2,3,4,5,6,7,8,9,0,, [51,4] 91
  57. 92 Grammar Inference (PL*) Learner Pre fi x Oracle w

    Blackbox Hypothesis w ∈ B Yes/No Yes/No PL(*) w ∈ H Substituting Equivalence Queries
  58. 93 Grammar Inference (PL*) Pr(L(A)≢X ≤ ϵ) ≥ 1−δ Relation

    between D,ϵ,δ and F1 score On Arithmetic (depth limited) L(*) Eq = Pre fi x Sampler Eq = Pre fi x Sampler) (p=0.05) (p=0.5) Eq = Pre fi x Sampler) (p=1.0) Red is good, Blue is bad PL(*) PL(*) PL(*) 1-δ: confidence 1-∈: accuracy
  59. 94 Grammar Inference (PL*) Pr(L(A)≢X ≤ ϵ) ≥ 1−δ Relation

    between D,ϵ,δ and F1 score On JSON (depth limited) L(*) Eq = Pre fi x Sampler (p=0.05) Eq = Pre fi x Sampler) (p=0.5) Eq = Pre fi x Sampler) (p=1.0) Red is good, Blue is bad 1-δ: confidence 1-∈: accuracy PL(*) PL(*) PL(*)