Slide 1

Slide 1 text

1 How to Talk to Strange Programs and Find Bugs Rahul Gopinath https://rahul.gopinath.org [email protected] @[email protected]

Slide 2

Slide 2 text

2 How to Talk to Strange Programs and Find Bugs Rahul Gopinath https://rahul.gopinath.org [email protected] @[email protected] i.e., when the input speci fi cation is not known

Slide 3

Slide 3 text

3 We live in a world defined by software

Slide 4

Slide 4 text

4 We have a crisis We live in a world defined by software

Slide 5

Slide 5 text

Bugs 5

Slide 6

Slide 6 text

6

Slide 7

Slide 7 text

The University of Sydney 7 https://www3.weforum.org/docs/WEF_Global_Risk_Report_2020.pdf Interlinked Causes for Catastrophes

Slide 8

Slide 8 text

8 Input ✓ ✘ Testing @app.route('/admin') def admin(): username = request.cookies.get("username") if not username: return {"Error": "Specify username in Cookie"} username = urllib.quote(os.path.basename(username)) url = "http://permissions:5000/permissions/{}".format(username) resp = requests.request(method="GET", url=url) # "superadmin\ud888" will be simpli fi ed to "superadmin" ret = ujson.loads(resp.text) if resp.status_code == 200: if "superadmin" in ret["roles"]: return {"OK": "Superadmin Access granted"} else: e = u"Access denied. User has following roles: {}".format(ret["roles"]) return {"Error": e}, 401 else:return {"Error": ret["Error"]}, 500

Slide 9

Slide 9 text

@app.route('/admin') def admin(): username = request.cookies.get("username") if not username: return {"Error": "Specify username in Cookie"} username = urllib.quote(os.path.basename(username)) url = "http://permissions:5000/permissions/{}".format(username) resp = requests.request(method="GET", url=url) # "superadmin\ud888" will be simpli fi ed to "superadmin" ret = ujson.loads(resp.text) if resp.status_code == 200: if "superadmin" in ret["roles"]: return {"OK": "Superadmin Access granted"} else: e = u"Access denied. User has following roles: {}".format(ret["roles"]) return {"Error": e}, 401 else:return {"Error": ret["Error"]}, 500 [ ; x 1 - G P Z + w c c k c ] ; , N 9 J + ? # 6 ^ 6 \ e ? ] 9 l u 2 _ % ' 4 G X " 0 V U B [ E / r ~ f A p u 6 b 8 < { % s i q 8 Z h . 6 { V , h r ? ; {Ti.r3PIxMMMv6{xS^+'Hq!AxB"YXRS@! Kd6;wtAMefFWM(`|J_<1~o}z3K(CCzRH J I I v H z > _ * . \ > J r l U 3 2 ~ e G P ? lR=bF3+;y$3lodQ & ] B S 6 R & j ? # t P 7 i a V } - } ` \ ? [ _ [ Z ^ L B M P G - FKj'\xwuZ1=Q`^`5,$N$Q@[!CuRzJ2D|vBy! ^ z k h d f 3 C 5 P A k R ? V ( ( - % > < h n | 3='i2Qx]D$qs4O`1@fevnG'2\11Vf3piU37@ 5 : d f d 4 5 * ( 7 ^ % 5 a p \ z I y l " ' f , $ee,J4Gw:cgNKLie3nx9(`efSlg6#[K"@Wjh Z}r[Scun&sBCS,T[/3]KAeEnQ7lU)3Pn,0)G/ 6N-wyzj/MTd#A;r Program Automating Testing 9 https://www.fuzzingbook.org/html/Fuzzer.html Fuzzing

Slide 10

Slide 10 text

@app.route('/admin') def admin(): username = request.cookies.get("username") if not username: return {"Error": "Specify username in Cookie"} username = urllib.quote(os.path.basename(username)) url = "http://permissions:5000/permissions/{}".format(username) resp = requests.request(method="GET", url=url) # "superadmin\ud888" will be simpli fi ed to "superadmin" ret = ujson.loads(resp.text) if resp.status_code == 200: if "superadmin" in ret["roles"]: return {"OK": "Superadmin Access granted"} else: e = u"Access denied. User has following roles: {}".format(ret["roles"]) return {"Error": e}, 401 else:return {"Error": ret["Error"]}, 500 [ ; x 1 - G P Z + w c c k c ] ; , N 9 J + ? # 6 ^ 6 \ e ? ] 9 l u 2 _ % ' 4 G X " 0 V U B [ E / r ~ f A p u 6 b 8 < { % s i q 8 Z h . 6 { V , h r ? ; {Ti.r3PIxMMMv6{xS^+'Hq!AxB"YXRS@! Kd6;wtAMefFWM(`|J_<1~o}z3K(CCzRH J I I v H z > _ * . \ > J r l U 3 2 ~ e G P ? lR=bF3+;y$3lodQ & ] B S 6 R & j ? # t P 7 i a V } - } ` \ ? [ _ [ Z ^ L B M P G - FKj'\xwuZ1=Q`^`5,$N$Q@[!CuRzJ2D|vBy! ^ z k h d f 3 C 5 P A k R ? V h n | 3='i2Qx]D$qs4O`1@fevnG'2\11Vf3piU37@ 5 5 a p \ z I y l " ' f , $ee,J4Gw:cgNKLie3nx9(`efSlg6#[K"@Wjh Z}r[Scun&sBCS,T[/3]KAeEnQ7lU)3Pn,0)G/ 6N-wyzj/MTd#A;r Structured Inputs SYNTAX ERROR ✘ 10

Slide 11

Slide 11 text

def process_input(input): try: val = parse(input) res = process(val) return res except SyntaxError: return Error def process_input(input): try: ✘val = parse(input) res = process(val) return res except SyntaxError: return Error SYNTAX ERROR 11 Parser ✘

Slide 12

Slide 12 text

SYNTAX ERROR def process_input(input): try: ✘val = parse(input) res = process(val) return res except SyntaxError: return Error 12 The Core ✘

Slide 13

Slide 13 text

13 Overcoming Parsers

Slide 14

Slide 14 text

14 def process_input(input): try: val = parse(input) res = process(val) return res except SyntaxError: return Error 14 {
 '' : [['']], '' : [[''], [''], [''], [''], ['true'], ['false'], ['null']], '' : [['{', '','}'], ['{}']], '' : [[',',','], ['']], '' : [['',':', '']], '' : [['[', '', ']'], ['[]']], '' : [[',',','], ['']], '' : [['"', '', '"'], ['""']], '' : [['',''], ['']], '' : [['']], '' : [['',''], ['']], '' : [[c] for c in string.characters] '' : [[c] for c in string.digits]
 } Fix: Input Grammar

Slide 15

Slide 15 text

15 def process_input(input): try: ✔val = parse(input) res = process(val) return res except SyntaxError: return Error 15 {
 '' : [['']], '' : [[''], [''], [''], [''], ['true'], ['false'], ['null']], '' : [['{', '','}'], ['{}']], '' : [[',',','], ['']], '' : [['',':', '']], '' : [['[', '', ']'], ['[]']], '' : [[',',','], ['']], '' : [['"', '', '"'], ['""']], '' : [['',''], ['']], '' : [['']], '' : [['',''], ['']], '' : [[c] for c in string.characters] '' : [[c] for c in string.digits]
 } ✓ Fix: Input Grammar

Slide 16

Slide 16 text

16 ASIDE: A Simple Solution Let the parser tell you what it wants

Slide 17

Slide 17 text

17 Let the parser tell you what it wants [ 3 5 ]

Slide 18

Slide 18 text

18 Track input string accesses and comparisons [ 3 5 PARSE ERROR c ∉{']' ','} ]

Slide 19

Slide 19 text

19 Track input string accesses and comparisons [ 3 , ]

Slide 20

Slide 20 text

20 Track input string accesses and comparisons [ 3 , ] PARSE ERROR c ∉{'0'.., '[', '{'}

Slide 21

Slide 21 text

21 Track input string accesses and comparisons [ 3 , 1

Slide 22

Slide 22 text

22 Track input string accesses and comparisons [ 3 , 1 ]

Slide 23

Slide 23 text

23 • Identify character comparisons or EOF Key Idea: Viable Pre fi xes 23 • Complete with one of the compared characters [ 3 , 1 ]

Slide 24

Slide 24 text

24 Viable Pre fi xes 24 Limitations • Performance • Lack of control [ x [ ; @ [ 3 _ [ 3 _ [ 3 $ _ [ 3 , _ [ 3 , x _ [ 3 , 1 _ [ 3 , 1 ; _ [ 3 , 1 ]

Slide 25

Slide 25 text

25 Mathis, Gopinath, Mera, Kampmann, Höschele, and Zeller. Parser Directed Fuzzing. PLDI 2019. Mathis, Gopinath and Zeller Learning Input Tokens for Effective Fuzzing. ISSTA 2020. Viable Pre fi xes

Slide 26

Slide 26 text

26

Slide 27

Slide 27 text

Where to Get the Input Grammar From? 27

Slide 28

Slide 28 text

28 Formal Languages Language Descriptions: Grammars Regular Context Free Recursively Enumerable (Chomsky,1956) Argument Stack Return Stack 28

Slide 29

Slide 29 text

29 Grammar := := '+' | '-' | '/' | '*' | '(' ')' | := | '.' := | := [0-9] Arithmetic expression grammar De f inition for key

Slide 30

Slide 30 text

30 := := '+' | '-' | '/' | '*' | '(' ')' | := | '.' := | := [0-9] Grammar Arithmetic expression grammar Expansion Rule Terminal Symbol Nonterminal Symbol

Slide 31

Slide 31 text

31 Grammars For Parsing (8 / 3) * 49 := := '+' | '-' | '/' | '*' | '(' ')' | := | '.' := | := [0-9]

Slide 32

Slide 32 text

32 Grammars 8.2 - 27 - -9 / +((+9 * --2 + --+-+- ((-1 * +(8 - 5 - 6)) * (-(a-+(((+(4) )))) - ++4) / +(-+---((5.6 - --(3 * -1.8 * +(6 * +-(((-(-6) * ---+6)) / +--(+-+-7 * (-0 * (+(((((2)) + 8 - 3 - ++9.0 + ---(--+7 / (1 / +++6.37) + (1) / 482) / +++-+0)))) + 8.2 - 27 - -9 / +((+9 * --2 + --+-+-((-1 * + (8 - 5 - 6)) * (-(a-+(((+(4))))) - + +4) / +(-+---((5.6 - --(3 * -1.8 * + (6 * +-(((-(-6) * ---+6)) / +--(+-+- 7 * (-0 * (+(((((2)) + 8 - 3 - ++9.0 + ---(--+7 / (1 / +++6.37) + (1) / 482) / +++-+0)))) * -+5 + 7.513)))) - (+1 / ++((-84)))))))) * ++5 / +-(- -2 - -++-9.0)))) / 5 * --++090 + * - +5 + 7.513)))) - (+1 / ++((-84)))))) )) * 8.2 - 27 - -9 / +((+9 * --2 + - -+-+-((-1 * +(8 - 5 - 6)) * (-(a-+(( (+(4))))) - ++4) / +(-+---((5.6 - -- (3 * -1.8 * +(6 * +-(((-(-6) * ---+6 )) / +--(+-+-7 * (-0 * (+(((((2)) + 8 - 3 - ++9.0 + ---(--+7 / (1 / +++6 .37) + (1) / 482) / +++-+0)))) * -+5 + 7.513)))) - (+1 / ++((-84)))))))) * ++5 / +-(--2 - -++-9.0)))) / 5 * --++090 ++5 / +-(--2 - -++-9.0)))) / 5 * --++090 := := '+' | '-' | '/' | '*' | '(' ')' | := | '.' := | := [0-9] For Fuzzing (Hanford 1970) (Purdom 1972)

Slide 33

Slide 33 text

33 Grammars As effective producers Interpreter Parser ✘ ✔ 8.2 - 27 - -9 / +((+9 * --2 + --+-+- ((-1 * +(8 - 5 - 6)) * (-(a-+(((+(4) )))) - ++4) / +(-+---((5.6 - --(3 * -1.8 * +(6 * +-(((-(-6) * ---+6)) / +--(+-+-7 * (-0 * (+(((((2)) + 8 - 3 - ++9.0 + ---(--+7 / (1 / +++6.37) + (1) / 482) / +++-+0)))) + 8.2 - 27 - -9 / +((+9 * --2 + --+-+-((-1 * + (8 - 5 - 6)) * (-(a-+(((+(4))))) - + +4) / +(-+---((5.6 - --(3 * -1.8 * + (6 * +-(((-(-6) * ---+6)) / +--(+-+- 7 * (-0 * (+(((((2)) + 8 - 3 - ++9.0 + ---(--+7 / (1 / +++6.37) + (1) / 482) / +++-+0)))) * -+5 + 7.513)))) - (+1 / ++((-84)))))))) * ++5 / +-(- -2 - -++-9.0)))) / 5 * --++090 + * - +5 + 7.513)))) - (+1 / ++((-84)))))) )) * 8.2 - 27 - -9 / +((+9 * --2 + - -+-+-((-1 * +(8 - 5 - 6)) * (-(a-+(( (+(4))))) - ++4) / +(-+---((5.6 - -- (3 * -1.8 * +(6 * +-(((-(-6) * ---+6 )) / +--(+-+-7 * (-0 * (+(((((2)) + 8 - 3 - ++9.0 + ---(--+7 / (1 / +++6 .37) + (1) / 482) / +++-+0)))) * -+5 + 7.513)))) - (+1 / ++((-84)))))))) * ++5 / +-(--2 - -++-9.0)))) / 5 * --++090 ++5 / +-(--2 - -++-9.0)))) / 5 * --++090

Slide 34

Slide 34 text

34 Grammars := := '+' | '-' | '/' | '*' | '(' ')' | := | '.' := | := [0-9] As efficient producers def start(): expr() def expr(): match (random() % 6): case 0: expr(); print('+'); expr() case 1: expr(); print('-'); expr() case 2: expr(); print('/'); expr() case 3: expr(); print('*'); expr() case 4: print('('); expr(); print(')') case 5: number() def number(): match (random() % 2): case 0: integer() case 1: integer(); print('.'); integer() def integer(): match (random() % 2): case 0: digit(); integer() case 1: digit() def digit(): match (random() % 10): case 0: print('0') case 1: print('1') case 2: print('2') case 3: print('3') case 4: print('4') case 5: print('5') case 6: print('6') case 7: print('7') Compiled Grammar (F1)

Slide 35

Slide 35 text

Where to Get the Grammar From? 35

Slide 36

Slide 36 text

Almost Everyone Uses Handwritten Parsers https://notes.eatonphil.com/parser-generators-vs-handwritten-parsers-survey-2021.html 36

Slide 37

Slide 37 text

Where to Get the Grammar From? 37

Slide 38

Slide 38 text

38 "Be liberal in what you accept, and conservative in what you send" Postel's Law 38

Slide 39

Slide 39 text

QUIRK_ALLOW_ASCII_CONTROL_CODES QUIRK_ALLOW_BACKSLASH_A QUIRK_ALLOW_BACKSLASH_CAPITAL_U QUIRK_ALLOW_BACKSLASH_E QUIRK_ALLOW_BACKSLASH_NEW_LINE QUIRK_ALLOW_BACKSLASH_QUESTION_MARK QUIRK_ALLOW_BACKSLASH_SINGLE_QUOTE QUIRK_ALLOW_BACKSLASH_V QUIRK_ALLOW_BACKSLASH_X_AS_BYTES QUIRK_ALLOW_BACKSLASH_X_AS_CODE_POINTS QUIRK_ALLOW_BACKSLASH_ZERO QUIRK_ALLOW_COMMENT_BLOCK QUIRK_ALLOW_COMMENT_LINE QUIRK_ALLOW_EXTRA_COMMA QUIRK_ALLOW_INF_NAN_NUMBERS QUIRK_ALLOW_LEADING_ASCII_RECORD_SEPARATOR QUIRK_ALLOW_LEADING_UNICODE_BYTE_ORDER_MARK QUIRK_ALLOW_TRAILING_FILLER QUIRK_EXPECT_TRAILING_NEW_LINE_OR_EOF QUIRK_JSON_POINTER_ALLOW_TILDE_N_TILDE_R_TILDE_T QUIRK_REPLACE_INVALID_UNICODE JSON common quirks from https://github.com/google/wuffs 39

Slide 40

Slide 40 text

"Be liberal in what you accept, and conservative in what you send"
 Postel's Law The Specification The Implementation Extra "Features" Where to Get the Grammar From? 40 Bugs

Slide 41

Slide 41 text

No content

Slide 42

Slide 42 text

42 Grammar Mining : Whitebox extraction of grammar Grammar Inference: Blackbox extraction of grammar

Slide 43

Slide 43 text

43 Where to Get the Grammar From? Handwritten parsers contain the parse structure key value key value scheme parse_scheme parse_hostpath parse_querystring parse_fragment domain TLD subdomain parse_host subdirectory parse_fslocation binary parse_binaryname parameters parse_parameters parse_url

Slide 44

Slide 44 text

44 Where to Get the Grammar From? Mining Grammar from a hand-written parser https://www.example.com/forum/questions/cgi?tag=networking&order=newwest#top key value key value split scheme parse_scheme host path parse_hostpath query string parse_querystring fragment parse_fragment domain TLD subdomain parse_host subdirectory parse_fslocation binary parse_binaryname parameters parse_parameters With Dynamic Data Flow Analysis parseurl

Slide 45

Slide 45 text

45 http://user:[email protected]:80/?q=path#ref urlparse:url = 'http://user:[email protected]:80/?q=path#ref' urlsplit:scheme = 'http' urlsplit:netloc = 'user:[email protected]:80' urlsplit:fragment = 'ref' urlsplit:query = 'q=path' https://soft-eng.sydney.edu.au:80/ urlparse:url = 'https://soft-eng.sydney.edu.au:80/' urlsplit:scheme = 'https' urlsplit:netloc = 'soft-eng.sydney.edu.au:80' http://www.fuzzingbook.org/#News urlparse:url = 'http://www.fuzzingbook.org/#News' urlsplit:scheme = 'http' urlsplit:netloc = 'www.fuzzingbook.org' urlsplit:fragment = 'News' Mining with Dynamic Data Flow Analysis

Slide 46

Slide 46 text

46 { '': [ ['', '://', '', '/?', '', '#', ''], ['', '://', '', '/#',''], ['', '://', '', '/']], '' : [ ['http'], ['http']], '': [ ['user:[email protected]:80'], ['www.fuzzingbook.org'], ['soft-eng.sydney.edu.au']], '' : [ ['q=path']], '' : [ ['ref'], ['News']],
 } http://user:[email protected]:80/?q=path#ref urlparse:url = 'http://user:[email protected]:80/?q=path#ref' urlsplit:scheme = 'http' urlsplit:netloc = 'user:[email protected]:80' urlsplit:fragment = 'ref' urlsplit:query = 'q=path' https://soft-eng.sydney.edu.au:80/ urlparse:url = 'https://soft-eng.sydney.edu.au:80/' urlsplit:scheme = 'https' urlsplit:netloc = 'soft-eng.sydney.edu.au:80' http://www.fuzzingbook.org/#News urlparse:url = 'http://www.fuzzingbook.org/#News' urlsplit:scheme = 'http' urlsplit:netloc = 'www.fuzzingbook.org' urlsplit:fragment = 'News' Mining with Dynamic Data Flow Analysis

Slide 47

Slide 47 text

{ '': [ ['', '://', '', '/?', '', '#', ''], ['', '://', '', '/#',''], ['', '://', '', '/']], '' : [ ['http'], ['http']], '': [ ['user:[email protected]:80'], ['www.fuzzingbook.org'], ['soft-eng.sydney.edu.au']], '' : [ ['q=path']], '' : [ ['ref'], ['News']],
 } Limitations • Poor accuracy in most handwritten parsers • Handwritten parsers are not often well formed • Control flow is ignored Mining with Dynamic Data Flow Analysis

Slide 49

Slide 49 text

49 1. Extract the input string accesses 2. Attach control fl ow information Hand-written parsers already encode the grammar Mining with Dynamic Control Flow Analysis

Slide 50

Slide 50 text

50 • Inputs + control fl ow -> Dynamic Control Dependence Trees • DCD Trees -> Parse Tree Mining with Dynamic Control Flow Analysis

Slide 51

Slide 51 text

51 Control Dependence Graph Statement B is control dependent on A if A determines whether B executes. def parse_csv(s,i): while s[i:]: if is_digit(s[i]): n,j = num(s[i:]) i = i+j else: comma(s[i]) i += 1 CDG for parse_csv while: determines whether if: executes

Slide 52

Slide 52 text

52 def parse_csv(s,i): while s[i:]: if is_digit(s[i]): n,j = num(s[i:]) i = i+j else: comma(s[i]) i += 1 CDG for parse_csv Dynamic Control Dependence Tree Each statement execution is represented as a separate node DCD Tree for call parse_csv()

Slide 53

Slide 53 text

53 def parse_csv(s,i): while s[i:]: if is_digit(s[i]): n,j = num(s[i:]) i = i+j else: comma(s[i]) i += 1 '1' '2' ',' DCD Tree ~ Parse Tree •No tracking beyond input bu ff er •Characters are attached to nodes where they are accessed last "12," "12,"

Slide 54

Slide 54 text

54 def is_digit(i): return i in '0123456789' def parse_num(s,i): n = '' while s[i:] and is_digit(s[i]): n += s[i] i = i +1 return i,n def parse_paren(s, i): assert s[i] == '(' i, v = parse_expr(s, i+1) if s[i:] == '': raise Ex(s, i) assert s[i] == ')' return i+1, v def parse_expr(s, i = 0): expr, is_op = [], True while s[i:]: c = s[i] if isdigit(c): if not is_op: raise Ex(s,i) i,num = parse_num(s,i) expr.append(num) is_op = False elif c in ['+', '-', '*', '/']: if is_op: raise Ex(s,i) expr.append(c) is_op, i = True, i + 1 elif c == '(': if not is_op: raise Ex(s,i) i, cexpr = parse_paren(s, i) expr.append(cexpr) is_op = False elif c == ')': break else: raise Ex(s,i) if is_op: raise Ex(s,i) return i, expr 9+3/4 Parse tree for parse_expr('9+3/4')

Slide 55

Slide 55 text

55 def is_digit(i): return i in '0123456789' def parse_num(s,i): n = '' while s[i:] and is_digit(s[i]): n += s[i] i = i +1 return i,n def parse_paren(s, i): assert s[i] == '(' i, v = parse_expr(s, i+1) if s[i:] == '': raise Ex(s, i) assert s[i] == ')' return i+1, v def parse_expr(s, i = 0): expr, is_op = [], True while s[i:]: c = s[i] if isdigit(c): if not is_op: raise Ex(s,i) i,num = parse_num(s,i) expr.append(num) is_op = False elif c in ['+', '-', '*', '/']: if is_op: raise Ex(s,i) expr.append(c) is_op, i = True, i + 1 elif c == '(': if not is_op: raise Ex(s,i) i, cexpr = parse_paren(s, i) expr.append(cexpr) is_op = False elif c == ')': break else: raise Ex(s,i) if is_op: raise Ex(s,i) return i, expr 9+3/4 Identifying Compatible Nodes Which nodes correspond to the same nonterminal

Slide 56

Slide 56 text

56 (9 + 1) * 3 3 * (9 + 1)

Slide 57

Slide 57 text

57 9 + 1 3 * (9 + 1)

Slide 58

Slide 58 text

58 3 (9 + 1) * 3 * (9 + 1)

Slide 59

Slide 59 text

59 3*(1) 1

Slide 60

Slide 60 text

60 3*(1) 1 := :=

Slide 61

Slide 61 text

:= := | := | | | := := | := := '3' | '1' := '(' ')' := := '*' 61

Slide 62

Slide 62 text

62 def is_digit(i): return i in '0123456789' def parse_num(s,i): n = '' while s[i:] and is_digit(s[i]): n += s[i] i = i +1 return i,n def parse_paren(s, i): assert s[i] == '(' i, v = parse_expr(s, i+1) if s[i:] == '': raise Ex(s, i) assert s[i] == ')' return i+1, v def parse_expr(s, i = 0): expr, is_op = [], True while s[i:]: c = s[i] if isdigit(c): if not is_op: raise Ex(s,i) i,num = parse_num(s,i) expr.append(num) is_op = False elif c in ['+', '-', '*', '/']: if is_op: raise Ex(s,i) expr.append(c) is_op, i = True, i + 1 elif c == '(': if not is_op: raise Ex(s,i) i, cexpr = parse_paren(s, i) expr.append(cexpr) is_op = False elif c == ')': break else: raise Ex(s,i) if is_op: raise Ex(s,i) return i, expr := := | := | := '(' ')' | := '*' | '+' | '-' | '/' := | : [0-9] calc.py Recovered Arithmetic Grammar

Slide 63

Slide 63 text

63 8.2 - 27 - -9 / +((+9 * --2 + --+-+- ((-1 * +(8 - 5 - 6)) * (-(a-+(((+(4) )))) - ++4) / +(-+---((5.6 - --(3 * -1.8 * +(6 * +-(((-(-6) * ---+6)) / +--(+-+-7 * (-0 * (+(((((2)) + 8 - 3 - ++9.0 + ---(--+7 / (1 / +++6.37) + (1) / 482) / +++-+0)))) + 8.2 - 27 - -9 / +((+9 * --2 + --+-+-((-1 * + (8 - 5 - 6)) * (-(a-+(((+(4))))) - + +4) / +(-+---((5.6 - --(3 * -1.8 * + (6 * +-(((-(-6) * ---+6)) / +--(+-+- 7 * (-0 * (+(((((2)) + 8 - 3 - ++9.0 + ---(--+7 / (1 / +++6.37) + (1) / 482) / +++-+0)))) * -+5 + 7.513)))) - (+1 / ++((-84)))))))) * ++5 / +-(- -2 - -++-9.0)))) / 5 * --++090 + * - +5 + 7.513)))) - (+1 / ++((-84)))))) )) * 8.2 - 27 - -9 / +((+9 * --2 + - -+-+-((-1 * +(8 - 5 - 6)) * (-(a-+(( (+(4))))) - ++4) / +(-+---((5.6 - -- (3 * -1.8 * +(6 * +-(((-(-6) * ---+6 )) / +--(+-+-7 * (-0 * (+(((((2)) + 8 - 3 - ++9.0 + ---(--+7 / (1 / +++6 .37) + (1) / 482) / +++-+0)))) * -+5 + 7.513)))) - (+1 / ++((-84)))))))) * ++5 / +-(--2 - -++-9.0)))) / 5 * --++090 ++5 / +-(--2 - -++-9.0)))) / 5 * --++090 := := | := | := '(' ')' | := '*' | '+' | '-' | '/' := | : [0-9]

Slide 64

Slide 64 text

64 ::= ::= '"' | '[' | '{' | | 'true' | 'false' | 'null' ::= + | + 'e' + ::= '+' | '-' | '.' | [0-9] | 'E' | 'e' ::= * '"' ::= ']' | (',')* ']' | ( ',' )+ (',' )* ']' ::= '}' | ( '"' ':' ',' )* '"' ':' '}' ::= ' ' | '!' | '#' | '$' | '%' | '&' | ''' | '*' | '+' | '-' | ',' | '.' | '/' | ':' | ';' | '<' | '=' | '>' | '?' | '@' | '[' | ']' | '^' | '_', ''',| '{' | '|' | '}' | '~' | '[A-Za-z0-9]' | '\' ::= '"' | '/' | 'b' | 'f' | 'n' | 'r' | 't' stm.next() if expect_key: raise JSONError(E_DKEY, stm, stm.pos) if c == '}': return result expect_key = 1 continue # parse out a key/value pair elif c == '"': key = _from_json_string(stm) stm.skipspaces() c = stm.next() if c != ':': raise JSONError(E_COLON, stm, stm.pos) stm.skipspaces() val = _from_json_raw(stm) result[key] = val expect_key = 0 continue raise JSONError(E_MALF, stm, stm.pos) def _from_json_raw(stm): while True: stm.skipspaces() c = stm.peek() if c == '"': return _from_json_string(stm) elif c == '{': return _from_json_dict(stm) elif c == '[': return _from_json_list(stm) elif c == 't': return _from_json_fixed(stm, 'true', True, E_BOOL) elif c == 'f': return _from_json_fixed(stm, 'false', False, E_BOOL) elif c == 'n': return _from_json_fixed(stm, 'null', None, E_NULL) elif c in NUMSTART: return _from_json_number(stm) raise JSONError(E_MALF, stm, stm.pos) def from_json(data): stm = JSONStream(data) return _from_json_raw(stm) microjson.py Recovered JSON grammar

Slide 65

Slide 65 text

def json_raw(stm): while True: stm.skipspaces() c = stm.peek() if c == 't': return json_fixed(stm, 'true') elif c == 'f': return json_fixed(stm, 'false') elif c == 'n': return json_fixed(stm, 'null') elif c == '"': return json_string(stm) elif c == '{': return json_dict(stm) elif c == '[': return json_list(stm) elif c in NUMSTART: return json_number(stm) raise JSONError(E_MALF, stm, stm.pos) ::= 
 | 
 | 
 | 
 | | | ::= `"` `"` | `""` ::= | ::= `{``}` | `{}` ::= `,` | ::= `:` ::= `[``]` | `[]` ::= `,` | ::= ::= | https://github.com/phensley/microjson MicroJSON 65 65

Slide 66

Slide 66 text

def json_raw(stm): while True: stm.skipspaces() c = stm.peek() if c == 't': return json_fixed(stm, 'true') elif c == 'f': return json_fixed(stm, 'false') elif c == 'n': return json_fixed(stm, 'null') elif c == '"': return json_string(stm) elif c == '{': return json_dict(stm) elif c == '[': return json_list(stm) elif c in NUMSTART: return json_number(stm) raise JSONError(E_MALF, stm, stm.pos) ::= 
 | 
 | 
 | 
 | | | ::= `"` `"` | `""` ::= | ::= `{``}` | `{}` ::= `,` | ::= `:` ::= `[``]` | `[]` ::= `,` | ::= ::= | https://github.com/phensley/microjson MicroJSON 66

Slide 67

Slide 67 text

67 ::= ::= '"' | '[' | '{' | | 'true' | 'false' | 'null' ::= + | + 'e' + ::= '+' | '-' | '.' | [0-9] | 'E' | 'e' ::= * '"' ::= ']' | (',')* ']' | ( ',' )+ (',' )* ']' ::= '}' | ( '"' ':' ',' )* '"' ':' '}' ::= ' ' | '!' | '#' | '$' | '%' | '&' | ''' | '*' | '+' | '-' | ',' | '.' | '/' | ':' | ';' | '<' | '=' | '>' | '?' | '@' | '[' | ']' | '^' | '_', ''',| '{' | '|' | '}' | '~' | '[A-Za-z0-9]' | '\' ::= '"' | '/' | 'b' | 'f' | 'n' | 'r' | 't' stm.next() if expect_key: raise JSONError(E_DKEY, stm, stm.pos) if c == '}': return result expect_key = 1 continue # parse out a key/value pair elif c == '"': key = _from_json_string(stm) stm.skipspaces() c = stm.next() if c != ':': raise JSONError(E_COLON, stm, stm.pos) stm.skipspaces() val = _from_json_raw(stm) result[key] = val expect_key = 0 continue raise JSONError(E_MALF, stm, stm.pos) def _from_json_raw(stm): while True: stm.skipspaces() c = stm.peek() if c == '"': return _from_json_string(stm) elif c == '{': return _from_json_dict(stm) elif c == '[': return _from_json_list(stm) elif c == 't': return _from_json_fixed(stm, 'true', True, E_BOOL) elif c == 'f': return _from_json_fixed(stm, 'false', False, E_BOOL) elif c == 'n': return _from_json_fixed(stm, 'null', None, E_NULL) elif c in NUMSTART: return _from_json_number(stm) raise JSONError(E_MALF, stm, stm.pos) def from_json(data): stm = JSONStream(data) return _from_json_raw(stm) microjson.py Recovered JSON grammar Mimid Gopinath, Mathis, and Zeller. Mining Input Grammars from Dynamic Control Flow. ESEC/FSE 2020. 67

Slide 68

Slide 68 text

68 68 PART II

Slide 69

Slide 69 text

Service topology map of Uber showing hundreds of microservices (Source: Uber Engineering) Instrumentation ability or source code access is not always guaranteed

Slide 70

Slide 70 text

Grammar Inference

Slide 71

Slide 71 text

Grammar Inference Problem: Exponential Search Space 2n possibilities for n length string

Slide 72

Slide 72 text

Grammar Inference Problem: Exponential Search Space 2n possibilities for n length string

Slide 73

Slide 73 text

Grammar Inference (with examples) Glade Arvada With good examples, the problem is tractable

Slide 74

Slide 74 text

Finding Good Examples Example corpus? (Blind spots) 74

Slide 75

Slide 75 text

75 Key Idea: Leverage Error Feedback 75 Viable Prefix (Ullman)

Slide 76

Slide 76 text

76 • Differentiate incomplete and incorrect inputs Key Idea: Viable Pre fi xes 76 • Solve one character at a time systematically

Slide 77

Slide 77 text

77 Example Generator a [ 5 1 b , } 4 ] a ∉ [,],{,},",0,1,2,3,4,5,.,. b ∉ [,],0,1,2,3,4,5,6,7,8,9,, } ∉ [,],0,1,2,3,4,5,6,7,8,9,0,, [51,4] 77

Slide 78

Slide 78 text

78 Pre fi xQ AFL(black) INI 62.5 65 CSV 65.7 68.3 JSON 13.8 9.2 TinyC 86.8 47.9 MJS 28.0 19.0 Quality of Examples Branch Coverage Obtained C programs

Slide 79

Slide 79 text

79 Pre fi xQ AFL(black) AFL(gray) INI 62.5 65 77.5 CSV 65.7 68.3 68.5 JSON 13.8 9.2 22.5 TinyC 86.8 47.9 81.6 MJS 28.0 19.0 29.9 Quality of Examples Tex Crash: ]9xdy[zSf$\theta{f!;} ;i\nonfrenchspacing !$$\prec q;7O/, $\downbrace fi ll @Pz \mathstrut{}$^: aK[X|?$47$ ,`D f$)Cg8$* Branch Coverage Obtained C programs

Slide 80

Slide 80 text

80 Grammar Inference

Slide 81

Slide 81 text

81 Grammar Inference

Slide 82

Slide 82 text

82 Grammar Inference 1984 1993 2014 2022 2019

Slide 83

Slide 83 text

83 Grammar Inference (L*) L* (Angluin'84) Learner membership: w ∈ L? equivalence: G = L? yes/no counterexample yes/no Teacher

Slide 84

Slide 84 text

84 Grammar Inference (L*) L* (Angluin) Learner membership: ab ? equivalence: no abbb yes Teacher ?

Slide 85

Slide 85 text

85 Grammar Inference (L*) Learner Teacher w G = L? Equivalences Queries are not possible in software engineering scenarios

Slide 86

Slide 86 text

86 L* Teacher with PAC Guarantees ab ✓ ✓ abb ✘ ✘ bb ✓ ✓ aaaa ✓ ✓ bbb ✓ ✘

Slide 87

Slide 87 text

87 L* Teacher with PAC Guarantees ab ✓ ✓ abb ✘ ✘ bb ✓ ✓ aaaa ✓ ✓ bbb ✘ ✘ aaa ✘ ✘ abab ✘ ✘

Slide 88

Slide 88 text

88 L* Teacher with PAC Guarantees Probably Approximately Correct (Valiant'84) Pr(L(A)≢X ≤ ϵ) ≥ 1−δ 1-∈: accuracy 1-δ: confidence Equivalence Query = Multiple Membership Checks Checks come from some sampling distribution D over A* We only get a PAC guarantee based on D qi = [1/ϵ (ln(1/δ) + i ln(2))] Checks made in place of ith equivalence query:

Slide 89

Slide 89 text

89 Grammar Inference (PAC-L*) Learner Pre fi x Oracle w Random Sampler (D) Blackbox Hypothesis w ∈ D L(*) Substituting Equivalence Queries ab ✓ ✓ abb ✘ ✘ bb ✓ ✓ aaaa ✓ ✓ bbb ✓ ✘

Slide 90

Slide 90 text

90 Grammar Inference (PAC-L*) Learner Pre fi x Oracle w Random Sampler (D) w ∈ D L(*) Substituting Equivalence Queries Search Space

Slide 91

Slide 91 text

91 Positive and Negative Examples with Pre fi x Queries a [ 5 1 b , } 4 ] a ∉ [,],{,},",0,1,2,3,4,5,.,. b ∉ [,],0,1,2,3,4,5,6,7,8,9,, } ∉ [,],0,1,2,3,4,5,6,7,8,9,0,, [51,4] 91

Slide 92

Slide 92 text

92 Grammar Inference (PL*) Learner Pre fi x Oracle w Blackbox Hypothesis w ∈ B Yes/No Yes/No PL(*) w ∈ H Substituting Equivalence Queries

Slide 93

Slide 93 text

93 Grammar Inference (PL*) Pr(L(A)≢X ≤ ϵ) ≥ 1−δ Relation between D,ϵ,δ and F1 score On Arithmetic (depth limited) L(*) Eq = Pre fi x Sampler Eq = Pre fi x Sampler) (p=0.05) (p=0.5) Eq = Pre fi x Sampler) (p=1.0) Red is good, Blue is bad PL(*) PL(*) PL(*) 1-δ: confidence 1-∈: accuracy

Slide 94

Slide 94 text

94 Grammar Inference (PL*) Pr(L(A)≢X ≤ ϵ) ≥ 1−δ Relation between D,ϵ,δ and F1 score On JSON (depth limited) L(*) Eq = Pre fi x Sampler (p=0.05) Eq = Pre fi x Sampler) (p=0.5) Eq = Pre fi x Sampler) (p=1.0) Red is good, Blue is bad 1-δ: confidence 1-∈: accuracy PL(*) PL(*) PL(*)

Slide 95

Slide 95 text

95 Grammar Mining Blakbox Generation Grammar Inference