Slide 1

Slide 1 text

1 Dancing to an Unknown Music Rahul Gopinath https://rahul.gopinath.org rahul@gopinath.org @rahul@gopinath.org Grammar Inference with Pre fi x Queries

Slide 2

Slide 2 text

2 Rahul Gopinath https://rahul.gopinath.org rahul@gopinath.org @rahul@gopinath.org Grammar Inference with Pre fi x Queries Dancing to an Unknown Music

Slide 3

Slide 3 text

3 Formal Languages Language Descriptions: Grammars Regular Context Free Recursively Enumerable (Chomsky,1956) Argument Stack Return Stack 3

Slide 4

Slide 4 text

Grammar Inference with Pre fi x Queri Why Should we Infer the Grammar?

Slide 5

Slide 5 text

The University of Sydney 5 https://www3.weforum.org/docs/WEF_Global_Risk_Report_2020.pdf

Slide 6

Slide 6 text

Bugs 6

Slide 7

Slide 7 text

7

Slide 8

Slide 8 text

8 Input ✓ ✘ Testing @app.route('/admin') def admin(): username = request.cookies.get("username") if not username: return {"Error": "Specify username in Cookie"} username = urllib.quote(os.path.basename(username)) url = "http://permissions:5000/permissions/{}".format(username) resp = requests.request(method="GET", url=url) # "superadmin\ud888" will be simpli fi ed to "superadmin" ret = ujson.loads(resp.text) if resp.status_code == 200: if "superadmin" in ret["roles"]: return {"OK": "Superadmin Access granted"} else: e = u"Access denied. User has following roles: {}".format(ret["roles"]) return {"Error": e}, 401 else:return {"Error": ret["Error"]}, 500

Slide 9

Slide 9 text

@app.route('/admin') def admin(): username = request.cookies.get("username") if not username: return {"Error": "Specify username in Cookie"} username = urllib.quote(os.path.basename(username)) url = "http://permissions:5000/permissions/{}".format(username) resp = requests.request(method="GET", url=url) # "superadmin\ud888" will be simpli fi ed to "superadmin" ret = ujson.loads(resp.text) if resp.status_code == 200: if "superadmin" in ret["roles"]: return {"OK": "Superadmin Access granted"} else: e = u"Access denied. User has following roles: {}".format(ret["roles"]) return {"Error": e}, 401 else:return {"Error": ret["Error"]}, 500 [ ; x 1 - G P Z + w c c k c ] ; , N 9 J + ? # 6 ^ 6 \ e ? ] 9 l u 2 _ % ' 4 G X " 0 V U B [ E / r ~ f A p u 6 b 8 < { % s i q 8 Z h . 6 { V , h r ? ; {Ti.r3PIxMMMv6{xS^+'Hq!AxB"YXRS@! Kd6;wtAMefFWM(`|J_<1~o}z3K(CCzRH J I I v H z > _ * . \ > J r l U 3 2 ~ e G P ? lR=bF3+;y$3lodQ & ] B S 6 R & j ? # t P 7 i a V } - } ` \ ? [ _ [ Z ^ L B M P G - FKj'\xwuZ1=Q`^`5,$N$Q@[!CuRzJ2D|vBy! ^ z k h d f 3 C 5 P A k R ? V ( ( - % > < h n | 3='i2Qx]D$qs4O`1@fevnG'2\11Vf3piU37@ 5 : d f d 4 5 * ( 7 ^ % 5 a p \ z I y l " ' f , $ee,J4Gw:cgNKLie3nx9(`efSlg6#[K"@Wjh Z}r[Scun&sBCS,T[/3]KAeEnQ7lU)3Pn,0)G/ 6N-wyzj/MTd#A;r Program Automating Testing 9 https://www.fuzzingbook.org/html/Fuzzer.html Fuzzing

Slide 10

Slide 10 text

@app.route('/admin') def admin(): username = request.cookies.get("username") if not username: return {"Error": "Specify username in Cookie"} username = urllib.quote(os.path.basename(username)) url = "http://permissions:5000/permissions/{}".format(username) resp = requests.request(method="GET", url=url) # "superadmin\ud888" will be simpli fi ed to "superadmin" ret = ujson.loads(resp.text) if resp.status_code == 200: if "superadmin" in ret["roles"]: return {"OK": "Superadmin Access granted"} else: e = u"Access denied. User has following roles: {}".format(ret["roles"]) return {"Error": e}, 401 else:return {"Error": ret["Error"]}, 500 [ ; x 1 - G P Z + w c c k c ] ; , N 9 J + ? # 6 ^ 6 \ e ? ] 9 l u 2 _ % ' 4 G X " 0 V U B [ E / r ~ f A p u 6 b 8 < { % s i q 8 Z h . 6 { V , h r ? ; {Ti.r3PIxMMMv6{xS^+'Hq!AxB"YXRS@! Kd6;wtAMefFWM(`|J_<1~o}z3K(CCzRH J I I v H z > _ * . \ > J r l U 3 2 ~ e G P ? lR=bF3+;y$3lodQ & ] B S 6 R & j ? # t P 7 i a V } - } ` \ ? [ _ [ Z ^ L B M P G - FKj'\xwuZ1=Q`^`5,$N$Q@[!CuRzJ2D|vBy! ^ z k h d f 3 C 5 P A k R ? V h n | 3='i2Qx]D$qs4O`1@fevnG'2\11Vf3piU37@ 5 5 a p \ z I y l " ' f , $ee,J4Gw:cgNKLie3nx9(`efSlg6#[K"@Wjh Z}r[Scun&sBCS,T[/3]KAeEnQ7lU)3Pn,0)G/ 6N-wyzj/MTd#A;r Structured Inputs SYNTAX ERROR ✘ 10

Slide 11

Slide 11 text

def process_input(input): try: val = parse(input) res = process(val) return res except SyntaxError: return Error def process_input(input): try: ✘val = parse(input) res = process(val) return res except SyntaxError: return Error SYNTAX ERROR 11 Parser ✘

Slide 12

Slide 12 text

SYNTAX ERROR def process_input(input): try: ✘val = parse(input) res = process(val) return res except SyntaxError: return Error 12 The Core ✘

Slide 13

Slide 13 text

13 Overcoming Parsers

Slide 14

Slide 14 text

14 def process_input(input): try: val = parse(input) res = process(val) return res except SyntaxError: return Error 14 {
 '' : [['']], '' : [[''], [''], [''], [''], ['true'], ['false'], ['null']], '' : [['{', '','}'], ['{}']], '' : [[',',','], ['']], '' : [['',':', '']], '' : [['[', '', ']'], ['[]']], '' : [[',',','], ['']], '' : [['"', '', '"'], ['""']], '' : [['',''], ['']], '' : [['']], '' : [['',''], ['']], '' : [[c] for c in string.characters] '' : [[c] for c in string.digits]
 } Fix: Input Grammar

Slide 15

Slide 15 text

15 def process_input(input): try: ✔val = parse(input) res = process(val) return res except SyntaxError: return Error 15 {
 '' : [['']], '' : [[''], [''], [''], [''], ['true'], ['false'], ['null']], '' : [['{', '','}'], ['{}']], '' : [[',',','], ['']], '' : [['',':', '']], '' : [['[', '', ']'], ['[]']], '' : [[',',','], ['']], '' : [['"', '', '"'], ['""']], '' : [['',''], ['']], '' : [['']], '' : [['',''], ['']], '' : [[c] for c in string.characters] '' : [[c] for c in string.digits]
 } ✓ Fix: Input Grammar

Slide 16

Slide 16 text

Where to Get the Grammar From? 16

Slide 17

Slide 17 text

Almost Everyone Uses Handwritten Parsers https://notes.eatonphil.com/parser-generators-vs-handwritten-parsers-survey-2021.html 17

Slide 18

Slide 18 text

Where to Get the Grammar From? 18

Slide 19

Slide 19 text

19 "Be liberal in what you accept, and conservative in what you send" Postel's Law 19

Slide 20

Slide 20 text

QUIRK_ALLOW_ASCII_CONTROL_CODES QUIRK_ALLOW_BACKSLASH_A QUIRK_ALLOW_BACKSLASH_CAPITAL_U QUIRK_ALLOW_BACKSLASH_E QUIRK_ALLOW_BACKSLASH_NEW_LINE QUIRK_ALLOW_BACKSLASH_QUESTION_MARK QUIRK_ALLOW_BACKSLASH_SINGLE_QUOTE QUIRK_ALLOW_BACKSLASH_V QUIRK_ALLOW_BACKSLASH_X_AS_BYTES QUIRK_ALLOW_BACKSLASH_X_AS_CODE_POINTS QUIRK_ALLOW_BACKSLASH_ZERO QUIRK_ALLOW_COMMENT_BLOCK QUIRK_ALLOW_COMMENT_LINE QUIRK_ALLOW_EXTRA_COMMA QUIRK_ALLOW_INF_NAN_NUMBERS QUIRK_ALLOW_LEADING_ASCII_RECORD_SEPARATOR QUIRK_ALLOW_LEADING_UNICODE_BYTE_ORDER_MARK QUIRK_ALLOW_TRAILING_FILLER QUIRK_EXPECT_TRAILING_NEW_LINE_OR_EOF QUIRK_JSON_POINTER_ALLOW_TILDE_N_TILDE_R_TILDE_T QUIRK_REPLACE_INVALID_UNICODE JSON common quirks from https://github.com/google/wuffs 20

Slide 21

Slide 21 text

"Be liberal in what you accept, and conservative in what you send"
 Postel's Law The Specification The Implementation Extra "Features" Where to Get the Grammar From? 21 Bugs

Slide 23

Slide 23 text

def json_raw(stm): while True: stm.skipspaces() c = stm.peek() if c == 't': return json_fixed(stm, 'true') elif c == 'f': return json_fixed(stm, 'false') elif c == 'n': return json_fixed(stm, 'null') elif c == '"': return json_string(stm) elif c == '{': return json_dict(stm) elif c == '[': return json_list(stm) elif c in NUMSTART: return json_number(stm) raise JSONError(E_MALF, stm, stm.pos) ::= 
 | 
 | 
 | 
 | | | ::= `"` `"` | `""` ::= | ::= `{``}` | `{}` ::= `,` | ::= `:` ::= `[``]` | `[]` ::= `,` | ::= ::= | https://github.com/phensley/microjson MicroJSON 23 23

Slide 24

Slide 24 text

def json_raw(stm): while True: stm.skipspaces() c = stm.peek() if c == 't': return json_fixed(stm, 'true') elif c == 'f': return json_fixed(stm, 'false') elif c == 'n': return json_fixed(stm, 'null') elif c == '"': return json_string(stm) elif c == '{': return json_dict(stm) elif c == '[': return json_list(stm) elif c in NUMSTART: return json_number(stm) raise JSONError(E_MALF, stm, stm.pos) ::= 
 | 
 | 
 | 
 | | | ::= `"` `"` | `""` ::= | ::= `{``}` | `{}` ::= `,` | ::= `:` ::= `[``]` | `[]` ::= `,` | ::= ::= | https://github.com/phensley/microjson MicroJSON 24

Slide 25

Slide 25 text

25 ::= ::= '"' | '[' | '{' | | 'true' | 'false' | 'null' ::= + | + 'e' + ::= '+' | '-' | '.' | [0-9] | 'E' | 'e' ::= * '"' ::= ']' | (',')* ']' | ( ',' )+ (',' )* ']' ::= '}' | ( '"' ':' ',' )* '"' ':' '}' ::= ' ' | '!' | '#' | '$' | '%' | '&' | ''' | '*' | '+' | '-' | ',' | '.' | '/' | ':' | ';' | '<' | '=' | '>' | '?' | '@' | '[' | ']' | '^' | '_', ''',| '{' | '|' | '}' | '~' | '[A-Za-z0-9]' | '\' ::= '"' | '/' | 'b' | 'f' | 'n' | 'r' | 't' stm.next() if expect_key: raise JSONError(E_DKEY, stm, stm.pos) if c == '}': return result expect_key = 1 continue # parse out a key/value pair elif c == '"': key = _from_json_string(stm) stm.skipspaces() c = stm.next() if c != ':': raise JSONError(E_COLON, stm, stm.pos) stm.skipspaces() val = _from_json_raw(stm) result[key] = val expect_key = 0 continue raise JSONError(E_MALF, stm, stm.pos) def _from_json_raw(stm): while True: stm.skipspaces() c = stm.peek() if c == '"': return _from_json_string(stm) elif c == '{': return _from_json_dict(stm) elif c == '[': return _from_json_list(stm) elif c == 't': return _from_json_fixed(stm, 'true', True, E_BOOL) elif c == 'f': return _from_json_fixed(stm, 'false', False, E_BOOL) elif c == 'n': return _from_json_fixed(stm, 'null', None, E_NULL) elif c in NUMSTART: return _from_json_number(stm) raise JSONError(E_MALF, stm, stm.pos) def from_json(data): stm = JSONStream(data) return _from_json_raw(stm) microjson.py Recovered JSON grammar Mimid Gopinath, Mathis, and Zeller. Mining Input Grammars from Dynamic Control Flow. ESEC/FSE 2020. 25

Slide 26

Slide 26 text

26 26 BUT

Slide 27

Slide 27 text

Service topology map of Uber showing hundreds of microservices (Source: Uber Engineering) Instrumentation ability or source code access is not always guaranteed

Slide 28

Slide 28 text

Blackbox Grammar Inference

Slide 29

Slide 29 text

Blackbox Grammar Inference Problem: Exponential Search Space 2n possibilities for n length string

Slide 30

Slide 30 text

Blackbox Grammar Inference (with examples) Glade Arvada With good examples, the problem is tractable

Slide 31

Slide 31 text

Finding Good Examples Example corpus? (Blind spots) 31

Slide 32

Slide 32 text

32 Key Idea: Leverage Error Feedback 32 Viable Prefix (Ullman)

Slide 33

Slide 33 text

33 • Differentiate incomplete and incorrect inputs Key Idea: Viable Prefixes 33 • Solve one character at a time systematically

Slide 34

Slide 34 text

34 Example Generator a [ 5 1 b , } 4 ] a ∉ [,],{,},",0,1,2,3,4,5,.,. b ∉ [,],0,1,2,3,4,5,6,7,8,9,, } ∉ [,],0,1,2,3,4,5,6,7,8,9,0,, [51,4] 34

Slide 35

Slide 35 text

35 Pre fi xQ AFL(black) INI 62.5 65 CSV 65.7 68.3 JSON 13.8 9.2 TinyC 86.8 47.9 MJS 28.0 19.0 Quality of Examples Branch Coverage Obtained C programs

Slide 36

Slide 36 text

36 Pre fi xQ AFL(black) AFL(gray) INI 62.5 65 77.5 CSV 65.7 68.3 68.5 JSON 13.8 9.2 22.5 TinyC 86.8 47.9 81.6 MJS 28.0 19.0 29.9 Quality of Examples Tex Crash: ]9xdy[zSf$\theta{f!;} ;i\nonfrenchspacing !$$\prec q;7O/, $\downbrace fi ll @Pz \mathstrut{}$^: aK[X|?$47$ ,`D f$)Cg8$* Branch Coverage Obtained C programs

Slide 37

Slide 37 text

37 Grammar Inference

Slide 38

Slide 38 text

38 Grammar Inference (L*) L* (Angluin'84) Learner membership: w ∈ L? equivalence: G = L? yes/no counterexample yes/no Teacher

Slide 39

Slide 39 text

39 Grammar Inference (L*) L* (Angluin) Learner membership: ab ? equivalence: no abbb yes Teacher ?

Slide 40

Slide 40 text

40 Grammar Inference (L*) Learner Teacher w G = L? Equivalences Queries are not possible in software engineering scenarios

Slide 41

Slide 41 text

41 L* Teacher with PAC Guarantees Probably Approximately Correct (Valiant'84) Pr(L(A)≢X ≤ ϵ) ≥ 1−δ 1-δ: confidence 1-∈: accuracy Equivalence Query = Multiple Membership Checks Checks come from some sampling distribution D over A* We only get a PAC guarantee based on D qi = [1/ϵ (ln(1/δ) + i ln(2))] Checks made in place of ith equivalence query:

Slide 42

Slide 42 text

42 Grammar Inference (PAC-L*) Learner Pre fi x Oracle w Random Sampler (D) Blackbox Hypothesis w ∈ D L(*) Substituting Equivalence Queries Yes No No Yes Yes Yes No Yes Yes No

Slide 43

Slide 43 text

43 Grammar Inference (PAC-L*) Learner Pre fi x Oracle w Random Sampler (D) w ∈ D L(*) Substituting Equivalence Queries Search Space

Slide 44

Slide 44 text

44 Grammar Inference (PL*) Learner Pre fi x Oracle w Blackbox Hypothesis w ∈ B Yes/No Yes/No PL(*) w ∈ H Substituting Equivalence Queries

Slide 45

Slide 45 text

45 Grammar Inference (PL*) Pr(L(A)≢X ≤ ϵ) ≥ 1−δ Relation between D,ϵ,δ and F1 score On Arithmetic (depth limited) L(*) Eq = Pre fi x Sampler Eq = Pre fi x Sampler) (p=0.05) (p=0.5) Eq = Pre fi x Sampler) (p=1.0) Red is good, Blue is bad PL(*) PL(*) PL(*) 1-δ: confidence 1-∈: accuracy

Slide 46

Slide 46 text

46 Grammar Inference (PL*) Pr(L(A)≢X ≤ ϵ) ≥ 1−δ Relation between D,ϵ,δ and F1 score On JSON (depth limited) L(*) Eq = Pre fi x Sampler (p=0.05) Eq = Pre fi x Sampler) (p=0.5) Eq = Pre fi x Sampler) (p=1.0) Red is good, Blue is bad 1-δ: confidence 1-∈: accuracy PL(*) PL(*) PL(*)

Slide 47

Slide 47 text

47 Grammar Mining Blakbox Generation Grammar Inference