Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Dancing to an Unknown Music: Grammar Inferrence with Prefix Queries

Dancing to an Unknown Music: Grammar Inferrence with Prefix Queries

Talk at SAPLING 2023

Rahul Gopinath

December 01, 2023
Tweet

More Decks by Rahul Gopinath

Other Decks in Research

Transcript

  1. 3 Formal Languages Language Descriptions: Grammars Regular Context Free Recursively

    Enumerable (Chomsky,1956) Argument Stack Return Stack 3
  2. 7

  3. 8 Input ✓ ✘ Testing @app.route('/admin') def admin(): username =

    request.cookies.get("username") if not username: return {"Error": "Specify username in Cookie"} username = urllib.quote(os.path.basename(username)) url = "http://permissions:5000/permissions/{}".format(username) resp = requests.request(method="GET", url=url) # "superadmin\ud888" will be simpli fi ed to "superadmin" ret = ujson.loads(resp.text) if resp.status_code == 200: if "superadmin" in ret["roles"]: return {"OK": "Superadmin Access granted"} else: e = u"Access denied. User has following roles: {}".format(ret["roles"]) return {"Error": e}, 401 else:return {"Error": ret["Error"]}, 500
  4. @app.route('/admin') def admin(): username = request.cookies.get("username") if not username: return

    {"Error": "Specify username in Cookie"} username = urllib.quote(os.path.basename(username)) url = "http://permissions:5000/permissions/{}".format(username) resp = requests.request(method="GET", url=url) # "superadmin\ud888" will be simpli fi ed to "superadmin" ret = ujson.loads(resp.text) if resp.status_code == 200: if "superadmin" in ret["roles"]: return {"OK": "Superadmin Access granted"} else: e = u"Access denied. User has following roles: {}".format(ret["roles"]) return {"Error": e}, 401 else:return {"Error": ret["Error"]}, 500 [ ; x 1 - G P Z + w c c k c ] ; , N 9 J + ? # 6 ^ 6 \ e ? ] 9 l u 2 _ % ' 4 G X " 0 V U B [ E / r ~ f A p u 6 b 8 < { % s i q 8 Z h . 6 { V , h r ? ; {Ti.r3PIxMMMv6{xS^+'Hq!AxB"YXRS@! Kd6;wtAMefFWM(`|J_<1~o}z3K(CCzRH J I I v H z > _ * . \ > J r l U 3 2 ~ e G P ? lR=bF3+;y$3lodQ<B89!5"W2fK*vE7v{')KC- i,c{<[~m!]o;{.'}Gj\(X}EtYetrpbY@aGZ1{P! A Z U 7 x # 4 ( R t n ! q 4 n C w q o l ^ y 6 } 0 | Ko=*JK~;zMKV=9Nai:wxu{J&UV#HaU)*Bi C < ) , ` + t * g k a < W = Z . % T 5 W G H Z p I 3 0 D < P q > & ] B S 6 R & j ? # t P 7 i a V } - } ` \ ? [ _ [ Z ^ L B M P G - FKj'\xwuZ1=Q`^`5,$N$Q@[!CuRzJ2D|vBy! ^ z k h d f 3 C 5 P A k R ? V ( ( - % > < h n | 3='i2Qx]D$qs4O`1@fevnG'2\11Vf3piU37@ 5 : d f d 4 5 * ( 7 ^ % 5 a p \ z I y l " ' f , $ee,J4Gw:cgNKLie3nx9(`efSlg6#[K"@Wjh Z}r[Scun&sBCS,T[/3]KAeEnQ7lU)3Pn,0)G/ 6N-wyzj/MTd#A;r Program Automating Testing 9 https://www.fuzzingbook.org/html/Fuzzer.html Fuzzing
  5. @app.route('/admin') def admin(): username = request.cookies.get("username") if not username: return

    {"Error": "Specify username in Cookie"} username = urllib.quote(os.path.basename(username)) url = "http://permissions:5000/permissions/{}".format(username) resp = requests.request(method="GET", url=url) # "superadmin\ud888" will be simpli fi ed to "superadmin" ret = ujson.loads(resp.text) if resp.status_code == 200: if "superadmin" in ret["roles"]: return {"OK": "Superadmin Access granted"} else: e = u"Access denied. User has following roles: {}".format(ret["roles"]) return {"Error": e}, 401 else:return {"Error": ret["Error"]}, 500 [ ; x 1 - G P Z + w c c k c ] ; , N 9 J + ? # 6 ^ 6 \ e ? ] 9 l u 2 _ % ' 4 G X " 0 V U B [ E / r ~ f A p u 6 b 8 < { % s i q 8 Z h . 6 { V , h r ? ; {Ti.r3PIxMMMv6{xS^+'Hq!AxB"YXRS@! Kd6;wtAMefFWM(`|J_<1~o}z3K(CCzRH J I I v H z > _ * . \ > J r l U 3 2 ~ e G P ? lR=bF3+;y$3lodQ<B89!5"W2fK*vE7v{')KC- i,c{<[~m!]o;{.'}Gj\(X}EtYetrpbY@aGZ1{P! A Z U 7 x # 4 ( R t n ! q 4 n C w q o l ^ y 6 } 0 | Ko=*JK~;zMKV=9Nai:wxu{J&UV#HaU)*Bi C < ) , ` + t * g k a < W = Z . % T 5 W G H Z p I 3 0 D < P q > & ] B S 6 R & j ? # t P 7 i a V } - } ` \ ? [ _ [ Z ^ L B M P G - FKj'\xwuZ1=Q`^`5,$N$Q@[!CuRzJ2D|vBy! ^ z k h d f 3 C 5 P A k R ? V h n | 3='i2Qx]D$qs4O`1@fevnG'2\11Vf3piU37@ 5 5 a p \ z I y l " ' f , $ee,J4Gw:cgNKLie3nx9(`efSlg6#[K"@Wjh Z}r[Scun&sBCS,T[/3]KAeEnQ7lU)3Pn,0)G/ 6N-wyzj/MTd#A;r Structured Inputs SYNTAX ERROR ✘ 10
  6. def process_input(input): try: val = parse(input) res = process(val) return

    res except SyntaxError: return Error def process_input(input): try: ✘val = parse(input) res = process(val) return res except SyntaxError: return Error SYNTAX ERROR 11 Parser ✘
  7. SYNTAX ERROR def process_input(input): try: ✘val = parse(input) res =

    process(val) return res except SyntaxError: return Error 12 The Core ✘
  8. 14 def process_input(input): try: val = parse(input) res = process(val)

    return res except SyntaxError: return Error 14 {
 '<json>' : [['<elt>']], '<elt>' : [['<object>'], ['<array>'], ['<string>'], ['<number>'], ['true'], ['false'], ['null']], '<object>' : [['{', '<items>','}'], ['{}']], '<items>' : [['<item>,',',<items>'], ['<item>']], '<item>' : [['<string>',':', '<elt>']], '<array>' : [['[', '<elts>', ']'], ['[]']], '<elts>' : [['<elt>,',',<elts>'], ['<elt>']], '<string>' : [['"', '<chars>', '"'], ['""']], '<chars>' : [['<char>','<chars>'], ['<char>']], '<number>' : [['<digits>']], '<digits>' : [['<digit>','<digits>'], ['<digit>']], '<char>' : [[c] for c in string.characters] '<digit>' : [[c] for c in string.digits]
 } Fix: Input Grammar
  9. 15 def process_input(input): try: ✔val = parse(input) res = process(val)

    return res except SyntaxError: return Error 15 {
 '<json>' : [['<elt>']], '<elt>' : [['<object>'], ['<array>'], ['<string>'], ['<number>'], ['true'], ['false'], ['null']], '<object>' : [['{', '<items>','}'], ['{}']], '<items>' : [['<item>,',',<items>'], ['<item>']], '<item>' : [['<string>',':', '<elt>']], '<array>' : [['[', '<elts>', ']'], ['[]']], '<elts>' : [['<elt>,',',<elts>'], ['<elt>']], '<string>' : [['"', '<chars>', '"'], ['""']], '<chars>' : [['<char>','<chars>'], ['<char>']], '<number>' : [['<digits>']], '<digits>' : [['<digit>','<digits>'], ['<digit>']], '<char>' : [[c] for c in string.characters] '<digit>' : [[c] for c in string.digits]
 } ✓ Fix: Input Grammar
  10. QUIRK_ALLOW_ASCII_CONTROL_CODES QUIRK_ALLOW_BACKSLASH_A QUIRK_ALLOW_BACKSLASH_CAPITAL_U QUIRK_ALLOW_BACKSLASH_E QUIRK_ALLOW_BACKSLASH_NEW_LINE QUIRK_ALLOW_BACKSLASH_QUESTION_MARK QUIRK_ALLOW_BACKSLASH_SINGLE_QUOTE QUIRK_ALLOW_BACKSLASH_V QUIRK_ALLOW_BACKSLASH_X_AS_BYTES QUIRK_ALLOW_BACKSLASH_X_AS_CODE_POINTS

    QUIRK_ALLOW_BACKSLASH_ZERO QUIRK_ALLOW_COMMENT_BLOCK QUIRK_ALLOW_COMMENT_LINE QUIRK_ALLOW_EXTRA_COMMA QUIRK_ALLOW_INF_NAN_NUMBERS QUIRK_ALLOW_LEADING_ASCII_RECORD_SEPARATOR QUIRK_ALLOW_LEADING_UNICODE_BYTE_ORDER_MARK QUIRK_ALLOW_TRAILING_FILLER QUIRK_EXPECT_TRAILING_NEW_LINE_OR_EOF QUIRK_JSON_POINTER_ALLOW_TILDE_N_TILDE_R_TILDE_T QUIRK_REPLACE_INVALID_UNICODE JSON common quirks from https://github.com/google/wuffs 20
  11. "Be liberal in what you accept, and conservative in what

    you send"
 Postel's Law The Specification The Implementation Extra "Features" Where to Get the Grammar From? 21 Bugs
  12. <F> := <A> | <B> <F> := <A> <B> <C>

    <Fs> := <B> <Fs> | <empty> Structured Control Flow to Grammar Sequence A B C [F] Selection cond A B [F] F T Iteration cond B [F] 22 Function [F] <F> := ...
  13. def json_raw(stm): while True: stm.skipspaces() c = stm.peek() if c

    == 't': return json_fixed(stm, 'true') elif c == 'f': return json_fixed(stm, 'false') elif c == 'n': return json_fixed(stm, 'null') elif c == '"': return json_string(stm) elif c == '{': return json_dict(stm) elif c == '[': return json_list(stm) elif c in NUMSTART: return json_number(stm) raise JSONError(E_MALF, stm, stm.pos) <json_raw>::= <json_object>
 | <json_list>
 | <json_string>
 | <json_number>
 | <json_fixed(true)> | <json_fixed(false)> | <json_fixed(null)> <json_string> ::= `"` <chars> `"` | `""` <chars> ::= <char><chars> | <char> <json_object> ::= `{`<items>`}` | `{}` <items> ::= <item>`,`<items> | <item> <item> ::= <string>`:`<elt> <json_list> ::= `[`<elts>`]` | `[]` <elts> ::= <elt>`,`<elts> | <elt> <json_number> ::= <digits> <digits> ::= <digit><digits> | <digit> https://github.com/phensley/microjson MicroJSON 23 23
  14. def json_raw(stm): while True: stm.skipspaces() c = stm.peek() if c

    == 't': return json_fixed(stm, 'true') elif c == 'f': return json_fixed(stm, 'false') elif c == 'n': return json_fixed(stm, 'null') elif c == '"': return json_string(stm) elif c == '{': return json_dict(stm) elif c == '[': return json_list(stm) elif c in NUMSTART: return json_number(stm) raise JSONError(E_MALF, stm, stm.pos) <json_raw>::= <json_object>
 | <json_list>
 | <json_string>
 | <json_number>
 | <json_fixed(true)> | <json_fixed(false)> | <json_fixed(null)> <json_string> ::= `"` <chars> `"` | `""` <chars> ::= <char><chars> | <char> <json_object> ::= `{`<items>`}` | `{}` <items> ::= <item>`,`<items> | <item> <item> ::= <string>`:`<elt> <json_list> ::= `[`<elts>`]` | `[]` <elts> ::= <elt>`,`<elts> | <elt> <json_number> ::= <digits> <digits> ::= <digit><digits> | <digit> https://github.com/phensley/microjson MicroJSON 24
  15. 25 <START> ::= <json_raw> <json_raw> ::= '"' <json_string'> | '['

    <json_list'> | '{' <json_dict'> | <json_number'> | 'true' | 'false' | 'null' <json_number'> ::= <json_number>+ | <json_number>+ 'e' <json_number>+ <json_number> ::= '+' | '-' | '.' | [0-9] | 'E' | 'e' <json_string'> ::= <json_string>* '"' <json_list'> ::= ']' | <json_raw> (','<json_raw>)* ']' | ( ',' <json_raw>)+ (',' <json_raw>)* ']' <json_dict'> ::= '}' | ( '"' <json_string'> ':' <json_raw> ',' )* '"'<json_string'> ':' <json_raw> '}' <json_string> ::= ' ' | '!' | '#' | '$' | '%' | '&' | ''' | '*' | '+' | '-' | ',' | '.' | '/' | ':' | ';' | '<' | '=' | '>' | '?' | '@' | '[' | ']' | '^' | '_', ''',| '{' | '|' | '}' | '~' | '[A-Za-z0-9]' | '\' <decode_escape> <decode_escape> ::= '"' | '/' | 'b' | 'f' | 'n' | 'r' | 't' stm.next() if expect_key: raise JSONError(E_DKEY, stm, stm.pos) if c == '}': return result expect_key = 1 continue # parse out a key/value pair elif c == '"': key = _from_json_string(stm) stm.skipspaces() c = stm.next() if c != ':': raise JSONError(E_COLON, stm, stm.pos) stm.skipspaces() val = _from_json_raw(stm) result[key] = val expect_key = 0 continue raise JSONError(E_MALF, stm, stm.pos) def _from_json_raw(stm): while True: stm.skipspaces() c = stm.peek() if c == '"': return _from_json_string(stm) elif c == '{': return _from_json_dict(stm) elif c == '[': return _from_json_list(stm) elif c == 't': return _from_json_fixed(stm, 'true', True, E_BOOL) elif c == 'f': return _from_json_fixed(stm, 'false', False, E_BOOL) elif c == 'n': return _from_json_fixed(stm, 'null', None, E_NULL) elif c in NUMSTART: return _from_json_number(stm) raise JSONError(E_MALF, stm, stm.pos) def from_json(data): stm = JSONStream(data) return _from_json_raw(stm) microjson.py Recovered JSON grammar Mimid Gopinath, Mathis, and Zeller. Mining Input Grammars from Dynamic Control Flow. ESEC/FSE 2020. 25
  16. Service topology map of Uber showing hundreds of microservices (Source:

    Uber Engineering) Instrumentation ability or source code access is not always guaranteed
  17. 33 • Differentiate incomplete and incorrect inputs Key Idea: Viable

    Prefixes 33 • Solve one character at a time systematically
  18. 34 Example Generator a [ 5 1 b , }

    4 ] a ∉ [,],{,},",0,1,2,3,4,5,.,. b ∉ [,],0,1,2,3,4,5,6,7,8,9,, } ∉ [,],0,1,2,3,4,5,6,7,8,9,0,, [51,4] 34
  19. 35 Pre fi xQ AFL(black) INI 62.5 65 CSV 65.7

    68.3 JSON 13.8 9.2 TinyC 86.8 47.9 MJS 28.0 19.0 Quality of Examples Branch Coverage Obtained C programs
  20. 36 Pre fi xQ AFL(black) AFL(gray) INI 62.5 65 77.5

    CSV 65.7 68.3 68.5 JSON 13.8 9.2 22.5 TinyC 86.8 47.9 81.6 MJS 28.0 19.0 29.9 Quality of Examples Tex Crash: ]9xdy[zSf$\theta{f!;} ;i\nonfrenchspacing !$$\prec q;7O/, $\downbrace fi ll @Pz \mathstrut{}$^: aK[X|?$47$ ,`D f$)Cg8$* Branch Coverage Obtained C programs
  21. 38 Grammar Inference (L*) L* (Angluin'84) Learner membership: w ∈

    L? equivalence: G = L? yes/no counterexample yes/no Teacher
  22. 40 Grammar Inference (L*) Learner Teacher w G = L?

    Equivalences Queries are not possible in software engineering scenarios
  23. 41 L* Teacher with PAC Guarantees Probably Approximately Correct (Valiant'84)

    Pr(L(A)≢X ≤ ϵ) ≥ 1−δ 1-δ: confidence 1-∈: accuracy Equivalence Query = Multiple Membership Checks Checks come from some sampling distribution D over A* We only get a PAC guarantee based on D qi = [1/ϵ (ln(1/δ) + i ln(2))] Checks made in place of ith equivalence query:
  24. 42 Grammar Inference (PAC-L*) Learner Pre fi x Oracle w

    Random Sampler (D) Blackbox Hypothesis w ∈ D L(*) Substituting Equivalence Queries Yes No No Yes Yes Yes No Yes Yes No
  25. 43 Grammar Inference (PAC-L*) Learner Pre fi x Oracle w

    Random Sampler (D) w ∈ D L(*) Substituting Equivalence Queries Search Space
  26. 44 Grammar Inference (PL*) Learner Pre fi x Oracle w

    Blackbox Hypothesis w ∈ B Yes/No Yes/No PL(*) w ∈ H Substituting Equivalence Queries
  27. 45 Grammar Inference (PL*) Pr(L(A)≢X ≤ ϵ) ≥ 1−δ Relation

    between D,ϵ,δ and F1 score On Arithmetic (depth limited) L(*) Eq = Pre fi x Sampler Eq = Pre fi x Sampler) (p=0.05) (p=0.5) Eq = Pre fi x Sampler) (p=1.0) Red is good, Blue is bad PL(*) PL(*) PL(*) 1-δ: confidence 1-∈: accuracy
  28. 46 Grammar Inference (PL*) Pr(L(A)≢X ≤ ϵ) ≥ 1−δ Relation

    between D,ϵ,δ and F1 score On JSON (depth limited) L(*) Eq = Pre fi x Sampler (p=0.05) Eq = Pre fi x Sampler) (p=0.5) Eq = Pre fi x Sampler) (p=1.0) Red is good, Blue is bad 1-δ: confidence 1-∈: accuracy PL(*) PL(*) PL(*)