Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Dancing to an Unknown Music: Grammar Inferrence with Prefix Queries

Dancing to an Unknown Music: Grammar Inferrence with Prefix Queries

Talk at SAPLING 2023

Rahul Gopinath

December 01, 2023
Tweet

More Decks by Rahul Gopinath

Other Decks in Research

Transcript

  1. 1
    Dancing to an Unknown Music
    Rahul Gopinath
    https://rahul.gopinath.org [email protected]
    @[email protected]
    Grammar Inference with Pre
    fi
    x Queries

    View full-size slide

  2. 2
    Rahul Gopinath
    https://rahul.gopinath.org [email protected]
    @[email protected]
    Grammar Inference with Pre
    fi
    x Queries
    Dancing to an Unknown Music

    View full-size slide

  3. 3
    Formal Languages
    Language Descriptions: Grammars
    Regular
    Context Free
    Recursively Enumerable
    (Chomsky,1956)
    Argument Stack
    Return Stack
    3

    View full-size slide

  4. Grammar Inference with Pre
    fi
    x Queri
    Why Should we Infer the Grammar?

    View full-size slide

  5. The University of Sydney 5
    https://www3.weforum.org/docs/WEF_Global_Risk_Report_2020.pdf

    View full-size slide

  6. 8
    Input


    Testing
    @app.route('/admin')

    def admin():

    username = request.cookies.get("username")

    if not username:

    return {"Error": "Specify username in Cookie"}

    username = urllib.quote(os.path.basename(username))

    url = "http://permissions:5000/permissions/{}".format(username)

    resp = requests.request(method="GET", url=url)

    # "superadmin\ud888" will be simpli
    fi
    ed to "superadmin"

    ret = ujson.loads(resp.text)

    if resp.status_code == 200:

    if "superadmin" in ret["roles"]:

    return {"OK": "Superadmin Access granted"}

    else:

    e = u"Access denied. User has following roles:
    {}".format(ret["roles"])

    return {"Error": e}, 401

    else:return {"Error": ret["Error"]}, 500

    View full-size slide

  7. @app.route('/admin')

    def admin():

    username = request.cookies.get("username")

    if not username:

    return {"Error": "Specify username in Cookie"}

    username = urllib.quote(os.path.basename(username))

    url = "http://permissions:5000/permissions/{}".format(username)

    resp = requests.request(method="GET", url=url)

    # "superadmin\ud888" will be simpli
    fi
    ed to "superadmin"

    ret = ujson.loads(resp.text)

    if resp.status_code == 200:

    if "superadmin" in ret["roles"]:

    return {"OK": "Superadmin Access granted"}

    else:

    e = u"Access denied. User has following roles:
    {}".format(ret["roles"])

    return {"Error": e}, 401

    else:return {"Error": ret["Error"]}, 500
    [ ; x 1 - G P Z + w c c k c ] ; , N 9 J + ?
    # 6 ^ 6 \ e ? ] 9 l u 2 _ % ' 4 G X " 0 V U B [ E / r
    ~ f A p u 6 b 8 < { % s i q 8 Z h . 6 { V , h r ? ;
    {Ti.r3PIxMMMv6{xS^+'Hq!AxB"YXRS@!
    Kd6;wtAMefFWM(`|J_<1~o}z3K(CCzRH
    J I I v H z > _ * . \ > J r l U 3 2 ~ e G P ?
    lR=bF3+;y$3lodQi,c{<[~m!]o;{.'}Gj\(X}EtYetrpbY@aGZ1{P!
    A Z U 7 x # 4 ( R t n ! q 4 n C w q o l ^ y 6 } 0 |
    Ko=*JK~;zMKV=9Nai:wxu{J&UV#HaU)*Bi
    C < ) , ` + t * g k a < W = Z .
    % T 5 W G H Z p I 3 0 D < P q > & ] B S 6 R & j ?
    # t P 7 i a V } - } ` \ ? [ _ [ Z ^ L B M P G -
    FKj'\xwuZ1=Q`^`5,$N$Q@[!CuRzJ2D|vBy!
    ^ z k h d f 3 C 5 P A k R ? V ( ( - % > < h n |
    3='i2Qx]D$qs4O`1@fevnG'2\11Vf3piU37@
    5 : d f d 4 5 * ( 7 ^ % 5 a p \ z I y l " ' f ,
    $ee,J4Gw:cgNKLie3nx9(`efSlg6#[K"@Wjh
    Z}r[Scun&sBCS,T[/3]KAeEnQ7lU)3Pn,0)G/
    6N-wyzj/MTd#A;r
    Program
    Automating Testing
    9
    https://www.fuzzingbook.org/html/Fuzzer.html
    Fuzzing

    View full-size slide

  8. @app.route('/admin')

    def admin():

    username = request.cookies.get("username")

    if not username:

    return {"Error": "Specify username in Cookie"}

    username = urllib.quote(os.path.basename(username))

    url = "http://permissions:5000/permissions/{}".format(username)

    resp = requests.request(method="GET", url=url)

    # "superadmin\ud888" will be simpli
    fi
    ed to "superadmin"

    ret = ujson.loads(resp.text)

    if resp.status_code == 200:

    if "superadmin" in ret["roles"]:

    return {"OK": "Superadmin Access granted"}

    else:

    e = u"Access denied. User has following roles:
    {}".format(ret["roles"])

    return {"Error": e}, 401

    else:return {"Error": ret["Error"]}, 500
    [ ; x 1 - G P Z + w c c k c ] ; , N 9 J + ?
    # 6 ^ 6 \ e ? ] 9 l u 2 _ % ' 4 G X " 0 V U B [ E / r
    ~ f A p u 6 b 8 < { % s i q 8 Z h . 6 { V , h r ? ;
    {Ti.r3PIxMMMv6{xS^+'Hq!AxB"YXRS@!
    Kd6;wtAMefFWM(`|J_<1~o}z3K(CCzRH
    J I I v H z > _ * . \ > J r l U 3 2 ~ e G P ?
    lR=bF3+;y$3lodQi,c{<[~m!]o;{.'}Gj\(X}EtYetrpbY@aGZ1{P!
    A Z U 7 x # 4 ( R t n ! q 4 n C w q o l ^ y 6 } 0 |
    Ko=*JK~;zMKV=9Nai:wxu{J&UV#HaU)*Bi
    C < ) , ` + t * g k a < W = Z .
    % T 5 W G H Z p I 3 0 D < P q > & ] B S 6 R & j ?
    # t P 7 i a V } - } ` \ ? [ _ [ Z ^ L B M P G -
    FKj'\xwuZ1=Q`^`5,$N$Q@[!CuRzJ2D|vBy!
    ^ z k h d f 3 C 5 P A k R ? V h n |
    3='i2Qx]D$qs4O`1@fevnG'2\11Vf3piU37@
    5 5 a p \ z I y l " ' f ,
    $ee,J4Gw:cgNKLie3nx9(`efSlg6#[K"@Wjh
    Z}r[Scun&sBCS,T[/3]KAeEnQ7lU)3Pn,0)G/
    6N-wyzj/MTd#A;r
    Structured Inputs
    SYNTAX ERROR

    10

    View full-size slide

  9. def process_input(input):
    try:
    val = parse(input)
    res = process(val)
    return res
    except SyntaxError:
    return Error
    def process_input(input):
    try:
    ✘val = parse(input)
    res = process(val)
    return res
    except SyntaxError:
    return Error
    SYNTAX ERROR
    11
    Parser

    View full-size slide

  10. SYNTAX ERROR
    def process_input(input):
    try:
    ✘val = parse(input)
    res = process(val)
    return res
    except SyntaxError:
    return Error
    12
    The Core

    View full-size slide

  11. 13
    Overcoming Parsers

    View full-size slide

  12. 14
    def process_input(input):
    try:
    val = parse(input)
    res = process(val)
    return res
    except SyntaxError:
    return Error
    14
    {

    '' : [['']],
    '' : [[''],
    [''],
    [''],
    [''],
    ['true'], ['false'], ['null']],
    '' : [['{', '','}'],
    ['{}']],
    '' : [[',',','],
    ['']],
    '' : [['',':', '']],
    '' : [['[', '', ']'],
    ['[]']],
    '' : [[',',','],
    ['']],
    '' : [['"', '', '"'],
    ['""']],
    '' : [['',''],
    ['']],
    '' : [['']],
    '' : [['',''],
    ['']],
    '' : [[c] for c in string.characters]
    '' : [[c] for c in string.digits]

    }
    Fix: Input Grammar

    View full-size slide

  13. 15
    def process_input(input):
    try:
    ✔val = parse(input)
    res = process(val)
    return res
    except SyntaxError:
    return Error
    15
    {

    '' : [['']],
    '' : [[''],
    [''],
    [''],
    [''],
    ['true'], ['false'], ['null']],
    '' : [['{', '','}'],
    ['{}']],
    '' : [[',',','],
    ['']],
    '' : [['',':', '']],
    '' : [['[', '', ']'],
    ['[]']],
    '' : [[',',','],
    ['']],
    '' : [['"', '', '"'],
    ['""']],
    '' : [['',''],
    ['']],
    '' : [['']],
    '' : [['',''],
    ['']],
    '' : [[c] for c in string.characters]
    '' : [[c] for c in string.digits]

    }

    Fix: Input Grammar

    View full-size slide

  14. Where to Get the Grammar From?
    16

    View full-size slide

  15. Almost Everyone Uses Handwritten Parsers
    https://notes.eatonphil.com/parser-generators-vs-handwritten-parsers-survey-2021.html
    17

    View full-size slide

  16. Where to Get the Grammar From?
    18

    View full-size slide

  17. 19
    "Be liberal in what you accept, and conservative in what you send"
    Postel's Law
    19

    View full-size slide

  18. QUIRK_ALLOW_ASCII_CONTROL_CODES
    QUIRK_ALLOW_BACKSLASH_A
    QUIRK_ALLOW_BACKSLASH_CAPITAL_U
    QUIRK_ALLOW_BACKSLASH_E
    QUIRK_ALLOW_BACKSLASH_NEW_LINE
    QUIRK_ALLOW_BACKSLASH_QUESTION_MARK
    QUIRK_ALLOW_BACKSLASH_SINGLE_QUOTE
    QUIRK_ALLOW_BACKSLASH_V
    QUIRK_ALLOW_BACKSLASH_X_AS_BYTES
    QUIRK_ALLOW_BACKSLASH_X_AS_CODE_POINTS
    QUIRK_ALLOW_BACKSLASH_ZERO
    QUIRK_ALLOW_COMMENT_BLOCK
    QUIRK_ALLOW_COMMENT_LINE
    QUIRK_ALLOW_EXTRA_COMMA
    QUIRK_ALLOW_INF_NAN_NUMBERS
    QUIRK_ALLOW_LEADING_ASCII_RECORD_SEPARATOR
    QUIRK_ALLOW_LEADING_UNICODE_BYTE_ORDER_MARK
    QUIRK_ALLOW_TRAILING_FILLER
    QUIRK_EXPECT_TRAILING_NEW_LINE_OR_EOF
    QUIRK_JSON_POINTER_ALLOW_TILDE_N_TILDE_R_TILDE_T
    QUIRK_REPLACE_INVALID_UNICODE
    JSON common quirks from
    https://github.com/google/wuffs
    20

    View full-size slide

  19. "Be liberal in what you accept, and conservative in what you send"

    Postel's Law
    The Specification
    The Implementation
    Extra "Features"
    Where to Get the Grammar From?
    21
    Bugs

    View full-size slide

  20. :=
    |
    := :=
    |
    Structured Control Flow to Grammar
    Sequence
    A
    B
    C
    [F]
    Selection
    cond
    A B
    [F]
    F
    T
    Iteration
    cond
    B
    [F]
    22
    Function
    [F]
    := ...

    View full-size slide

  21. def json_raw(stm):
    while True:
    stm.skipspaces()
    c = stm.peek()
    if c == 't':
    return json_fixed(stm, 'true')
    elif c == 'f':
    return json_fixed(stm, 'false')
    elif c == 'n':
    return json_fixed(stm, 'null')
    elif c == '"':
    return json_string(stm)
    elif c == '{':
    return json_dict(stm)
    elif c == '[':
    return json_list(stm)
    elif c in NUMSTART:
    return json_number(stm)
    raise JSONError(E_MALF, stm, stm.pos)
    ::= 

    | 

    | 

    | 

    |
    |
    |
    ::= `"` `"`
    | `""`
    ::=
    |
    ::= `{``}`
    | `{}`
    ::= `,`
    |
    ::= `:`
    ::= `[``]`
    | `[]`
    ::= `,`
    |
    ::=
    ::=
    |
    https://github.com/phensley/microjson
    MicroJSON
    23 23

    View full-size slide

  22. def json_raw(stm):
    while True:
    stm.skipspaces()
    c = stm.peek()
    if c == 't':
    return json_fixed(stm, 'true')
    elif c == 'f':
    return json_fixed(stm, 'false')
    elif c == 'n':
    return json_fixed(stm, 'null')
    elif c == '"':
    return json_string(stm)
    elif c == '{':
    return json_dict(stm)
    elif c == '[':
    return json_list(stm)
    elif c in NUMSTART:
    return json_number(stm)
    raise JSONError(E_MALF, stm, stm.pos)
    ::= 

    | 

    | 

    | 

    |
    |
    |
    ::= `"` `"`
    | `""`
    ::=
    |
    ::= `{``}`
    | `{}`
    ::= `,`
    |
    ::= `:`
    ::= `[``]`
    | `[]`
    ::= `,`
    |
    ::=
    ::=
    |
    https://github.com/phensley/microjson
    MicroJSON
    24

    View full-size slide

  23. 25
    ::=
    ::= '"'
    | '['
    | '{'
    |
    | 'true'
    | 'false'
    | 'null'
    ::= +
    | + 'e' +
    ::= '+' | '-' | '.' | [0-9] | 'E' | 'e'
    ::= * '"'
    ::= ']'
    | (',')* ']'
    | ( ',' )+ (',' )* ']'
    ::= '}'
    | ( '"' ':' ',' )*
    '"' ':' '}'
    ::= ' ' | '!' | '#' | '$' | '%' | '&' | '''
    | '*' | '+' | '-' | ',' | '.' | '/' | ':' | ';'
    | '<' | '=' | '>' | '?' | '@' | '[' | ']' | '^'
    | '_', ''',| '{' | '|' | '}' | '~'
    | '[A-Za-z0-9]'
    | '\'
    ::= '"' | '/' | 'b' | 'f' | 'n' | 'r' | 't'
    stm.next()
    if expect_key:
    raise JSONError(E_DKEY, stm, stm.pos)
    if c == '}':
    return result
    expect_key = 1
    continue
    # parse out a key/value pair
    elif c == '"':
    key = _from_json_string(stm)
    stm.skipspaces()
    c = stm.next()
    if c != ':':
    raise JSONError(E_COLON, stm, stm.pos)
    stm.skipspaces()
    val = _from_json_raw(stm)
    result[key] = val
    expect_key = 0
    continue
    raise JSONError(E_MALF, stm, stm.pos)
    def _from_json_raw(stm):
    while True:
    stm.skipspaces()
    c = stm.peek()
    if c == '"':
    return _from_json_string(stm)
    elif c == '{':
    return _from_json_dict(stm)
    elif c == '[':
    return _from_json_list(stm)
    elif c == 't':
    return _from_json_fixed(stm, 'true', True, E_BOOL)
    elif c == 'f':
    return _from_json_fixed(stm, 'false', False, E_BOOL)
    elif c == 'n':
    return _from_json_fixed(stm, 'null', None, E_NULL)
    elif c in NUMSTART:
    return _from_json_number(stm)
    raise JSONError(E_MALF, stm, stm.pos)
    def from_json(data):
    stm = JSONStream(data)
    return _from_json_raw(stm)
    microjson.py Recovered JSON grammar
    Mimid
    Gopinath, Mathis, and Zeller. Mining Input Grammars from Dynamic Control Flow. ESEC/FSE 2020.
    25

    View full-size slide

  24. Service topology map of Uber
    showing hundreds of microservices
    (Source: Uber Engineering)
    Instrumentation ability or source code
    access is not always guaranteed

    View full-size slide

  25. Blackbox Grammar Inference

    View full-size slide

  26. Blackbox Grammar Inference
    Problem: Exponential Search Space
    2n possibilities for n length string

    View full-size slide

  27. Blackbox Grammar Inference (with examples)
    Glade Arvada
    With good examples, the problem is tractable

    View full-size slide

  28. Finding Good Examples
    Example corpus?
    (Blind spots)
    31

    View full-size slide

  29. 32
    Key Idea: Leverage Error Feedback
    32
    Viable Prefix (Ullman)

    View full-size slide

  30. 33
    • Differentiate incomplete and incorrect inputs
    Key Idea: Viable Prefixes
    33
    • Solve one character at a time systematically

    View full-size slide

  31. 34
    Example Generator
    a
    [ 5
    1
    b
    ,
    }
    4 ]
    a ∉ [,],{,},",0,1,2,3,4,5,.,.
    b ∉ [,],0,1,2,3,4,5,6,7,8,9,,
    } ∉ [,],0,1,2,3,4,5,6,7,8,9,0,,
    [51,4]
    34

    View full-size slide

  32. 35
    Pre
    fi
    xQ AFL(black)
    INI 62.5 65
    CSV 65.7 68.3
    JSON 13.8 9.2
    TinyC 86.8 47.9
    MJS 28.0 19.0
    Quality of Examples
    Branch Coverage Obtained
    C programs

    View full-size slide

  33. 36
    Pre
    fi
    xQ AFL(black) AFL(gray)
    INI 62.5 65 77.5
    CSV 65.7 68.3 68.5
    JSON 13.8 9.2 22.5
    TinyC 86.8 47.9 81.6
    MJS 28.0 19.0 29.9
    Quality of Examples
    Tex Crash: ]9xdy[zSf$\theta{f!;} ;i\nonfrenchspacing !$$\prec q;7O/, $\downbrace
    fi
    ll @Pz \mathstrut{}$^: aK[X|?$47$ ,`D f$)Cg8$*
    Branch Coverage Obtained
    C programs

    View full-size slide

  34. 37
    Grammar Inference

    View full-size slide

  35. 38
    Grammar Inference (L*)
    L* (Angluin'84)
    Learner
    membership: w ∈ L?
    equivalence: G = L?
    yes/no
    counterexample
    yes/no
    Teacher

    View full-size slide

  36. 39
    Grammar Inference (L*)
    L* (Angluin)
    Learner
    membership: ab ?
    equivalence:
    no
    abbb
    yes
    Teacher
    ?

    View full-size slide

  37. 40
    Grammar Inference (L*)
    Learner Teacher
    w
    G = L?
    Equivalences Queries are not possible in software engineering scenarios

    View full-size slide

  38. 41
    L* Teacher with PAC Guarantees
    Probably Approximately Correct (Valiant'84)
    Pr(L(A)≢X ≤ ϵ) ≥ 1−δ 1-δ: confidence
    1-∈: accuracy
    Equivalence Query = Multiple Membership Checks
    Checks come from some sampling distribution D over A*

    We only get a PAC guarantee based on D
    qi = [1/ϵ (ln(1/δ) + i ln(2))]
    Checks made in place of ith equivalence query:

    View full-size slide

  39. 42
    Grammar Inference (PAC-L*)
    Learner Pre
    fi
    x Oracle
    w
    Random Sampler (D)
    Blackbox Hypothesis
    w ∈ D
    L(*)
    Substituting Equivalence Queries
    Yes

    No

    No

    Yes

    Yes
    Yes

    No

    Yes

    Yes

    No

    View full-size slide

  40. 43
    Grammar Inference (PAC-L*)
    Learner Pre
    fi
    x Oracle
    w
    Random Sampler (D)
    w ∈ D
    L(*)
    Substituting Equivalence Queries
    Search Space

    View full-size slide

  41. 44
    Grammar Inference (PL*)
    Learner Pre
    fi
    x Oracle
    w
    Blackbox Hypothesis
    w ∈ B
    Yes/No
    Yes/No
    PL(*)
    w ∈ H
    Substituting Equivalence Queries

    View full-size slide

  42. 45
    Grammar Inference (PL*)
    Pr(L(A)≢X ≤ ϵ) ≥ 1−δ
    Relation between D,ϵ,δ and F1 score On Arithmetic (depth limited)
    L(*)
    Eq = Pre
    fi
    x Sampler Eq = Pre
    fi
    x Sampler)
    (p=0.05) (p=0.5)
    Eq = Pre
    fi
    x Sampler)
    (p=1.0)
    Red is good, Blue is bad
    PL(*) PL(*) PL(*)
    1-δ: confidence
    1-∈: accuracy

    View full-size slide

  43. 46
    Grammar Inference (PL*)
    Pr(L(A)≢X ≤ ϵ) ≥ 1−δ
    Relation between D,ϵ,δ and F1 score On JSON (depth limited)
    L(*)
    Eq = Pre
    fi
    x Sampler
    (p=0.05)
    Eq = Pre
    fi
    x Sampler)
    (p=0.5)
    Eq = Pre
    fi
    x Sampler)
    (p=1.0)
    Red is good, Blue is bad
    1-δ: confidence
    1-∈: accuracy
    PL(*)
    PL(*) PL(*)

    View full-size slide

  44. 47
    Grammar Mining
    Blakbox Generation Grammar Inference

    View full-size slide