Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to Talk to Strange Programs and Find Bugs

How to Talk to Strange Programs and Find Bugs

Talk at ASESS'24

Rahul Gopinath

February 01, 2024
Tweet

More Decks by Rahul Gopinath

Other Decks in Research

Transcript

  1. 1
    How to Talk to Strange Programs and Find Bugs
    Rahul Gopinath
    https://rahul.gopinath.org [email protected]
    @[email protected]

    View full-size slide

  2. 2
    How to Talk to Strange Programs and Find Bugs
    Rahul Gopinath
    https://rahul.gopinath.org [email protected]
    @[email protected]
    i.e., when the input speci
    fi
    cation is not known

    View full-size slide

  3. 3
    We live in a world defined by
    software

    View full-size slide

  4. 4
    We have a crisis
    We live in a world defined by
    software

    View full-size slide

  5. The University of Sydney 7
    https://www3.weforum.org/docs/WEF_Global_Risk_Report_2020.pdf
    Interlinked Causes for Catastrophes

    View full-size slide

  6. 8
    Input


    Testing
    @app.route('/admin')

    def admin():

    username = request.cookies.get("username")

    if not username:

    return {"Error": "Specify username in Cookie"}

    username = urllib.quote(os.path.basename(username))

    url = "http://permissions:5000/permissions/{}".format(username)

    resp = requests.request(method="GET", url=url)

    # "superadmin\ud888" will be simpli
    fi
    ed to "superadmin"

    ret = ujson.loads(resp.text)

    if resp.status_code == 200:

    if "superadmin" in ret["roles"]:

    return {"OK": "Superadmin Access granted"}

    else:

    e = u"Access denied. User has following roles:
    {}".format(ret["roles"])

    return {"Error": e}, 401

    else:return {"Error": ret["Error"]}, 500

    View full-size slide

  7. @app.route('/admin')

    def admin():

    username = request.cookies.get("username")

    if not username:

    return {"Error": "Specify username in Cookie"}

    username = urllib.quote(os.path.basename(username))

    url = "http://permissions:5000/permissions/{}".format(username)

    resp = requests.request(method="GET", url=url)

    # "superadmin\ud888" will be simpli
    fi
    ed to "superadmin"

    ret = ujson.loads(resp.text)

    if resp.status_code == 200:

    if "superadmin" in ret["roles"]:

    return {"OK": "Superadmin Access granted"}

    else:

    e = u"Access denied. User has following roles:
    {}".format(ret["roles"])

    return {"Error": e}, 401

    else:return {"Error": ret["Error"]}, 500
    [ ; x 1 - G P Z + w c c k c ] ; , N 9 J + ?
    # 6 ^ 6 \ e ? ] 9 l u 2 _ % ' 4 G X " 0 V U B [ E / r
    ~ f A p u 6 b 8 < { % s i q 8 Z h . 6 { V , h r ? ;
    {Ti.r3PIxMMMv6{xS^+'Hq!AxB"YXRS@!
    Kd6;wtAMefFWM(`|J_<1~o}z3K(CCzRH
    J I I v H z > _ * . \ > J r l U 3 2 ~ e G P ?
    lR=bF3+;y$3lodQi,c{<[~m!]o;{.'}Gj\(X}EtYetrpbY@aGZ1{P!
    A Z U 7 x # 4 ( R t n ! q 4 n C w q o l ^ y 6 } 0 |
    Ko=*JK~;zMKV=9Nai:wxu{J&UV#HaU)*Bi
    C < ) , ` + t * g k a < W = Z .
    % T 5 W G H Z p I 3 0 D < P q > & ] B S 6 R & j ?
    # t P 7 i a V } - } ` \ ? [ _ [ Z ^ L B M P G -
    FKj'\xwuZ1=Q`^`5,$N$Q@[!CuRzJ2D|vBy!
    ^ z k h d f 3 C 5 P A k R ? V ( ( - % > < h n |
    3='i2Qx]D$qs4O`1@fevnG'2\11Vf3piU37@
    5 : d f d 4 5 * ( 7 ^ % 5 a p \ z I y l " ' f ,
    $ee,J4Gw:cgNKLie3nx9(`efSlg6#[K"@Wjh
    Z}r[Scun&sBCS,T[/3]KAeEnQ7lU)3Pn,0)G/
    6N-wyzj/MTd#A;r
    Program
    Automating Testing
    9
    https://www.fuzzingbook.org/html/Fuzzer.html
    Fuzzing

    View full-size slide

  8. @app.route('/admin')

    def admin():

    username = request.cookies.get("username")

    if not username:

    return {"Error": "Specify username in Cookie"}

    username = urllib.quote(os.path.basename(username))

    url = "http://permissions:5000/permissions/{}".format(username)

    resp = requests.request(method="GET", url=url)

    # "superadmin\ud888" will be simpli
    fi
    ed to "superadmin"

    ret = ujson.loads(resp.text)

    if resp.status_code == 200:

    if "superadmin" in ret["roles"]:

    return {"OK": "Superadmin Access granted"}

    else:

    e = u"Access denied. User has following roles:
    {}".format(ret["roles"])

    return {"Error": e}, 401

    else:return {"Error": ret["Error"]}, 500
    [ ; x 1 - G P Z + w c c k c ] ; , N 9 J + ?
    # 6 ^ 6 \ e ? ] 9 l u 2 _ % ' 4 G X " 0 V U B [ E / r
    ~ f A p u 6 b 8 < { % s i q 8 Z h . 6 { V , h r ? ;
    {Ti.r3PIxMMMv6{xS^+'Hq!AxB"YXRS@!
    Kd6;wtAMefFWM(`|J_<1~o}z3K(CCzRH
    J I I v H z > _ * . \ > J r l U 3 2 ~ e G P ?
    lR=bF3+;y$3lodQi,c{<[~m!]o;{.'}Gj\(X}EtYetrpbY@aGZ1{P!
    A Z U 7 x # 4 ( R t n ! q 4 n C w q o l ^ y 6 } 0 |
    Ko=*JK~;zMKV=9Nai:wxu{J&UV#HaU)*Bi
    C < ) , ` + t * g k a < W = Z .
    % T 5 W G H Z p I 3 0 D < P q > & ] B S 6 R & j ?
    # t P 7 i a V } - } ` \ ? [ _ [ Z ^ L B M P G -
    FKj'\xwuZ1=Q`^`5,$N$Q@[!CuRzJ2D|vBy!
    ^ z k h d f 3 C 5 P A k R ? V h n |
    3='i2Qx]D$qs4O`1@fevnG'2\11Vf3piU37@
    5 5 a p \ z I y l " ' f ,
    $ee,J4Gw:cgNKLie3nx9(`efSlg6#[K"@Wjh
    Z}r[Scun&sBCS,T[/3]KAeEnQ7lU)3Pn,0)G/
    6N-wyzj/MTd#A;r
    Structured Inputs
    SYNTAX ERROR

    10

    View full-size slide

  9. def process_input(input):
    try:
    val = parse(input)
    res = process(val)
    return res
    except SyntaxError:
    return Error
    def process_input(input):
    try:
    ✘val = parse(input)
    res = process(val)
    return res
    except SyntaxError:
    return Error
    SYNTAX ERROR
    11
    Parser

    View full-size slide

  10. SYNTAX ERROR
    def process_input(input):
    try:
    ✘val = parse(input)
    res = process(val)
    return res
    except SyntaxError:
    return Error
    12
    The Core

    View full-size slide

  11. 13
    Overcoming Parsers

    View full-size slide

  12. 14
    def process_input(input):
    try:
    val = parse(input)
    res = process(val)
    return res
    except SyntaxError:
    return Error
    14
    {

    '' : [['']],
    '' : [[''],
    [''],
    [''],
    [''],
    ['true'], ['false'], ['null']],
    '' : [['{', '','}'],
    ['{}']],
    '' : [[',',','],
    ['']],
    '' : [['',':', '']],
    '' : [['[', '', ']'],
    ['[]']],
    '' : [[',',','],
    ['']],
    '' : [['"', '', '"'],
    ['""']],
    '' : [['',''],
    ['']],
    '' : [['']],
    '' : [['',''],
    ['']],
    '' : [[c] for c in string.characters]
    '' : [[c] for c in string.digits]

    }
    Fix: Input Grammar

    View full-size slide

  13. 15
    def process_input(input):
    try:
    ✔val = parse(input)
    res = process(val)
    return res
    except SyntaxError:
    return Error
    15
    {

    '' : [['']],
    '' : [[''],
    [''],
    [''],
    [''],
    ['true'], ['false'], ['null']],
    '' : [['{', '','}'],
    ['{}']],
    '' : [[',',','],
    ['']],
    '' : [['',':', '']],
    '' : [['[', '', ']'],
    ['[]']],
    '' : [[',',','],
    ['']],
    '' : [['"', '', '"'],
    ['""']],
    '' : [['',''],
    ['']],
    '' : [['']],
    '' : [['',''],
    ['']],
    '' : [[c] for c in string.characters]
    '' : [[c] for c in string.digits]

    }

    Fix: Input Grammar

    View full-size slide

  14. 16
    ASIDE: A Simple Solution
    Let the parser tell you what it wants

    View full-size slide

  15. 17
    Let the parser tell you what it wants
    [ 3 5 ]

    View full-size slide

  16. 18
    Track input string accesses and comparisons
    [ 3 5
    PARSE ERROR c ∉{']' ','}
    ]

    View full-size slide

  17. 19
    Track input string accesses and comparisons
    [ 3 , ]

    View full-size slide

  18. 20
    Track input string accesses and comparisons
    [ 3 , ]
    PARSE ERROR c ∉{'0'.., '[', '{'}

    View full-size slide

  19. 21
    Track input string accesses and comparisons
    [ 3 , 1

    View full-size slide

  20. 22
    Track input string accesses and comparisons
    [ 3 , 1 ]

    View full-size slide

  21. 23
    • Identify character comparisons or EOF
    Key Idea: Viable Pre
    fi
    xes
    23
    • Complete with one of the compared characters
    [ 3 , 1 ]

    View full-size slide

  22. 24
    Viable Pre
    fi
    xes
    24
    Limitations
    • Performance
    • Lack of control
    [
    x
    [ ;
    @
    [ 3
    _
    [ 3
    _
    [ 3 $
    _
    [ 3 ,
    _
    [ 3 , x
    _
    [ 3 , 1
    _
    [ 3 , 1 ;
    _
    [ 3 , 1 ]

    View full-size slide

  23. 25
    Mathis, Gopinath, Mera, Kampmann, Höschele, and Zeller. Parser Directed Fuzzing. PLDI 2019.
    Mathis, Gopinath and Zeller Learning Input Tokens for Effective Fuzzing. ISSTA 2020.
    Viable Pre
    fi
    xes

    View full-size slide

  24. Where to Get the Input Grammar From?
    27

    View full-size slide

  25. 28
    Formal Languages
    Language Descriptions: Grammars
    Regular
    Context Free
    Recursively Enumerable
    (Chomsky,1956)
    Argument Stack
    Return Stack
    28

    View full-size slide

  26. 29
    Grammar
    :=
    := '+'
    | '-'
    | '/'
    | '*'
    | '(' ')'
    |
    :=
    | '.'
    :=
    |
    := [0-9]
    Arithmetic expression grammar
    De
    f
    inition for
    key

    View full-size slide

  27. 30
    :=
    := '+'
    | '-'
    | '/'
    | '*'
    | '(' ')'
    |
    :=
    | '.'
    :=
    |
    := [0-9]
    Grammar
    Arithmetic expression grammar
    Expansion Rule Terminal Symbol
    Nonterminal Symbol

    View full-size slide

  28. 31
    Grammars
    For Parsing
    (8 / 3) * 49
    :=
    := '+'
    | '-'
    | '/'
    | '*'
    | '(' ')'
    |
    :=
    | '.'
    :=
    |
    := [0-9]

    View full-size slide

  29. 32
    Grammars 8.2 - 27 - -9 / +((+9 * --2 + --+-+-
    ((-1 * +(8 - 5 - 6)) * (-(a-+(((+(4)
    )))) - ++4) / +(-+---((5.6 - --(3 *
    -1.8 * +(6 * +-(((-(-6) * ---+6)) /
    +--(+-+-7 * (-0 * (+(((((2)) + 8 - 3
    - ++9.0 + ---(--+7 / (1 / +++6.37)
    + (1) / 482) / +++-+0)))) + 8.2 - 27
    - -9 / +((+9 * --2 + --+-+-((-1 * +
    (8 - 5 - 6)) * (-(a-+(((+(4))))) - +
    +4) / +(-+---((5.6 - --(3 * -1.8 * +
    (6 * +-(((-(-6) * ---+6)) / +--(+-+-
    7 * (-0 * (+(((((2)) + 8 - 3 - ++9.0
    + ---(--+7 / (1 / +++6.37) + (1) /
    482) / +++-+0)))) * -+5 + 7.513))))
    - (+1 / ++((-84)))))))) * ++5 / +-(-
    -2 - -++-9.0)))) / 5 * --++090 + * -
    +5 + 7.513)))) - (+1 / ++((-84))))))
    )) * 8.2 - 27 - -9 / +((+9 * --2 + -
    -+-+-((-1 * +(8 - 5 - 6)) * (-(a-+((
    (+(4))))) - ++4) / +(-+---((5.6 - --
    (3 * -1.8 * +(6 * +-(((-(-6) * ---+6
    )) / +--(+-+-7 * (-0 * (+(((((2)) +
    8 - 3 - ++9.0 + ---(--+7 / (1 / +++6
    .37) + (1) / 482) / +++-+0)))) * -+5
    + 7.513)))) - (+1 / ++((-84))))))))
    * ++5 / +-(--2 - -++-9.0)))) / 5 *
    --++090 ++5 / +-(--2 - -++-9.0)))) /
    5 * --++090
    :=
    := '+'
    | '-'
    | '/'
    | '*'
    | '(' ')'
    |
    :=
    | '.'
    :=
    |
    := [0-9]
    For Fuzzing (Hanford 1970)

    (Purdom 1972)

    View full-size slide

  30. 33
    Grammars
    As effective producers
    Interpreter
    Parser


    8.2 - 27 - -9 / +((+9 * --2 + --+-+-
    ((-1 * +(8 - 5 - 6)) * (-(a-+(((+(4)
    )))) - ++4) / +(-+---((5.6 - --(3 *
    -1.8 * +(6 * +-(((-(-6) * ---+6)) /
    +--(+-+-7 * (-0 * (+(((((2)) + 8 - 3
    - ++9.0 + ---(--+7 / (1 / +++6.37)
    + (1) / 482) / +++-+0)))) + 8.2 - 27
    - -9 / +((+9 * --2 + --+-+-((-1 * +
    (8 - 5 - 6)) * (-(a-+(((+(4))))) - +
    +4) / +(-+---((5.6 - --(3 * -1.8 * +
    (6 * +-(((-(-6) * ---+6)) / +--(+-+-
    7 * (-0 * (+(((((2)) + 8 - 3 - ++9.0
    + ---(--+7 / (1 / +++6.37) + (1) /
    482) / +++-+0)))) * -+5 + 7.513))))
    - (+1 / ++((-84)))))))) * ++5 / +-(-
    -2 - -++-9.0)))) / 5 * --++090 + * -
    +5 + 7.513)))) - (+1 / ++((-84))))))
    )) * 8.2 - 27 - -9 / +((+9 * --2 + -
    -+-+-((-1 * +(8 - 5 - 6)) * (-(a-+((
    (+(4))))) - ++4) / +(-+---((5.6 - --
    (3 * -1.8 * +(6 * +-(((-(-6) * ---+6
    )) / +--(+-+-7 * (-0 * (+(((((2)) +
    8 - 3 - ++9.0 + ---(--+7 / (1 / +++6
    .37) + (1) / 482) / +++-+0)))) * -+5
    + 7.513)))) - (+1 / ++((-84))))))))
    * ++5 / +-(--2 - -++-9.0)))) / 5 *
    --++090 ++5 / +-(--2 - -++-9.0)))) /
    5 * --++090

    View full-size slide

  31. 34
    Grammars
    :=
    := '+'
    | '-'
    | '/'
    | '*'
    | '(' ')'
    |
    :=
    | '.'
    :=
    |
    := [0-9]
    As efficient producers
    def start():
    expr()
    def expr():
    match (random() % 6):
    case 0: expr(); print('+'); expr()
    case 1: expr(); print('-'); expr()
    case 2: expr(); print('/'); expr()
    case 3: expr(); print('*'); expr()
    case 4: print('('); expr(); print(')')
    case 5: number()
    def number():
    match (random() % 2):
    case 0: integer()
    case 1: integer(); print('.'); integer()
    def integer():
    match (random() % 2):
    case 0: digit(); integer()
    case 1: digit()
    def digit():
    match (random() % 10):
    case 0: print('0')
    case 1: print('1')
    case 2: print('2')
    case 3: print('3')
    case 4: print('4')
    case 5: print('5')
    case 6: print('6')
    case 7: print('7')
    Compiled Grammar (F1)

    View full-size slide

  32. Where to Get the Grammar From?
    35

    View full-size slide

  33. Almost Everyone Uses Handwritten Parsers
    https://notes.eatonphil.com/parser-generators-vs-handwritten-parsers-survey-2021.html
    36

    View full-size slide

  34. Where to Get the Grammar From?
    37

    View full-size slide

  35. 38
    "Be liberal in what you accept, and conservative in what you send"
    Postel's Law
    38

    View full-size slide

  36. QUIRK_ALLOW_ASCII_CONTROL_CODES
    QUIRK_ALLOW_BACKSLASH_A
    QUIRK_ALLOW_BACKSLASH_CAPITAL_U
    QUIRK_ALLOW_BACKSLASH_E
    QUIRK_ALLOW_BACKSLASH_NEW_LINE
    QUIRK_ALLOW_BACKSLASH_QUESTION_MARK
    QUIRK_ALLOW_BACKSLASH_SINGLE_QUOTE
    QUIRK_ALLOW_BACKSLASH_V
    QUIRK_ALLOW_BACKSLASH_X_AS_BYTES
    QUIRK_ALLOW_BACKSLASH_X_AS_CODE_POINTS
    QUIRK_ALLOW_BACKSLASH_ZERO
    QUIRK_ALLOW_COMMENT_BLOCK
    QUIRK_ALLOW_COMMENT_LINE
    QUIRK_ALLOW_EXTRA_COMMA
    QUIRK_ALLOW_INF_NAN_NUMBERS
    QUIRK_ALLOW_LEADING_ASCII_RECORD_SEPARATOR
    QUIRK_ALLOW_LEADING_UNICODE_BYTE_ORDER_MARK
    QUIRK_ALLOW_TRAILING_FILLER
    QUIRK_EXPECT_TRAILING_NEW_LINE_OR_EOF
    QUIRK_JSON_POINTER_ALLOW_TILDE_N_TILDE_R_TILDE_T
    QUIRK_REPLACE_INVALID_UNICODE
    JSON common quirks from
    https://github.com/google/wuffs
    39

    View full-size slide

  37. "Be liberal in what you accept, and conservative in what you send"

    Postel's Law
    The Specification
    The Implementation
    Extra "Features"
    Where to Get the Grammar From?
    40
    Bugs

    View full-size slide

  38. 42
    Grammar Mining : Whitebox extraction of grammar

    Grammar Inference: Blackbox extraction of grammar

    View full-size slide

  39. 43
    Where to Get the Grammar From?
    Handwritten parsers contain the parse structure
    key value key value
    scheme
    parse_scheme parse_hostpath parse_querystring parse_fragment
    domain TLD
    subdomain
    parse_host
    subdirectory
    parse_fslocation
    binary
    parse_binaryname
    parameters
    parse_parameters
    parse_url

    View full-size slide

  40. 44
    Where to Get the Grammar From?
    Mining Grammar from a hand-written parser
    https://www.example.com/forum/questions/cgi?tag=networking&order=newwest#top
    key value key value
    split
    scheme
    parse_scheme
    host path
    parse_hostpath
    query string
    parse_querystring
    fragment
    parse_fragment
    domain TLD
    subdomain
    parse_host
    subdirectory
    parse_fslocation
    binary
    parse_binaryname
    parameters
    parse_parameters
    With Dynamic Data Flow Analysis
    parseurl

    View full-size slide

  41. 45
    http://user:[email protected]:80/?q=path#ref
    urlparse:url = 'http://user:[email protected]:80/?q=path#ref'
    urlsplit:scheme = 'http'
    urlsplit:netloc = 'user:[email protected]:80'
    urlsplit:fragment = 'ref'
    urlsplit:query = 'q=path'
    https://soft-eng.sydney.edu.au:80/
    urlparse:url = 'https://soft-eng.sydney.edu.au:80/'
    urlsplit:scheme = 'https'
    urlsplit:netloc = 'soft-eng.sydney.edu.au:80'
    http://www.fuzzingbook.org/#News
    urlparse:url = 'http://www.fuzzingbook.org/#News'
    urlsplit:scheme = 'http'
    urlsplit:netloc = 'www.fuzzingbook.org'
    urlsplit:fragment = 'News'
    Mining with Dynamic Data Flow Analysis

    View full-size slide

  42. 46
    {
    '': [
    ['', '://', '', '/?', '', '#', ''],
    ['', '://', '', '/#',''],
    ['', '://', '', '/']],
    '' : [
    ['http'],
    ['http']],
    '': [
    ['user:[email protected]:80'],
    ['www.fuzzingbook.org'],
    ['soft-eng.sydney.edu.au']],
    '' : [
    ['q=path']],
    '' : [
    ['ref'],
    ['News']],

    }
    http://user:[email protected]:80/?q=path#ref
    urlparse:url = 'http://user:[email protected]:80/?q=path#ref'
    urlsplit:scheme = 'http'
    urlsplit:netloc = 'user:[email protected]:80'
    urlsplit:fragment = 'ref'
    urlsplit:query = 'q=path'
    https://soft-eng.sydney.edu.au:80/
    urlparse:url = 'https://soft-eng.sydney.edu.au:80/'
    urlsplit:scheme = 'https'
    urlsplit:netloc = 'soft-eng.sydney.edu.au:80'
    http://www.fuzzingbook.org/#News
    urlparse:url = 'http://www.fuzzingbook.org/#News'
    urlsplit:scheme = 'http'
    urlsplit:netloc = 'www.fuzzingbook.org'
    urlsplit:fragment = 'News'
    Mining with Dynamic Data Flow Analysis

    View full-size slide

  43. {
    '': [
    ['', '://', '', '/?', '', '#', ''],
    ['', '://', '', '/#',''],
    ['', '://', '', '/']],
    '' : [
    ['http'],
    ['http']],
    '': [
    ['user:[email protected]:80'],
    ['www.fuzzingbook.org'],
    ['soft-eng.sydney.edu.au']],
    '' : [
    ['q=path']],
    '' : [
    ['ref'],
    ['News']],

    }
    Limitations
    • Poor accuracy in most handwritten parsers
    • Handwritten parsers are not often well formed
    • Control flow is ignored
    Mining with Dynamic Data Flow Analysis

    View full-size slide

  44. :=
    |
    := :=
    |
    Structured Control Flow to Grammar
    Sequence
    A
    B
    C
    [F]
    Selection
    cond
    A B
    [F]
    F
    T
    Iteration
    cond
    B
    [F]
    48
    Function
    [F]
    := ...

    View full-size slide

  45. 49
    1. Extract the input string accesses

    2. Attach control
    fl
    ow information
    Hand-written parsers already encode the grammar
    Mining with Dynamic Control Flow Analysis

    View full-size slide

  46. 50
    • Inputs + control
    fl
    ow -> Dynamic Control Dependence Trees

    • DCD Trees -> Parse Tree
    Mining with Dynamic Control Flow Analysis

    View full-size slide

  47. 51
    Control Dependence Graph
    Statement B is control dependent on A if A determines whether B executes.
    def parse_csv(s,i):
    while s[i:]:
    if is_digit(s[i]):
    n,j = num(s[i:])
    i = i+j
    else:
    comma(s[i])
    i += 1
    CDG for parse_csv
    while: determines

    whether

    if: executes

    View full-size slide

  48. 52
    def parse_csv(s,i):
    while s[i:]:
    if is_digit(s[i]):
    n,j = num(s[i:])
    i = i+j
    else:
    comma(s[i])
    i += 1
    CDG for parse_csv
    Dynamic Control Dependence Tree
    Each statement execution is represented as a separate node
    DCD Tree for call parse_csv()

    View full-size slide

  49. 53
    def parse_csv(s,i):
    while s[i:]:
    if is_digit(s[i]):
    n,j = num(s[i:])
    i = i+j
    else:
    comma(s[i])
    i += 1
    '1' '2' ','
    DCD Tree ~ Parse Tree
    •No tracking beyond input bu
    ff
    er

    •Characters are attached to nodes where they are accessed last
    "12,"
    "12,"

    View full-size slide

  50. 54
    def is_digit(i): return i in '0123456789'
    def parse_num(s,i):
    n = ''
    while s[i:] and is_digit(s[i]):
    n += s[i]
    i = i +1
    return i,n
    def parse_paren(s, i):
    assert s[i] == '('
    i, v = parse_expr(s, i+1)
    if s[i:] == '': raise Ex(s, i)
    assert s[i] == ')'
    return i+1, v
    def parse_expr(s, i = 0):
    expr, is_op = [], True
    while s[i:]:
    c = s[i]
    if isdigit(c):
    if not is_op: raise Ex(s,i)
    i,num = parse_num(s,i)
    expr.append(num)
    is_op = False
    elif c in ['+', '-', '*', '/']:
    if is_op: raise Ex(s,i)
    expr.append(c)
    is_op, i = True, i + 1
    elif c == '(':
    if not is_op: raise Ex(s,i)
    i, cexpr = parse_paren(s, i)
    expr.append(cexpr)
    is_op = False
    elif c == ')': break
    else: raise Ex(s,i)
    if is_op: raise Ex(s,i)
    return i, expr
    9+3/4
    Parse tree for parse_expr('9+3/4')

    View full-size slide

  51. 55
    def is_digit(i): return i in '0123456789'
    def parse_num(s,i):
    n = ''
    while s[i:] and is_digit(s[i]):
    n += s[i]
    i = i +1
    return i,n
    def parse_paren(s, i):
    assert s[i] == '('
    i, v = parse_expr(s, i+1)
    if s[i:] == '': raise Ex(s, i)
    assert s[i] == ')'
    return i+1, v
    def parse_expr(s, i = 0):
    expr, is_op = [], True
    while s[i:]:
    c = s[i]
    if isdigit(c):
    if not is_op: raise Ex(s,i)
    i,num = parse_num(s,i)
    expr.append(num)
    is_op = False
    elif c in ['+', '-', '*', '/']:
    if is_op: raise Ex(s,i)
    expr.append(c)
    is_op, i = True, i + 1
    elif c == '(':
    if not is_op: raise Ex(s,i)
    i, cexpr = parse_paren(s, i)
    expr.append(cexpr)
    is_op = False
    elif c == ')': break
    else: raise Ex(s,i)
    if is_op: raise Ex(s,i)
    return i, expr
    9+3/4
    Identifying Compatible Nodes
    Which nodes correspond to the same nonterminal

    View full-size slide

  52. 56
    (9 + 1) * 3
    3 * (9 + 1)

    View full-size slide

  53. 57
    9 + 1
    3 * (9 + 1)

    View full-size slide

  54. 58
    3 (9 + 1) *
    3 * (9 + 1)

    View full-size slide

  55. 60
    3*(1)
    1
    :=


    :=

    View full-size slide

  56. :=
    :=
    |
    :=
    |
    |
    |
    :=
    :=
    |
    :=
    := '3' | '1'
    := '(' ')'
    :=
    := '*'
    61

    View full-size slide

  57. 62
    def is_digit(i): return i in '0123456789'
    def parse_num(s,i):
    n = ''
    while s[i:] and is_digit(s[i]):
    n += s[i]
    i = i +1
    return i,n
    def parse_paren(s, i):
    assert s[i] == '('
    i, v = parse_expr(s, i+1)
    if s[i:] == '': raise Ex(s, i)
    assert s[i] == ')'
    return i+1, v
    def parse_expr(s, i = 0):
    expr, is_op = [], True
    while s[i:]:
    c = s[i]
    if isdigit(c):
    if not is_op: raise Ex(s,i)
    i,num = parse_num(s,i)
    expr.append(num)
    is_op = False
    elif c in ['+', '-', '*', '/']:
    if is_op: raise Ex(s,i)
    expr.append(c)
    is_op, i = True, i + 1
    elif c == '(':
    if not is_op: raise Ex(s,i)
    i, cexpr = parse_paren(s, i)
    expr.append(cexpr)
    is_op = False
    elif c == ')': break
    else: raise Ex(s,i)
    if is_op: raise Ex(s,i)
    return i, expr
    :=
    :=
    |
    :=
    |
    := '(' ')'
    |
    := '*' | '+' | '-' | '/'
    :=
    |
    : [0-9]
    calc.py Recovered Arithmetic Grammar

    View full-size slide

  58. 63
    8.2 - 27 - -9 / +((+9 * --2 + --+-+-
    ((-1 * +(8 - 5 - 6)) * (-(a-+(((+(4)
    )))) - ++4) / +(-+---((5.6 - --(3 *
    -1.8 * +(6 * +-(((-(-6) * ---+6)) /
    +--(+-+-7 * (-0 * (+(((((2)) + 8 - 3
    - ++9.0 + ---(--+7 / (1 / +++6.37)
    + (1) / 482) / +++-+0)))) + 8.2 - 27
    - -9 / +((+9 * --2 + --+-+-((-1 * +
    (8 - 5 - 6)) * (-(a-+(((+(4))))) - +
    +4) / +(-+---((5.6 - --(3 * -1.8 * +
    (6 * +-(((-(-6) * ---+6)) / +--(+-+-
    7 * (-0 * (+(((((2)) + 8 - 3 - ++9.0
    + ---(--+7 / (1 / +++6.37) + (1) /
    482) / +++-+0)))) * -+5 + 7.513))))
    - (+1 / ++((-84)))))))) * ++5 / +-(-
    -2 - -++-9.0)))) / 5 * --++090 + * -
    +5 + 7.513)))) - (+1 / ++((-84))))))
    )) * 8.2 - 27 - -9 / +((+9 * --2 + -
    -+-+-((-1 * +(8 - 5 - 6)) * (-(a-+((
    (+(4))))) - ++4) / +(-+---((5.6 - --
    (3 * -1.8 * +(6 * +-(((-(-6) * ---+6
    )) / +--(+-+-7 * (-0 * (+(((((2)) +
    8 - 3 - ++9.0 + ---(--+7 / (1 / +++6
    .37) + (1) / 482) / +++-+0)))) * -+5
    + 7.513)))) - (+1 / ++((-84))))))))
    * ++5 / +-(--2 - -++-9.0)))) / 5 *
    --++090 ++5 / +-(--2 - -++-9.0)))) /
    5 * --++090
    :=
    :=
    |
    :=
    |
    := '(' ')'
    |
    := '*' | '+' | '-' | '/'
    :=
    |
    : [0-9]

    View full-size slide

  59. 64
    ::=
    ::= '"'
    | '['
    | '{'
    |
    | 'true'
    | 'false'
    | 'null'
    ::= +
    | + 'e' +
    ::= '+' | '-' | '.' | [0-9] | 'E' | 'e'
    ::= * '"'
    ::= ']'
    | (',')* ']'
    | ( ',' )+ (',' )* ']'
    ::= '}'
    | ( '"' ':' ',' )*
    '"' ':' '}'
    ::= ' ' | '!' | '#' | '$' | '%' | '&' | '''
    | '*' | '+' | '-' | ',' | '.' | '/' | ':' | ';'
    | '<' | '=' | '>' | '?' | '@' | '[' | ']' | '^'
    | '_', ''',| '{' | '|' | '}' | '~'
    | '[A-Za-z0-9]'
    | '\'
    ::= '"' | '/' | 'b' | 'f' | 'n' | 'r' | 't'
    stm.next()
    if expect_key:
    raise JSONError(E_DKEY, stm, stm.pos)
    if c == '}':
    return result
    expect_key = 1
    continue
    # parse out a key/value pair
    elif c == '"':
    key = _from_json_string(stm)
    stm.skipspaces()
    c = stm.next()
    if c != ':':
    raise JSONError(E_COLON, stm, stm.pos)
    stm.skipspaces()
    val = _from_json_raw(stm)
    result[key] = val
    expect_key = 0
    continue
    raise JSONError(E_MALF, stm, stm.pos)
    def _from_json_raw(stm):
    while True:
    stm.skipspaces()
    c = stm.peek()
    if c == '"':
    return _from_json_string(stm)
    elif c == '{':
    return _from_json_dict(stm)
    elif c == '[':
    return _from_json_list(stm)
    elif c == 't':
    return _from_json_fixed(stm, 'true', True, E_BOOL)
    elif c == 'f':
    return _from_json_fixed(stm, 'false', False, E_BOOL)
    elif c == 'n':
    return _from_json_fixed(stm, 'null', None, E_NULL)
    elif c in NUMSTART:
    return _from_json_number(stm)
    raise JSONError(E_MALF, stm, stm.pos)
    def from_json(data):
    stm = JSONStream(data)
    return _from_json_raw(stm)
    microjson.py Recovered JSON grammar

    View full-size slide

  60. def json_raw(stm):
    while True:
    stm.skipspaces()
    c = stm.peek()
    if c == 't':
    return json_fixed(stm, 'true')
    elif c == 'f':
    return json_fixed(stm, 'false')
    elif c == 'n':
    return json_fixed(stm, 'null')
    elif c == '"':
    return json_string(stm)
    elif c == '{':
    return json_dict(stm)
    elif c == '[':
    return json_list(stm)
    elif c in NUMSTART:
    return json_number(stm)
    raise JSONError(E_MALF, stm, stm.pos)
    ::= 

    | 

    | 

    | 

    |
    |
    |
    ::= `"` `"`
    | `""`
    ::=
    |
    ::= `{``}`
    | `{}`
    ::= `,`
    |
    ::= `:`
    ::= `[``]`
    | `[]`
    ::= `,`
    |
    ::=
    ::=
    |
    https://github.com/phensley/microjson
    MicroJSON
    65 65

    View full-size slide

  61. def json_raw(stm):
    while True:
    stm.skipspaces()
    c = stm.peek()
    if c == 't':
    return json_fixed(stm, 'true')
    elif c == 'f':
    return json_fixed(stm, 'false')
    elif c == 'n':
    return json_fixed(stm, 'null')
    elif c == '"':
    return json_string(stm)
    elif c == '{':
    return json_dict(stm)
    elif c == '[':
    return json_list(stm)
    elif c in NUMSTART:
    return json_number(stm)
    raise JSONError(E_MALF, stm, stm.pos)
    ::= 

    | 

    | 

    | 

    |
    |
    |
    ::= `"` `"`
    | `""`
    ::=
    |
    ::= `{``}`
    | `{}`
    ::= `,`
    |
    ::= `:`
    ::= `[``]`
    | `[]`
    ::= `,`
    |
    ::=
    ::=
    |
    https://github.com/phensley/microjson
    MicroJSON
    66

    View full-size slide

  62. 67
    ::=
    ::= '"'
    | '['
    | '{'
    |
    | 'true'
    | 'false'
    | 'null'
    ::= +
    | + 'e' +
    ::= '+' | '-' | '.' | [0-9] | 'E' | 'e'
    ::= * '"'
    ::= ']'
    | (',')* ']'
    | ( ',' )+ (',' )* ']'
    ::= '}'
    | ( '"' ':' ',' )*
    '"' ':' '}'
    ::= ' ' | '!' | '#' | '$' | '%' | '&' | '''
    | '*' | '+' | '-' | ',' | '.' | '/' | ':' | ';'
    | '<' | '=' | '>' | '?' | '@' | '[' | ']' | '^'
    | '_', ''',| '{' | '|' | '}' | '~'
    | '[A-Za-z0-9]'
    | '\'
    ::= '"' | '/' | 'b' | 'f' | 'n' | 'r' | 't'
    stm.next()
    if expect_key:
    raise JSONError(E_DKEY, stm, stm.pos)
    if c == '}':
    return result
    expect_key = 1
    continue
    # parse out a key/value pair
    elif c == '"':
    key = _from_json_string(stm)
    stm.skipspaces()
    c = stm.next()
    if c != ':':
    raise JSONError(E_COLON, stm, stm.pos)
    stm.skipspaces()
    val = _from_json_raw(stm)
    result[key] = val
    expect_key = 0
    continue
    raise JSONError(E_MALF, stm, stm.pos)
    def _from_json_raw(stm):
    while True:
    stm.skipspaces()
    c = stm.peek()
    if c == '"':
    return _from_json_string(stm)
    elif c == '{':
    return _from_json_dict(stm)
    elif c == '[':
    return _from_json_list(stm)
    elif c == 't':
    return _from_json_fixed(stm, 'true', True, E_BOOL)
    elif c == 'f':
    return _from_json_fixed(stm, 'false', False, E_BOOL)
    elif c == 'n':
    return _from_json_fixed(stm, 'null', None, E_NULL)
    elif c in NUMSTART:
    return _from_json_number(stm)
    raise JSONError(E_MALF, stm, stm.pos)
    def from_json(data):
    stm = JSONStream(data)
    return _from_json_raw(stm)
    microjson.py Recovered JSON grammar
    Mimid
    Gopinath, Mathis, and Zeller. Mining Input Grammars from Dynamic Control Flow. ESEC/FSE 2020.
    67

    View full-size slide

  63. Service topology map of Uber
    showing hundreds of microservices
    (Source: Uber Engineering)
    Instrumentation ability or source code
    access is not always guaranteed

    View full-size slide

  64. Grammar Inference

    View full-size slide

  65. Grammar Inference
    Problem: Exponential Search Space
    2n possibilities for n length string

    View full-size slide

  66. Grammar Inference
    Problem: Exponential Search Space
    2n possibilities for n length string

    View full-size slide

  67. Grammar Inference (with examples)
    Glade Arvada
    With good examples, the problem is tractable

    View full-size slide

  68. Finding Good Examples
    Example corpus?
    (Blind spots)
    74

    View full-size slide

  69. 75
    Key Idea: Leverage Error Feedback
    75
    Viable Prefix (Ullman)

    View full-size slide

  70. 76
    • Differentiate incomplete and incorrect inputs
    Key Idea: Viable Pre
    fi
    xes
    76
    • Solve one character at a time systematically

    View full-size slide

  71. 77
    Example Generator
    a
    [ 5
    1
    b
    ,
    }
    4 ]
    a ∉ [,],{,},",0,1,2,3,4,5,.,.
    b ∉ [,],0,1,2,3,4,5,6,7,8,9,,
    } ∉ [,],0,1,2,3,4,5,6,7,8,9,0,,
    [51,4]
    77

    View full-size slide

  72. 78
    Pre
    fi
    xQ AFL(black)
    INI 62.5 65
    CSV 65.7 68.3
    JSON 13.8 9.2
    TinyC 86.8 47.9
    MJS 28.0 19.0
    Quality of Examples
    Branch Coverage Obtained
    C programs

    View full-size slide

  73. 79
    Pre
    fi
    xQ AFL(black) AFL(gray)
    INI 62.5 65 77.5
    CSV 65.7 68.3 68.5
    JSON 13.8 9.2 22.5
    TinyC 86.8 47.9 81.6
    MJS 28.0 19.0 29.9
    Quality of Examples
    Tex Crash: ]9xdy[zSf$\theta{f!;} ;i\nonfrenchspacing !$$\prec q;7O/, $\downbrace
    fi
    ll @Pz \mathstrut{}$^: aK[X|?$47$ ,`D f$)Cg8$*
    Branch Coverage Obtained
    C programs

    View full-size slide

  74. 80
    Grammar Inference

    View full-size slide

  75. 81
    Grammar Inference

    View full-size slide

  76. 82
    Grammar Inference
    1984
    1993
    2014
    2022
    2019

    View full-size slide

  77. 83
    Grammar Inference (L*)
    L* (Angluin'84)
    Learner
    membership: w ∈ L?
    equivalence: G = L?
    yes/no
    counterexample
    yes/no
    Teacher

    View full-size slide

  78. 84
    Grammar Inference (L*)
    L* (Angluin)
    Learner
    membership: ab ?
    equivalence:
    no
    abbb
    yes
    Teacher
    ?

    View full-size slide

  79. 85
    Grammar Inference (L*)
    Learner Teacher
    w
    G = L?
    Equivalences Queries are not possible in software engineering scenarios

    View full-size slide

  80. 86
    L* Teacher with PAC Guarantees
    ab


    abb


    bb


    aaaa


    bbb


    View full-size slide

  81. 87
    L* Teacher with PAC Guarantees
    ab


    abb


    bb


    aaaa


    bbb


    aaa


    abab


    View full-size slide

  82. 88
    L* Teacher with PAC Guarantees
    Probably Approximately Correct (Valiant'84)
    Pr(L(A)≢X ≤ ϵ) ≥ 1−δ 1-∈: accuracy
    1-δ: confidence
    Equivalence Query = Multiple Membership Checks
    Checks come from some sampling distribution D over A*

    We only get a PAC guarantee based on D
    qi = [1/ϵ (ln(1/δ) + i ln(2))]
    Checks made in place of ith equivalence query:

    View full-size slide

  83. 89
    Grammar Inference (PAC-L*)
    Learner Pre
    fi
    x Oracle
    w
    Random Sampler (D)
    Blackbox Hypothesis
    w ∈ D
    L(*)
    Substituting Equivalence Queries
    ab ✓

    abb
    ✘ ✘
    bb ✓

    aaaa ✓

    bbb ✓

    View full-size slide

  84. 90
    Grammar Inference (PAC-L*)
    Learner Pre
    fi
    x Oracle
    w
    Random Sampler (D)
    w ∈ D
    L(*)
    Substituting Equivalence Queries
    Search Space

    View full-size slide

  85. 91
    Positive and Negative Examples
    with Pre
    fi
    x Queries
    a
    [ 5
    1
    b
    ,
    }
    4 ]
    a ∉ [,],{,},",0,1,2,3,4,5,.,.
    b ∉ [,],0,1,2,3,4,5,6,7,8,9,,
    } ∉ [,],0,1,2,3,4,5,6,7,8,9,0,,
    [51,4]
    91

    View full-size slide

  86. 92
    Grammar Inference (PL*)
    Learner Pre
    fi
    x Oracle
    w
    Blackbox Hypothesis
    w ∈ B
    Yes/No
    Yes/No
    PL(*)
    w ∈ H
    Substituting Equivalence Queries

    View full-size slide

  87. 93
    Grammar Inference (PL*)
    Pr(L(A)≢X ≤ ϵ) ≥ 1−δ
    Relation between D,ϵ,δ and F1 score On Arithmetic (depth limited)
    L(*)
    Eq = Pre
    fi
    x Sampler Eq = Pre
    fi
    x Sampler)
    (p=0.05) (p=0.5)
    Eq = Pre
    fi
    x Sampler)
    (p=1.0)
    Red is good, Blue is bad
    PL(*) PL(*) PL(*)
    1-δ: confidence
    1-∈: accuracy

    View full-size slide

  88. 94
    Grammar Inference (PL*)
    Pr(L(A)≢X ≤ ϵ) ≥ 1−δ
    Relation between D,ϵ,δ and F1 score On JSON (depth limited)
    L(*)
    Eq = Pre
    fi
    x Sampler
    (p=0.05)
    Eq = Pre
    fi
    x Sampler)
    (p=0.5)
    Eq = Pre
    fi
    x Sampler)
    (p=1.0)
    Red is good, Blue is bad
    1-δ: confidence
    1-∈: accuracy
    PL(*)
    PL(*) PL(*)

    View full-size slide

  89. 95
    Grammar Mining
    Blakbox Generation Grammar Inference

    View full-size slide