Slide 1

Slide 1 text

Presented by Liang Gong PHOG: Probabilistic Model for Code Pavol Bielik, Veselin Raychev, and Martin Vechev Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley.

Slide 2

Slide 2 text

Motivation Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. Background: • Probabilistic models are used for: • Code completion • Statistical deobfuscation • Patch generation • Translation between languages Problem: • Existing models (PCFGs, n-grams): • Not precise • Limited applicability 2

Slide 3

Slide 3 text

Probabilistic Model for Code Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 3 • Suppose we are expected to fill in the blank. • Code completion • Statistical deobfuscation • Patch generation • Translation between languages ?

Slide 4

Slide 4 text

Probabilistic Model for Code Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 4 ? • Suppose we are expected to fill in the blank. • Predict by CFG: • JavaScript is untyped

Slide 5

Slide 5 text

Probabilistic Model for Code Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 5 • Suppose we are expected to fill in the blank. • Predict by CFG: • JavaScript is untyped • Not all identifier makes sense length ? ABC ? promise ? ...

Slide 6

Slide 6 text

Probabilistic Model for Code Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 6 • Probabilistic prediction for each component in an AST • Existing models: • N-gram 2-gram frequency of combo: defer promise defer notify defer resolve defer reject

Slide 7

Slide 7 text

Probabilistic Model for Code Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 7 • Probabilistic prediction for each component in an AST • Existing models: • Probabilistic CFG Learns the frequency of expansion of each production rule

Slide 8

Slide 8 text

Probabilistic Model for Code Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 8 • Probabilistic prediction for each component in an AST • Existing models: • Probabilistic CFG Learns the frequency of expansion of each production rule

Slide 9

Slide 9 text

Probabilistic Model for Code Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 9 • Probabilistic prediction for each component in an AST • Existing models: • Probabilistic CFG Learns the frequency of expansion of each production rule

Slide 10

Slide 10 text

Probabilistic Context Free Grammar Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. Pros: • Considers frequency of production rule Cons: • Does not consider context info 10

Slide 11

Slide 11 text

Probabilistic Context Free Grammar Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. Pros: • Considers frequency of production rule Cons: • Does not consider context info 11 Context: A set of facts about the surrounding nodes

Slide 12

Slide 12 text

Probabilistic Context Free Grammar Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. Pros: • Considers frequency of production rule Cons: • Does not consider context info 12 Context: A set of facts about the surrounding nodes • defer is a promise • Type of statement …

Slide 13

Slide 13 text

PHOG: Probabilistic Higher Order Grammar Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 13 • Considering the Context • return .;

Slide 14

Slide 14 text

PHOG: Probabilistic Higher Order Grammar Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 14 • Considering the Context • return .;

Slide 15

Slide 15 text

PHOG: Probabilistic Higher Order Grammar Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 15 • Considering the Context • return .; • How to compute the context? • Select facts: where and which? • Previous works are based on hard-coded heuristics

Slide 16

Slide 16 text

PHOG: Probabilistic Higher Order Grammar Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 16 Automatically synthesize a DSL function • Input: AST tree • Output: a conditioning context • Result: PHOG maximizes the probability of the training dataset

Slide 17

Slide 17 text

PHOG: Probabilistic Higher Order Grammar Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 17 https://www.srl.inf.ethz.ch/slides/ICML16_PHOG.pdf

Slide 18

Slide 18 text

PHOG: Probabilistic Higher Order Grammar Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 18

Slide 19

Slide 19 text

PHOG: Probabilistic Higher Order Grammar non-terminals Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 19

Slide 20

Slide 20 text

PHOG: Probabilistic Higher Order Grammar non-terminals terminals Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 20

Slide 21

Slide 21 text

PHOG: Probabilistic Higher Order Grammar non-terminals terminals start symbol Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 21

Slide 22

Slide 22 text

PHOG: Probabilistic Higher Order Grammar non-terminals terminals start symbol Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 22 Context

Slide 23

Slide 23 text

PHOG: Probabilistic Higher Order Grammar non-terminals terminals start symbol Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 23 production rules Context

Slide 24

Slide 24 text

PHOG: Probabilistic Higher Order Grammar non-terminals terminals start symbol Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 24 production rules One production rule: Context

Slide 25

Slide 25 text

PHOG: Probabilistic Higher Order Grammar non-terminals terminals start symbol Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 25 production rules One production rule: return context of trees Context

Slide 26

Slide 26 text

PHOG: Probabilistic Higher Order Grammar Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 26 q returns the probability of a production rule.

Slide 27

Slide 27 text

PHOG: Probabilistic Higher Order Grammar Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 27 q returns the probability of a production rule. Normalize the distribution of all production rules:

Slide 28

Slide 28 text

PHOG: Probabilistic Higher Order Grammar Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 28 Given an AST subtree T, how to calculate its probability? gets probability of a production rule maps non-terminal to the rule used xi T xi is a non-terminal

Slide 29

Slide 29 text

PHOG: Probabilistic Higher Order Grammar Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 29 Given an AST subtree T, how to calculate its probability? xi T xi is a non-terminal Goal: find a distribution of q that maximizes the Pr of all trees in a training set.

Slide 30

Slide 30 text

TCond DSL to PHOG Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 30 For each production rule , there is a set of all possible contexts. One specific context associated with

Slide 31

Slide 31 text

PHOG: Probabilistic Higher Order Grammar Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 31 xi T xi is a non-terminal Goal: find a set of contexts that maximizes the Pr of all trees in a training set

Slide 32

Slide 32 text

PHOG: Probabilistic Higher Order Grammar Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 32 xi T xi is a non-terminal Goal: find a set of contexts that maximizes the Pr of all trees in a training set A naïve way (impractical): Manually search and Hard-code all possible contexts for all possible production rules

Slide 33

Slide 33 text

PHOG: Probabilistic Higher Order Grammar Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 33 xi T xi is a non-terminal Goal: find a set of contexts that maximizes the Pr of all trees in a training set Synthesize a function p that generates contexts p should maximize our goal

Slide 34

Slide 34 text

PHOG: Probabilistic Higher Order Grammar Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 34 Synthesize a function p that generates contexts p should maximize our goal. • Need to define a DSL for p • Need a search strategy for p • Enumerative search + Genetic programming

Slide 35

Slide 35 text

DSL for Context Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 35 Context: a set of facts about the surrounding nodes DSL program: Select facts, where and which?

Slide 36

Slide 36 text

DSL for Context Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 36 A DSL for traversing sub-trees and accumulating context with values from the tree during the traversal

Slide 37

Slide 37 text

DSL for Context Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 37 A program in the DSL is a sequence of operations.

Slide 38

Slide 38 text

DSL for Context Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 38 A program in the DSL is a sequence of operations. Two types of operations: • move operation (where) • write operation (which)

Slide 39

Slide 39 text

DSL for Context Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 39 https://www.srl.inf.ethz.ch/slides/ICML16_PHOG.pdf

Slide 40

Slide 40 text

DSL for Context Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 40 https://www.srl.inf.ethz.ch/slides/ICML16_PHOG.pdf

Slide 41

Slide 41 text

DSL for Context Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 41 Start from the position to be prediction. https://www.srl.inf.ethz.ch/slides/ICML16_PHOG.pdf

Slide 42

Slide 42 text

DSL for Context Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 42 Move to the left sibling node. https://www.srl.inf.ethz.ch/slides/ICML16_PHOG.pdf

Slide 43

Slide 43 text

DSL for Context Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 43 Record the value of the current node. https://www.srl.inf.ethz.ch/slides/ICML16_PHOG.pdf

Slide 44

Slide 44 text

DSL for Context Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 44 Move up to the parent node, which is the block statement. https://www.srl.inf.ethz.ch/slides/ICML16_PHOG.pdf

Slide 45

Slide 45 text

DSL for Context Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 45 Record the position of the current node in its parent. https://www.srl.inf.ethz.ch/slides/ICML16_PHOG.pdf

Slide 46

Slide 46 text

DSL for Context Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 46 { Previous Property, Parameter Position, API Name } So this DSL program records the context, which is: https://www.srl.inf.ethz.ch/slides/ICML16_PHOG.pdf

Slide 47

Slide 47 text

Evaluation Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. • On 150,000 JavaScript files • De-duplicated and non-obfuscated (?) • Learn PHOG from a training set • Evaluate on testing set • Remove and predict sub-trees 47

Slide 48

Slide 48 text

Evaluation Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 48 Lower  better https://www.srl.inf.ethz.ch/slides/ICML16_PHOG.pdf

Slide 49

Slide 49 text

Evaluation Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 49 https://www.srl.inf.ethz.ch/slides/ICML16_PHOG.pdf

Slide 50

Slide 50 text

Evaluation Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 50 https://www.srl.inf.ethz.ch/slides/ICML16_PHOG.pdf