Science, University of California, Berkeley. 3 • Suppose we are expected to fill in the blank. • Code completion • Statistical deobfuscation • Patch generation • Translation between languages ?
Science, University of California, Berkeley. 5 • Suppose we are expected to fill in the blank. • Predict by CFG: • JavaScript is untyped • Not all identifier makes sense length ? ABC ? promise ? ...
Science, University of California, Berkeley. 6 • Probabilistic prediction for each component in an AST • Existing models: • N-gram 2-gram frequency of combo: defer promise defer notify defer resolve defer reject
Science, University of California, Berkeley. 7 • Probabilistic prediction for each component in an AST • Existing models: • Probabilistic CFG Learns the frequency of expansion of each production rule
Science, University of California, Berkeley. 8 • Probabilistic prediction for each component in an AST • Existing models: • Probabilistic CFG Learns the frequency of expansion of each production rule
Science, University of California, Berkeley. 9 • Probabilistic prediction for each component in an AST • Existing models: • Probabilistic CFG Learns the frequency of expansion of each production rule
Science, University of California, Berkeley. Pros: • Considers frequency of production rule Cons: • Does not consider context info 11 Context: A set of facts about the surrounding nodes
Science, University of California, Berkeley. Pros: • Considers frequency of production rule Cons: • Does not consider context info 12 Context: A set of facts about the surrounding nodes • defer is a promise • Type of statement …
Computer Science, University of California, Berkeley. 15 • Considering the Context • return <obj>.<Prop>; • How to compute the context? • Select facts: where and which? • Previous works are based on hard-coded heuristics
Computer Science, University of California, Berkeley. 16 Automatically synthesize a DSL function • Input: AST tree • Output: a conditioning context • Result: PHOG maximizes the probability of the training dataset
Gong, Electric Engineering & Computer Science, University of California, Berkeley. 25 production rules One production rule: return context of trees Context
Computer Science, University of California, Berkeley. 27 q returns the probability of a production rule. Normalize the distribution of all production rules:
Computer Science, University of California, Berkeley. 28 Given an AST subtree T, how to calculate its probability? gets probability of a production rule maps non-terminal to the rule used xi T xi is a non-terminal
Computer Science, University of California, Berkeley. 29 Given an AST subtree T, how to calculate its probability? xi T xi is a non-terminal Goal: find a distribution of q that maximizes the Pr of all trees in a training set.
Science, University of California, Berkeley. 30 For each production rule , there is a set of all possible contexts. One specific context associated with
Computer Science, University of California, Berkeley. 31 xi T xi is a non-terminal Goal: find a set of contexts that maximizes the Pr of all trees in a training set
Computer Science, University of California, Berkeley. 32 xi T xi is a non-terminal Goal: find a set of contexts that maximizes the Pr of all trees in a training set A naïve way (impractical): Manually search and Hard-code all possible contexts for all possible production rules
Computer Science, University of California, Berkeley. 33 xi T xi is a non-terminal Goal: find a set of contexts that maximizes the Pr of all trees in a training set Synthesize a function p that generates contexts p should maximize our goal
Computer Science, University of California, Berkeley. 34 Synthesize a function p that generates contexts p should maximize our goal. • Need to define a DSL for p • Need a search strategy for p • Enumerative search + Genetic programming
University of California, Berkeley. 38 A program in the DSL is a sequence of operations. Two types of operations: • move operation (where) • write operation (which)
University of California, Berkeley. 46 { Previous Property, Parameter Position, API Name } So this DSL program records the context, which is: https://www.srl.inf.ethz.ch/slides/ICML16_PHOG.pdf
California, Berkeley. • On 150,000 JavaScript files • De-duplicated and non-obfuscated (?) • Learn PHOG from a training set • Evaluate on testing set • Remove and predict sub-trees 47