Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PHOG: Probabilistic Model for Code

PHOG: Probabilistic Model for Code

Liang Gong

May 06, 2018
Tweet

More Decks by Liang Gong

Other Decks in Research

Transcript

  1. Presented by Liang Gong PHOG: Probabilistic Model for Code Pavol

    Bielik, Veselin Raychev, and Martin Vechev Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley.
  2. Motivation Liang Gong, Electric Engineering & Computer Science, University of

    California, Berkeley. Background: • Probabilistic models are used for: • Code completion • Statistical deobfuscation • Patch generation • Translation between languages Problem: • Existing models (PCFGs, n-grams): • Not precise • Limited applicability 2
  3. Probabilistic Model for Code Liang Gong, Electric Engineering & Computer

    Science, University of California, Berkeley. 3 • Suppose we are expected to fill in the blank. • Code completion • Statistical deobfuscation • Patch generation • Translation between languages ?
  4. Probabilistic Model for Code Liang Gong, Electric Engineering & Computer

    Science, University of California, Berkeley. 4 ? • Suppose we are expected to fill in the blank. • Predict by CFG: • JavaScript is untyped
  5. Probabilistic Model for Code Liang Gong, Electric Engineering & Computer

    Science, University of California, Berkeley. 5 • Suppose we are expected to fill in the blank. • Predict by CFG: • JavaScript is untyped • Not all identifier makes sense length ? ABC ? promise ? ...
  6. Probabilistic Model for Code Liang Gong, Electric Engineering & Computer

    Science, University of California, Berkeley. 6 • Probabilistic prediction for each component in an AST • Existing models: • N-gram 2-gram frequency of combo: defer promise defer notify defer resolve defer reject
  7. Probabilistic Model for Code Liang Gong, Electric Engineering & Computer

    Science, University of California, Berkeley. 7 • Probabilistic prediction for each component in an AST • Existing models: • Probabilistic CFG Learns the frequency of expansion of each production rule
  8. Probabilistic Model for Code Liang Gong, Electric Engineering & Computer

    Science, University of California, Berkeley. 8 • Probabilistic prediction for each component in an AST • Existing models: • Probabilistic CFG Learns the frequency of expansion of each production rule
  9. Probabilistic Model for Code Liang Gong, Electric Engineering & Computer

    Science, University of California, Berkeley. 9 • Probabilistic prediction for each component in an AST • Existing models: • Probabilistic CFG Learns the frequency of expansion of each production rule
  10. Probabilistic Context Free Grammar Liang Gong, Electric Engineering & Computer

    Science, University of California, Berkeley. Pros: • Considers frequency of production rule Cons: • Does not consider context info 10
  11. Probabilistic Context Free Grammar Liang Gong, Electric Engineering & Computer

    Science, University of California, Berkeley. Pros: • Considers frequency of production rule Cons: • Does not consider context info 11 Context: A set of facts about the surrounding nodes
  12. Probabilistic Context Free Grammar Liang Gong, Electric Engineering & Computer

    Science, University of California, Berkeley. Pros: • Considers frequency of production rule Cons: • Does not consider context info 12 Context: A set of facts about the surrounding nodes • defer is a promise • Type of statement …
  13. PHOG: Probabilistic Higher Order Grammar Liang Gong, Electric Engineering &

    Computer Science, University of California, Berkeley. 13 • Considering the Context • return <obj>.<Prop>;
  14. PHOG: Probabilistic Higher Order Grammar Liang Gong, Electric Engineering &

    Computer Science, University of California, Berkeley. 14 • Considering the Context • return <obj>.<Prop>;
  15. PHOG: Probabilistic Higher Order Grammar Liang Gong, Electric Engineering &

    Computer Science, University of California, Berkeley. 15 • Considering the Context • return <obj>.<Prop>; • How to compute the context? • Select facts: where and which? • Previous works are based on hard-coded heuristics
  16. PHOG: Probabilistic Higher Order Grammar Liang Gong, Electric Engineering &

    Computer Science, University of California, Berkeley. 16 Automatically synthesize a DSL function • Input: AST tree • Output: a conditioning context • Result: PHOG maximizes the probability of the training dataset
  17. PHOG: Probabilistic Higher Order Grammar Liang Gong, Electric Engineering &

    Computer Science, University of California, Berkeley. 17 https://www.srl.inf.ethz.ch/slides/ICML16_PHOG.pdf
  18. PHOG: Probabilistic Higher Order Grammar Liang Gong, Electric Engineering &

    Computer Science, University of California, Berkeley. 18
  19. PHOG: Probabilistic Higher Order Grammar non-terminals Liang Gong, Electric Engineering

    & Computer Science, University of California, Berkeley. 19
  20. PHOG: Probabilistic Higher Order Grammar non-terminals terminals Liang Gong, Electric

    Engineering & Computer Science, University of California, Berkeley. 20
  21. PHOG: Probabilistic Higher Order Grammar non-terminals terminals start symbol Liang

    Gong, Electric Engineering & Computer Science, University of California, Berkeley. 21
  22. PHOG: Probabilistic Higher Order Grammar non-terminals terminals start symbol Liang

    Gong, Electric Engineering & Computer Science, University of California, Berkeley. 22 Context
  23. PHOG: Probabilistic Higher Order Grammar non-terminals terminals start symbol Liang

    Gong, Electric Engineering & Computer Science, University of California, Berkeley. 23 production rules Context
  24. PHOG: Probabilistic Higher Order Grammar non-terminals terminals start symbol Liang

    Gong, Electric Engineering & Computer Science, University of California, Berkeley. 24 production rules One production rule: Context
  25. PHOG: Probabilistic Higher Order Grammar non-terminals terminals start symbol Liang

    Gong, Electric Engineering & Computer Science, University of California, Berkeley. 25 production rules One production rule: return context of trees Context
  26. PHOG: Probabilistic Higher Order Grammar Liang Gong, Electric Engineering &

    Computer Science, University of California, Berkeley. 26 q returns the probability of a production rule.
  27. PHOG: Probabilistic Higher Order Grammar Liang Gong, Electric Engineering &

    Computer Science, University of California, Berkeley. 27 q returns the probability of a production rule. Normalize the distribution of all production rules:
  28. PHOG: Probabilistic Higher Order Grammar Liang Gong, Electric Engineering &

    Computer Science, University of California, Berkeley. 28 Given an AST subtree T, how to calculate its probability? gets probability of a production rule maps non-terminal to the rule used xi T xi is a non-terminal
  29. PHOG: Probabilistic Higher Order Grammar Liang Gong, Electric Engineering &

    Computer Science, University of California, Berkeley. 29 Given an AST subtree T, how to calculate its probability? xi T xi is a non-terminal Goal: find a distribution of q that maximizes the Pr of all trees in a training set.
  30. TCond DSL to PHOG Liang Gong, Electric Engineering & Computer

    Science, University of California, Berkeley. 30 For each production rule , there is a set of all possible contexts. One specific context associated with
  31. PHOG: Probabilistic Higher Order Grammar Liang Gong, Electric Engineering &

    Computer Science, University of California, Berkeley. 31 xi T xi is a non-terminal Goal: find a set of contexts that maximizes the Pr of all trees in a training set
  32. PHOG: Probabilistic Higher Order Grammar Liang Gong, Electric Engineering &

    Computer Science, University of California, Berkeley. 32 xi T xi is a non-terminal Goal: find a set of contexts that maximizes the Pr of all trees in a training set A naïve way (impractical): Manually search and Hard-code all possible contexts for all possible production rules
  33. PHOG: Probabilistic Higher Order Grammar Liang Gong, Electric Engineering &

    Computer Science, University of California, Berkeley. 33 xi T xi is a non-terminal Goal: find a set of contexts that maximizes the Pr of all trees in a training set Synthesize a function p that generates contexts p should maximize our goal
  34. PHOG: Probabilistic Higher Order Grammar Liang Gong, Electric Engineering &

    Computer Science, University of California, Berkeley. 34 Synthesize a function p that generates contexts p should maximize our goal. • Need to define a DSL for p • Need a search strategy for p • Enumerative search + Genetic programming
  35. DSL for Context Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 35 Context: a set of facts about the surrounding nodes DSL program: Select facts, where and which?
  36. DSL for Context Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 36 A DSL for traversing sub-trees and accumulating context with values from the tree during the traversal
  37. DSL for Context Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 37 A program in the DSL is a sequence of operations.
  38. DSL for Context Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 38 A program in the DSL is a sequence of operations. Two types of operations: • move operation (where) • write operation (which)
  39. DSL for Context Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 39 https://www.srl.inf.ethz.ch/slides/ICML16_PHOG.pdf
  40. DSL for Context Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 40 https://www.srl.inf.ethz.ch/slides/ICML16_PHOG.pdf
  41. DSL for Context Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 41 Start from the position to be prediction. https://www.srl.inf.ethz.ch/slides/ICML16_PHOG.pdf
  42. DSL for Context Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 42 Move to the left sibling node. https://www.srl.inf.ethz.ch/slides/ICML16_PHOG.pdf
  43. DSL for Context Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 43 Record the value of the current node. https://www.srl.inf.ethz.ch/slides/ICML16_PHOG.pdf
  44. DSL for Context Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 44 Move up to the parent node, which is the block statement. https://www.srl.inf.ethz.ch/slides/ICML16_PHOG.pdf
  45. DSL for Context Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 45 Record the position of the current node in its parent. https://www.srl.inf.ethz.ch/slides/ICML16_PHOG.pdf
  46. DSL for Context Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 46 { Previous Property, Parameter Position, API Name } So this DSL program records the context, which is: https://www.srl.inf.ethz.ch/slides/ICML16_PHOG.pdf
  47. Evaluation Liang Gong, Electric Engineering & Computer Science, University of

    California, Berkeley. • On 150,000 JavaScript files • De-duplicated and non-obfuscated (?) • Learn PHOG from a training set • Evaluate on testing set • Remove and predict sub-trees 47
  48. Evaluation Liang Gong, Electric Engineering & Computer Science, University of

    California, Berkeley. 48 Lower  better https://www.srl.inf.ethz.ch/slides/ICML16_PHOG.pdf
  49. Evaluation Liang Gong, Electric Engineering & Computer Science, University of

    California, Berkeley. 49 https://www.srl.inf.ethz.ch/slides/ICML16_PHOG.pdf
  50. Evaluation Liang Gong, Electric Engineering & Computer Science, University of

    California, Berkeley. 50 https://www.srl.inf.ethz.ch/slides/ICML16_PHOG.pdf