Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Predicting Program Properties from “Big Code”

Liang Gong
April 01, 2017

Predicting Program Properties from “Big Code”

An awesome paper about the technique behind JSNice.
Presented by Liang Gong in Berkeley's group meeting.

Liang Gong

April 01, 2017
Tweet

More Decks by Liang Gong

Other Decks in Research

Transcript

  1. Predicting Program Properties from “Big Code” An explanatory tutorial assembled

    by: Liang Gong Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 1 Veselin Raychev, Martin Vechev, and Andreas Krause
  2. JSNice Liang Gong, Electric Engineering & Computer Science, University of

    California, Berkeley. 2 • Predict identifier names for minified programs • Predict type annotations of variables
  3. JSNice Liang Gong, Electric Engineering & Computer Science, University of

    California, Berkeley. 3 • Predict identifier names for minified programs • Predict type annotations of variables
  4. Outline • A two-phase prediction approach • prediction phase +

    training phase Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 4
  5. Step-1.1 identify what to predict Liang Gong, Electric Engineering &

    Computer Science, University of California, Berkeley. 5 • local variable names (unknown property)
  6. Step-1.1 identify what to predict Liang Gong, Electric Engineering &

    Computer Science, University of California, Berkeley. 6 • local variable names (unknown property)
  7. Step-1.2 identify what is known Liang Gong, Electric Engineering &

    Computer Science, University of California, Berkeley. 7 • Constants, APIs (known property)
  8. Step-1.2 identify what is known Liang Gong, Electric Engineering &

    Computer Science, University of California, Berkeley. 8 • Constants, APIs (known property)
  9. Step-2 build dependency network Liang Gong, Electric Engineering & Computer

    Science, University of California, Berkeley. 9 • Dependency Network: an undirected graphical model that captures various kinds of relationships between program elements.
  10. Main Idea of Dependency Network (DN) Liang Gong, Electric Engineering

    & Computer Science, University of California, Berkeley. 10 A structurally similar DN mined from the training code repository can help infer the unknown property in the query DN. DN in the input program DN in the training repository Step i len
  11. Main Idea of Dependency Network (DN) Liang Gong, Electric Engineering

    & Computer Science, University of California, Berkeley. 11 A structurally similar DN mined from the training code repository can help infer the unknown property in the query DN. DN in the input program DN in the training repository Step i len
  12. Main Idea of Dependency Network (DN) Liang Gong, Electric Engineering

    & Computer Science, University of California, Berkeley. 12 Perfectly identical DN is probably rare. DN in the input program DN in the training repository Step i len length L>R L<R L=_.R
  13. Main Idea of Dependency Network (DN) Liang Gong, Electric Engineering

    & Computer Science, University of California, Berkeley. 13 Perfectly identical DN is probably rare. Multiple similar DNs could be found. DN in the input program DN in the training repository Step i len length L>R L<R L=_.R interv al j len length L+=R L<=R L=_.R
  14. Main Idea of Dependency Network (DN) Liang Gong, Electric Engineering

    & Computer Science, University of California, Berkeley. 14 Perfectly identical DN is probably rare. Multiple similar DNs could be found. Need a fuzz way to infer unknown properties. DN in the input program DN in the training repository Step i len length L>R L<R L=_.R interv al j len length L+=R L<=R L=_.R
  15. MAP Inference Liang Gong, Electric Engineering & Computer Science, University

    of California, Berkeley. 15 MAP: Maximum a Posteriori Probability Infer the most likely properties for the nodes of the network.
  16. Problem Definition Liang Gong, Electric Engineering & Computer Science, University

    of California, Berkeley. 16 The entire training set D contains t programs. Each program x(j) has a vector of labels y Given a program x to be predicted, return a label vector with the maximal probability. n(x): # of unknown properties
  17. Conditional Random Field Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 19 all possible labels in x
  18. Conditional Random Field Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 20 all possible labels in x
  19. Conditional Random Field Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 21 The score of label vector y all possible labels in x
  20. Conditional Random Field Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 22 The score of label y, based on feature fi The score of label vector y all possible labels in x
  21. Conditional Random Field Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 23 The score of label y, based on feature fi The score of label vector y all possible labels in x Learnt through gradient descent
  22. Conditional Random Field Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 25 Pick the label with maximal probability
  23. Conditional Random Field Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 26 Pick the label with maximal probability Same for all scores
  24. Conditional Random Field Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 27 Pick the label with maximal probability
  25. Define Feature Functions Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 32 feature score of label vector y in program x
  26. Define Feature Functions Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 33 feature score of label vector y in program x The set of all edges in the dependency network of program x
  27. Define Feature Functions Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 34 feature score of label vector y in program x The set of all edges in the dependency network of program x This function returns 1 if this pair appears in the training set, otherwise return 0
  28. Define Feature Functions Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 35 feature score of label vector y in program x The set of all edges in the dependency network of program x Unknown label vector y This function returns 1 if this pair appears in the training set, otherwise return 0
  29. Define Feature Functions Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 36 feature score of label vector y in program x The set of all edges in the dependency network of program x Known property vector Unknown label vector y This function returns 1 if this pair appears in the training set, otherwise return 0
  30. Define Feature Functions Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 37 feature score of label vector y in program x The set of all edges in the dependency network of program x Known property vector Unknown label vector y This function returns 1 if this pair appears in the training set, otherwise return 0 The label of node a in label vector y
  31. Extracting Features Liang Gong, Electric Engineering & Computer Science, University

    of California, Berkeley. 38 Labeled source code (training raw input)
  32. Extracting Features Liang Gong, Electric Engineering & Computer Science, University

    of California, Berkeley. 39 Labeled source code (training raw input) Labeled dependency networks
  33. Extracting Features Liang Gong, Electric Engineering & Computer Science, University

    of California, Berkeley. 40 Labeled source code (training raw input) Labeled dependency networks
  34. Extracting Features Liang Gong, Electric Engineering & Computer Science, University

    of California, Berkeley. 41 Labeled source code (training raw input) Labeled dependency networks Label Label relation
  35. Extracting Features Liang Gong, Electric Engineering & Computer Science, University

    of California, Berkeley. 42 Labeled source code (training raw input) Labeled dependency networks Label Label relation
  36. Extracting Features Liang Gong, Electric Engineering & Computer Science, University

    of California, Berkeley. 43 Labeled source code (training raw input) Labeled dependency networks Label Label relation
  37. Extracting Features Liang Gong, Electric Engineering & Computer Science, University

    of California, Berkeley. 44 Labeled source code (training raw input) Labeled dependency networks Label Label relation
  38. Extracting Features Liang Gong, Electric Engineering & Computer Science, University

    of California, Berkeley. 45 Labeled source code (training raw input) Labeled dependency networks Label Label relation
  39. Simplify the Math Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 51 • NP Hard: iterate over all possible label combinations
  40. Simplify the Math Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 52 • NP Hard: iterate over all possible label combinations • Need a faster (approximate) algorithm.
  41. Practical Algorithm Liang Gong, Electric Engineering & Computer Science, University

    of California, Berkeley. 53 • NP Hard: iterate over all possible label combinations • Need a faster (approximate) algorithm. • Greedily selecting best possible labels • Start with an initial assignment • make small changes and try to improve the score
  42. Practical Algorithm Liang Gong, Electric Engineering & Computer Science, University

    of California, Berkeley. 54 • NP Hard: iterate over all possible label combinations • Need a faster (approximate) algorithm. • Greedily selecting best possible labels • Start with an initial assignment • make small changes and try to improve the score step j len length L+=R L<=R L=_.R interval L=_.R
  43. Practical Algorithm Liang Gong, Electric Engineering & Computer Science, University

    of California, Berkeley. 55 • NP Hard: iterate over all possible label combinations • Need a faster (approximate) algorithm. • Greedily selecting best possible labels • Start with an initial assignment • make small changes and try to improve the score step j len length L+=R L<=R L=_.R interval L=_.R • Initial assignment assigns len
  44. Practical Algorithm Liang Gong, Electric Engineering & Computer Science, University

    of California, Berkeley. 56 • NP Hard: iterate over all possible label combinations • Need a faster (approximate) algorithm. • Greedily selecting best possible labels • Start with an initial assignment • make small changes and try to improve the score j len length L<=R L=_.R • Initial assignment assigns len • Extract the local network interval L=_.R
  45. Practical Algorithm Liang Gong, Electric Engineering & Computer Science, University

    of California, Berkeley. 57 • NP Hard: iterate over all possible label combinations • Need a faster (approximate) algorithm. • Greedily selecting best possible labels • Start with an initial assignment • make small changes and try to improve the score j len length L<=R L=_.R • Initial assignment assigns len • Extract the local network • Calculate its current score based on the local network interval L=_.R
  46. Practical Algorithm Liang Gong, Electric Engineering & Computer Science, University

    of California, Berkeley. 58 • NP Hard: iterate over all possible label combinations • Need a faster (approximate) algorithm. • Greedily selecting best possible labels • Start with an initial assignment • make small changes and try to improve the score j len length L<=R L=_.R • Initial assignment assigns len • Extract the local network • Calculate its current score based on the local network • Try other labels and calculate scores interval L=_.R name input
  47. Practical Algorithm Liang Gong, Electric Engineering & Computer Science, University

    of California, Berkeley. 59 • NP Hard: iterate over all possible label combinations • Need a faster (approximate) algorithm. • Greedily selecting best possible labels • Start with an initial assignment • make small changes and try to improve the score j name length L<=R L=_.R • Initial assignment assigns len • Extract the local network • Calculate its current score based on the local network • Try other labels and calculate scores • Pick the label with maximal local score interval L=_.R
  48. Practical Algorithm Liang Gong, Electric Engineering & Computer Science, University

    of California, Berkeley. 60 • NP Hard: iterate over all possible label combinations • Need a faster (approximate) algorithm. • Greedily selecting best possible labels • Start with an initial assignment • make small changes and try to improve the score • Initial assignment assigns len • Extract the local network • Calculate its current score based on the local network • Try other labels and calculate scores • Pick the label with maximal local score step j name length L+=R L<=R L=_.R interval L=_.R Pick the next node and repeat the process
  49. How to Pick Candidates? Liang Gong, Electric Engineering & Computer

    Science, University of California, Berkeley. 61 • Suppose we are trying to improve the score of a node (current label: Len) step j Len length L+=R L<=R L=_.R interval L=_.R
  50. How to Pick Candidates? Liang Gong, Electric Engineering & Computer

    Science, University of California, Berkeley. 62 • Suppose we are trying to improve the score of a node (current label: Len) • Get all its adjacent nodes’ label j Len length L+=R L<=R L=_.R interval L=_.R
  51. How to Pick Candidates? Liang Gong, Electric Engineering & Computer

    Science, University of California, Berkeley. 63 • Suppose we are trying to improve the score of a node (current label: Len) • Get all its adjacent nodes’ label j Len length L+=R L<=R L=_.R interval L=_.R j j j length length interval interval
  52. How to Pick Candidates? Liang Gong, Electric Engineering & Computer

    Science, University of California, Berkeley. 64 • Suppose we are trying to improve the score of a node (current label: Len) • Get all its adjacent nodes’ label j Len length L+=R L<=R L=_.R interval L=_.R j j j length length interval interval x y xi
  53. How to Pick Candidates? Liang Gong, Electric Engineering & Computer

    Science, University of California, Berkeley. 65 • Suppose we are trying to improve the score of a node (current label: Len) • Get all its adjacent nodes’ label j Len length L+=R L<=R L=_.R interval L=_.R j j j length length interval interval x y xi Rank based on its weight
  54. How to Pick Candidates? Liang Gong, Electric Engineering & Computer

    Science, University of California, Berkeley. 66 • Suppose we are trying to improve the score of a node (current label: Len) • Get all its adjacent nodes’ label j Len length L+=R L<=R L=_.R interval L=_.R j j j length length interval interval x y xi Rank based on its weight Get top s triples
  55. How to Pick Candidates? Liang Gong, Electric Engineering & Computer

    Science, University of California, Berkeley. 67 • Suppose we are trying to improve the score of a node (current label: Len) • Get all its adjacent nodes’ label j Len length L+=R L<=R L=_.R interval L=_.R j j j length length interval interval x y xi Rank based on its weight Get top s triples
  56. How to Pick Candidates? Liang Gong, Electric Engineering & Computer

    Science, University of California, Berkeley. 68 • Suppose we are trying to improve the score of a node (current label: Len) • Get all its adjacent nodes’ label j Len length L+=R L<=R L=_.R interval L=_.R j j j length length interval interval x y xi Rank based on its weight Get top s triples X y Candidates:
  57. How to Train the Weights? Liang Gong, Electric Engineering &

    Computer Science, University of California, Berkeley. 69 • Each feature function has a corresponding weight
  58. How to Train the Weights? Liang Gong, Electric Engineering &

    Computer Science, University of California, Berkeley. 70 • Each feature function has a corresponding weight
  59. How to Train the Weights? Liang Gong, Electric Engineering &

    Computer Science, University of California, Berkeley. 71 • Each feature function has a corresponding weight
  60. How to Train the Weights? Liang Gong, Electric Engineering &

    Computer Science, University of California, Berkeley. 72 • Each feature function has a corresponding weight • How do we obtain the weights?
  61. How to Train the Weights? Liang Gong, Electric Engineering &

    Computer Science, University of California, Berkeley. 73 • Each feature function has a corresponding weight • How do we obtain the weights? • Maximize the margin between correct assignment and wrong assignments:
  62. How to Train the Weights? Liang Gong, Electric Engineering &

    Computer Science, University of California, Berkeley. 74 • Each feature function has a corresponding weight • How do we obtain the weights? • Maximize the margin between correct assignment and wrong assignments:
  63. How to Train the Weights? Liang Gong, Electric Engineering &

    Computer Science, University of California, Berkeley. 75 • How do we obtain the weights? • Maximize the margin between correct assignment and wrong assignments: Score of correct assignment Score of wrong assignment A distance function • Ideally:
  64. How to Train the Weights? Liang Gong, Electric Engineering &

    Computer Science, University of California, Berkeley. 76 • How do we obtain the weights? • Maximize the margin between correct assignment and wrong assignments: Score of wrong assignment Score of correct assignment
  65. How to Train the Weights? Liang Gong, Electric Engineering &

    Computer Science, University of California, Berkeley. 77 • How do we obtain the weights? • Maximize the margin between correct assignment and wrong assignments: Score of wrong assignment Score of correct assignment Negative value
  66. How to Train the Weights? Liang Gong, Electric Engineering &

    Computer Science, University of California, Berkeley. 78 • How do we obtain the weights? • Maximize the margin between correct assignment and wrong assignments: Score of wrong assignment Score of correct assignment Negative value Pick the wrong assignment with minimal margin
  67. How to Train the Weights? Liang Gong, Electric Engineering &

    Computer Science, University of California, Berkeley. 79 • How do we obtain the weights? • Maximize the margin between correct assignment and wrong assignments: Score of wrong assignment Score of correct assignment Negative value Pick the wrong assignment with minimal margin Negative value
  68. How to Train the Weights? Liang Gong, Electric Engineering &

    Computer Science, University of California, Berkeley. 80 • How do we obtain the weights? • Maximize the margin between correct assignment and wrong assignments: Score of wrong assignment Score of correct assignment Negative value Pick the wrong assignment with minimal margin Negative value maximize the distance between correct assignment and the closest wrong assignment
  69. How to Train the Weights? Liang Gong, Electric Engineering &

    Computer Science, University of California, Berkeley. 81 • How do we obtain the weights? • Maximize the margin between correct assignment and wrong assignments: Based on this reward function, do stochastic gradient descent to update the weights
  70. Evaluation Liang Gong, Electric Engineering & Computer Science, University of

    California, Berkeley. 82 • 10,517 JavaScript project from Github • Sample 50 projects with highest # of commits • Training: 324,501 files • Prediction: 3,710 files (minimized by UglifyJS) • 10-fold cross-validation
  71. Manual Type Annotations vs Predicted Annotations Liang Gong, Electric Engineering

    & Computer Science, University of California, Berkeley. 83 • JS Developer typically do not type check their annotations (miss spell, missing or conflicting annotations)
  72. Parameter Trade-off Liang Gong, Electric Engineering & Computer Science, University

    of California, Berkeley. 84 • precision and runtime vs beam search parameter s
  73. 85

  74. Conditional Random Field Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 86 Normalize the score to get the probability: The score of label y, based on feature fi The score of label y The set of all possible labels in x
  75. Outline • An approach for predicting program properties using Conditional

    Random Fields (CRFs) • CRF Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 87
  76. What is Conditional Random Field? • Something similar Markov Networks

    • What is Markov Networks? • https://www.youtube.com/watch?v=2BXoj778YU8 • CRF is a probabilistic framework for labeling and segmenting sequential data. • CRF can be viewed as an undirected graphical model Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 88