Predicting Program Properties from “Big Code”

Slide 1

Slide 1 text

Predicting Program Properties from “Big Code” An explanatory tutorial assembled by: Liang Gong Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 1 Veselin Raychev, Martin Vechev, and Andreas Krause

Slide 2

Slide 2 text

JSNice Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 2 • Predict identifier names for minified programs • Predict type annotations of variables

Slide 3

Slide 3 text

JSNice Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 3 • Predict identifier names for minified programs • Predict type annotations of variables

Slide 4

Slide 4 text

Outline • A two-phase prediction approach • prediction phase + training phase Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 4

Slide 5

Slide 5 text

Step-1.1 identify what to predict Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 5 • local variable names (unknown property)

Slide 6

Slide 6 text

Step-1.1 identify what to predict Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 6 • local variable names (unknown property)

Slide 7

Slide 7 text

Step-1.2 identify what is known Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 7 • Constants, APIs (known property)

Slide 8

Slide 8 text

Step-1.2 identify what is known Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 8 • Constants, APIs (known property)

Slide 9

Slide 9 text

Step-2 build dependency network Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 9 • Dependency Network: an undirected graphical model that captures various kinds of relationships between program elements.

Slide 10

Slide 10 text

Main Idea of Dependency Network (DN) Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 10 A structurally similar DN mined from the training code repository can help infer the unknown property in the query DN. DN in the input program DN in the training repository Step i len

Slide 11

Slide 11 text

Main Idea of Dependency Network (DN) Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 11 A structurally similar DN mined from the training code repository can help infer the unknown property in the query DN. DN in the input program DN in the training repository Step i len

Slide 12

Slide 12 text

Main Idea of Dependency Network (DN) Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 12 Perfectly identical DN is probably rare. DN in the input program DN in the training repository Step i len length L>R L

Slide 13

Slide 13 text

Main Idea of Dependency Network (DN) Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 13 Perfectly identical DN is probably rare. Multiple similar DNs could be found. DN in the input program DN in the training repository Step i len length L>R L

Slide 14

Slide 14 text

Main Idea of Dependency Network (DN) Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 14 Perfectly identical DN is probably rare. Multiple similar DNs could be found. Need a fuzz way to infer unknown properties. DN in the input program DN in the training repository Step i len length L>R L

Slide 15

Slide 15 text

MAP Inference Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 15 MAP: Maximum a Posteriori Probability Infer the most likely properties for the nodes of the network.

Slide 16

Slide 16 text

Problem Definition Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 16 The entire training set D contains t programs. Each program x(j) has a vector of labels y Given a program x to be predicted, return a label vector with the maximal probability. n(x): # of unknown properties

Slide 17

Slide 17 text

Conditional Random Field Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 17

Slide 18

Slide 18 text

Conditional Random Field Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 18

Slide 19

Slide 19 text

Conditional Random Field Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 19 all possible labels in x

Slide 20

Slide 20 text

Conditional Random Field Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 20 all possible labels in x

Slide 21

Slide 21 text

Conditional Random Field Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 21 The score of label vector y all possible labels in x

Slide 22

Slide 22 text

Conditional Random Field Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 22 The score of label y, based on feature fi The score of label vector y all possible labels in x

Slide 23

Slide 23 text

Conditional Random Field Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 23 The score of label y, based on feature fi The score of label vector y all possible labels in x Learnt through gradient descent

Slide 24

Slide 24 text

Conditional Random Field Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 24

Slide 25

Slide 25 text

Conditional Random Field Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 25 Pick the label with maximal probability

Slide 26

Slide 26 text

Conditional Random Field Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 26 Pick the label with maximal probability Same for all scores

Slide 27

Slide 27 text

Conditional Random Field Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 27 Pick the label with maximal probability

Slide 28

Slide 28 text

Conditional Random Field Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 28

Slide 29

Slide 29 text

Conditional Random Field Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 29

Slide 30

Slide 30 text

Conditional Random Field Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 30

Slide 31

Slide 31 text

Define Feature Functions Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 31

Slide 32

Slide 32 text

Define Feature Functions Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 32 feature score of label vector y in program x

Slide 33

Slide 33 text

Define Feature Functions Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 33 feature score of label vector y in program x The set of all edges in the dependency network of program x

Slide 34

Slide 34 text

Define Feature Functions Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 34 feature score of label vector y in program x The set of all edges in the dependency network of program x This function returns 1 if this pair appears in the training set, otherwise return 0

Slide 35

Slide 35 text

Define Feature Functions Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 35 feature score of label vector y in program x The set of all edges in the dependency network of program x Unknown label vector y This function returns 1 if this pair appears in the training set, otherwise return 0

Slide 36

Slide 36 text

Define Feature Functions Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 36 feature score of label vector y in program x The set of all edges in the dependency network of program x Known property vector Unknown label vector y This function returns 1 if this pair appears in the training set, otherwise return 0

Slide 37

Slide 37 text

Define Feature Functions Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 37 feature score of label vector y in program x The set of all edges in the dependency network of program x Known property vector Unknown label vector y This function returns 1 if this pair appears in the training set, otherwise return 0 The label of node a in label vector y

Slide 38

Slide 38 text

Extracting Features Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 38 Labeled source code (training raw input)

Slide 39

Slide 39 text

Extracting Features Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 39 Labeled source code (training raw input) Labeled dependency networks

Slide 40

Slide 40 text

Extracting Features Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 40 Labeled source code (training raw input) Labeled dependency networks

Slide 41

Slide 41 text

Extracting Features Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 41 Labeled source code (training raw input) Labeled dependency networks Label Label relation

Slide 42

Slide 42 text

Extracting Features Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 42 Labeled source code (training raw input) Labeled dependency networks Label Label relation

Slide 43

Slide 43 text

Extracting Features Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 43 Labeled source code (training raw input) Labeled dependency networks Label Label relation

Slide 44

Slide 44 text

Extracting Features Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 44 Labeled source code (training raw input) Labeled dependency networks Label Label relation

Slide 45

Slide 45 text

Extracting Features Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 45 Labeled source code (training raw input) Labeled dependency networks Label Label relation

Slide 46

Slide 46 text

Simplify the Math Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 46

Slide 47

Slide 47 text

Simplify the Math Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 47

Slide 48

Slide 48 text

Simplify the Math Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 48

Slide 49

Slide 49 text

Simplify the Math Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 49

Slide 50

Slide 50 text

Simplify the Math Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 50

Slide 51

Slide 51 text

Simplify the Math Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 51 • NP Hard: iterate over all possible label combinations

Slide 52

Slide 52 text

Simplify the Math Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 52 • NP Hard: iterate over all possible label combinations • Need a faster (approximate) algorithm.

Slide 53

Slide 53 text

Practical Algorithm Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 53 • NP Hard: iterate over all possible label combinations • Need a faster (approximate) algorithm. • Greedily selecting best possible labels • Start with an initial assignment • make small changes and try to improve the score

Slide 54

Slide 54 text

Practical Algorithm Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 54 • NP Hard: iterate over all possible label combinations • Need a faster (approximate) algorithm. • Greedily selecting best possible labels • Start with an initial assignment • make small changes and try to improve the score step j len length L+=R L<=R L=_.R interval L=_.R

Slide 55

Slide 55 text

Practical Algorithm Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 55 • NP Hard: iterate over all possible label combinations • Need a faster (approximate) algorithm. • Greedily selecting best possible labels • Start with an initial assignment • make small changes and try to improve the score step j len length L+=R L<=R L=_.R interval L=_.R • Initial assignment assigns len

Slide 56

Slide 56 text

Practical Algorithm Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 56 • NP Hard: iterate over all possible label combinations • Need a faster (approximate) algorithm. • Greedily selecting best possible labels • Start with an initial assignment • make small changes and try to improve the score j len length L<=R L=_.R • Initial assignment assigns len • Extract the local network interval L=_.R

Slide 57

Slide 57 text

Practical Algorithm Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 57 • NP Hard: iterate over all possible label combinations • Need a faster (approximate) algorithm. • Greedily selecting best possible labels • Start with an initial assignment • make small changes and try to improve the score j len length L<=R L=_.R • Initial assignment assigns len • Extract the local network • Calculate its current score based on the local network interval L=_.R

Slide 58

Slide 58 text

Practical Algorithm Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 58 • NP Hard: iterate over all possible label combinations • Need a faster (approximate) algorithm. • Greedily selecting best possible labels • Start with an initial assignment • make small changes and try to improve the score j len length L<=R L=_.R • Initial assignment assigns len • Extract the local network • Calculate its current score based on the local network • Try other labels and calculate scores interval L=_.R name input

Slide 59

Slide 59 text

Practical Algorithm Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 59 • NP Hard: iterate over all possible label combinations • Need a faster (approximate) algorithm. • Greedily selecting best possible labels • Start with an initial assignment • make small changes and try to improve the score j name length L<=R L=_.R • Initial assignment assigns len • Extract the local network • Calculate its current score based on the local network • Try other labels and calculate scores • Pick the label with maximal local score interval L=_.R

Slide 60

Slide 60 text

Practical Algorithm Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 60 • NP Hard: iterate over all possible label combinations • Need a faster (approximate) algorithm. • Greedily selecting best possible labels • Start with an initial assignment • make small changes and try to improve the score • Initial assignment assigns len • Extract the local network • Calculate its current score based on the local network • Try other labels and calculate scores • Pick the label with maximal local score step j name length L+=R L<=R L=_.R interval L=_.R Pick the next node and repeat the process

Slide 61

Slide 61 text

How to Pick Candidates? Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 61 • Suppose we are trying to improve the score of a node (current label: Len) step j Len length L+=R L<=R L=_.R interval L=_.R

Slide 62

Slide 62 text

How to Pick Candidates? Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 62 • Suppose we are trying to improve the score of a node (current label: Len) • Get all its adjacent nodes’ label j Len length L+=R L<=R L=_.R interval L=_.R

Slide 63

Slide 63 text

How to Pick Candidates? Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 63 • Suppose we are trying to improve the score of a node (current label: Len) • Get all its adjacent nodes’ label j Len length L+=R L<=R L=_.R interval L=_.R j j j length length interval interval

Slide 64

Slide 64 text

How to Pick Candidates? Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 64 • Suppose we are trying to improve the score of a node (current label: Len) • Get all its adjacent nodes’ label j Len length L+=R L<=R L=_.R interval L=_.R j j j length length interval interval x y xi

Slide 65

Slide 65 text

How to Pick Candidates? Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 65 • Suppose we are trying to improve the score of a node (current label: Len) • Get all its adjacent nodes’ label j Len length L+=R L<=R L=_.R interval L=_.R j j j length length interval interval x y xi Rank based on its weight

Slide 66

Slide 66 text

How to Pick Candidates? Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 66 • Suppose we are trying to improve the score of a node (current label: Len) • Get all its adjacent nodes’ label j Len length L+=R L<=R L=_.R interval L=_.R j j j length length interval interval x y xi Rank based on its weight Get top s triples

Slide 67

Slide 67 text

How to Pick Candidates? Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 67 • Suppose we are trying to improve the score of a node (current label: Len) • Get all its adjacent nodes’ label j Len length L+=R L<=R L=_.R interval L=_.R j j j length length interval interval x y xi Rank based on its weight Get top s triples

Slide 68

Slide 68 text

How to Pick Candidates? Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 68 • Suppose we are trying to improve the score of a node (current label: Len) • Get all its adjacent nodes’ label j Len length L+=R L<=R L=_.R interval L=_.R j j j length length interval interval x y xi Rank based on its weight Get top s triples X y Candidates:

Slide 69

Slide 69 text

How to Train the Weights? Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 69 • Each feature function has a corresponding weight

Slide 70

Slide 70 text

How to Train the Weights? Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 70 • Each feature function has a corresponding weight

Slide 71

Slide 71 text

How to Train the Weights? Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 71 • Each feature function has a corresponding weight

Slide 72

Slide 72 text

How to Train the Weights? Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 72 • Each feature function has a corresponding weight • How do we obtain the weights?

Slide 73

Slide 73 text

How to Train the Weights? Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 73 • Each feature function has a corresponding weight • How do we obtain the weights? • Maximize the margin between correct assignment and wrong assignments:

Slide 74

Slide 74 text

How to Train the Weights? Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 74 • Each feature function has a corresponding weight • How do we obtain the weights? • Maximize the margin between correct assignment and wrong assignments:

Slide 75

Slide 75 text

How to Train the Weights? Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 75 • How do we obtain the weights? • Maximize the margin between correct assignment and wrong assignments: Score of correct assignment Score of wrong assignment A distance function • Ideally:

Slide 76

Slide 76 text

How to Train the Weights? Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 76 • How do we obtain the weights? • Maximize the margin between correct assignment and wrong assignments: Score of wrong assignment Score of correct assignment

Slide 77

Slide 77 text

How to Train the Weights? Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 77 • How do we obtain the weights? • Maximize the margin between correct assignment and wrong assignments: Score of wrong assignment Score of correct assignment Negative value

Slide 78

Slide 78 text

How to Train the Weights? Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 78 • How do we obtain the weights? • Maximize the margin between correct assignment and wrong assignments: Score of wrong assignment Score of correct assignment Negative value Pick the wrong assignment with minimal margin

Slide 79

Slide 79 text

How to Train the Weights? Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 79 • How do we obtain the weights? • Maximize the margin between correct assignment and wrong assignments: Score of wrong assignment Score of correct assignment Negative value Pick the wrong assignment with minimal margin Negative value

Slide 80

Slide 80 text

How to Train the Weights? Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 80 • How do we obtain the weights? • Maximize the margin between correct assignment and wrong assignments: Score of wrong assignment Score of correct assignment Negative value Pick the wrong assignment with minimal margin Negative value maximize the distance between correct assignment and the closest wrong assignment

Slide 81

Slide 81 text

How to Train the Weights? Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 81 • How do we obtain the weights? • Maximize the margin between correct assignment and wrong assignments: Based on this reward function, do stochastic gradient descent to update the weights

Slide 82

Slide 82 text

Evaluation Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 82 • 10,517 JavaScript project from Github • Sample 50 projects with highest # of commits • Training: 324,501 files • Prediction: 3,710 files (minimized by UglifyJS) • 10-fold cross-validation

Slide 83

Slide 83 text

Manual Type Annotations vs Predicted Annotations Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 83 • JS Developer typically do not type check their annotations (miss spell, missing or conflicting annotations)

Slide 84

Slide 84 text

Parameter Trade-off Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 84 • precision and runtime vs beam search parameter s

Slide 85

Slide 85 text

Slide 86

Slide 86 text

Conditional Random Field Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 86 Normalize the score to get the probability: The score of label y, based on feature fi The score of label y The set of all possible labels in x

Slide 87

Slide 87 text

Outline • An approach for predicting program properties using Conditional Random Fields (CRFs) • CRF Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 87

Slide 88

Slide 88 text

What is Conditional Random Field? • Something similar Markov Networks • What is Markov Networks? • https://www.youtube.com/watch?v=2BXoj778YU8 • CRF is a probabilistic framework for labeling and segmenting sequential data. • CRF can be viewed as an undirected graphical model Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 88