Predicting Program Properties from “Big Code”

Predicting Program Properties from “Big Code” An explanatory tutorial assembled
by: Liang Gong Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 1 Veselin Raychev, Martin Vechev, and Andreas Krause

JSNice Liang Gong, Electric Engineering & Computer Science, University of
California, Berkeley. 2 • Predict identifier names for minified programs • Predict type annotations of variables

JSNice Liang Gong, Electric Engineering & Computer Science, University of
California, Berkeley. 3 • Predict identifier names for minified programs • Predict type annotations of variables

Outline • A two-phase prediction approach • prediction phase +
training phase Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 4

Step-1.1 identify what to predict Liang Gong, Electric Engineering &
Computer Science, University of California, Berkeley. 5 • local variable names (unknown property)

Step-1.1 identify what to predict Liang Gong, Electric Engineering &
Computer Science, University of California, Berkeley. 6 • local variable names (unknown property)

Step-1.2 identify what is known Liang Gong, Electric Engineering &
Computer Science, University of California, Berkeley. 7 • Constants, APIs (known property)

Step-1.2 identify what is known Liang Gong, Electric Engineering &
Computer Science, University of California, Berkeley. 8 • Constants, APIs (known property)

Step-2 build dependency network Liang Gong, Electric Engineering & Computer
Science, University of California, Berkeley. 9 • Dependency Network: an undirected graphical model that captures various kinds of relationships between program elements.

Main Idea of Dependency Network (DN) Liang Gong, Electric Engineering
& Computer Science, University of California, Berkeley. 10 A structurally similar DN mined from the training code repository can help infer the unknown property in the query DN. DN in the input program DN in the training repository Step i len

& Computer Science, University of California, Berkeley. 11 A structurally similar DN mined from the training code repository can help infer the unknown property in the query DN. DN in the input program DN in the training repository Step i len

& Computer Science, University of California, Berkeley. 12 Perfectly identical DN is probably rare. DN in the input program DN in the training repository Step i len length L>R L<R L=_.R

& Computer Science, University of California, Berkeley. 13 Perfectly identical DN is probably rare. Multiple similar DNs could be found. DN in the input program DN in the training repository Step i len length L>R L<R L=_.R interv al j len length L+=R L<=R L=_.R

& Computer Science, University of California, Berkeley. 14 Perfectly identical DN is probably rare. Multiple similar DNs could be found. Need a fuzz way to infer unknown properties. DN in the input program DN in the training repository Step i len length L>R L<R L=_.R interv al j len length L+=R L<=R L=_.R

MAP Inference Liang Gong, Electric Engineering & Computer Science, University
of California, Berkeley. 15 MAP: Maximum a Posteriori Probability Infer the most likely properties for the nodes of the network.

Problem Definition Liang Gong, Electric Engineering & Computer Science, University
of California, Berkeley. 16 The entire training set D contains t programs. Each program x(j) has a vector of labels y Given a program x to be predicted, return a label vector with the maximal probability. n(x): # of unknown properties

Conditional Random Field Liang Gong, Electric Engineering & Computer Science,
University of California, Berkeley. 17

University of California, Berkeley. 19 all possible labels in x

University of California, Berkeley. 20 all possible labels in x

University of California, Berkeley. 21 The score of label vector y all possible labels in x

University of California, Berkeley. 22 The score of label y, based on feature fi The score of label vector y all possible labels in x

University of California, Berkeley. 23 The score of label y, based on feature fi The score of label vector y all possible labels in x Learnt through gradient descent

University of California, Berkeley. 25 Pick the label with maximal probability

University of California, Berkeley. 26 Pick the label with maximal probability Same for all scores

University of California, Berkeley. 27 Pick the label with maximal probability

Define Feature Functions Liang Gong, Electric Engineering & Computer Science,

University of California, Berkeley. 32 feature score of label vector y in program x

University of California, Berkeley. 33 feature score of label vector y in program x The set of all edges in the dependency network of program x

University of California, Berkeley. 34 feature score of label vector y in program x The set of all edges in the dependency network of program x This function returns 1 if this pair appears in the training set, otherwise return 0

University of California, Berkeley. 35 feature score of label vector y in program x The set of all edges in the dependency network of program x Unknown label vector y This function returns 1 if this pair appears in the training set, otherwise return 0

University of California, Berkeley. 36 feature score of label vector y in program x The set of all edges in the dependency network of program x Known property vector Unknown label vector y This function returns 1 if this pair appears in the training set, otherwise return 0

University of California, Berkeley. 37 feature score of label vector y in program x The set of all edges in the dependency network of program x Known property vector Unknown label vector y This function returns 1 if this pair appears in the training set, otherwise return 0 The label of node a in label vector y

Extracting Features Liang Gong, Electric Engineering & Computer Science, University
of California, Berkeley. 38 Labeled source code (training raw input)

of California, Berkeley. 39 Labeled source code (training raw input) Labeled dependency networks

of California, Berkeley. 40 Labeled source code (training raw input) Labeled dependency networks

of California, Berkeley. 41 Labeled source code (training raw input) Labeled dependency networks Label Label relation

Simplify the Math Liang Gong, Electric Engineering & Computer Science,

University of California, Berkeley. 51 • NP Hard: iterate over all possible label combinations

University of California, Berkeley. 52 • NP Hard: iterate over all possible label combinations • Need a faster (approximate) algorithm.

Practical Algorithm Liang Gong, Electric Engineering & Computer Science, University
of California, Berkeley. 53 • NP Hard: iterate over all possible label combinations • Need a faster (approximate) algorithm. • Greedily selecting best possible labels • Start with an initial assignment • make small changes and try to improve the score

of California, Berkeley. 54 • NP Hard: iterate over all possible label combinations • Need a faster (approximate) algorithm. • Greedily selecting best possible labels • Start with an initial assignment • make small changes and try to improve the score step j len length L+=R L<=R L=_.R interval L=_.R

of California, Berkeley. 55 • NP Hard: iterate over all possible label combinations • Need a faster (approximate) algorithm. • Greedily selecting best possible labels • Start with an initial assignment • make small changes and try to improve the score step j len length L+=R L<=R L=_.R interval L=_.R • Initial assignment assigns len

of California, Berkeley. 56 • NP Hard: iterate over all possible label combinations • Need a faster (approximate) algorithm. • Greedily selecting best possible labels • Start with an initial assignment • make small changes and try to improve the score j len length L<=R L=_.R • Initial assignment assigns len • Extract the local network interval L=_.R

of California, Berkeley. 57 • NP Hard: iterate over all possible label combinations • Need a faster (approximate) algorithm. • Greedily selecting best possible labels • Start with an initial assignment • make small changes and try to improve the score j len length L<=R L=_.R • Initial assignment assigns len • Extract the local network • Calculate its current score based on the local network interval L=_.R

of California, Berkeley. 58 • NP Hard: iterate over all possible label combinations • Need a faster (approximate) algorithm. • Greedily selecting best possible labels • Start with an initial assignment • make small changes and try to improve the score j len length L<=R L=_.R • Initial assignment assigns len • Extract the local network • Calculate its current score based on the local network • Try other labels and calculate scores interval L=_.R name input

of California, Berkeley. 59 • NP Hard: iterate over all possible label combinations • Need a faster (approximate) algorithm. • Greedily selecting best possible labels • Start with an initial assignment • make small changes and try to improve the score j name length L<=R L=_.R • Initial assignment assigns len • Extract the local network • Calculate its current score based on the local network • Try other labels and calculate scores • Pick the label with maximal local score interval L=_.R

of California, Berkeley. 60 • NP Hard: iterate over all possible label combinations • Need a faster (approximate) algorithm. • Greedily selecting best possible labels • Start with an initial assignment • make small changes and try to improve the score • Initial assignment assigns len • Extract the local network • Calculate its current score based on the local network • Try other labels and calculate scores • Pick the label with maximal local score step j name length L+=R L<=R L=_.R interval L=_.R Pick the next node and repeat the process

How to Pick Candidates? Liang Gong, Electric Engineering & Computer
Science, University of California, Berkeley. 61 • Suppose we are trying to improve the score of a node (current label: Len) step j Len length L+=R L<=R L=_.R interval L=_.R

Science, University of California, Berkeley. 62 • Suppose we are trying to improve the score of a node (current label: Len) • Get all its adjacent nodes’ label j Len length L+=R L<=R L=_.R interval L=_.R

Science, University of California, Berkeley. 63 • Suppose we are trying to improve the score of a node (current label: Len) • Get all its adjacent nodes’ label j Len length L+=R L<=R L=_.R interval L=_.R j j j length length interval interval

Science, University of California, Berkeley. 64 • Suppose we are trying to improve the score of a node (current label: Len) • Get all its adjacent nodes’ label j Len length L+=R L<=R L=_.R interval L=_.R j j j length length interval interval x y xi

Science, University of California, Berkeley. 65 • Suppose we are trying to improve the score of a node (current label: Len) • Get all its adjacent nodes’ label j Len length L+=R L<=R L=_.R interval L=_.R j j j length length interval interval x y xi Rank based on its weight

Science, University of California, Berkeley. 66 • Suppose we are trying to improve the score of a node (current label: Len) • Get all its adjacent nodes’ label j Len length L+=R L<=R L=_.R interval L=_.R j j j length length interval interval x y xi Rank based on its weight Get top s triples

Science, University of California, Berkeley. 67 • Suppose we are trying to improve the score of a node (current label: Len) • Get all its adjacent nodes’ label j Len length L+=R L<=R L=_.R interval L=_.R j j j length length interval interval x y xi Rank based on its weight Get top s triples

Science, University of California, Berkeley. 68 • Suppose we are trying to improve the score of a node (current label: Len) • Get all its adjacent nodes’ label j Len length L+=R L<=R L=_.R interval L=_.R j j j length length interval interval x y xi Rank based on its weight Get top s triples X y Candidates:

How to Train the Weights? Liang Gong, Electric Engineering &
Computer Science, University of California, Berkeley. 69 • Each feature function has a corresponding weight

Computer Science, University of California, Berkeley. 72 • Each feature function has a corresponding weight • How do we obtain the weights?

Computer Science, University of California, Berkeley. 73 • Each feature function has a corresponding weight • How do we obtain the weights? • Maximize the margin between correct assignment and wrong assignments:

Computer Science, University of California, Berkeley. 74 • Each feature function has a corresponding weight • How do we obtain the weights? • Maximize the margin between correct assignment and wrong assignments:

Computer Science, University of California, Berkeley. 75 • How do we obtain the weights? • Maximize the margin between correct assignment and wrong assignments: Score of correct assignment Score of wrong assignment A distance function • Ideally:

Computer Science, University of California, Berkeley. 76 • How do we obtain the weights? • Maximize the margin between correct assignment and wrong assignments: Score of wrong assignment Score of correct assignment

Computer Science, University of California, Berkeley. 77 • How do we obtain the weights? • Maximize the margin between correct assignment and wrong assignments: Score of wrong assignment Score of correct assignment Negative value

Computer Science, University of California, Berkeley. 78 • How do we obtain the weights? • Maximize the margin between correct assignment and wrong assignments: Score of wrong assignment Score of correct assignment Negative value Pick the wrong assignment with minimal margin

Computer Science, University of California, Berkeley. 79 • How do we obtain the weights? • Maximize the margin between correct assignment and wrong assignments: Score of wrong assignment Score of correct assignment Negative value Pick the wrong assignment with minimal margin Negative value

Computer Science, University of California, Berkeley. 80 • How do we obtain the weights? • Maximize the margin between correct assignment and wrong assignments: Score of wrong assignment Score of correct assignment Negative value Pick the wrong assignment with minimal margin Negative value maximize the distance between correct assignment and the closest wrong assignment

Computer Science, University of California, Berkeley. 81 • How do we obtain the weights? • Maximize the margin between correct assignment and wrong assignments: Based on this reward function, do stochastic gradient descent to update the weights

Evaluation Liang Gong, Electric Engineering & Computer Science, University of
California, Berkeley. 82 • 10,517 JavaScript project from Github • Sample 50 projects with highest # of commits • Training: 324,501 files • Prediction: 3,710 files (minimized by UglifyJS) • 10-fold cross-validation

Manual Type Annotations vs Predicted Annotations Liang Gong, Electric Engineering
& Computer Science, University of California, Berkeley. 83 • JS Developer typically do not type check their annotations (miss spell, missing or conflicting annotations)

Parameter Trade-off Liang Gong, Electric Engineering & Computer Science, University
of California, Berkeley. 84 • precision and runtime vs beam search parameter s

University of California, Berkeley. 86 Normalize the score to get the probability: The score of label y, based on feature fi The score of label y The set of all possible labels in x

Outline • An approach for predicting program properties using Conditional
Random Fields (CRFs) • CRF Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 87

What is Conditional Random Field? • Something similar Markov Networks
• What is Markov Networks? • https://www.youtube.com/watch?v=2BXoj778YU8 • CRF is a probabilistic framework for labeling and segmenting sequential data. • CRF can be viewed as an undirected graphical model Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 88

Predicting Program Properties from “Big Code”

Predicting Program Properties from “Big Code”

More Decks by Liang Gong

Other Decks in Research

Featured

Transcript