$30 off During Our Annual Pro Sale. View Details »

MULTISTAGING TO UNDERSTAND: DISTILLING THE ESSENCE OF CODE EXAMPLES

MULTISTAGING TO UNDERSTAND: DISTILLING THE ESSENCE OF CODE EXAMPLES

Getting a quick overview of an online code example is often non-trivial and requires a tedious and manual inspection process. In this talk, I introduce a technique called Multistaging to Understand (MTU), which streamlines this inspection process by distilling the example's essence. The essence of a code example conveys the most important aspects of the example’s intended function. MTU automatically decomposes the code in an example into code stages that can be explored by its learners non-sequentially; enabling fast exploratory learning.

Huascar Sanchez

May 17, 2016
Tweet

More Decks by Huascar Sanchez

Other Decks in Research

Transcript

  1. MULTISTAGING TO UNDERSTAND Huascar Sanchez huascar.sanchez@sri.com DISTILLING THE ESSENCE OF

    CODE EXAMPLES May 16 -17, 2016 SRI International Jim Whitehead ejw@soe.ucsc.edu UC Santa Cruz Martin Schäf martin.shaef@sri.com SRI International ICPC’16
  2. MTU - ICPC’16 - 05 16, 2016 Problem: Understanding unfamiliar

    code during code foraging is laborious and challenging. • Lots of information contained within code are either peripheral or obscured by other elements. • Lack of tool support for locating the essential sections within a code and then aid with their understanding. Solution: Deliver a method (and its tool) for discovering these essential sections and reveal only their relevant details. 2 Distilling the Essence of Code Examples
  3. Title - CONFYYY - MM DD, YYYY 3 Multistage Representation

    of Code code example … public final class RandGaussianDistrib { private static final Random R… static double uniform(){ return R.nextInt(n); } static double uniform(double a, double b) { return a + uniform() * (b - a); } static double gaussian(){ double r, x, y; do { x = uniform(-1.0, 1.0); y = uniform(-1.0, 1.0); r = (x*x) + (y*y); } while(r >= 1 || r == 0); return x * Math.sqrt(-2 * Math.log(r) / r); } } … public final class RandGaussianDistrib { private static final Random R… static double uniform(){ return R.nextInt(n); } } … public final class RandGaussianDistrib { private static final Random R… static double uniform(){ return R.nextInt(n); } static double uniform(double a, double b){ return a + uniform() * (b - a); } } … public final class RandGaussianDistrib { private static final Random R… static double uniform(){…} static double uniform(double a, double b) {…} static double gaussian(){ double r, x, y; do {…} while (r >= 1 || r == 0); return x * Math.sqrt(-2 * Math.log(r) / r); } } 1: Code stage 2: Code stage 3: Code stage
  4. Title - CONFYYY - MM DD, YYYY 4 Multistage representation

    of code … public final class RandGaussianDistrib { private static final Random R… static double uniform(){ return R.nextInt(n); } } … public final class RandGaussianDistrib { private static final Random R… static double uniform(){ return R.nextInt(n); } static double uniform(double a, double b){ return a + uniform() * (n); } } … public final class RandGaussianDistrib { private static final Random R… static double uniform(){…} static double uniform(double a, double b) {…} static double gaussian(){ double r, x, y; do {…} while (r >= 1 || r == 0); return x * Math.sqrt(-2 * Math.log(r) / r); } } 1: Code stage 2: Code stage 3: Code stage Roadmap for steering understanding
  5. Title - CONFYYY - MM DD, YYYY 5 Implementation

  6. Title - CONFYYY - MM DD, YYYY 6 Multistaging Implementation

    browser plug-in Stage code example Code stages S 1: Get Pivot Index (a) Code stage 1. (b) Code stage 2. (c) Code stage 3. Fig. 6. One application of MethodStaging against the SmallestNum code example. function GETBINDINGSIN( m ) V , S , R {} W {target node types} S S [ m while S is not empty do u pop S if u / 2 V then V V [ { u } for each child node w in u do if if w 2 W then R R [ {binding of w } end if S S [ { w } end for end if end while return R // Set of bindings in m end function Fig. 7. GetBindingsIn subroutine. function RECONSTRUCTSOURCECODE( p, d ) // deletes declaration nodes n 2 { p \ d } from AST p p 0 { n | n 2 p and n / 2 { p \ d }} return source code for p 0 end function Fig. 8. ReconstructSourceCode subroutine. B. MethodStaging with Reduction Programmers dealing with large code stages are often con- fronted with the consequent information overload problem. We can reduce this problem by automatically reducing them. The rationale is that reduced code stages can be easily digested by programmers wishing a quick overview of their operation. We make reduction decisions in MethodStaging based on examples’ source code structure. Our approach is consistent with how human abstractors approach inspecting unfamiliar Equation 1. The usage score of a code block is representative of the demand of its elements throughout the code example. The usage frequency of each element in a code block is the number of times this element appears in a code stage. As a result, we use code blocks’ usage score to show the blocks with a higher demand and hide those with a lesser demand. UsageScore ( b ) = P elem2b UsageF req ( elem ) T otalChildren ( b ) (1) For example, given a nested code block at line 11 in Figure 4c, we first collect its children: temp, list, left, and right. Second, we compute each child’s usage frequency: 2, 7, 10, and 9. Lastly, we put it all together and calculate the nested code block’s usage score: (2 + 7 + 10 + 9)/4 = 7. We cast the problem of reducing large code stages as an instance of the Precedence Constrained Knapsack Problem or PCKP [17]. This problem is specified herein. Problem 3.2: Code Stage Reduction. Given a set of code blocks B (with weight wb and profit pb per block b 2 B), a Knapsack capacity W, a precedence order O ✓ B ⇥ B, and a set of constraints C, find H⇤ such that H⇤ = B \ X ⇤, where wb = number of lines of code in b, pb = UsageScore(b), X ⇤ = arg max { P b2B pb }, and X ⇤ satisfies the constraints in C. The constraints in C include: P b j 2B wb j  W, where bi bj (bi precedes bj ) 2 O, and i, j = 1, . . . , |B|. Similar to Samphaiboon et al. [17], we solve this problem by using dynamic programming. Our solution generalizes the code stage reduction problem, also taking into account a precedence relation between code blocks in a code stage. We build a Directed Acyclic Graph (DAG) to represent such a relation, where nodes correspond to code blocks in a one– to–one fashion. This relation is expressed as a composition relation between code blocks. For instance, a code block k 1 2: Select (a) Code stage 1. (b) Code stage 2. (c) Code stage 3. Fig. 6. One application of MethodStaging against the SmallestNum code example. function GETBINDINGSIN( m ) V , S , R {} W {target node types} S S [ m while S is not empty do u pop S if u / 2 V then V V [ { u } for each child node w in u do if if w 2 W then R R [ {binding of w } end if S S [ { w } end for end if end while return R // Set of bindings in m end function Fig. 7. GetBindingsIn subroutine. function RECONSTRUCTSOURCECODE( p, d ) // deletes declaration nodes n 2 { p \ d } from AST p p 0 { n | n 2 p and n / 2 { p \ d }} return source code for p 0 end function Fig. 8. ReconstructSourceCode subroutine. B. MethodStaging with Reduction Equation 1. The usage score of a code block is representative of the demand of its elements throughout the code example. The usage frequency of each element in a code block is the number of times this element appears in a code stage. As a result, we use code blocks’ usage score to show the blocks with a higher demand and hide those with a lesser demand. UsageScore ( b ) = P elem2b UsageFreq ( elem ) TotalChildren ( b ) (1) For example, given a nested code block at line 11 in Figure 4c, we first collect its children: temp, list, left, and right. Second, we compute each child’s usage frequency: 2, 7, 10, and 9. Lastly, we put it all together and calculate the nested code block’s usage score: (2 + 7 + 10 + 9)/4 = 7. We cast the problem of reducing large code stages as an instance of the Precedence Constrained Knapsack Problem or PCKP [17]. This problem is specified herein. Problem 3.2: Code Stage Reduction. Given a set of code blocks B (with weight wb and profit pb per block b 2 B), a Knapsack capacity W, a precedence order O ✓ B ⇥ B, and a set of constraints C, find H⇤ such that H⇤ = B \ X⇤, where wb = number of lines of code in b, pb = UsageScore(b), X⇤ = arg max { P b2B pb }, and X⇤ satisfies the constraints in C. The constraints in C include: P b j 2B wb j  W, where bi bj (bi precedes bj) 2 O, and i, j = 1, . . . , |B|. 3: Main (a) Code stage 1. (b) Code stage 2. (c) Code stage 3. Fig. 6. One application of MethodStaging against the SmallestNum code example. function GETBINDINGSIN( m ) V , S , R {} W {target node types} S S [ m while S is not empty do u pop S if u / 2 V then V V [ { u } for each child node w in u do if if w 2 W then R R [ {binding of w } end if S S [ { w } end for end if end while return R // Set of bindings in m end function Fig. 7. GetBindingsIn subroutine. function RECONSTRUCTSOURCECODE( p, d ) Equation 1. The usage score of a code block is representative of the demand of its elements throughout the code example. The usage frequency of each element in a code block is the number of times this element appears in a code stage. As a result, we use code blocks’ usage score to show the blocks with a higher demand and hide those with a lesser demand. UsageScore ( b ) = P elem2b UsageFreq ( elem ) TotalChildren ( b ) (1) For example, given a nested code block at line 11 in Figure 4c, we first collect its children: temp, list, left, and right. Second, we compute each child’s usage frequency: 2, 7, 10, and 9. Lastly, we put it all together and calculate the nested code block’s usage score: (2 + 7 + 10 + 9)/4 = 7. We cast the problem of reducing large code stages as an instance of the Precedence Constrained Knapsack Problem or PCKP [17]. This problem is specified herein. RESTful service MethodStaging w/Reduction Source code Capacity multistaging request 1 2 3 (S, H*) e.g., hi ∈ hidden code H* browser plug-in Stage RESTful service processing …
  7. Title - CONFYYY - MM DD, YYYY 7 Multistaging in

    Action
  8. MTU - ICPC’16 - 05 16, 2016 Exploring the code

    stages suggests a form of code inspection called Multistaging to Understand (MTU). By adopting MTU • Programmers can inspect a few generated code stages, • mentally abstract their functionality, and then • combine gained knowledge to understand main functionality MTU shares similarities with code reading by stepwise abstraction (Linger et al., 1979) 8 Multistaging to Understand
  9. MTU - ICPC’16 - 05 16, 2016 9 MTU Evaluation

  10. Title - CONFYYY - MM DD, YYYY We consider the

    following question: Does MTU make the understanding of unfamiliar code examples easier during code foraging? Where easier means: • High comprehension accuracy • short reviewing time 10 MTU in the Lab
  11. • 12 Participants, 2 groups, 3 tasks, 120 minutes •

    Crossed factorial design with 2 factors • Between-subjects: Comprehension strategy • Within-subjects: Size of code examples • Variables • Response Accuracy & Reviewing time SCC - SRI - 09 18, 2015 11 Experimental Setup (Babbie, 2015)
  12. MTU - ICPC’16 - 05 16, 2016 Open ended questions

    addressing five comprehension abstractions (Pennington, 1987) 12 Response Accuracy Function Describe the overall functionality. Control flow Describe execution sequence using pseudo code. Data flow Describe when a data object gets updated. Operations Describe data object’s need in an execution sequence. State Describe data object’s composition at point of execution. Rating scheme (Du Bois, 2005) to score answers: Correct (10 pts), Almost Correct (8 pts), Right Idea (5 pts), and Wrong (0 pts)
  13. MTU - ICPC’16 - 05 16, 2016 Collected reviewing times

    from two sources: 13 Reviewing time browser plug-in’s time tracker Stage 00h 00m off Upwork’s time tracker Assigned task
  14. MTU - ICPC’16 - 05 16, 2016 14 Results

  15. MTU - ICPC’16 - 05 16, 2016 Significant differences in

    average response accuracy; favoring the treatment group (in bold) over the control group (in italics) 15 Average Response Accuracy Short [35,70) Medium [70, 140) Long [140, 200] MTU RTU p-value MTU RTU p-value MTU RTU p-value Function 6.83 3.33 0.0037 7.17 - 3.83 3.83 0.0509 7.67 - 5.00 p=0.0534 Control flow 8.50 6.83 0.0525 7.17 - 4.33 4.33 0.1984 8.17 - 4.33 p=0.0204 Data flow 8.67 6.17 0.0462 5.33 - 3.00 3.00 0.2308 8.50 - 6.00 p=0.1199 State 8.67 7.00 0.0873 7.67 - 5.67 5.67 0.1594 9.00 - 6.50 p=0.0971 Operations 7.33 3.33 0.0595 7.83 - 4.83 4.83 0.0609 6.50 - 3.00 p=0.0549 Unaccounted factor: Delocalization (Letovsky et al.,1986) Delocalization led to many wrong answers, which caused high p-values. Note: Rating scheme for scoring accuracy of answers: Correct (10 points), Almost Correct (8 points), Right Idea (5 points), and Wrong (0 points).
  16. MTU - ICPC’16 - 05 16, 2016 Significant speed improvements

    of the treatment group (in bold) over the control group (in italics) 16 Average Reviewing Time Short [35,70) Medium [70, 140) Long [140, 200] MTU RTU p-value MTU RTU p-value MTU RTU p-value Reviewing time (secs) 475 745 0.0995 655 1022 0.0446 465 912 0.0284 Note: Reviewing times obtained from two sources: Upwork’s time tracker and Violette’s time tracker.
  17. MTU - ICPC’16 - 05 16, 2016 MTU helps facilitate

    quick and accurate understanding when most of the code is localized. MTU provides minor benefits when code is partially or fully delocalized. MTU provides consistent speed improvements regardless of delocalization. 17 Summarizing
  18. MTU - ICPC’16 - 05 16, 2016 18 Questions: huascar.sanchez@sri.com

    https://github.com/vesperin
  19. Title - CONFYYY - MM DD, YYYY 19

  20. Title - CONFYYY - MM DD, YYYY 20 Multistager Architecture

  21. MTU - ICPC’16 - 05 16, 2016 Given the AST

    of a code example, with a set of n method declarations D = D1 ∪ D2 … ∪ Dn, compute a set of interconnected code stages {S | S ⊆ D × D}, sorted in ascending order by LOC, s.t., each code stage s ∈ S ∪ {sØ } builds upon, and in relation to, preceding code stages. Where: • sØ is the null code stage (sØ ’s preceding code stage is sØ ) • si < sj , si precedes sj and i, j = 1 … |S| 21 The Multistaging Problem
  22. MTU - ICPC’16 - 05 16, 2016 Algorithm: MethodStaging(p/*AST*/, sØ

    ) Stages = {sØ } for each method m in p do d = {} // declarations set for each binding b in GetBindingsIn(m) do // e.g., b = (select, method) d = d U {getDeclarationNode(b)} end for s = source code for {n|n ∈ p ∧ n ∉ {p\d}} Stages = Stages U {s} end for return sortAscending(Stages) end Algorithm 22 Solution: MethodStaging Algorithm
  23. MTU - ICPC’16 - 05 16, 2016 MethodStaging provides an

    effective divide and conquer approach for code understanding. One caveat: It can produce large code stages. • Large code stages (code stages with long methods) can hinder MethodStaging’s effectiveness. • Long methods tend to increase programmers’ cognitive overhead more than small methods (Mantyla et al. 2003) Solution: MethodStaging w/Reduction (via code folding) 23 Reflections on MethodStaging
  24. Reduction in MethodStaging shows the code blocks (X*) with a

    high usage score in each code stage s and hides (i.e., folds) the ones with a low usage score (H*), where X* U H* ∈ s MTU - ICPC’16 - 05 16, 2016 24 MethodStaging w/Reduction Basics Usage frequency of an element in a code block b ⊆ s is the number of times it appears in s. UsageScore(b) = ∑elem ∈ b UsageFreq(elem) TotalChildren(b)
  25. MTU - ICPC’16 - 05 16, 2016 Given a set

    of code blocks B (with a weight wb and profit pb per b ∈ B), a Knapsack capacity W, a precedence order O ⊆ B x B (modeled as a DAG), and a set of constraints C, let’s find the set H*, such that H* = B \ X*, wb = LOC(b), pb = UsageScore(b), X* = arg max {∑b ∈ B pb }, and X* satisfies the constraints in C. Where C includes: • ∑ b’ ∈ B wb ’ ≤ W • ∃ bi → bj (bi precedes bi ) ∈ O, i, j = 1 … |B| 25 MethodStaging w/Reduction (Formulated as a Precedence-Constrained Knapsack Problem)
  26. Title - CONFYYY - MM DD, YYYY Multistaging Problem 26

    MethodSlicing w/Reduction s to iden- de stages. em by au- t reduced grammers xample. g entirely approach h inspect- unfamiliar ing to the [18]. This ected code score. We quation 1. of the de- mple. The Input: AST Node p, and declarations d 2 p Output: A tuple consisting of the reconstructed source code and H⇤ Function ReconstructSourceCode( p, d ) // delete nodes {p \ d} from AST let p0 JDT. deleteAstNodes (p, {p \ d}) let DAG p 0 traverse p0 and then get built DAG let H⇤ computes B p 0 \ X⇤ p 0 using DAG p 0 and a capacity of 15 LOC return (JDT. getSourceCode (p0), H⇤) end Figure 11: Pseudocode for updated Reconstruct- SourceCode . This subroutine returns a tuple com- prising the reconstructed source code and the code elements to hide.
  27. B7 B1 B2 B3 B4 B6 B5 B8 B9 B12

    B10 B11 X* B7 B1 B2 B3 B4 B6 B5 B8 B9 B12 B10 B11 H* SCC - SRI - 09 18, 2015 Generating the set H* 27 63/6 7/3 19/1 5/3 5/3 3/0 5/0 3/0 13/1 3/1 2/0 17/1 wb/pb B7 B1 B2 B3 B4 B6 B5 B8 B9 B12 B10 B11 B 3. Generate H*: • H* = B \ X* { X*[k,w] = X*[k - 1, w] wk > w max(X*[k - 1, w], wk ≤ w ∧ k - 1 → k X*[k - 1, w - wk] + pk) 1. Build DAG from Example’s AST • bi → bj, bi precedes bj • wb and pb are calculated • wb = wb-original − (wc + wd ) 2. Solve X* using Dynamic Programing