Slide 1

Slide 1 text

Beyond Java: Expanding Refactoring Research to Multiple Programming Languages Marco Tulio Valente ASERG, DCC, UFMG, BR [email protected], @mtov 1

Slide 2

Slide 2 text

2

Slide 3

Slide 3 text

3 Food Mountains Universities & Science Belo Horizonte

Slide 4

Slide 4 text

Expanding Refactoring Research to Multiple Programming Languages 4

Slide 5

Slide 5 text

GitHub 2018 5 https://octoverse.github.com/projects#languages

Slide 6

Slide 6 text

Stack Overflow Survey 2019 6 https://insights.stackoverflow.com/survey/2019#technology

Slide 7

Slide 7 text

TIOBE Index (May 2019) 7 https://www.tiobe.com/tiobe-index/

Slide 8

Slide 8 text

Software engineering is not dominated by a single language 8

Slide 9

Slide 9 text

… but refactoring research is dominated by Java 9

Slide 10

Slide 10 text

Refactoring Papers 10 ● Study by Marouane Kessentini, Danny Digg et al. ● Java dominates ● More papers using Java than all other languages combined ● C++ is the 2nd language Danny Dig's keynote talk at WAPI 2018 https://w-api.github.io/resources/wapi18_dig_refactoring.pdf

Slide 11

Slide 11 text

… and refactoring practice? 11

Slide 12

Slide 12 text

Fowler, 2nd ed. (2018) 12 The first edition of this book used Java … [In this 2nd edition] I chose JavaScript to illustrate these refactorings ...

Slide 13

Slide 13 text

… supporting multiple languages is only an engineering problem! 13

Slide 14

Slide 14 text

I'll counter-argument using 5 papers & RefDiff 2.0 14

Slide 15

Slide 15 text

15 1 FSE 16

Slide 16

Slide 16 text

16 Firehouse interviews: * 195 developers * 124 projects * 436 refactoring instances * 12 refactoring operations * 44 reasons for refactoring FSE 16

Slide 17

Slide 17 text

17 Firehouse interviews: * 195 developers * 124 projects * 436 refactoring instances * 12 refactoring operations * 44 reasons for refactoring FSE 16

Slide 18

Slide 18 text

18 GitHub contributors ≈ Java projects Do our findings generalize to other languages? FSE 16

Slide 19

Slide 19 text

19 2 SPLASH 17

Slide 20

Slide 20 text

20 SPLASH 17

Slide 21

Slide 21 text

21 SPLASH 17

Slide 22

Slide 22 text

22 Do these findings generalize to other languages? Python? C#? SPLASH 17

Slide 23

Slide 23 text

23 3 IEEE TSE 12

Slide 24

Slide 24 text

24 IEEE TSE 12

Slide 25

Slide 25 text

25 IEEE TSE 12

Slide 26

Slide 26 text

26 Refactoring engines ≈ Java-based IDEs Do we have refactoring engines for other languages? If yes, can we reuse this technique? IEEE TSE 12

Slide 27

Slide 27 text

27 4 JSS 18

Slide 28

Slide 28 text

28 JSS 18

Slide 29

Slide 29 text

29 JMove ⇒ Java-based tool How to infer dependencies in JavaScript (e.g., using Facebook's Flow)? Can we reuse these ideas to build JSMove?

Slide 30

Slide 30 text

30 5 ICSE 18

Slide 31

Slide 31 text

31 ICSE 18

Slide 32

Slide 32 text

ICSE 18

Slide 33

Slide 33 text

Commit-History ≈ Java-based projects C, Python, JS, etc? ICSE 18

Slide 34

Slide 34 text

RefDiff 2.0: Detecting Refactorings in Multiple Programming Languages by Danilo Silva, PhD student 34

Slide 35

Slide 35 text

RefDiff 1.0 35 6 MSR 17

Slide 36

Slide 36 text

RefDiff 1.0 36 Java? MSR 17

Slide 37

Slide 37 text

RefDiff 1.0 37 But, first language-agnostic design decisions Java? YES MSR 17

Slide 38

Slide 38 text

How to design a language-agnostic refactoring detection tool? 38

Slide 39

Slide 39 text

What features do we have in all languages? 39

Slide 40

Slide 40 text

What features do we have in all languages? 1. Tokens 40

Slide 41

Slide 41 text

What features do we have in all languages? 2. Containment Hierarchy: ● a program is not a flat list of tokens ● Examples: ○ C: Tokens → Functions → Files ○ Java: Tokens → Methods → Classes → Packages ○ JavaScript: Tokens → Functions → Files ● Tokens + Containment Hierarchy ⇒ Code Structure Tree (CST) 41

Slide 42

Slide 42 text

42 Code Structure Tree

Slide 43

Slide 43 text

CST vs AST ● Difference is not C vs A, but in the "S": ○ CST ⇒ Structure (or hierarchy), which we have in all languages ○ AST ⇒ Syntax, which is language-specific ● CST are "universal" ASTs 43

Slide 44

Slide 44 text

CST vs AST ● CSTs are also more simple structures than ASTs ● Argument #1: ○ JDT AST: 112 classes ○ CST: 1 class ● Argument #2: ○ We need to implement 4 visitors to generate CSTs from Java ASTs ● We implemented CSTs for three languages: ○ Java, JavaScript, C (by a MSc student, in a course project) 44

Slide 45

Slide 45 text

RefDiff 2.0 Architecture 45 CST Plug-in Program v1 Program v2 CST v1 CST v2 RefDiff Refactorings in language X for language X

Slide 46

Slide 46 text

RefDiff 2.0 Architecture 46 CST Plug-in Program v1 Program v2 CST v1 CST v2 RefDiff Refactorings language agnostic "simple" to implement

Slide 47

Slide 47 text

CST Nodes ● Each CST node has ○ ID ○ Name space ○ Type: function, method, class, package, file etc ○ Parameter list (optional) ○ Tokens 47

Slide 48

Slide 48 text

RefDiff Algorithm 48

Slide 49

Slide 49 text

Key step: matching CST nodes 49 n 1 n 2

Slide 50

Slide 50 text

Matching Relationships 50

Slide 51

Slide 51 text

Same Relationship 51 same type

Slide 52

Slide 52 text

Same Relationship 52 same IDs

Slide 53

Slide 53 text

Same Relationship 53 same parent

Slide 54

Slide 54 text

Pull Up Relationship 54 same type same IDs

Slide 55

Slide 55 text

Pull Up Relationship 55 same type same IDs parent(n1) is a subtype of parent(n2) n 1 n 2 (n 1 )' before after

Slide 56

Slide 56 text

Pull Up Relationship 56 same type same IDs parent(n1) is a subtype of parent(n2) n 1 n 2 (n 1 )' before after If the language does not have inheritance: subtype(c1,c2) = false, forall c1, c2

Slide 57

Slide 57 text

Rename Relationship 57 same type same containers

Slide 58

Slide 58 text

Rename Relationship 58 same type same containers different names

Slide 59

Slide 59 text

Rename Relationship 59 same type same containers different names It's common to have renaming + edits tokens(n1) ≠ tokens(n2) We need to tolerate edits in code (n2)

Slide 60

Slide 60 text

Code Similarity: Weighted Jaccard Coefficient where: - e i : tokens in a CST node - m i (t): number of tokens t in e i - idf: idf coefficient

Slide 61

Slide 61 text

Rename Relationship 61 same type same containers different names similar tokens jaccard > 0.5

Slide 62

Slide 62 text

Evaluation 62

Slide 63

Slide 63 text

JavaScript Results Precision: 91% Recall: 88% 63

Slide 64

Slide 64 text

C Results Precision: 88% Recall: 91% 64

Slide 65

Slide 65 text

Java Results Precision: 96 (vs 99 RMiner) * Recall: 80 (vs 81 RMiner) 65

Slide 66

Slide 66 text

If you want to use RefDiff 2.0 https://github.com/aserg-ufmg/RefDiff 66

Slide 67

Slide 67 text

Summary 67

Slide 68

Slide 68 text

"this is only an engineering problem" is a false argument, at least when detecting refactorings 68

Slide 69

Slide 69 text

Supporting multiple languages can increase the practical impact of refactoring reserch 69

Slide 70

Slide 70 text

Supporting multiple languages can increase the practical impact of refactoring research 70

Slide 71

Slide 71 text

Thanks! Marco Tulio Valente ASERG, DCC, UFMG, BR 71

Slide 72

Slide 72 text

We also need call graphs ● To detect refactorings, it's also interesting to have a call graph ● Call graph: node A calls node B ● We also embedded a lightweight call graph in CSTs 72 call graph