Beyond Java: Expanding Refactoring Research to Multiple Programming Languages (IWoR 2019)

Beyond Java: Expanding Refactoring Research to Multiple Programming Languages (IWoR 2019)

Traditionally, academic research on refactoring has focused on Java programs. However, in recent years, other programming languages are gaining momentum and are being used to build highly-successful applications. In this talk, I will first motivate the need to expand research on refactoring to support other programming languages. Second, I will argue this support is not only an engineering and portability issue; by contrast, it demands major changes in the tools and algorithms previously developed by researchers in the area. Finally, I will present our current effort to evolve and redesign RefDiff—our refactoring detection tool—to work with multiple languages and programming paradigms.

13beaa3b7239eca3319d54c6a9f3a85a?s=128

ASERG, DCC, UFMG

May 28, 2019
Tweet

Transcript

  1. Beyond Java: Expanding Refactoring Research to Multiple Programming Languages Marco

    Tulio Valente ASERG, DCC, UFMG, BR mtov@dcc.ufmg.br, @mtov 1
  2. 2

  3. 3 Food Mountains Universities & Science Belo Horizonte

  4. Expanding Refactoring Research to Multiple Programming Languages 4

  5. GitHub 2018 5 https://octoverse.github.com/projects#languages

  6. Stack Overflow Survey 2019 6 https://insights.stackoverflow.com/survey/2019#technology

  7. TIOBE Index (May 2019) 7 https://www.tiobe.com/tiobe-index/

  8. Software engineering is not dominated by a single language 8

  9. … but refactoring research is dominated by Java 9

  10. Refactoring Papers 10 • Study by Marouane Kessentini, Danny Digg

    et al. • Java dominates • More papers using Java than all other languages combined • C++ is the 2nd language Danny Dig's keynote talk at WAPI 2018 https://w-api.github.io/resources/wapi18_dig_refactoring.pdf
  11. … and refactoring practice? 11

  12. Fowler, 2nd ed. (2018) 12 The first edition of this

    book used Java … [In this 2nd edition] I chose JavaScript to illustrate these refactorings ...
  13. … supporting multiple languages is only an engineering problem! 13

  14. I'll counter-argument using 5 papers & RefDiff 2.0 14

  15. 15 1 FSE 16

  16. 16 Firehouse interviews: * 195 developers * 124 projects *

    436 refactoring instances * 12 refactoring operations * 44 reasons for refactoring FSE 16
  17. 17 Firehouse interviews: * 195 developers * 124 projects *

    436 refactoring instances * 12 refactoring operations * 44 reasons for refactoring FSE 16
  18. 18 GitHub contributors ≈ Java projects Do our findings generalize

    to other languages? FSE 16
  19. 19 2 SPLASH 17

  20. 20 SPLASH 17

  21. 21 SPLASH 17

  22. 22 Do these findings generalize to other languages? Python? C#?

    SPLASH 17
  23. 23 3 IEEE TSE 12

  24. 24 IEEE TSE 12

  25. 25 IEEE TSE 12

  26. 26 Refactoring engines ≈ Java-based IDEs Do we have refactoring

    engines for other languages? If yes, can we reuse this technique? IEEE TSE 12
  27. 27 4 JSS 18

  28. 28 JSS 18

  29. 29 JMove ⇒ Java-based tool How to infer dependencies in

    JavaScript (e.g., using Facebook's Flow)? Can we reuse these ideas to build JSMove?
  30. 30 5 ICSE 18

  31. 31 ICSE 18

  32. ICSE 18

  33. Commit-History ≈ Java-based projects C, Python, JS, etc? ICSE 18

  34. RefDiff 2.0: Detecting Refactorings in Multiple Programming Languages by Danilo

    Silva, PhD student 34
  35. RefDiff 1.0 35 6 MSR 17

  36. RefDiff 1.0 36 Java? MSR 17

  37. RefDiff 1.0 37 But, first language-agnostic design decisions Java? YES

    MSR 17
  38. How to design a language-agnostic refactoring detection tool? 38

  39. What features do we have in all languages? 39

  40. What features do we have in all languages? 1. Tokens

    40
  41. What features do we have in all languages? 2. Containment

    Hierarchy: • a program is not a flat list of tokens • Examples: ◦ C: Tokens → Functions → Files ◦ Java: Tokens → Methods → Classes → Packages ◦ JavaScript: Tokens → Functions → Files • Tokens + Containment Hierarchy ⇒ Code Structure Tree (CST) 41
  42. 42 Code Structure Tree

  43. CST vs AST • Difference is not C vs A,

    but in the "S": ◦ CST ⇒ Structure (or hierarchy), which we have in all languages ◦ AST ⇒ Syntax, which is language-specific • CST are "universal" ASTs 43
  44. CST vs AST • CSTs are also more simple structures

    than ASTs • Argument #1: ◦ JDT AST: 112 classes ◦ CST: 1 class • Argument #2: ◦ We need to implement 4 visitors to generate CSTs from Java ASTs • We implemented CSTs for three languages: ◦ Java, JavaScript, C (by a MSc student, in a course project) 44
  45. RefDiff 2.0 Architecture 45 CST Plug-in Program v1 Program v2

    CST v1 CST v2 RefDiff Refactorings in language X for language X
  46. RefDiff 2.0 Architecture 46 CST Plug-in Program v1 Program v2

    CST v1 CST v2 RefDiff Refactorings language agnostic "simple" to implement
  47. CST Nodes • Each CST node has ◦ ID ◦

    Name space ◦ Type: function, method, class, package, file etc ◦ Parameter list (optional) ◦ Tokens 47
  48. RefDiff Algorithm 48

  49. Key step: matching CST nodes 49 n 1 n 2

  50. Matching Relationships 50

  51. Same Relationship 51 same type

  52. Same Relationship 52 same IDs

  53. Same Relationship 53 same parent

  54. Pull Up Relationship 54 same type same IDs

  55. Pull Up Relationship 55 same type same IDs parent(n1) is

    a subtype of parent(n2) n 1 n 2 (n 1 )' before after
  56. Pull Up Relationship 56 same type same IDs parent(n1) is

    a subtype of parent(n2) n 1 n 2 (n 1 )' before after If the language does not have inheritance: subtype(c1,c2) = false, forall c1, c2
  57. Rename Relationship 57 same type same containers

  58. Rename Relationship 58 same type same containers different names

  59. Rename Relationship 59 same type same containers different names It's

    common to have renaming + edits tokens(n1) ≠ tokens(n2) We need to tolerate edits in code (n2)
  60. Code Similarity: Weighted Jaccard Coefficient where: - e i :

    tokens in a CST node - m i (t): number of tokens t in e i - idf: idf coefficient
  61. Rename Relationship 61 same type same containers different names similar

    tokens jaccard > 0.5
  62. Evaluation 62

  63. JavaScript Results Precision: 91% Recall: 88% 63

  64. C Results Precision: 88% Recall: 91% 64

  65. Java Results Precision: 96 (vs 99 RMiner) * Recall: 80

    (vs 81 RMiner) 65
  66. If you want to use RefDiff 2.0 https://github.com/aserg-ufmg/RefDiff 66

  67. Summary 67

  68. "this is only an engineering problem" is a false argument,

    at least when detecting refactorings 68
  69. Supporting multiple languages can increase the practical impact of refactoring

    reserch 69
  70. Supporting multiple languages can increase the practical impact of refactoring

    research 70
  71. Thanks! Marco Tulio Valente ASERG, DCC, UFMG, BR 71

  72. We also need call graphs • To detect refactorings, it's

    also interesting to have a call graph • Call graph: node A calls node B • We also embedded a lightweight call graph in CSTs 72 call graph