Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Beyond Java: Expanding Refactoring Research to Multiple Programming Languages (IWoR 2019)

Beyond Java: Expanding Refactoring Research to Multiple Programming Languages (IWoR 2019)

Traditionally, academic research on refactoring has focused on Java programs. However, in recent years, other programming languages are gaining momentum and are being used to build highly-successful applications. In this talk, I will first motivate the need to expand research on refactoring to support other programming languages. Second, I will argue this support is not only an engineering and portability issue; by contrast, it demands major changes in the tools and algorithms previously developed by researchers in the area. Finally, I will present our current effort to evolve and redesign RefDiff—our refactoring detection tool—to work with multiple languages and programming paradigms.

ASERG, DCC, UFMG

May 28, 2019
Tweet

More Decks by ASERG, DCC, UFMG

Other Decks in Research

Transcript

  1. Beyond Java: Expanding Refactoring Research
    to Multiple Programming Languages
    Marco Tulio Valente
    ASERG, DCC, UFMG, BR
    [email protected], @mtov
    1

    View Slide

  2. 2

    View Slide

  3. 3
    Food Mountains
    Universities & Science
    Belo Horizonte

    View Slide

  4. Expanding Refactoring Research to Multiple
    Programming Languages
    4

    View Slide

  5. GitHub 2018
    5
    https://octoverse.github.com/projects#languages

    View Slide

  6. Stack Overflow Survey 2019
    6
    https://insights.stackoverflow.com/survey/2019#technology

    View Slide

  7. TIOBE Index (May 2019)
    7
    https://www.tiobe.com/tiobe-index/

    View Slide

  8. Software engineering is not dominated by a
    single language
    8

    View Slide

  9. … but refactoring research is dominated by Java
    9

    View Slide

  10. Refactoring Papers
    10
    ● Study by Marouane Kessentini, Danny Digg et al.
    ● Java dominates
    ● More papers using Java than
    all other languages combined
    ● C++ is the 2nd language
    Danny Dig's keynote talk at WAPI 2018
    https://w-api.github.io/resources/wapi18_dig_refactoring.pdf

    View Slide

  11. … and refactoring practice?
    11

    View Slide

  12. Fowler, 2nd ed. (2018)
    12
    The first edition of this book used Java …
    [In this 2nd edition] I chose JavaScript to
    illustrate these refactorings ...

    View Slide

  13. … supporting multiple languages is only an
    engineering problem!
    13

    View Slide

  14. I'll counter-argument using 5 papers & RefDiff 2.0
    14

    View Slide

  15. 15
    1
    FSE 16

    View Slide

  16. 16
    Firehouse interviews:
    * 195 developers
    * 124 projects
    * 436 refactoring instances
    * 12 refactoring operations
    * 44 reasons for refactoring
    FSE 16

    View Slide

  17. 17
    Firehouse interviews:
    * 195 developers
    * 124 projects
    * 436 refactoring instances
    * 12 refactoring operations
    * 44 reasons for refactoring
    FSE 16

    View Slide

  18. 18
    GitHub contributors ≈ Java projects
    Do our findings generalize to other languages?
    FSE 16

    View Slide

  19. 19
    2
    SPLASH 17

    View Slide

  20. 20
    SPLASH 17

    View Slide

  21. 21
    SPLASH 17

    View Slide

  22. 22
    Do these findings generalize to other languages? Python? C#?
    SPLASH 17

    View Slide

  23. 23
    3
    IEEE
    TSE 12

    View Slide

  24. 24
    IEEE
    TSE 12

    View Slide

  25. 25
    IEEE
    TSE 12

    View Slide

  26. 26
    Refactoring engines ≈ Java-based IDEs
    Do we have refactoring engines for other languages?
    If yes, can we reuse this technique?
    IEEE
    TSE 12

    View Slide

  27. 27
    4
    JSS 18

    View Slide

  28. 28
    JSS 18

    View Slide

  29. 29
    JMove ⇒ Java-based tool
    How to infer dependencies in JavaScript (e.g., using Facebook's Flow)?
    Can we reuse these ideas to build JSMove?

    View Slide

  30. 30
    5
    ICSE 18

    View Slide

  31. 31
    ICSE 18

    View Slide

  32. ICSE 18

    View Slide

  33. Commit-History ≈ Java-based projects
    C, Python, JS, etc?
    ICSE 18

    View Slide

  34. RefDiff 2.0: Detecting Refactorings
    in Multiple Programming Languages
    by Danilo Silva, PhD student
    34

    View Slide

  35. RefDiff 1.0
    35
    6
    MSR 17

    View Slide

  36. RefDiff 1.0
    36
    Java?
    MSR 17

    View Slide

  37. RefDiff 1.0
    37
    But, first language-agnostic design decisions
    Java?
    YES
    MSR 17

    View Slide

  38. How to design a language-agnostic refactoring
    detection tool?
    38

    View Slide

  39. What features do we have in all languages?
    39

    View Slide

  40. What features do we have in all languages?
    1. Tokens
    40

    View Slide

  41. What features do we have in all languages?
    2. Containment Hierarchy:
    ● a program is not a flat list of tokens
    ● Examples:
    ○ C: Tokens → Functions → Files
    ○ Java: Tokens → Methods → Classes → Packages
    ○ JavaScript: Tokens → Functions → Files
    ● Tokens + Containment Hierarchy ⇒ Code Structure Tree (CST)
    41

    View Slide

  42. 42
    Code Structure Tree

    View Slide

  43. CST vs AST
    ● Difference is not C vs A, but in the "S":
    ○ CST ⇒ Structure (or hierarchy), which we have in all languages
    ○ AST ⇒ Syntax, which is language-specific
    ● CST are "universal" ASTs
    43

    View Slide

  44. CST vs AST
    ● CSTs are also more simple structures than ASTs
    ● Argument #1:
    ○ JDT AST: 112 classes
    ○ CST: 1 class
    ● Argument #2:
    ○ We need to implement 4 visitors to generate CSTs from Java ASTs
    ● We implemented CSTs for three languages:
    ○ Java, JavaScript, C (by a MSc student, in a course project)
    44

    View Slide

  45. RefDiff 2.0 Architecture
    45
    CST
    Plug-in
    Program v1
    Program v2
    CST v1
    CST v2
    RefDiff Refactorings
    in language X
    for language X

    View Slide

  46. RefDiff 2.0 Architecture
    46
    CST
    Plug-in
    Program v1
    Program v2
    CST v1
    CST v2
    RefDiff Refactorings
    language agnostic
    "simple" to implement

    View Slide

  47. CST Nodes
    ● Each CST node has
    ○ ID
    ○ Name space
    ○ Type: function, method, class, package, file etc
    ○ Parameter list (optional)
    ○ Tokens
    47

    View Slide

  48. RefDiff Algorithm
    48

    View Slide

  49. Key step: matching CST nodes
    49
    n
    1
    n
    2

    View Slide

  50. Matching Relationships
    50

    View Slide

  51. Same Relationship
    51
    same type

    View Slide

  52. Same Relationship
    52
    same IDs

    View Slide

  53. Same Relationship
    53
    same parent

    View Slide

  54. Pull Up Relationship
    54
    same type same IDs

    View Slide

  55. Pull Up Relationship
    55
    same type same IDs parent(n1) is a subtype
    of parent(n2)
    n
    1
    n
    2
    (n
    1
    )'
    before after

    View Slide

  56. Pull Up Relationship
    56
    same type same IDs parent(n1) is a subtype
    of parent(n2)
    n
    1
    n
    2
    (n
    1
    )'
    before after
    If the language does not have
    inheritance:
    subtype(c1,c2) = false, forall c1, c2

    View Slide

  57. Rename Relationship
    57
    same type same containers

    View Slide

  58. Rename Relationship
    58
    same type same containers
    different names

    View Slide

  59. Rename Relationship
    59
    same type same containers
    different names
    It's common to have renaming + edits
    tokens(n1) ≠ tokens(n2)
    We need to tolerate edits in code (n2)

    View Slide

  60. Code Similarity: Weighted Jaccard Coefficient
    where:
    - e
    i
    : tokens in a CST node
    - m
    i
    (t): number of tokens t in e
    i
    - idf: idf coefficient

    View Slide

  61. Rename Relationship
    61
    same type same containers
    different names similar tokens
    jaccard > 0.5

    View Slide

  62. Evaluation
    62

    View Slide

  63. JavaScript Results
    Precision: 91%
    Recall: 88%
    63

    View Slide

  64. C Results
    Precision: 88%
    Recall: 91%
    64

    View Slide

  65. Java Results
    Precision: 96 (vs 99 RMiner) * Recall: 80 (vs 81 RMiner)
    65

    View Slide

  66. If you want to use RefDiff 2.0
    https://github.com/aserg-ufmg/RefDiff
    66

    View Slide

  67. Summary
    67

    View Slide

  68. "this is only an engineering problem" is a false
    argument, at least when detecting refactorings
    68

    View Slide

  69. Supporting multiple languages can increase the
    practical impact of refactoring reserch
    69

    View Slide

  70. Supporting multiple languages can increase the
    practical impact of refactoring research
    70

    View Slide

  71. Thanks!
    Marco Tulio Valente
    ASERG, DCC, UFMG, BR
    71

    View Slide

  72. We also need call graphs
    ● To detect refactorings, it's also interesting to have a call graph
    ● Call graph: node A calls node B
    ● We also embedded a lightweight call graph in CSTs
    72
    call graph

    View Slide