Slide 1

Slide 1 text

Refactoring Mining The key to unlock software evolution Nikolaos Tsantalis Concordia University

Slide 2

Slide 2 text

Thank you IWoR

Slide 3

Slide 3 text

Who am I RefactoringMiner

Slide 4

Slide 4 text

Refactoring Miner

Slide 5

Slide 5 text

Alexander Chatzigeorgiou Eleni Stroulia Zhenchang Xing Fabio Rocha

Slide 6

Slide 6 text

CASCON'13 version 0.0

Slide 7

Slide 7 text

• Ideas inspired from UMLDiff • Structure of method bodies is ignored • Method body includes only method calls and field accesses • Most refactorings detected based on signature matching • Only precision is provided • Evaluation included only 3 systems

Slide 8

Slide 8 text

Refactoring motivations

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

Why did I stop working on this research? Ref-Finder (ICSM 2010): "The precision and recall on open source projects were 0.74 and 0.96 respectively." "Since these programs did not document refactorings, we created a set of correct refactorings by running REF-FINDER with a similarity threshold (σ=0.65) and manually verified them. We then measured a recall by comparing this set with the results found using a higher threshold (σ=0.85)"

Slide 11

Slide 11 text

Lesson # 1 Don't be intim ida te d by s u p e r re sults Always que stion the results

Slide 12

Slide 12 text

Until one day... Danilo Silva Marco Tulio Valente

Slide 13

Slide 13 text

Where was the code?

Slide 14

Slide 14 text

Lesson #2 Always make your code & data available in a repository You never know what it can enable in the future

Slide 15

Slide 15 text

Open Science initiatives M. T. Baldassarre, N. Ernst, B. Hermann, T. Menzies, R. Yedida Mandatory Data Availability field

Slide 16

Slide 16 text

Let's do an empirical study about refactoring We must study Why We Refactor

Slide 17

Slide 17 text

ICSE'16 version 0.1

Slide 18

Slide 18 text

• Danilo developed the API of RefactoringMiner • Tooling for checking out and parsing Git commits • Infrastructure for monitoring GitHub projects • Automatic generation of emails to contact developers • A web app for thematic analysis

Slide 19

Slide 19 text

Firehouse interview • Monitored 124 GitHub projects between June 8th and August 7th, 2015 • Sent 465 emails and received 195 responses (42%) • +27 commits with a description explaining the reasons • Compiled a catalogue of 44 distinct motivations for 12 well-known refactoring types

Slide 20

Slide 20 text

Motivation Catalogue Artifact https://github.com/aserg- ufmg/why-we-refactor

Slide 21

Slide 21 text

ICSE'16 rejection Reviewer #1: "A major threat to the research is not discussed or considered, that RefFinder has poor recall (0.24 [31]). The authors did a good job of combating the low-precision by manually inspecting results, the low recall is not discussed or dealt with."

Slide 22

Slide 22 text

Lesson #3 Even ICSE reviewers make mistakes Don't get disappointed. Improve, advance, re-submit

Slide 23

Slide 23 text

FSE'16 re-submission • Renamed the tool from RefDetector to RefactoringMiner • Addressed most of reviewers' comments

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

RefDiff MSR'17 RefactoringMiner ...

Slide 26

Slide 26 text

The path of Virtue or Vice

Slide 27

Slide 27 text

Rebirth ICSE’18 version 1.0

Slide 28

Slide 28 text

Limitations of previous approaches 1. Dependence on similarity thresholds • thresholds need calibration for projects with different characteristics 2. Dependence on built versions • only 38% of the change history can be successfully compiled [Tufano et al., 2017] 3. Unreliable oracles for evaluating precision/recall • Incomplete (refactorings found in release notes or commit messages) • Biased (applying a single tool with two different similarity thresholds) • Artificial (seeded refactorings)

Slide 29

Slide 29 text

Thresholds are banned

Slide 30

Slide 30 text

private static Address[] createAddresses(int count) { Address[] addresses = new Address[count]; for (int i = 0; i < count; i++) { try { addresses[i] = new Address("127.0.0.1", PORTS.incrementAndGet()); } catch (UnknownHostException e) { e.printStackTrace(); } } return addresses; } private static Address[] createAddresses(int count) { Address[] addresses = new Address[count]; for (int i = 0; i < count; i++) { try { addresses[i] = new Address("127.0.0.1", PORTS.incrementAndGet()); } catch (UnknownHostException e) { e.printStackTrace(); } } return addresses; } After Before

Slide 31

Slide 31 text

private static Address[] createAddresses(int count) { Address[] addresses = new Address[count]; for (int i = 0; i < count; i++) { try { addresses[i] = new Address("127.0.0.1", PORTS.incrementAndGet()); } catch (UnknownHostException e) { e.printStackTrace(); } } return addresses; } private static List
createAddresses(int count) { List
addresses = new ArrayList
(count); for (int i = 0; i < count; i++) { try { addresses[i] = new Address("127.0.0.1", PORTS.incrementAndGet()); } catch (UnknownHostException e) { e.printStackTrace(); } } return addresses; } After Before

Slide 32

Slide 32 text

private static Address[] createAddresses(int count) { Address[] addresses = new Address[count]; for (int i = 0; i < count; i++) { try { addresses[i] = new Address("127.0.0.1", PORTS.incrementAndGet()); } catch (UnknownHostException e) { e.printStackTrace(); } } return addresses; } After Before private static List
createAddresses(AtomicInteger ports, int count){ List
addresses = new ArrayList
(count); for (int i = 0; i < count; i++) { try { addresses[i] = new Address("127.0.0.1", ports.incrementAndGet()); } catch (UnknownHostException e) { e.printStackTrace(); } } return addresses; }

Slide 33

Slide 33 text

private static Address[] createAddresses(int count) { Address[] addresses = new Address[count]; for (int i = 0; i < count; i++) { try { addresses[i] = new Address("127.0.0.1", PORTS.incrementAndGet()); } catch (UnknownHostException e) { e.printStackTrace(); } } return addresses; } After Before private static List
createAddresses(AtomicInteger ports, int count){ List
addresses = new ArrayList
(count); for (int i = 0; i < count; i++) { try { addresses[i] = new Address("127.0.0.1", ports.incrementAndGet()); } catch (UnknownHostException e) { e.printStackTrace(); } } return addresses; }

Slide 34

Slide 34 text

private static Address[] createAddresses(int count) { Address[] addresses = new Address[count]; for (int i = 0; i < count; i++) { try { addresses[i] = new Address("127.0.0.1", PORTS.incrementAndGet()); } catch (UnknownHostException e) { e.printStackTrace(); } } return addresses; } private static List
createAddresses(AtomicInteger ports, int count){ List
addresses = new ArrayList
(count); for (int i = 0; i < count; i++) { } return addresses; } try { addresses[i] = new Address("127.0.0.1", ports.incrementAndGet()); } catch (UnknownHostException e) { e.printStackTrace(); } After Before

Slide 35

Slide 35 text

private static Address[] createAddresses(int count) { Address[] addresses = new Address[count]; for (int i = 0; i < count; i++) { try { addresses[i] = new Address("127.0.0.1", PORTS.incrementAndGet()); } catch (UnknownHostException e) { e.printStackTrace(); } } return addresses; } private static List
createAddresses(AtomicInteger ports, int count){ List
addresses = new ArrayList
(count); for (int i = 0; i < count; i++) { addresses.add(createAddress("127.0.0.1", ports.incrementAndGet())); } return addresses; } protected static Address createAddress(String host, int port) { try { return new Address(host, port); } catch (UnknownHostException e) { e.printStackTrace(); } return null; } After Before

Slide 36

Slide 36 text

protected static Address createAddress(String host, int port) { try { return new Address(host, port); } catch (UnknownHostException e) { e.printStackTrace(); } return null; } private static Address[] createAddresses(int count) { Address[] addresses = new Address[count]; for (int i = 0; i < count; i++) { try { addresses[i] = new Address("127.0.0.1", PORTS.incrementAndGet()); } catch (UnknownHostException e) { e.printStackTrace(); } } return addresses; } private static List
createAddresses(AtomicInteger ports, int count){ List
addresses = new ArrayList
(count); for (int i = 0; i < count; i++) { addresses.add(createAddress("127.0.0.1", ports.incrementAndGet())); } return addresses; } After Before textual similarity ≈ 30%

Slide 37

Slide 37 text

private static Address[] createAddresses(int count) { Address[] addresses = new Address[count]; for (int i = 0; i < count; i++) { try { addresses[i] = new Address("127.0.0.1", PORTS.incrementAndGet()); } catch (UnknownHostException e) { e.printStackTrace(); } } return addresses; } private static List
createAddresses(AtomicInteger ports, int count){ List
addresses = new ArrayList
(count); for (int i = 0; i < count; i++) { addresses.add(createAddress("127.0.0.1", ports.incrementAndGet())); } return addresses; } protected static Address createAddress(String host, int port) { try { return new Address(host, port); } catch (UnknownHostException e) { e.printStackTrace(); } return null; } After Before (1) Abstraction

Slide 38

Slide 38 text

private static Address[] createAddresses(int count) { Address[] addresses = new Address[count]; for (int i = 0; i < count; i++) { try { addresses[i] = new Address("127.0.0.1", PORTS.incrementAndGet()); } catch (UnknownHostException e) { e.printStackTrace(); } } return addresses; } private static List
createAddresses(AtomicInteger ports, int count){ List
addresses = new ArrayList
(count); for (int i = 0; i < count; i++) { addresses.add(createAddress("127.0.0.1", ports.incrementAndGet())); } return addresses; } protected static Address createAddress(String host, int port) { try { return new Address(host, port); } catch (UnknownHostException e) { e.printStackTrace(); } return null; } After Before (1) Abstraction

Slide 39

Slide 39 text

private static Address[] createAddresses(int count) { Address[] addresses = new Address[count]; for (int i = 0; i < count; i++) { try { addresses[i] = new Address("127.0.0.1", PORTS.incrementAndGet()); } catch (UnknownHostException e) { e.printStackTrace(); } } return addresses; } private static List
createAddresses(AtomicInteger ports, int count){ List
addresses = new ArrayList
(count); for (int i = 0; i < count; i++) { addresses.add(createAddress("127.0.0.1", ports.incrementAndGet())); } return addresses; } protected static Address createAddress(String host, int port) { try { return new Address(host, port); } catch (UnknownHostException e) { e.printStackTrace(); } return null; } After Before (2) Argumentization

Slide 40

Slide 40 text

private static Address[] createAddresses(int count) { Address[] addresses = new Address[count]; for (int i = 0; i < count; i++) { try { addresses[i] = new Address("127.0.0.1", PORTS.incrementAndGet()); } catch (UnknownHostException e) { e.printStackTrace(); } } return addresses; } private static List
createAddresses(AtomicInteger ports, int count){ List
addresses = new ArrayList
(count); for (int i = 0; i < count; i++) { addresses.add(createAddress("127.0.0.1", ports.incrementAndGet())); } return addresses; } protected static Address createAddress(String host, int port) { try { return new Address("127.0.0.1", ports.incrementAndGet()); } catch (UnknownHostException e) { e.printStackTrace(); } return null; } After Before (2) Argumentization

Slide 41

Slide 41 text

private static Address[] createAddresses(int count) { Address[] addresses = new Address[count]; for (int i = 0; i < count; i++) { try { addresses[i] = new Address("127.0.0.1", PORTS.incrementAndGet()); } catch (UnknownHostException e) { e.printStackTrace(); } } return addresses; } private static List
createAddresses(AtomicInteger ports, int count){ List
addresses = new ArrayList
(count); for (int i = 0; i < count; i++) { addresses.add(createAddress("127.0.0.1", ports.incrementAndGet())); } return addresses; } protected static Address createAddress(String host, int port) { try { return new Address("127.0.0.1", ports.incrementAndGet()); } catch (UnknownHostException e) { e.printStackTrace(); } return null; } After Before (3) AST Node Replacements

Slide 42

Slide 42 text

private static Address[] createAddresses(int count) { Address[] addresses = new Address[count]; for (int i = 0; i < count; i++) { try { addresses[i] = new Address("127.0.0.1", PORTS.incrementAndGet()); } catch (UnknownHostException e) { e.printStackTrace(); } } return addresses; } private static List
createAddresses(AtomicInteger ports, int count){ List
addresses = new ArrayList
(count); for (int i = 0; i < count; i++) { addresses.add(createAddress("127.0.0.1", ports.incrementAndGet())); } return addresses; } protected static Address createAddress(String host, int port) { try { return new Address("127.0.0.1", PORTS.incrementAndGet()); } catch (UnknownHostException e) { e.printStackTrace(); } return null; } After Before (3) AST Node Replacements

Slide 43

Slide 43 text

private static Address[] createAddresses(int count) { Address[] addresses = new Address[count]; for (int i = 0; i < count; i++) { try { addresses[i] = new Address("127.0.0.1", PORTS.incrementAndGet()); } catch (UnknownHostException e) { e.printStackTrace(); } } return addresses; } private static List
createAddresses(AtomicInteger ports, int count){ List
addresses = new ArrayList
(count); for (int i = 0; i < count; i++) { addresses.add(createAddress("127.0.0.1", ports.incrementAndGet())); } return addresses; } protected static Address createAddress(String host, int port) { try { return new Address("127.0.0.1", PORTS.incrementAndGet()); } catch (UnknownHostException e) { e.printStackTrace(); } return null; } After Before textual similarity = 100%

Slide 44

Slide 44 text

Reliable oracle • We used the “Why We Refactor” dataset (538 commits from 185 open source projects) • We executed both available tools (RefactoringMiner and RefDiff) • Converted the output of the tools to the same format • We manually validated all refactoring instances (4,108 unique instances, out of which 3,188 were true positives and 920 were false positives) • The validation process was labor-intensive and involved 3 validators for a period of 3 months (i.e., 9 person-months) • To compute recall, we considered the union of the true positives reported by both tools as the ground truth. Matin Mansouri Laleh Eshkevari Davood Mazinanian

Slide 45

Slide 45 text

No content

Slide 46

Slide 46 text

No content

Slide 47

Slide 47 text

ICSE’18 retrospective • Our solution was still far from perfect • Time pressure • It was urgent to establish RefactoringMiner with a publication • ICSE deadline: August 25, 2017 • Sophia’s birth: August 15, 2017 • Despite working over 1 year on this paper, there was still space for improvement • The entire team graduated after this work

Slide 48

Slide 48 text

Wisdom TSE’20 version 2.0

Slide 49

Slide 49 text

• Added more replacement types • 20 new sub-method level refactoring types detected based on AST node replacements • Nested refactoring detection • Refactoring inference • Improved the matching of method calls to method declarations with argument type inference

Slide 50

Slide 50 text

Even more reliable oracle • We executed all available tools • RefactoringMiner 1.0 and 2.0 • RefDiff 0.1.1, 1.0, 2.0 • GumTreeDiff 2.1.2 • Converted the output of the tools to the same format • We validated 5,830 new unique refactoring instances, out of which 4,038 were true positives and 1,792 were false positives. • 7,226 true positives in total for 40 different refactoring types (72% of true instances are detected by two or more tools) Ameya Ketkar

Slide 51

Slide 51 text

No content

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

Emperor version 2.2

Slide 54

Slide 54 text

From loop and if to Stream API .forEach() and .filter()

Slide 55

Slide 55 text

Current status • 85 supported refactoring types • 10,900 validated true positives in the oracle • Precision: 99.7% • Recall: 97% • Most tested and reliable version (200K commits without exception) • Independent studies confirm it has the best precision

Slide 56

Slide 56 text

Lesson #4 What it takes to make a reliable & usable tool Minimum 5 years of research and development Extensive testing Detailed documentation (README with API code snippets) Supporting users (200+ issues resolved) Stable project leader Great team

Slide 57

Slide 57 text

The promises

Slide 58

Slide 58 text

Promise #1 “researchers can replicate existing empirical studies and refute or confirm previously-held beliefs.”

Slide 59

Slide 59 text

No content

Slide 60

Slide 60 text

Threats to validity • Both studies relied on Ref-Finder • Independent studies revealed that Ref-Finder had low precision/recall • 35% precision, 24% recall [Soares et al. JSS 2013] • 27% precision [Kádár et al. PROMISE 2016] • Ref-Finder paper claimed 74% precision, 96% recall • Collected refactoring info at release level (between two versions) • Coarse-grained analysis led to strong and unsafe assumptions

Slide 61

Slide 61 text

No content

Slide 62

Slide 62 text

Lesson #5 A tool comes with a huge responsibility for its authors We should go above and beyond when evaluating its accuracy

Slide 63

Slide 63 text

Promise #2 “reduce the noise created by refactorings, such as file/directory renaming, and significantly improve the accuracy of other tools.”

Slide 64

Slide 64 text

No content

Slide 65

Slide 65 text

Lesson #6 Making existing code evolution analysis techniques refactoring-aware can improve their accuracy

Slide 66

Slide 66 text

Promise #3 “our oracle of true refactorings from 538 commits across 185 projects provides an invaluable resource for validating novel refactoring tools and for comparing existing approaches”

Slide 67

Slide 67 text

No content

Slide 68

Slide 68 text

Promise #4 “enable online refactoring detection on partial input, when a developer inspects a code diff to review a change, or tries to understand code evolution selectively”

Slide 69

Slide 69 text

• Partially refactoring-aware • Supports method signature changes • Method moves • File renames/moves • Uses thresholds when comparing program elements, calibrated on a training set of 100 methods • Mismatches methods from which a significant part of their body has been extracted to new methods, as it uses a 75% body similarity threshold to match modified methods • Ultra-fast: Less than 2 seconds to fetch the entire change history of a method

Slide 70

Slide 70 text

Challenge How can we take advantage of RefactoringMiner accuracy in a way that is not computationally expensive? Mehran Jodavi Solution Partial and incremental commit analysis based on the location of the tracked program element

Slide 71

Slide 71 text

Precision +7.5% Recall +3.7% Precision +9.1% Recall +5.8% > 97% precision/recall change level Method tracking commit level Method tracking change level Variable tracking > 98% precision/recall commit level

Slide 72

Slide 72 text

Promise #5 “Integrating RefactoringMiner with the diff and code review tools can raise the level of abstraction for code changes originating from refactorings, thus helping developers better understand the code evolution”

Slide 73

Slide 73 text

Diff Hunk Graph Nodes: diff hunks Hard links: syntactic constraints between the AST nodes Soft links: repetitive change patterns Refactoring links: edits in different locations originating from the same refactoring Cosmetic links: changes in comments & reformatting

Slide 74

Slide 74 text

Promise #6 “refactoring operations can be automatically documented at commit-time to provide a more detailed description of the applied changes in the commit message”

Slide 75

Slide 75 text

No content

Slide 76

Slide 76 text

No content

Slide 77

Slide 77 text

Mining for Learning

Slide 78

Slide 78 text

No content

Slide 79

Slide 79 text

File f = new File(:[v0], :[v1]) → Path f = :[v0].resolve(:[v1]) FileOutputStream stream = new FileOutputStream(:[v0]) → OutputStream stream = Files.newOutputStream(:[v0]) → Ameya Ketkar Rewrite rules

Slide 80

Slide 80 text

No content

Slide 81

Slide 81 text

Mining for Recommending

Slide 82

Slide 82 text

Why We Refactor++ A large-scale empirical study • We developed detection rules for refactoring motivations • We optimized the rules on the FSE’16 “Why We Refactor” dataset • We validated their accuracy on the TOSEM’2020 dataset • Precision: 98.4% Recall: 93.5% • We collected the motivations for 346K Extract Method instances found in 132,897 commits of 325 open-source repositories Sadegh Aalizadeh J. Pa ntiuchina , F. Za m pe tti, S. Sca la brino, V. Pia nta dosi, R. Olive to, G. Bavota , a nd M. Di Pe nta , "Why Deve lope rs Re fa ctor Source Code : A Mining-ba se d Study,“ ACM Tra nsa ctions on Softwa re Engine e ring a nd Me thodology, Volum e 29, Issue 4, Article 29, Se pte m be r 2020.

Slide 83

Slide 83 text

No content

Slide 84

Slide 84 text

Refactoring recommendation tools for top-5 Extract Method motivations 1. Reusable Method [NONE] 2. Remove Duplication [CeDAR (Tairas and Gray, IST’12), Creios (Hotta et al., CSMR’12), Rase (Meng et al., ICSE’15), JDeodorant (Tsantalis et al., TSE’15 + ICSE’17), CRec (Yue et al., ICSME’18)] 3. Facilitate Extension [FR-Refactor (Ally S. Nyamawe et al. RE’19 + EMSE’20)] 4. Decompose for Readability [JDeodorant (Tsantalis and Chatzigeorgiou, JSS’11), JExtract (Silva et al., ICPC’14), SEMI (Charalampidou et al., TSE’17), GEMS (Xu et al., ISSRE’17)] 5. Alternative Signature [NONE]

Slide 85

Slide 85 text

Mining for Accuracy

Slide 86

Slide 86 text

No content

Slide 87

Slide 87 text

No content

Slide 88

Slide 88 text

Making better refactoring mining tools is a community effort Mining competitions hosted in the Workshop Community-created benchmarks Hackathons for improving tools

Slide 89

Slide 89 text

https://github.com/tsantalis/RefactoringMiner https://github.com/JetBrains-Research/RefactorInsight https://github.com/JetBrains-Research/kotlinRMiner https://github.com/JetBrains-Research/data-driven-type-migration Oleg Smirnov Ameya Ketkar Zarina Kurbatova https://github.com/jodavimehran/code-tracker Mehran Jodavi