Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Will this clone be short-lived? Towards understanding of the characteristics of short-lived clones

Will this clone be short-lived? Towards understanding of the characteristics of short-lived clones

Code clones are created when a developer duplicates a code fragment to reuse existing functionalities. Mitigating clones by refactoring them helps ease the long-term maintenance of large software systems. However, refactoring can introduce an additional cost. Prior work also suggest that refactoring all clones can be counterproductive since clones may live in a system for a short duration. Hence, it is beneficial to determine in advance whether a newly-introduced clone will be short-lived or long-lived to plan the most effective use of resources. In this work, we perform an empirical study on six open source Java systems to better understand the life expectancy of clones. We find that a large number of clones (i.e., 30% to 87%) lived in the systems for a short duration. Moreover, we find that although short-lived clones were changed more frequently than long-lived clones throughout their lifetime, short-lived clones were consistently changed with their siblings less often than long-lived clones. Furthermore, we build random forest classifiers in order to determine the life expectancy of a newly-introduced clone (i.e., whether a clone will be short-lived or long-lived). Our empirical results show that our random forest classifiers can determine the life expectancy of a newly-introduced clone with an average AUC of 0.63 to 0.92. We also find that the churn made to the methods containing a newly-introduced clone, the complexity and size of the methods containing the newly-introduced clone are highly influential in determining whether the newly-introduced clone will be short-lived. Furthermore, the size of a newly-introduced clone shares a positive relationship with the likelihood that the newly-introduced clone will be short-lived. Our results suggest that to improve the efficiency of clone management efforts, practitioners can leverage our classifiers and insights in order to determine whether a newly-introduced clone will be short-lived or long-lived to plan the most effective use of their clone management resources in advance.

These slides were presented at the International Conference on Software Engineering (2019)

Transcript

  1. Will this clone be short-lived? Towards understanding of the characteristics

    of short-lived clones - Journal First Presentation - Ahmed Hassan Weiyi Shang Patanamon (Pick) Thongtanunam patanamon.thongtanunam @unimelb.edu.au @patanamon
  2. Code clone: a group of code fragments that are nearly-identical

    For lower maintenance effort, these clones should be refactored to reduce code repetitiveness Monopoly/IncomeTaxSquare.java Monopoly/GoToJailSquare.java “2,241 refactoring instances were detected in 285 GitHub projects [Silva et al., 2016]” !2
  3. Refactoring all clones may not be worthwhile Released Version 2

    versions 2 versions 10 versions !3
  4. Refactoring all clones may not be worthwhile Released Version 2

    versions 2 versions 75% and 36% of volatile clones in the carol and dnsjava systems lived for a short-duration [Kim et al., 2005] 10 versions !3
  5. Refactoring all clones may not be worthwhile Released Version 2

    versions 2 versions 75% and 36% of volatile clones in the carol and dnsjava systems lived for a short-duration [Kim et al., 2005] Many of the long-lived clones cannot be removed using standard refactoring techniques [Kim et al., 2005] 10 versions !3
  6. Refactoring all clones may not be worthwhile Released Version 2

    versions 2 versions 10 versions !3 Determining the life expectancy of clones in advance may be beneficial when managing clones
  7. Understanding clone genealogies and 
 their life expectancy Apache Pig

    17 Years 22 Releases 10 Years 35 Releases 13 Years 66 Releases 14 Years 36 Releases 8 Years 15 Releases 11 Years 43 Releases (PQ1) How long do clones live in a software system? (PQ2) How were short-lived and long-lived clones changed throughout their lifetime? !4
  8. Identifying clone life expectancy !5

  9. Identifying clone life expectancy !5 Code Repository

  10. Identifying clone life expectancy !5 Code Repository Extract sequentially developed

    versions v2.17.1 v2.17.2 v2.17.3 v2.16.0 v2.17.0 v2.18.0 v2.18.1 v2.18.2 v2.18.3 v2.17.4 Using git commands
  11. Identifying clone life expectancy Extract clone genealogies 2 versions 6

    versions Using iClones [Göde and Koschke, 2009] !5 Code Repository Extract sequentially developed versions v2.17.1 v2.17.2 v2.17.3 v2.16.0 v2.17.0 v2.18.0 v2.18.1 v2.18.2 v2.18.3 v2.17.4 Using git commands
  12. Identifying clone life expectancy Identify short-lived & long-lived clones 2

    versions 6 versions Using a clustering technique Short Long Extract clone genealogies 2 versions 6 versions Using iClones [Göde and Koschke, 2009] !5 Code Repository Extract sequentially developed versions v2.17.1 v2.17.2 v2.17.3 v2.16.0 v2.17.0 v2.18.0 v2.18.1 v2.18.2 v2.18.3 v2.17.4 Using git commands
  13. 30% (Maven) - 87% (Jackrabbit) of clones are short-lived clones

    Maven Pig Tomcat Ant Camel Jackrabbit 1−2 (1.27) 3−5 (4.05) 6−8 (6.57) 12−13 (12.33) 14−16 (14.94) 19−20 (19.43) 1−2 (1.27) 3−4 (3.43) 5−5 (5) 6−7 (6.43) 8−11 (9.22) 12−14 (13.54) 1−7 (3.9) 9−19 (13.12) 21−38 (26.91) 1−2 (1.21) 3−4 (3.32) 6−7 (6.62) 8−9 (8.22) 10−11 (10.79) 12−14 (12.67) 1−6 (2.63) 7−25 (11.06) 1−9 (2.28) 10−28 (16.2) 35−56 (51.5) 0 200 400 0 20 40 60 80 0 200 400 600 800 0 300 600 900 1200 0 25 50 75 100 0 5 10 15 20 Number of versions Number of clones Short-lived Long-lived !6 Identify short-lived & long-lived clones 2 versions 6 versions Using a clustering technique Short Long
  14. 30% (Maven) - 87% (Jackrabbit) of clones are short-lived clones

    Maven Pig Tomcat Ant Camel Jackrabbit 1−2 (1.27) 3−5 (4.05) 6−8 (6.57) 12−13 (12.33) 14−16 (14.94) 19−20 (19.43) 1−2 (1.27) 3−4 (3.43) 5−5 (5) 6−7 (6.43) 8−11 (9.22) 12−14 (13.54) 1−7 (3.9) 9−19 (13.12) 21−38 (26.91) 1−2 (1.21) 3−4 (3.32) 6−7 (6.62) 8−9 (8.22) 10−11 (10.79) 12−14 (12.67) 1−6 (2.63) 7−25 (11.06) 1−9 (2.28) 10−28 (16.2) 35−56 (51.5) 0 200 400 0 20 40 60 80 0 200 400 600 800 0 300 600 900 1200 0 25 50 75 100 0 5 10 15 20 Number of versions Number of clones Short-lived Long-lived Short-lived !6 Identify short-lived & long-lived clones 2 versions 6 versions Using a clustering technique Short Long
  15. 30% (Maven) - 87% (Jackrabbit) of clones are short-lived clones

    Maven Pig Tomcat Ant Camel Jackrabbit 1−2 (1.27) 3−5 (4.05) 6−8 (6.57) 12−13 (12.33) 14−16 (14.94) 19−20 (19.43) 1−2 (1.27) 3−4 (3.43) 5−5 (5) 6−7 (6.43) 8−11 (9.22) 12−14 (13.54) 1−7 (3.9) 9−19 (13.12) 21−38 (26.91) 1−2 (1.21) 3−4 (3.32) 6−7 (6.62) 8−9 (8.22) 10−11 (10.79) 12−14 (12.67) 1−6 (2.63) 7−25 (11.06) 1−9 (2.28) 10−28 (16.2) 35−56 (51.5) 0 200 400 0 20 40 60 80 0 200 400 600 800 0 300 600 900 1200 0 25 50 75 100 0 5 10 15 20 Number of versions Number of clones Short-lived Long-lived Short-lived Long-lived !6 Identify short-lived & long-lived clones 2 versions 6 versions Using a clustering technique Short Long
  16. 30% (Maven) - 87% (Jackrabbit) of clones are short-lived clones

    Maven Pig Tomcat Ant Camel Jackrabbit 1−2 (1.27) 3−5 (4.05) 6−8 (6.57) 12−13 (12.33) 14−16 (14.94) 19−20 (19.43) 1−2 (1.27) 3−4 (3.43) 5−5 (5) 6−7 (6.43) 8−11 (9.22) 12−14 (13.54) 1−7 (3.9) 9−19 (13.12) 21−38 (26.91) 1−2 (1.21) 3−4 (3.32) 6−7 (6.62) 8−9 (8.22) 10−11 (10.79) 12−14 (12.67) 1−6 (2.63) 7−25 (11.06) 1−9 (2.28) 10−28 (16.2) 35−56 (51.5) 0 200 400 0 20 40 60 80 0 200 400 600 800 0 300 600 900 1200 0 25 50 75 100 0 5 10 15 20 Number of versions Number of clones Short-lived Long-lived !6 Identify short-lived & long-lived clones 2 versions 6 versions Using a clustering technique Short Long The life expectancy of short-lived clones account for <17% of all studied releases
  17. !7 Consistent changes appear in the long-lived clones more often

    than short-lived clones
  18. Clones are consistently changed !7 Clones are consistently changed Consistent

    changes appear in the long-lived clones more often than short-lived clones
  19. !7 %clone genealogies including consistently changing patterns Ant Camel Jackrabbit

    Maven Pig Tomcat 35% 25% 25% 26% 37% 31% 19% 9% 14% 15% 16% 10% Short-lived Long-lived Consistent changes appear in the long-lived clones more often than short-lived clones
  20. !7 The maintenance effort that is associated with short-lived clones

    is smaller than the maintenance effort associated with long-lived clones %clone genealogies including consistently changing patterns Ant Camel Jackrabbit Maven Pig Tomcat 35% 25% 25% 26% 37% 31% 19% 9% 14% 15% 16% 10% Short-lived Long-lived Consistent changes appear in the long-lived clones more often than short-lived clones
  21. Understanding clone genealogies and 
 their life expectancy Apache Pig

    17 Years 22 Releases 10 Years 35 Releases 13 Years 66 Releases 14 Years 36 Releases 8 Years 15 Releases 11 Years 43 Releases (PQ1) How long do clones live in a software system? (PQ2) How were short-lived and long-lived clones changed throughout their lifetime? !8
  22. Understanding clone genealogies and 
 their life expectancy (PQ1) How

    long do clones live in a software system? (PQ2) How were short-lived and long-lived clones changed throughout their lifetime? Many clones lived in the studied systems for a short duration The maintenance effort for short- lived clones is smaller than that for long-lived clones It is important to determine in advance whether a clone will be short-lived or long-lived to manage clones more efficiently !9
  23. Building a classifier to determine the life expectancy of a

    newly-introduced clone A clone at the time when it was injected into the source code A classifier (Random Forest) Train !10 Product metrics Process metrics Clone metrics (Churn, #Developers) (#Lines of Code) (#Clone Siblings) 38 metrics
  24. Towards understanding of the characteristics of short-lived clones (RQ1) How

    well can we determine whether an introduced clone will be short-lived? (RQ2) What are the most influential metrics for determining the clone life expectancy? A classifier (Random Forest) Short-lived? Long-lived? Product metrics Process metrics Clone metrics (Churn, #Developers) !11 (#Lines of Code) (#Clone Siblings)
  25. Towards understanding of the characteristics of short-lived clones (RQ1) How

    well can we determine whether an introduced clone will be short-lived? (RQ2) What are the most influential metrics for determining the clone life expectancy? A classifier (Random Forest) Process metrics Our random forest classifiers achieve an average AUC of 0.63 to 0.92 Clones that are introduced with a large amount of churn made into their methods are more likely to be short-lived Our classifiers and metrics can be used to determine whether a newly-introduced clone will be short-lived !12
  26. !13

  27. Maven Pig Tomcat Ant Camel Jackrabbit 1−2 (1.27) 3−5 (4.05)

    6−8 (6.57) 12−13 (12.33) 14−16 (14.94) 19−20 (19.43) 1−2 (1.27) 3−4 (3.43) 5−5 (5) 6−7 (6.43) 8−11 (9.22) 12−14 (13.54) 1−7 (3.9) 9−19 (13.12) 21−38 (26.91) 1−2 (1.21) 3−4 (3.32) 6−7 (6.62) 8−9 (8.22) 10−11 (10.79) 12−14 (12.67) 1−6 (2.63) 7−25 (11.06) 1−9 (2.28) 10−28 (16.2) 35−56 (51.5) 0 200 400 0 20 40 60 80 0 200 400 600 800 0 300 600 900 1200 Number of versions Long−lived Short−lived Short Long Less consistent changes More consistent changes !13
  28. Maven Pig Tomcat Ant Camel Jackrabbit 1−2 (1.27) 3−5 (4.05)

    6−8 (6.57) 12−13 (12.33) 14−16 (14.94) 19−20 (19.43) 1−2 (1.27) 3−4 (3.43) 5−5 (5) 6−7 (6.43) 8−11 (9.22) 12−14 (13.54) 1−7 (3.9) 9−19 (13.12) 21−38 (26.91) 1−2 (1.21) 3−4 (3.32) 6−7 (6.62) 8−9 (8.22) 10−11 (10.79) 12−14 (12.67) 1−6 (2.63) 7−25 (11.06) 1−9 (2.28) 10−28 (16.2) 35−56 (51.5) 0 200 400 0 20 40 60 80 0 200 400 600 800 0 300 600 900 1200 Number of versions Long−lived Short−lived Short Long Less consistent changes More consistent changes Clone metrics Product metrics Process metrics A classifier Building a classifier to determine the life expectancy of clones !13
  29. Towards understanding of the characteristics of short-lived clones (RQ1) How

    well can we determine whether an introduced clone will be short-lived? (RQ2) What are the most influential metrics for determining the clone life expectancy? A classifier (Random Forest) Process metrics Our random forest classifiers achieve an average AUC of 0.63 to 0.92 Clones that are introduced with a large amount of churn made into their methods are more likely to be short-lived Our classifiers and metrics can be used to determine whether a newly-introduced clone will be short-lived Maven Pig Tomcat Ant Camel Jackrabbit 1−2 (1.27) 3−5 (4.05) 6−8 (6.57) 12−13 (12.33) 14−16 (14.94) 19−20 (19.43) 1−2 (1.27) 3−4 (3.43) 5−5 (5) 6−7 (6.43) 8−11 (9.22) 12−14 (13.54) 1−7 (3.9) 9−19 (13.12) 21−38 (26.91) 1−2 (1.21) 3−4 (3.32) 6−7 (6.62) 8−9 (8.22) 10−11 (10.79) 12−14 (12.67) 1−6 (2.63) 7−25 (11.06) 1−9 (2.28) 10−28 (16.2) 35−56 (51.5) 0 200 400 0 20 40 60 80 0 200 400 600 800 0 300 600 900 1200 Number of versions Long−lived Short−lived Short Long Less consistent changes More consistent changes Clone metrics Product metrics Process metrics A classifier Building a classifier to determine the life expectancy of clones !13
  30. Towards understanding of the characteristics of short-lived clones (RQ1) How

    well can we determine whether an introduced clone will be short-lived? (RQ2) What are the most influential metrics for determining the clone life expectancy? A classifier (Random Forest) Process metrics Our random forest classifiers achieve an average AUC of 0.63 to 0.92 Clones that are introduced with a large amount of churn made into their methods are more likely to be short-lived Our classifiers and metrics can be used to determine whether a newly-introduced clone will be short-lived Maven Pig Tomcat Ant Camel Jackrabbit 1−2 (1.27) 3−5 (4.05) 6−8 (6.57) 12−13 (12.33) 14−16 (14.94) 19−20 (19.43) 1−2 (1.27) 3−4 (3.43) 5−5 (5) 6−7 (6.43) 8−11 (9.22) 12−14 (13.54) 1−7 (3.9) 9−19 (13.12) 21−38 (26.91) 1−2 (1.21) 3−4 (3.32) 6−7 (6.62) 8−9 (8.22) 10−11 (10.79) 12−14 (12.67) 1−6 (2.63) 7−25 (11.06) 1−9 (2.28) 10−28 (16.2) 35−56 (51.5) 0 200 400 0 20 40 60 80 0 200 400 600 800 0 300 600 900 1200 Number of versions Long−lived Short−lived Short Long Less consistent changes More consistent changes Clone metrics Product metrics Process metrics A classifier Building a classifier to determine the life expectancy of clones Our classifiers and insights can help teams to plan the most effective use of the clone management resources patanamon.thongtanunam @unimelb.edu.au @patanamon http://patanamon.com !13