$30 off During Our Annual Pro Sale. View Details »

Will this clone be short-lived? Towards understanding of the characteristics of short-lived clones

Will this clone be short-lived? Towards understanding of the characteristics of short-lived clones

Code clones are created when a developer duplicates a code fragment to reuse existing functionalities. Mitigating clones by refactoring them helps ease the long-term maintenance of large software systems. However, refactoring can introduce an additional cost. Prior work also suggest that refactoring all clones can be counterproductive since clones may live in a system for a short duration. Hence, it is beneficial to determine in advance whether a newly-introduced clone will be short-lived or long-lived to plan the most effective use of resources. In this work, we perform an empirical study on six open source Java systems to better understand the life expectancy of clones. We find that a large number of clones (i.e., 30% to 87%) lived in the systems for a short duration. Moreover, we find that although short-lived clones were changed more frequently than long-lived clones throughout their lifetime, short-lived clones were consistently changed with their siblings less often than long-lived clones. Furthermore, we build random forest classifiers in order to determine the life expectancy of a newly-introduced clone (i.e., whether a clone will be short-lived or long-lived). Our empirical results show that our random forest classifiers can determine the life expectancy of a newly-introduced clone with an average AUC of 0.63 to 0.92. We also find that the churn made to the methods containing a newly-introduced clone, the complexity and size of the methods containing the newly-introduced clone are highly influential in determining whether the newly-introduced clone will be short-lived. Furthermore, the size of a newly-introduced clone shares a positive relationship with the likelihood that the newly-introduced clone will be short-lived. Our results suggest that to improve the efficiency of clone management efforts, practitioners can leverage our classifiers and insights in order to determine whether a newly-introduced clone will be short-lived or long-lived to plan the most effective use of their clone management resources in advance.

These slides were presented at the International Conference on Software Engineering (2019)

More Decks by Patanamon (Pick) Thongtanunam

Other Decks in Research

Transcript

  1. Will this clone be short-lived? Towards understanding
    of the characteristics of short-lived clones
    - Journal First Presentation -
    Ahmed Hassan
    Weiyi Shang
    Patanamon (Pick)
    Thongtanunam
    patanamon.thongtanunam
    @unimelb.edu.au
    @patanamon

    View Slide

  2. Code clone: a group of code fragments that are
    nearly-identical
    For lower maintenance effort, these clones should
    be refactored to reduce code repetitiveness
    Monopoly/IncomeTaxSquare.java
    Monopoly/GoToJailSquare.java
    “2,241 refactoring instances were detected
    in 285 GitHub projects [Silva et al., 2016]”
    !2

    View Slide

  3. Refactoring all clones may not be worthwhile
    Released
    Version
    2 versions
    2 versions
    10 versions
    !3

    View Slide

  4. Refactoring all clones may not be worthwhile
    Released
    Version
    2 versions
    2 versions
    75% and 36% of volatile clones in the carol
    and dnsjava systems lived for a short-duration
    [Kim et al., 2005]
    10 versions
    !3

    View Slide

  5. Refactoring all clones may not be worthwhile
    Released
    Version
    2 versions
    2 versions
    75% and 36% of volatile clones in the carol
    and dnsjava systems lived for a short-duration
    [Kim et al., 2005]
    Many of the long-lived clones cannot be
    removed using standard refactoring techniques
    [Kim et al., 2005]
    10 versions
    !3

    View Slide

  6. Refactoring all clones may not be worthwhile
    Released
    Version
    2 versions
    2 versions
    10 versions
    !3
    Determining the life expectancy of
    clones in advance may be
    beneficial when managing clones

    View Slide

  7. Understanding clone genealogies and 

    their life expectancy
    Apache Pig
    17 Years

    22 Releases
    10 Years

    35 Releases
    13 Years

    66 Releases
    14 Years

    36 Releases
    8 Years

    15 Releases
    11 Years

    43 Releases
    (PQ1) How long do clones live
    in a software system?
    (PQ2) How were short-lived and long-lived
    clones changed throughout their lifetime?
    !4

    View Slide

  8. Identifying clone life expectancy
    !5

    View Slide

  9. Identifying clone life expectancy
    !5
    Code
    Repository

    View Slide

  10. Identifying clone life expectancy
    !5
    Code
    Repository
    Extract sequentially
    developed versions
    v2.17.1
    v2.17.2
    v2.17.3
    v2.16.0
    v2.17.0
    v2.18.0
    v2.18.1
    v2.18.2
    v2.18.3
    v2.17.4
    Using git commands

    View Slide

  11. Identifying clone life expectancy
    Extract clone
    genealogies
    2 versions
    6 versions
    Using iClones

    [Göde and Koschke, 2009]
    !5
    Code
    Repository
    Extract sequentially
    developed versions
    v2.17.1
    v2.17.2
    v2.17.3
    v2.16.0
    v2.17.0
    v2.18.0
    v2.18.1
    v2.18.2
    v2.18.3
    v2.17.4
    Using git commands

    View Slide

  12. Identifying clone life expectancy
    Identify short-lived &
    long-lived clones
    2 versions
    6 versions
    Using a clustering technique
    Short Long
    Extract clone
    genealogies
    2 versions
    6 versions
    Using iClones

    [Göde and Koschke, 2009]
    !5
    Code
    Repository
    Extract sequentially
    developed versions
    v2.17.1
    v2.17.2
    v2.17.3
    v2.16.0
    v2.17.0
    v2.18.0
    v2.18.1
    v2.18.2
    v2.18.3
    v2.17.4
    Using git commands

    View Slide

  13. 30% (Maven) - 87% (Jackrabbit) of clones are
    short-lived clones
    Maven Pig Tomcat
    Ant Camel Jackrabbit
    1−2
    (1.27) 3−5
    (4.05) 6−8
    (6.57)
    12−13
    (12.33)
    14−16
    (14.94)
    19−20
    (19.43) 1−2
    (1.27) 3−4
    (3.43) 5−5
    (5) 6−7
    (6.43)
    8−11
    (9.22)
    12−14
    (13.54) 1−7
    (3.9)
    9−19
    (13.12)
    21−38
    (26.91)
    1−2
    (1.21) 3−4
    (3.32) 6−7
    (6.62) 8−9
    (8.22)
    10−11
    (10.79)
    12−14
    (12.67) 1−6
    (2.63)
    7−25
    (11.06) 1−9
    (2.28)
    10−28
    (16.2)
    35−56
    (51.5)
    0
    200
    400
    0
    20
    40
    60
    80
    0
    200
    400
    600
    800
    0
    300
    600
    900
    1200
    0
    25
    50
    75
    100
    0
    5
    10
    15
    20
    Number of versions
    Number of clones
    Short-lived Long-lived
    !6
    Identify short-lived &
    long-lived clones
    2 versions
    6 versions
    Using a clustering technique
    Short Long

    View Slide

  14. 30% (Maven) - 87% (Jackrabbit) of clones are
    short-lived clones
    Maven Pig Tomcat
    Ant Camel Jackrabbit
    1−2
    (1.27) 3−5
    (4.05) 6−8
    (6.57)
    12−13
    (12.33)
    14−16
    (14.94)
    19−20
    (19.43) 1−2
    (1.27) 3−4
    (3.43) 5−5
    (5) 6−7
    (6.43)
    8−11
    (9.22)
    12−14
    (13.54) 1−7
    (3.9)
    9−19
    (13.12)
    21−38
    (26.91)
    1−2
    (1.21) 3−4
    (3.32) 6−7
    (6.62) 8−9
    (8.22)
    10−11
    (10.79)
    12−14
    (12.67) 1−6
    (2.63)
    7−25
    (11.06) 1−9
    (2.28)
    10−28
    (16.2)
    35−56
    (51.5)
    0
    200
    400
    0
    20
    40
    60
    80
    0
    200
    400
    600
    800
    0
    300
    600
    900
    1200
    0
    25
    50
    75
    100
    0
    5
    10
    15
    20
    Number of versions
    Number of clones
    Short-lived Long-lived
    Short-lived
    !6
    Identify short-lived &
    long-lived clones
    2 versions
    6 versions
    Using a clustering technique
    Short Long

    View Slide

  15. 30% (Maven) - 87% (Jackrabbit) of clones are
    short-lived clones
    Maven Pig Tomcat
    Ant Camel Jackrabbit
    1−2
    (1.27) 3−5
    (4.05) 6−8
    (6.57)
    12−13
    (12.33)
    14−16
    (14.94)
    19−20
    (19.43) 1−2
    (1.27) 3−4
    (3.43) 5−5
    (5) 6−7
    (6.43)
    8−11
    (9.22)
    12−14
    (13.54) 1−7
    (3.9)
    9−19
    (13.12)
    21−38
    (26.91)
    1−2
    (1.21) 3−4
    (3.32) 6−7
    (6.62) 8−9
    (8.22)
    10−11
    (10.79)
    12−14
    (12.67) 1−6
    (2.63)
    7−25
    (11.06) 1−9
    (2.28)
    10−28
    (16.2)
    35−56
    (51.5)
    0
    200
    400
    0
    20
    40
    60
    80
    0
    200
    400
    600
    800
    0
    300
    600
    900
    1200
    0
    25
    50
    75
    100
    0
    5
    10
    15
    20
    Number of versions
    Number of clones
    Short-lived Long-lived
    Short-lived
    Long-lived
    !6
    Identify short-lived &
    long-lived clones
    2 versions
    6 versions
    Using a clustering technique
    Short Long

    View Slide

  16. 30% (Maven) - 87% (Jackrabbit) of clones are
    short-lived clones
    Maven Pig Tomcat
    Ant Camel Jackrabbit
    1−2
    (1.27) 3−5
    (4.05) 6−8
    (6.57)
    12−13
    (12.33)
    14−16
    (14.94)
    19−20
    (19.43) 1−2
    (1.27) 3−4
    (3.43) 5−5
    (5) 6−7
    (6.43)
    8−11
    (9.22)
    12−14
    (13.54) 1−7
    (3.9)
    9−19
    (13.12)
    21−38
    (26.91)
    1−2
    (1.21) 3−4
    (3.32) 6−7
    (6.62) 8−9
    (8.22)
    10−11
    (10.79)
    12−14
    (12.67) 1−6
    (2.63)
    7−25
    (11.06) 1−9
    (2.28)
    10−28
    (16.2)
    35−56
    (51.5)
    0
    200
    400
    0
    20
    40
    60
    80
    0
    200
    400
    600
    800
    0
    300
    600
    900
    1200
    0
    25
    50
    75
    100
    0
    5
    10
    15
    20
    Number of versions
    Number of clones
    Short-lived Long-lived
    !6
    Identify short-lived &
    long-lived clones
    2 versions
    6 versions
    Using a clustering technique
    Short Long
    The life expectancy of
    short-lived clones
    account for <17% of all
    studied releases

    View Slide

  17. !7
    Consistent changes appear in the long-lived
    clones more often than short-lived clones

    View Slide

  18. Clones are
    consistently changed
    !7
    Clones are
    consistently changed
    Consistent changes appear in the long-lived
    clones more often than short-lived clones

    View Slide

  19. !7
    %clone genealogies
    including consistently
    changing patterns
    Ant Camel Jackrabbit Maven Pig Tomcat
    35%
    25%
    25%
    26%
    37%
    31%
    19%
    9%
    14%
    15%
    16%
    10%
    Short-lived Long-lived
    Consistent changes appear in the long-lived
    clones more often than short-lived clones

    View Slide

  20. !7
    The maintenance effort that is associated with short-lived clones is
    smaller than the maintenance effort associated with long-lived clones
    %clone genealogies
    including consistently
    changing patterns
    Ant Camel Jackrabbit Maven Pig Tomcat
    35%
    25%
    25%
    26%
    37%
    31%
    19%
    9%
    14%
    15%
    16%
    10%
    Short-lived Long-lived
    Consistent changes appear in the long-lived
    clones more often than short-lived clones

    View Slide

  21. Understanding clone genealogies and 

    their life expectancy
    Apache Pig
    17 Years

    22 Releases
    10 Years

    35 Releases
    13 Years

    66 Releases
    14 Years

    36 Releases
    8 Years

    15 Releases
    11 Years

    43 Releases
    (PQ1) How long do clones live
    in a software system?
    (PQ2) How were short-lived and long-lived
    clones changed throughout their lifetime?
    !8

    View Slide

  22. Understanding clone genealogies and 

    their life expectancy
    (PQ1) How long do clones live
    in a software system?
    (PQ2) How were short-lived and long-lived
    clones changed throughout their lifetime?
    Many clones lived in the studied
    systems for a short duration
    The maintenance effort for short-
    lived clones is smaller than that for
    long-lived clones
    It is important to determine in advance whether a clone will
    be short-lived or long-lived to manage clones more efficiently
    !9

    View Slide

  23. Building a classifier to determine the life
    expectancy of a newly-introduced clone
    A clone at the time when
    it was injected into the
    source code
    A classifier
    (Random Forest)
    Train
    !10
    Product metrics

    Process metrics

    Clone metrics

    (Churn, #Developers)
    (#Lines of Code)
    (#Clone Siblings)
    38 metrics

    View Slide

  24. Towards understanding of the characteristics of
    short-lived clones
    (RQ1) How well can we determine whether an
    introduced clone will be short-lived?
    (RQ2) What are the most influential metrics for
    determining the clone life expectancy?
    A classifier
    (Random Forest)
    Short-lived?
    Long-lived?
    Product metrics
    Process metrics

    Clone metrics

    (Churn, #Developers)
    !11
    (#Lines of Code)
    (#Clone Siblings)

    View Slide

  25. Towards understanding of the characteristics of
    short-lived clones
    (RQ1) How well can we determine whether an
    introduced clone will be short-lived?
    (RQ2) What are the most influential metrics for
    determining the clone life expectancy?
    A classifier
    (Random Forest)
    Process
    metrics

    Our random forest classifiers
    achieve an average AUC of
    0.63 to 0.92
    Clones that are introduced with a
    large amount of churn made into
    their methods are more likely to be
    short-lived
    Our classifiers and metrics can be used to determine
    whether a newly-introduced clone will be short-lived
    !12

    View Slide

  26. !13

    View Slide

  27. Maven Pig Tomcat
    Ant Camel Jackrabbit
    1−2
    (1.27) 3−5
    (4.05) 6−8
    (6.57)
    12−13
    (12.33)
    14−16
    (14.94)
    19−20
    (19.43) 1−2
    (1.27) 3−4
    (3.43) 5−5
    (5) 6−7
    (6.43)
    8−11
    (9.22)
    12−14
    (13.54) 1−7
    (3.9)
    9−19
    (13.12)
    21−38
    (26.91)
    1−2
    (1.21) 3−4
    (3.32) 6−7
    (6.62) 8−9
    (8.22)
    10−11
    (10.79)
    12−14
    (12.67) 1−6
    (2.63)
    7−25
    (11.06) 1−9
    (2.28)
    10−28
    (16.2)
    35−56
    (51.5)
    0
    200
    400
    0
    20
    40
    60
    80
    0
    200
    400
    600
    800
    0
    300
    600
    900
    1200
    Number of versions
    Long−lived
    Short−lived
    Short
    Long
    Less consistent
    changes
    More consistent
    changes
    !13

    View Slide

  28. Maven Pig Tomcat
    Ant Camel Jackrabbit
    1−2
    (1.27) 3−5
    (4.05) 6−8
    (6.57)
    12−13
    (12.33)
    14−16
    (14.94)
    19−20
    (19.43) 1−2
    (1.27) 3−4
    (3.43) 5−5
    (5) 6−7
    (6.43)
    8−11
    (9.22)
    12−14
    (13.54) 1−7
    (3.9)
    9−19
    (13.12)
    21−38
    (26.91)
    1−2
    (1.21) 3−4
    (3.32) 6−7
    (6.62) 8−9
    (8.22)
    10−11
    (10.79)
    12−14
    (12.67) 1−6
    (2.63)
    7−25
    (11.06) 1−9
    (2.28)
    10−28
    (16.2)
    35−56
    (51.5)
    0
    200
    400
    0
    20
    40
    60
    80
    0
    200
    400
    600
    800
    0
    300
    600
    900
    1200
    Number of versions
    Long−lived
    Short−lived
    Short
    Long
    Less consistent
    changes
    More consistent
    changes
    Clone metrics
    Product metrics

    Process metrics

    A classifier
    Building a classifier to determine
    the life expectancy of clones
    !13

    View Slide

  29. Towards understanding of the characteristics of
    short-lived clones
    (RQ1) How well can we determine whether an
    introduced clone will be short-lived?
    (RQ2) What are the most influential metrics for
    determining the clone life expectancy?
    A classifier
    (Random Forest)
    Process
    metrics

    Our random forest classifiers
    achieve an average AUC of
    0.63 to 0.92
    Clones that are introduced with a
    large amount of churn made into
    their methods are more likely to be
    short-lived
    Our classifiers and metrics can be used to determine
    whether a newly-introduced clone will be short-lived
    Maven Pig Tomcat
    Ant Camel Jackrabbit
    1−2
    (1.27) 3−5
    (4.05) 6−8
    (6.57)
    12−13
    (12.33)
    14−16
    (14.94)
    19−20
    (19.43) 1−2
    (1.27) 3−4
    (3.43) 5−5
    (5) 6−7
    (6.43)
    8−11
    (9.22)
    12−14
    (13.54) 1−7
    (3.9)
    9−19
    (13.12)
    21−38
    (26.91)
    1−2
    (1.21) 3−4
    (3.32) 6−7
    (6.62) 8−9
    (8.22)
    10−11
    (10.79)
    12−14
    (12.67) 1−6
    (2.63)
    7−25
    (11.06) 1−9
    (2.28)
    10−28
    (16.2)
    35−56
    (51.5)
    0
    200
    400
    0
    20
    40
    60
    80
    0
    200
    400
    600
    800
    0
    300
    600
    900
    1200
    Number of versions
    Long−lived
    Short−lived
    Short
    Long
    Less consistent
    changes
    More consistent
    changes
    Clone metrics
    Product metrics

    Process metrics

    A classifier
    Building a classifier to determine
    the life expectancy of clones
    !13

    View Slide

  30. Towards understanding of the characteristics of
    short-lived clones
    (RQ1) How well can we determine whether an
    introduced clone will be short-lived?
    (RQ2) What are the most influential metrics for
    determining the clone life expectancy?
    A classifier
    (Random Forest)
    Process
    metrics

    Our random forest classifiers
    achieve an average AUC of
    0.63 to 0.92
    Clones that are introduced with a
    large amount of churn made into
    their methods are more likely to be
    short-lived
    Our classifiers and metrics can be used to determine
    whether a newly-introduced clone will be short-lived
    Maven Pig Tomcat
    Ant Camel Jackrabbit
    1−2
    (1.27) 3−5
    (4.05) 6−8
    (6.57)
    12−13
    (12.33)
    14−16
    (14.94)
    19−20
    (19.43) 1−2
    (1.27) 3−4
    (3.43) 5−5
    (5) 6−7
    (6.43)
    8−11
    (9.22)
    12−14
    (13.54) 1−7
    (3.9)
    9−19
    (13.12)
    21−38
    (26.91)
    1−2
    (1.21) 3−4
    (3.32) 6−7
    (6.62) 8−9
    (8.22)
    10−11
    (10.79)
    12−14
    (12.67) 1−6
    (2.63)
    7−25
    (11.06) 1−9
    (2.28)
    10−28
    (16.2)
    35−56
    (51.5)
    0
    200
    400
    0
    20
    40
    60
    80
    0
    200
    400
    600
    800
    0
    300
    600
    900
    1200
    Number of versions
    Long−lived
    Short−lived
    Short
    Long
    Less consistent
    changes
    More consistent
    changes
    Clone metrics
    Product metrics

    Process metrics

    A classifier
    Building a classifier to determine
    the life expectancy of clones
    Our classifiers and insights can help teams to
    plan the most effective use of the clone
    management resources
    patanamon.thongtanunam
    @unimelb.edu.au
    @patanamon
    http://patanamon.com
    !13

    View Slide