Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Modeling Changeset Topics

Modeling Changeset Topics

Christopher S. Corley, Kelly L. Kashuda
The University of Alabama

Daniel S. May
Swarthmore College

Nicholas A. Kraft
ABB Corporate Research

Topic modeling has been applied to several areas of software engineering, such as bug localization, feature location, triaging change requests, and traceability link recovery. Many of these approaches combine mining unstructured data, such as bug reports, with topic modeling a snapshot (or release) of source code. However, source code evolves, which causes models to become obsolete. In this paper, we explore the approach of topic modeling changesets over the traditional release approach. We conduct an exploratory study of four open source systems. We investigate the differences in corpora in each project, and evaluate the topic distinctness of the models.

*Note*: these slides were animation-heavy, YouTube recording available here: https://www.youtube.com/watch?v=S12B_CTeUtA

Christopher Corley

September 30, 2014
Tweet

More Decks by Christopher Corley

Other Decks in Research

Transcript

  1. Modeling Changeset Topics
    C.S. Corley, K.L. Kashuda, D.S. May, N.A. Kraft
    @excsc
    [email protected]
    cscorley/mud2014-modeling-changeset-topics

    View Slide

  2. 2
    ????
    Topic Modeling

    View Slide

  3. 3

    View Slide

  4. 3
    latent Dirichlet allocation (LDA)

    View Slide

  5. 4
    1

    View Slide

  6. 4
    1

    View Slide

  7. 5
    2

    View Slide

  8. 5
    2

    View Slide

  9. 5
    2

    View Slide

  10. 5
    2

    View Slide

  11. 6
    3*

    View Slide

  12. 6
    3*

    View Slide

  13. 6
    3*

    View Slide

  14. 7

    View Slide

  15. 7
    Feature location

    View Slide

  16. 7
    Feature location
    Bug localization

    View Slide

  17. 7
    Feature location
    Bug localization
    Ɲ Traceability links

    View Slide

  18. 7
    Feature location
    Bug localization
    Ɲ Traceability links
    ?
    ? ?
    Developer identification

    View Slide

  19. 8
    Release A
    ƭ ƭ
    ƭ ƭ

    View Slide

  20. 8
    Release A
    ƭ ƭ
    ƭ ƭ
    A

    View Slide

  21. 9
    ƭ ƭ
    ƭ ƭ
    Release A

    View Slide

  22. 9
    ƭ ƭ
    ƭ ƭ
    Release A

    ƭ ƭ
    ƭ ƭ
    A+1
    ƭ ƭ
    ƭ ƭ
    A+2
    ƭ ƭ
    ƭ ƭ
    A+3
    ƭ ƭ
    ƭ ƭ
    Release B

    View Slide

  23. 10
    ƭ ƭ
    ƭ ƭ
    Release A

    ƭ ƭ
    ƭ ƭ
    A+1
    ƭ ƭ
    ƭ ƭ
    A+2
    ƭ ƭ
    ƭ ƭ
    A+3
    ƭ ƭ
    ƭ ƭ
    Release B

    View Slide

  24. 10
    ƭ ƭ
    ƭ ƭ
    Release A

    ƭ ƭ
    ƭ ƭ
    A+1
    ƭ ƭ
    ƭ ƭ
    A+2
    ƭ ƭ
    ƭ ƭ
    A+3
    ƭ ƭ
    ƭ ƭ
    Release B
    ƭ
    ƭ ƭ ƭ

    View Slide

  25. 11

    View Slide

  26. 11
    ƃ ƃ ƃ ƃ ƃ
    Rome wasn’t built in a day

    View Slide

  27. 11
    ƃ ƃ ƃ ƃ ƃ
    Rome wasn’t built in a day
    ƭ ƭ ƭ ƭ ƭ ƭ ƭ
    … neither is software.

    View Slide

  28. 12
    ƭ ƭ
    ƭ ƭ
    Release A

    ƭ ƭ
    ƭ ƭ
    A+1
    ƭ ƭ
    ƭ ƭ
    A+2
    ƭ ƭ
    ƭ ƭ
    A+3
    ƭ ƭ
    ƭ ƭ
    Release B
    ƭ
    ƭ ƭ ƭ
    ƭ ƭ
    ƭ ƭ
    ƭ ƭ

    View Slide

  29. 12
    ƭ ƭ
    ƭ ƭ
    Release A

    ƭ ƭ
    ƭ ƭ
    A+1
    ƭ ƭ
    ƭ ƭ
    A+2
    ƭ ƭ
    ƭ ƭ
    A+3
    ƭ ƭ
    ƭ ƭ
    Release B
    ƭ
    ƭ ƭ ƭ
    ƭ ƭ
    ƭ ƭ
    ƭ ƭ
    (not a good idea)

    View Slide

  30. 13
    ƭ ƭ
    ƭ ƭ
    Release A

    ƭ ƭ
    ƭ ƭ
    A+1
    ƭ ƭ
    ƭ ƭ
    A+2
    ƭ ƭ
    ƭ ƭ
    A+3
    ƭ ƭ
    ƭ ƭ
    Release B

    View Slide

  31. 13
    ƭ ƭ
    ƭ ƭ
    Release A

    ƭ ƭ
    ƭ ƭ
    A+1
    ƭ ƭ
    ƭ ƭ
    A+2
    ƭ ƭ
    ƭ ƭ
    A+3
    ƭ ƭ
    ƭ ƭ
    Release B
    ƭ

    ƭ
    ƭ

    ƭ
    ƭ
    ƭ

    View Slide

  32. 13
    ƭ ƭ
    ƭ ƭ
    Release A

    ƭ ƭ
    ƭ ƭ
    A+1
    ƭ ƭ
    ƭ ƭ
    A+2
    ƭ ƭ
    ƭ ƭ
    A+3
    ƭ ƭ
    ƭ ƭ
    Release B
    ƭ
    ƭ ƭ ƭ
    ƭ

    ƭ
    ƭ

    ƭ
    ƭ
    ƭ

    View Slide

  33. 13
    ƭ ƭ
    ƭ ƭ
    Release A

    ƭ ƭ
    ƭ ƭ
    A+1
    ƭ ƭ
    ƭ ƭ
    A+2
    ƭ ƭ
    ƭ ƭ
    A+3
    ƭ ƭ
    ƭ ƭ
    Release B
    ƭ
    ƭ ƭ ƭ
    (a much better idea)
    ƭ

    ƭ
    ƭ

    ƭ
    ƭ
    ƭ

    View Slide

  34. 14
    LDA is online

    View Slide

  35. 14
    LDA is online => streamed

    View Slide

  36. 14
    LDA can process an
    unknown number
    of documents
    LDA is online => streamed

    View Slide

  37. 14
    LDA can process an
    unknown number
    of documents
    LDA is online => streamed
    => ∞

    View Slide

  38. 15
    Source code repositories!

    View Slide

  39. 16
    ƭ ƭ
    ƭ ƭ
    Release A

    ƭ ƭ
    ƭ ƭ
    A+1
    ƭ ƭ
    ƭ ƭ
    A+2
    ƭ ƭ
    ƭ ƭ
    A+3
    ƭ ƭ
    ƭ ƭ
    Release B
    Source code repositories!

    View Slide

  40. 17
    ƭ ƭ
    ƭ ƭ
    Release A

    ƭ ƭ
    ƭ ƭ
    A+1
    ƭ ƭ
    ƭ ƭ
    A+2
    ƭ ƭ
    ƭ ƭ
    A+3
    ƭ ƭ
    ƭ ƭ
    Release B
    But, how?

    View Slide

  41. 17
    ƭ ƭ
    ƭ ƭ
    Release A

    ƭ ƭ
    ƭ ƭ
    A+1
    ƭ ƭ
    ƭ ƭ
    A+2
    ƭ ƭ
    ƭ ƭ
    A+3
    ƭ ƭ
    ƭ ƭ
    Release B
    But, how?

    View Slide

  42. 18
    ƭ ƭ
    ƭ ƭ
    Release A

    ƭ ƭ
    ƭ ƭ
    A+1
    ƭ ƭ
    ƭ ƭ
    A+2
    ƭ ƭ
    ƭ ƭ
    A+3
    ƭ ƭ
    ƭ ƭ
    Release B

    View Slide

  43. 18
    ƭ ƭ
    ƭ ƭ
    Release A

    ƭ ƭ
    ƭ ƭ
    A+1
    ƭ ƭ
    ƭ ƭ
    A+2
    ƭ ƭ
    ƭ ƭ
    A+3
    ƭ ƭ
    ƭ ƭ
    Release B
    diff A..A+1
    diff A+1..A+2
    diff A+2..A+3

    View Slide

  44. 18
    ƭ ƭ
    ƭ ƭ
    Release A

    ƭ ƭ
    ƭ ƭ
    A+1
    ƭ ƭ
    ƭ ƭ
    A+2
    ƭ ƭ
    ƭ ƭ
    A+3
    ƭ ƭ
    ƭ ƭ
    Release B

    View Slide

  45. 19
    Release Changeset
    ƭ ƭ
    ƭ ƭ
    ƭ
    ƭ ƭ ƭ

    View Slide

  46. 19
    Release Changeset
    ƭ ƭ
    ƭ ƭ
    ƭ
    ƭ ƭ ƭ
    How does the
    corpus change?

    View Slide

  47. 19
    Release Changeset
    ƭ ƭ
    ƭ ƭ
    ƭ
    ƭ ƭ ƭ
    How does the
    corpus change?
    How does the
    model change?

    View Slide

  48. 20
    AspectJ
    Joda-Time

    View Slide

  49. 20
    AspectJ
    Joda-Time
    RQ1: cosine similarity

    View Slide

  50. 20
    99.7 %
    AspectJ
    Joda-Time
    RQ1: cosine similarity

    View Slide

  51. 20
    99.7 % 93.1 %
    AspectJ
    Joda-Time
    RQ1: cosine similarity

    View Slide

  52. 20
    99.7 % 93.1 % 93.5 %
    AspectJ
    Joda-Time
    RQ1: cosine similarity

    View Slide

  53. 20
    99.7 % 93.1 % 93.5 % 67.1 %
    AspectJ
    Joda-Time
    RQ1: cosine similarity

    View Slide

  54. 21
    AspectJ
    Joda-Time
    !

    View Slide

  55. 21
    AspectJ
    Joda-Time
    !
    RQ2: distinctness score

    View Slide

  56. 21
    AspectJ
    Joda-Time
    2.31
    !
    3.17
    !
    RQ2: distinctness score

    View Slide

  57. 21
    AspectJ
    Joda-Time
    2.31
    !
    3.17
    3.75
    !
    2.78
    !
    RQ2: distinctness score

    View Slide

  58. 21
    AspectJ
    Joda-Time
    2.31
    !
    3.17
    3.75
    !
    2.78
    1.34
    !
    1.03
    !
    RQ2: distinctness score

    View Slide

  59. 21
    AspectJ
    Joda-Time
    2.31
    !
    3.17
    3.75
    !
    2.78
    1.34
    !
    1.03
    2.59
    !
    3.56
    !
    RQ2: distinctness score

    View Slide

  60. 22
    Release Changeset
    ƭ ƭ
    ƭ ƭ
    ƭ
    ƭ ƭ ƭ
    How does the
    corpus change?
    How does the
    model change?

    View Slide

  61. 23
    Modeling Changeset Topics
    C.S. Corley, K.L. Kashuda, D.S. May, N.A. Kraft
    @excsc
    [email protected]
    cscorley/mud2014-modeling-changeset-topics

    View Slide

  62. 24
    The Way that can be told of is not the eternal Way;
    The name that can be named is not the eternal name.
    The Nameless is the origin of Heaven and Earth;
    The Named is the mother of all things.
    Therefore let there always be non-being,
    so we may see their subtlety,
    And let there always be being,
    so we may see their outcome.
    The two are the same,
    But after they are produced,
    they have different names.

    View Slide

  63. 24
    The Nameless is the origin of Heaven and Earth;
    The Named is the mother of all things.
    Therefore let there always be non-being,
    so we may see their subtlety,
    And let there always be being,
    so we may see their outcome.
    The two are the same,
    But after they are produced,
    they have different names.
    They both may be called deep and profound.
    Deeper and more profound,
    The door of all subtleties!

    View Slide

  64. 24
    The Nameless is the origin of Heaven and Earth;
    The named is the mother of all things.
    !
    Therefore let there always be non-being,
    so we may see their subtlety,
    And let there always be being,
    so we may see their outcome.
    The two are the same,
    But after they are produced,
    they have different names.
    They both may be called deep and profound.
    Deeper and more profound,
    The door of all subtleties!

    View Slide

  65. 25
    !
    !
    !
    !
    !
    The Way that can be told of is not the eternal Way;
    The name that can be named is not the eternal name.
    !
    The Named is the mother of all things.
    The named is the mother of all things.
    !
    !
    !
    !
    !
    !
    !
    !
    They both may be called deep and profound.
    Deeper and more profound,
    The door of all subtleties!
    diff --git a/lao b/tzu
    index 635ef2c..5af88a8 100644
    --- a/lao
    +++ b/tzu
    @@ -1,7 +1,6 @@
    -The Way that can be told of is not the eternal Way;
    -The name that can be named is not the eternal name.
    The Nameless is the origin of Heaven and Earth;
    -The Named is the mother of all things.
    +The named is the mother of all things.
    +
    Therefore let there always be non-being,
    so we may see their subtlety,
    And let there always be being,
    @@ -9,3 +8,6 @@ And let there always be being,
    The two are the same,
    But after they are produced,
    they have different names.
    +They both may be called deep and profound.
    +Deeper and more profound,
    +The door of all subtleties!

    View Slide

  66. 25
    !
    !
    !
    !
    !
    The Way that can be told of is not the eternal Way;
    The name that can be named is not the eternal name.
    The Nameless is the origin of Heaven and Earth;
    The Named is the mother of all things.
    The named is the mother of all things.
    Therefore let there always be non-being,
    so we may see their subtlety,
    And let there always be being,
    !
    The two are the same,
    But after they are produced,
    they have different names.
    They both may be called deep and profound.
    Deeper and more profound,
    The door of all subtleties!
    !
    !
    !
    !
    !
    The Way that can be told of is not the eternal Way;
    The name that can be named is not the eternal name.
    !
    The Named is the mother of all things.
    The named is the mother of all things.
    !
    !
    !
    !
    !
    !
    !
    !
    They both may be called deep and profound.
    Deeper and more profound,
    The door of all subtleties!

    View Slide