Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Modeling Changeset Topics

Modeling Changeset Topics

Christopher S. Corley, Kelly L. Kashuda
The University of Alabama

Daniel S. May
Swarthmore College

Nicholas A. Kraft
ABB Corporate Research

Topic modeling has been applied to several areas of software engineering, such as bug localization, feature location, triaging change requests, and traceability link recovery. Many of these approaches combine mining unstructured data, such as bug reports, with topic modeling a snapshot (or release) of source code. However, source code evolves, which causes models to become obsolete. In this paper, we explore the approach of topic modeling changesets over the traditional release approach. We conduct an exploratory study of four open source systems. We investigate the differences in corpora in each project, and evaluate the topic distinctness of the models.

*Note*: these slides were animation-heavy, YouTube recording available here: https://www.youtube.com/watch?v=S12B_CTeUtA

Christopher Corley

September 30, 2014
Tweet

More Decks by Christopher Corley

Other Decks in Research

Transcript

  1. Modeling Changeset Topics
    C.S. Corley, K.L. Kashuda, D.S. May, N.A. Kraft
    @excsc
    [email protected]
    cscorley/mud2014-modeling-changeset-topics

    View full-size slide

  2. 2
    ????
    Topic Modeling

    View full-size slide

  3. 3
    latent Dirichlet allocation (LDA)

    View full-size slide

  4. 7
    Feature location

    View full-size slide

  5. 7
    Feature location
    Bug localization

    View full-size slide

  6. 7
    Feature location
    Bug localization
    Ɲ Traceability links

    View full-size slide

  7. 7
    Feature location
    Bug localization
    Ɲ Traceability links
    ?
    ? ?
    Developer identification

    View full-size slide

  8. 8
    Release A
    ƭ ƭ
    ƭ ƭ

    View full-size slide

  9. 8
    Release A
    ƭ ƭ
    ƭ ƭ
    A

    View full-size slide

  10. 9
    ƭ ƭ
    ƭ ƭ
    Release A

    View full-size slide

  11. 9
    ƭ ƭ
    ƭ ƭ
    Release A

    ƭ ƭ
    ƭ ƭ
    A+1
    ƭ ƭ
    ƭ ƭ
    A+2
    ƭ ƭ
    ƭ ƭ
    A+3
    ƭ ƭ
    ƭ ƭ
    Release B

    View full-size slide

  12. 10
    ƭ ƭ
    ƭ ƭ
    Release A

    ƭ ƭ
    ƭ ƭ
    A+1
    ƭ ƭ
    ƭ ƭ
    A+2
    ƭ ƭ
    ƭ ƭ
    A+3
    ƭ ƭ
    ƭ ƭ
    Release B

    View full-size slide

  13. 10
    ƭ ƭ
    ƭ ƭ
    Release A

    ƭ ƭ
    ƭ ƭ
    A+1
    ƭ ƭ
    ƭ ƭ
    A+2
    ƭ ƭ
    ƭ ƭ
    A+3
    ƭ ƭ
    ƭ ƭ
    Release B
    ƭ
    ƭ ƭ ƭ

    View full-size slide

  14. 11
    ƃ ƃ ƃ ƃ ƃ
    Rome wasn’t built in a day

    View full-size slide

  15. 11
    ƃ ƃ ƃ ƃ ƃ
    Rome wasn’t built in a day
    ƭ ƭ ƭ ƭ ƭ ƭ ƭ
    … neither is software.

    View full-size slide

  16. 12
    ƭ ƭ
    ƭ ƭ
    Release A

    ƭ ƭ
    ƭ ƭ
    A+1
    ƭ ƭ
    ƭ ƭ
    A+2
    ƭ ƭ
    ƭ ƭ
    A+3
    ƭ ƭ
    ƭ ƭ
    Release B
    ƭ
    ƭ ƭ ƭ
    ƭ ƭ
    ƭ ƭ
    ƭ ƭ

    View full-size slide

  17. 12
    ƭ ƭ
    ƭ ƭ
    Release A

    ƭ ƭ
    ƭ ƭ
    A+1
    ƭ ƭ
    ƭ ƭ
    A+2
    ƭ ƭ
    ƭ ƭ
    A+3
    ƭ ƭ
    ƭ ƭ
    Release B
    ƭ
    ƭ ƭ ƭ
    ƭ ƭ
    ƭ ƭ
    ƭ ƭ
    (not a good idea)

    View full-size slide

  18. 13
    ƭ ƭ
    ƭ ƭ
    Release A

    ƭ ƭ
    ƭ ƭ
    A+1
    ƭ ƭ
    ƭ ƭ
    A+2
    ƭ ƭ
    ƭ ƭ
    A+3
    ƭ ƭ
    ƭ ƭ
    Release B

    View full-size slide

  19. 13
    ƭ ƭ
    ƭ ƭ
    Release A

    ƭ ƭ
    ƭ ƭ
    A+1
    ƭ ƭ
    ƭ ƭ
    A+2
    ƭ ƭ
    ƭ ƭ
    A+3
    ƭ ƭ
    ƭ ƭ
    Release B
    ƭ

    ƭ
    ƭ

    ƭ
    ƭ
    ƭ

    View full-size slide

  20. 13
    ƭ ƭ
    ƭ ƭ
    Release A

    ƭ ƭ
    ƭ ƭ
    A+1
    ƭ ƭ
    ƭ ƭ
    A+2
    ƭ ƭ
    ƭ ƭ
    A+3
    ƭ ƭ
    ƭ ƭ
    Release B
    ƭ
    ƭ ƭ ƭ
    ƭ

    ƭ
    ƭ

    ƭ
    ƭ
    ƭ

    View full-size slide

  21. 13
    ƭ ƭ
    ƭ ƭ
    Release A

    ƭ ƭ
    ƭ ƭ
    A+1
    ƭ ƭ
    ƭ ƭ
    A+2
    ƭ ƭ
    ƭ ƭ
    A+3
    ƭ ƭ
    ƭ ƭ
    Release B
    ƭ
    ƭ ƭ ƭ
    (a much better idea)
    ƭ

    ƭ
    ƭ

    ƭ
    ƭ
    ƭ

    View full-size slide

  22. 14
    LDA is online

    View full-size slide

  23. 14
    LDA is online => streamed

    View full-size slide

  24. 14
    LDA can process an
    unknown number
    of documents
    LDA is online => streamed

    View full-size slide

  25. 14
    LDA can process an
    unknown number
    of documents
    LDA is online => streamed
    => ∞

    View full-size slide

  26. 15
    Source code repositories!

    View full-size slide

  27. 16
    ƭ ƭ
    ƭ ƭ
    Release A

    ƭ ƭ
    ƭ ƭ
    A+1
    ƭ ƭ
    ƭ ƭ
    A+2
    ƭ ƭ
    ƭ ƭ
    A+3
    ƭ ƭ
    ƭ ƭ
    Release B
    Source code repositories!

    View full-size slide

  28. 17
    ƭ ƭ
    ƭ ƭ
    Release A

    ƭ ƭ
    ƭ ƭ
    A+1
    ƭ ƭ
    ƭ ƭ
    A+2
    ƭ ƭ
    ƭ ƭ
    A+3
    ƭ ƭ
    ƭ ƭ
    Release B
    But, how?

    View full-size slide

  29. 17
    ƭ ƭ
    ƭ ƭ
    Release A

    ƭ ƭ
    ƭ ƭ
    A+1
    ƭ ƭ
    ƭ ƭ
    A+2
    ƭ ƭ
    ƭ ƭ
    A+3
    ƭ ƭ
    ƭ ƭ
    Release B
    But, how?

    View full-size slide

  30. 18
    ƭ ƭ
    ƭ ƭ
    Release A

    ƭ ƭ
    ƭ ƭ
    A+1
    ƭ ƭ
    ƭ ƭ
    A+2
    ƭ ƭ
    ƭ ƭ
    A+3
    ƭ ƭ
    ƭ ƭ
    Release B

    View full-size slide

  31. 18
    ƭ ƭ
    ƭ ƭ
    Release A

    ƭ ƭ
    ƭ ƭ
    A+1
    ƭ ƭ
    ƭ ƭ
    A+2
    ƭ ƭ
    ƭ ƭ
    A+3
    ƭ ƭ
    ƭ ƭ
    Release B
    diff A..A+1
    diff A+1..A+2
    diff A+2..A+3

    View full-size slide

  32. 18
    ƭ ƭ
    ƭ ƭ
    Release A

    ƭ ƭ
    ƭ ƭ
    A+1
    ƭ ƭ
    ƭ ƭ
    A+2
    ƭ ƭ
    ƭ ƭ
    A+3
    ƭ ƭ
    ƭ ƭ
    Release B

    View full-size slide

  33. 19
    Release Changeset
    ƭ ƭ
    ƭ ƭ
    ƭ
    ƭ ƭ ƭ

    View full-size slide

  34. 19
    Release Changeset
    ƭ ƭ
    ƭ ƭ
    ƭ
    ƭ ƭ ƭ
    How does the
    corpus change?

    View full-size slide

  35. 19
    Release Changeset
    ƭ ƭ
    ƭ ƭ
    ƭ
    ƭ ƭ ƭ
    How does the
    corpus change?
    How does the
    model change?

    View full-size slide

  36. 20
    AspectJ
    Joda-Time

    View full-size slide

  37. 20
    AspectJ
    Joda-Time
    RQ1: cosine similarity

    View full-size slide

  38. 20
    99.7 %
    AspectJ
    Joda-Time
    RQ1: cosine similarity

    View full-size slide

  39. 20
    99.7 % 93.1 %
    AspectJ
    Joda-Time
    RQ1: cosine similarity

    View full-size slide

  40. 20
    99.7 % 93.1 % 93.5 %
    AspectJ
    Joda-Time
    RQ1: cosine similarity

    View full-size slide

  41. 20
    99.7 % 93.1 % 93.5 % 67.1 %
    AspectJ
    Joda-Time
    RQ1: cosine similarity

    View full-size slide

  42. 21
    AspectJ
    Joda-Time
    !

    View full-size slide

  43. 21
    AspectJ
    Joda-Time
    !
    RQ2: distinctness score

    View full-size slide

  44. 21
    AspectJ
    Joda-Time
    2.31
    !
    3.17
    !
    RQ2: distinctness score

    View full-size slide

  45. 21
    AspectJ
    Joda-Time
    2.31
    !
    3.17
    3.75
    !
    2.78
    !
    RQ2: distinctness score

    View full-size slide

  46. 21
    AspectJ
    Joda-Time
    2.31
    !
    3.17
    3.75
    !
    2.78
    1.34
    !
    1.03
    !
    RQ2: distinctness score

    View full-size slide

  47. 21
    AspectJ
    Joda-Time
    2.31
    !
    3.17
    3.75
    !
    2.78
    1.34
    !
    1.03
    2.59
    !
    3.56
    !
    RQ2: distinctness score

    View full-size slide

  48. 22
    Release Changeset
    ƭ ƭ
    ƭ ƭ
    ƭ
    ƭ ƭ ƭ
    How does the
    corpus change?
    How does the
    model change?

    View full-size slide

  49. 23
    Modeling Changeset Topics
    C.S. Corley, K.L. Kashuda, D.S. May, N.A. Kraft
    @excsc
    [email protected]
    cscorley/mud2014-modeling-changeset-topics

    View full-size slide

  50. 24
    The Way that can be told of is not the eternal Way;
    The name that can be named is not the eternal name.
    The Nameless is the origin of Heaven and Earth;
    The Named is the mother of all things.
    Therefore let there always be non-being,
    so we may see their subtlety,
    And let there always be being,
    so we may see their outcome.
    The two are the same,
    But after they are produced,
    they have different names.

    View full-size slide

  51. 24
    The Nameless is the origin of Heaven and Earth;
    The Named is the mother of all things.
    Therefore let there always be non-being,
    so we may see their subtlety,
    And let there always be being,
    so we may see their outcome.
    The two are the same,
    But after they are produced,
    they have different names.
    They both may be called deep and profound.
    Deeper and more profound,
    The door of all subtleties!

    View full-size slide

  52. 24
    The Nameless is the origin of Heaven and Earth;
    The named is the mother of all things.
    !
    Therefore let there always be non-being,
    so we may see their subtlety,
    And let there always be being,
    so we may see their outcome.
    The two are the same,
    But after they are produced,
    they have different names.
    They both may be called deep and profound.
    Deeper and more profound,
    The door of all subtleties!

    View full-size slide

  53. 25
    !
    !
    !
    !
    !
    The Way that can be told of is not the eternal Way;
    The name that can be named is not the eternal name.
    !
    The Named is the mother of all things.
    The named is the mother of all things.
    !
    !
    !
    !
    !
    !
    !
    !
    They both may be called deep and profound.
    Deeper and more profound,
    The door of all subtleties!
    diff --git a/lao b/tzu
    index 635ef2c..5af88a8 100644
    --- a/lao
    +++ b/tzu
    @@ -1,7 +1,6 @@
    -The Way that can be told of is not the eternal Way;
    -The name that can be named is not the eternal name.
    The Nameless is the origin of Heaven and Earth;
    -The Named is the mother of all things.
    +The named is the mother of all things.
    +
    Therefore let there always be non-being,
    so we may see their subtlety,
    And let there always be being,
    @@ -9,3 +8,6 @@ And let there always be being,
    The two are the same,
    But after they are produced,
    they have different names.
    +They both may be called deep and profound.
    +Deeper and more profound,
    +The door of all subtleties!

    View full-size slide

  54. 25
    !
    !
    !
    !
    !
    The Way that can be told of is not the eternal Way;
    The name that can be named is not the eternal name.
    The Nameless is the origin of Heaven and Earth;
    The Named is the mother of all things.
    The named is the mother of all things.
    Therefore let there always be non-being,
    so we may see their subtlety,
    And let there always be being,
    !
    The two are the same,
    But after they are produced,
    they have different names.
    They both may be called deep and profound.
    Deeper and more profound,
    The door of all subtleties!
    !
    !
    !
    !
    !
    The Way that can be told of is not the eternal Way;
    The name that can be named is not the eternal name.
    !
    The Named is the mother of all things.
    The named is the mother of all things.
    !
    !
    !
    !
    !
    !
    !
    !
    They both may be called deep and profound.
    Deeper and more profound,
    The door of all subtleties!

    View full-size slide