Modeling Changeset Topics

Modeling Changeset Topics

Christopher S. Corley, Kelly L. Kashuda
The University of Alabama

Daniel S. May
Swarthmore College

Nicholas A. Kraft
ABB Corporate Research

Topic modeling has been applied to several areas of software engineering, such as bug localization, feature location, triaging change requests, and traceability link recovery. Many of these approaches combine mining unstructured data, such as bug reports, with topic modeling a snapshot (or release) of source code. However, source code evolves, which causes models to become obsolete. In this paper, we explore the approach of topic modeling changesets over the traditional release approach. We conduct an exploratory study of four open source systems. We investigate the differences in corpora in each project, and evaluate the topic distinctness of the models.

*Note*: these slides were animation-heavy, YouTube recording available here: https://www.youtube.com/watch?v=S12B_CTeUtA

02498ca4cb73f57dc33c2642cd70fef2?s=128

Christopher Corley

September 30, 2014
Tweet

Transcript

  1. Modeling Changeset Topics C.S. Corley, K.L. Kashuda, D.S. May, N.A.

    Kraft @excsc cscorley@ua.edu cscorley/mud2014-modeling-changeset-topics
  2. 2 ???? Topic Modeling

  3. 3

  4. 3 latent Dirichlet allocation (LDA)

  5. 4 1

  6. 4 1

  7. 5 2

  8. 5 2

  9. 5 2

  10. 5 2

  11. 6 3*

  12. 6 3*

  13. 6 3*

  14. 7

  15. 7 Feature location

  16. 7 Feature location Bug localization

  17. 7 Feature location Bug localization Ɲ Traceability links

  18. 7 Feature location Bug localization Ɲ Traceability links ? ?

    ? Developer identification
  19. 8 Release A ƭ ƭ ƭ ƭ

  20. 8 Release A ƭ ƭ ƭ ƭ A

  21. 9 ƭ ƭ ƭ ƭ Release A

  22. 9 ƭ ƭ ƭ ƭ Release A … ƭ ƭ

    ƭ ƭ A+1 ƭ ƭ ƭ ƭ A+2 ƭ ƭ ƭ ƭ A+3 ƭ ƭ ƭ ƭ Release B
  23. 10 ƭ ƭ ƭ ƭ Release A … ƭ ƭ

    ƭ ƭ A+1 ƭ ƭ ƭ ƭ A+2 ƭ ƭ ƭ ƭ A+3 ƭ ƭ ƭ ƭ Release B
  24. 10 ƭ ƭ ƭ ƭ Release A … ƭ ƭ

    ƭ ƭ A+1 ƭ ƭ ƭ ƭ A+2 ƭ ƭ ƭ ƭ A+3 ƭ ƭ ƭ ƭ Release B ƭ ƭ ƭ ƭ
  25. 11

  26. 11 ƃ ƃ ƃ ƃ ƃ Rome wasn’t built in

    a day
  27. 11 ƃ ƃ ƃ ƃ ƃ Rome wasn’t built in

    a day ƭ ƭ ƭ ƭ ƭ ƭ ƭ … neither is software.
  28. 12 ƭ ƭ ƭ ƭ Release A … ƭ ƭ

    ƭ ƭ A+1 ƭ ƭ ƭ ƭ A+2 ƭ ƭ ƭ ƭ A+3 ƭ ƭ ƭ ƭ Release B ƭ ƭ ƭ ƭ ƭ ƭ ƭ ƭ ƭ ƭ
  29. 12 ƭ ƭ ƭ ƭ Release A … ƭ ƭ

    ƭ ƭ A+1 ƭ ƭ ƭ ƭ A+2 ƭ ƭ ƭ ƭ A+3 ƭ ƭ ƭ ƭ Release B ƭ ƭ ƭ ƭ ƭ ƭ ƭ ƭ ƭ ƭ (not a good idea)
  30. 13 ƭ ƭ ƭ ƭ Release A … ƭ ƭ

    ƭ ƭ A+1 ƭ ƭ ƭ ƭ A+2 ƭ ƭ ƭ ƭ A+3 ƭ ƭ ƭ ƭ Release B
  31. 13 ƭ ƭ ƭ ƭ Release A … ƭ ƭ

    ƭ ƭ A+1 ƭ ƭ ƭ ƭ A+2 ƭ ƭ ƭ ƭ A+3 ƭ ƭ ƭ ƭ Release B ƭ ⭈ ƭ ƭ ⭈ ƭ ƭ ƭ ⭈
  32. 13 ƭ ƭ ƭ ƭ Release A … ƭ ƭ

    ƭ ƭ A+1 ƭ ƭ ƭ ƭ A+2 ƭ ƭ ƭ ƭ A+3 ƭ ƭ ƭ ƭ Release B ƭ ƭ ƭ ƭ ƭ ⭈ ƭ ƭ ⭈ ƭ ƭ ƭ ⭈
  33. 13 ƭ ƭ ƭ ƭ Release A … ƭ ƭ

    ƭ ƭ A+1 ƭ ƭ ƭ ƭ A+2 ƭ ƭ ƭ ƭ A+3 ƭ ƭ ƭ ƭ Release B ƭ ƭ ƭ ƭ (a much better idea) ƭ ⭈ ƭ ƭ ⭈ ƭ ƭ ƭ ⭈
  34. 14 LDA is online

  35. 14 LDA is online => streamed

  36. 14 LDA can process an unknown number of documents LDA

    is online => streamed
  37. 14 LDA can process an unknown number of documents LDA

    is online => streamed => ∞
  38. 15 Source code repositories!

  39. 16 ƭ ƭ ƭ ƭ Release A … ƭ ƭ

    ƭ ƭ A+1 ƭ ƭ ƭ ƭ A+2 ƭ ƭ ƭ ƭ A+3 ƭ ƭ ƭ ƭ Release B Source code repositories!
  40. 17 ƭ ƭ ƭ ƭ Release A … ƭ ƭ

    ƭ ƭ A+1 ƭ ƭ ƭ ƭ A+2 ƭ ƭ ƭ ƭ A+3 ƭ ƭ ƭ ƭ Release B But, how?
  41. 17 ƭ ƭ ƭ ƭ Release A … ƭ ƭ

    ƭ ƭ A+1 ƭ ƭ ƭ ƭ A+2 ƭ ƭ ƭ ƭ A+3 ƭ ƭ ƭ ƭ Release B But, how?
  42. 18 ƭ ƭ ƭ ƭ Release A … ƭ ƭ

    ƭ ƭ A+1 ƭ ƭ ƭ ƭ A+2 ƭ ƭ ƭ ƭ A+3 ƭ ƭ ƭ ƭ Release B
  43. 18 ƭ ƭ ƭ ƭ Release A … ƭ ƭ

    ƭ ƭ A+1 ƭ ƭ ƭ ƭ A+2 ƭ ƭ ƭ ƭ A+3 ƭ ƭ ƭ ƭ Release B diff A..A+1 diff A+1..A+2 diff A+2..A+3
  44. 18 ƭ ƭ ƭ ƭ Release A … ƭ ƭ

    ƭ ƭ A+1 ƭ ƭ ƭ ƭ A+2 ƭ ƭ ƭ ƭ A+3 ƭ ƭ ƭ ƭ Release B
  45. 19 Release Changeset ƭ ƭ ƭ ƭ ƭ ƭ ƭ

    ƭ
  46. 19 Release Changeset ƭ ƭ ƭ ƭ ƭ ƭ ƭ

    ƭ How does the corpus change?
  47. 19 Release Changeset ƭ ƭ ƭ ƭ ƭ ƭ ƭ

    ƭ How does the corpus change? How does the model change?
  48. 20 AspectJ Joda-Time

  49. 20 AspectJ Joda-Time RQ1: cosine similarity

  50. 20 99.7 % AspectJ Joda-Time RQ1: cosine similarity

  51. 20 99.7 % 93.1 % AspectJ Joda-Time RQ1: cosine similarity

  52. 20 99.7 % 93.1 % 93.5 % AspectJ Joda-Time RQ1:

    cosine similarity
  53. 20 99.7 % 93.1 % 93.5 % 67.1 % AspectJ

    Joda-Time RQ1: cosine similarity
  54. 21 AspectJ Joda-Time !

  55. 21 AspectJ Joda-Time ! RQ2: distinctness score

  56. 21 AspectJ Joda-Time 2.31 ! 3.17 ! RQ2: distinctness score

  57. 21 AspectJ Joda-Time 2.31 ! 3.17 3.75 ! 2.78 !

    RQ2: distinctness score
  58. 21 AspectJ Joda-Time 2.31 ! 3.17 3.75 ! 2.78 1.34

    ! 1.03 ! RQ2: distinctness score
  59. 21 AspectJ Joda-Time 2.31 ! 3.17 3.75 ! 2.78 1.34

    ! 1.03 2.59 ! 3.56 ! RQ2: distinctness score
  60. 22 Release Changeset ƭ ƭ ƭ ƭ ƭ ƭ ƭ

    ƭ How does the corpus change? How does the model change?
  61. 23 Modeling Changeset Topics C.S. Corley, K.L. Kashuda, D.S. May,

    N.A. Kraft @excsc cscorley@ua.edu cscorley/mud2014-modeling-changeset-topics
  62. 24 The Way that can be told of is not

    the eternal Way; The name that can be named is not the eternal name. The Nameless is the origin of Heaven and Earth; The Named is the mother of all things. Therefore let there always be non-being, so we may see their subtlety, And let there always be being, so we may see their outcome. The two are the same, But after they are produced, they have different names.
  63. 24 The Nameless is the origin of Heaven and Earth;

    The Named is the mother of all things. Therefore let there always be non-being, so we may see their subtlety, And let there always be being, so we may see their outcome. The two are the same, But after they are produced, they have different names. They both may be called deep and profound. Deeper and more profound, The door of all subtleties!
  64. 24 The Nameless is the origin of Heaven and Earth;

    The named is the mother of all things. ! Therefore let there always be non-being, so we may see their subtlety, And let there always be being, so we may see their outcome. The two are the same, But after they are produced, they have different names. They both may be called deep and profound. Deeper and more profound, The door of all subtleties!
  65. 25 ! ! ! ! ! The Way that can

    be told of is not the eternal Way; The name that can be named is not the eternal name. ! The Named is the mother of all things. The named is the mother of all things. ! ! ! ! ! ! ! ! They both may be called deep and profound. Deeper and more profound, The door of all subtleties! diff --git a/lao b/tzu index 635ef2c..5af88a8 100644 --- a/lao +++ b/tzu @@ -1,7 +1,6 @@ -The Way that can be told of is not the eternal Way; -The name that can be named is not the eternal name. The Nameless is the origin of Heaven and Earth; -The Named is the mother of all things. +The named is the mother of all things. + Therefore let there always be non-being, so we may see their subtlety, And let there always be being, @@ -9,3 +8,6 @@ And let there always be being, The two are the same, But after they are produced, they have different names. +They both may be called deep and profound. +Deeper and more profound, +The door of all subtleties!
  66. 25 ! ! ! ! ! The Way that can

    be told of is not the eternal Way; The name that can be named is not the eternal name. The Nameless is the origin of Heaven and Earth; The Named is the mother of all things. The named is the mother of all things. Therefore let there always be non-being, so we may see their subtlety, And let there always be being, ! The two are the same, But after they are produced, they have different names. They both may be called deep and profound. Deeper and more profound, The door of all subtleties! ! ! ! ! ! The Way that can be told of is not the eternal Way; The name that can be named is not the eternal name. ! The Named is the mother of all things. The named is the mother of all things. ! ! ! ! ! ! ! ! They both may be called deep and profound. Deeper and more profound, The door of all subtleties!