$30 off During Our Annual Pro Sale. View Details »

Modeling Changeset Topics for Feature Location

Modeling Changeset Topics for Feature Location

Feature location is a program comprehension activity in which a developer inspects source code to locate the classes or methods that implement a feature of interest. Many feature location techniques (FLTs) are based on text retrieval models, and in such FLTs it is typical for the models to be trained on source code snapshots. However, source code evolution leads to model obsolescence and thus to the need to retrain the model from the latest snapshot. In this paper, we introduce a topic-modeling-based FLT in which the model is built incrementally from source code history. By training an online learning algorithm using changesets, the FLT maintains an up-to-date model without incurring the non-trivial computational cost associated with retraining traditional FLTs. Overall, we studied over 1,200 defects and features from 14 open-source Java projects. We also present a historical simulation that demonstrates how the FLT performs as a project evolves. Our results indicate that the accuracy of a changeset-based FLT is similar to that of a snapshot-based FLT, but without the retraining costs.

Christopher Corley

September 29, 2015
Tweet

More Decks by Christopher Corley

Other Decks in Research

Transcript

  1. Modeling Changeset Topics
    for Feature Location
    Nicholas A. Kraft
    University of Alabama ABB Corporate Research
    University of Alabama
    Christopher S. Corley Kelly L. Kashuda
    In this talk, there’s a lot of FLT jargon, and the problem I’m trying to address is pretty subtle and is easier to see if we go through everything.
    1. “A way we can use topic models on something other than source code, such as changesets, for FLT”
    2. A look into more accurate FLT evaluations

    View Slide

  2. 2
    !
    "
    "
    "
    "
    "
    "
    "
    "
    "
    "
    #
    $
    We have a user and they have something they are looking for in a big mess of source code. Our objective is to search for them and give them a ranked list
    where the top document(s) are the ones they are looking for.
    !
    We sometimes use topic modeling to accomplish this

    View Slide

  3. 2
    !
    "
    "
    "
    "
    "
    "
    "
    "
    "
    "
    #
    $
    We have a user and they have something they are looking for in a big mess of source code. Our objective is to search for them and give them a ranked list
    where the top document(s) are the ones they are looking for.
    !
    We sometimes use topic modeling to accomplish this

    View Slide

  4. 3
    % %
    %
    %
    % %%
    Topic modeling is taking a bunch of immutable documents, such as books & news paper articles, and discovering the underlying themes in the text.
    - We categorize documents by which topics they contain using machine learning
    >>>Topic models have 3 basic steps to using them

    View Slide

  5. 3
    & &
    & &
    % %
    %
    %
    % %%
    Topic modeling is taking a bunch of immutable documents, such as books & news paper articles, and discovering the underlying themes in the text.
    - We categorize documents by which topics they contain using machine learning
    >>>Topic models have 3 basic steps to using them

    View Slide

  6. 4
    '
    % %
    %
    %
    % %%
    ( (
    1
    Step one is the learning step:
    - Take a bunch of documents
    - Put them into a machine learning algorithm
    - >>> The algorithm produces a trained model which we can use for the other steps.

    View Slide

  7. )
    4
    '
    ( (
    1
    Step one is the learning step:
    - Take a bunch of documents
    - Put them into a machine learning algorithm
    - >>> The algorithm produces a trained model which we can use for the other steps.

    View Slide

  8. 5
    ( (
    2
    )
    '
    Step two:
    - Take this trained model
    - Using the same constructs
    - AND the same documents again
    - Infer the thematic structure of the documents with that model being held

    View Slide

  9. 5
    ( (
    2 )
    Step two:
    - Take this trained model
    - Using the same constructs
    - AND the same documents again
    - Infer the thematic structure of the documents with that model being held

    View Slide

  10. 5
    % %
    %
    %
    % %%
    ( (
    2 )
    Step two:
    - Take this trained model
    - Using the same constructs
    - AND the same documents again
    - Infer the thematic structure of the documents with that model being held

    View Slide

  11. 5
    & &
    & &
    ( (
    2 )
    Step two:
    - Take this trained model
    - Using the same constructs
    - AND the same documents again
    - Infer the thematic structure of the documents with that model being held

    View Slide

  12. 6
    & &
    & &
    )
    ( (
    3
    Step 3, which is special:
    - we can take UNSEEN documents, documents we didn’t use in training
    - infer their thematic structure
    !
    That’s a quick overview on how topic models work
    >>> How do we use topic models for feature location?

    View Slide

  13. 6
    & &
    & &
    )
    ( (
    ##
    #
    #
    #
    3
    Step 3, which is special:
    - we can take UNSEEN documents, documents we didn’t use in training
    - infer their thematic structure
    !
    That’s a quick overview on how topic models work
    >>> How do we use topic models for feature location?

    View Slide

  14. 6
    & &
    & &
    )
    ( (
    # #
    # #
    3
    Step 3, which is special:
    - we can take UNSEEN documents, documents we didn’t use in training
    - infer their thematic structure
    !
    That’s a quick overview on how topic models work
    >>> How do we use topic models for feature location?

    View Slide

  15. "
    7
    7
    ( (
    1
    " "
    " "
    '
    )
    2
    ( (
    " "
    " "
    )
    " "
    " "
    ( (
    )
    " "
    " "
    !
    *
    *
    #
    #
    $
    $
    Marcus, A.; Sergeyev, A.; Rajlich, V. & Maletic, J. I.
    An information retrieval approach to concept location in source code,
    Proceedings of the 11th Working Conference on Reverse Engineering, 2004
    3
    To do FLT, we do those three steps sequentially:
    1. Train model,
    2. Infer thematic structure of code
    3. With a user query, find the most related source code to the query
    >>> Simple, and effective.
    But there is a mostly overlooked and subtle problem with this

    View Slide

  16. 8
    " "
    " "
    Snapshot A
    ( (
    '
    )
    Here’s the problem. For step one, we take a snapshot of source code at some point in time, such as a software release.
    BUT, as soon as a change to the source code happens, our model becomes outdated.
    By the time the next release comes around, the model might make zero sense for the source code it represents.
    !
    >>> How do we deal with this problem?

    View Slide

  17. 8
    " "
    " "
    Snapshot A

    " "
    " "
    A+1
    " "
    " "
    A+2
    " "
    " "
    A+3
    + + +
    " "
    " "
    Snapshot B
    ( (
    '
    )
    Here’s the problem. For step one, we take a snapshot of source code at some point in time, such as a software release.
    BUT, as soon as a change to the source code happens, our model becomes outdated.
    By the time the next release comes around, the model might make zero sense for the source code it represents.
    !
    >>> How do we deal with this problem?

    View Slide

  18. )
    9
    " "
    " "
    Snapshot A

    " "
    " "
    A+1
    " "
    " "
    A+2
    " "
    " "
    A+3
    + + +
    " "
    " "
    Snapshot B
    "
    " " "
    ( (
    '
    )
    ( (
    '
    We can’t just re-build a model whenever a new release comes around. That’s not practical if you’re actually developing the software.
    The model only is useful for two points in time, and there’s a lot of work going on in between.
    Most research thus far has overlooked the practicality of the approach, making it not reasonable for IDE integration.

    View Slide

  19. 10
    " "
    " "
    Snapshot A

    " "
    " "
    A+1
    " "
    " "
    A+2
    " "
    " "
    A+3
    " "
    " "
    Snapshot B
    '
    ) )
    "
    " " "
    '
    )
    " " "
    '
    )
    " "
    '
    )
    "
    '
    The naive solution is to just re-build the model every time.
    (not a good idea) — to build models from scratch can take hours (for thousand LOC) to days (for million LOC)
    So, we need to be clever….
    We cant change the fact that software is going to change… if we could, this field wouldn’t be this active!
    >>> What about the models? Let’s look closer…

    View Slide

  20. 10
    " "
    " "
    Snapshot A

    " "
    " "
    A+1
    " "
    " "
    A+2
    " "
    " "
    A+3
    " "
    " "
    Snapshot B
    (not a good idea)
    '
    ) )
    "
    " " "
    '
    )
    " " "
    '
    )
    " "
    '
    )
    "
    '
    The naive solution is to just re-build the model every time.
    (not a good idea) — to build models from scratch can take hours (for thousand LOC) to days (for million LOC)
    So, we need to be clever….
    We cant change the fact that software is going to change… if we could, this field wouldn’t be this active!
    >>> What about the models? Let’s look closer…

    View Slide

  21. 11
    " "
    " "
    Snapshot A

    " "
    " "
    A+1
    " "
    " "
    A+2
    " "
    " "
    A+3
    '
    )
    " "
    " "
    Snapshot B
    Rao, S.
    Incremental update framework for efficient retrieval from software libraries for bug localization,
    Purdue University, 2013
    - A clever way was introduced by Shivani Rao.
    - She has extended existing TMs that can keep the model approximately up-to-date.
    - However, she notes that even her extensions will needs rebuilding occasionally.
    - This is the best thing I’ve seen so far. If you’re doing FLT work, read her work.

    View Slide

  22. 11
    " "
    " "
    Snapshot A

    " "
    " "
    A+1
    " "
    " "
    A+2
    " "
    " "
    A+3
    '
    )
    " "
    " "
    Snapshot B
    )
    "
    ⭈ )
    "
    "
    ⭈ )
    "
    "
    "

    Rao, S.
    Incremental update framework for efficient retrieval from software libraries for bug localization,
    Purdue University, 2013
    - A clever way was introduced by Shivani Rao.
    - She has extended existing TMs that can keep the model approximately up-to-date.
    - However, she notes that even her extensions will needs rebuilding occasionally.
    - This is the best thing I’ve seen so far. If you’re doing FLT work, read her work.

    View Slide

  23. 11
    " "
    " "
    Snapshot A

    " "
    " "
    A+1
    " "
    " "
    A+2
    " "
    " "
    A+3
    '
    )
    " "
    " "
    Snapshot B
    )
    '
    "
    " " "
    )
    "
    ⭈ )
    "
    "
    ⭈ )
    "
    "
    "

    Rao, S.
    Incremental update framework for efficient retrieval from software libraries for bug localization,
    Purdue University, 2013
    - A clever way was introduced by Shivani Rao.
    - She has extended existing TMs that can keep the model approximately up-to-date.
    - However, she notes that even her extensions will needs rebuilding occasionally.
    - This is the best thing I’ve seen so far. If you’re doing FLT work, read her work.

    View Slide

  24. 12
    " "
    " "
    Snapshot A

    " "
    " "
    A+1
    " "
    " "
    A+2
    " "
    " "
    A+3
    '
    )
    " "
    " "
    Snapshot B
    )
    '
    "
    " " "
    )
    "
    ⭈ )
    "
    "
    ⭈ )
    "
    "
    "

    Rao, S.
    Incremental update framework for efficient retrieval from software libraries for bug localization,
    Purdue University, 2013
    I’m not satisfied with this. I think we can do better with less work AND with off-the-shelf topic modeling algorithms so that we don’t have to extend a new
    TM algo for software-specific problems.
    !
    >>> There are a few things about modern topic models that I think can help.

    View Slide

  25. 13
    TMs process… ,
    Hoffman, M.; Bach, F. R. & Blei, D. M. Lafferty, J.; Williams, C.
    Online Learning for Latent Dirichlet Allocation,
    Advances in Neural Information Processing Systems, 2010
    - Basic topic models designed to process text, like books.
    - Topic models process streamable data: Can process new data *as* it appears. Does not need to know about data beforehand.
    - As books are being written, the model is being updated…
    - This allows us to iterate between step 1 and step 2. Updating model with new data, infer things about that new data.
    - Cool property that is underutilized…
    - Topic models process infinite data, continuously be updated until the end of time
    >>> However, there is a tradeoff of these two properties is….

    View Slide

  26. 13
    TMs process… ,
    Online
    Hoffman, M.; Bach, F. R. & Blei, D. M. Lafferty, J.; Williams, C.
    Online Learning for Latent Dirichlet Allocation,
    Advances in Neural Information Processing Systems, 2010
    - Basic topic models designed to process text, like books.
    - Topic models process streamable data: Can process new data *as* it appears. Does not need to know about data beforehand.
    - As books are being written, the model is being updated…
    - This allows us to iterate between step 1 and step 2. Updating model with new data, infer things about that new data.
    - Cool property that is underutilized…
    - Topic models process infinite data, continuously be updated until the end of time
    >>> However, there is a tradeoff of these two properties is….

    View Slide

  27. 13
    TMs process…
    streamable data
    ,
    Online
    Hoffman, M.; Bach, F. R. & Blei, D. M. Lafferty, J.; Williams, C.
    Online Learning for Latent Dirichlet Allocation,
    Advances in Neural Information Processing Systems, 2010
    - Basic topic models designed to process text, like books.
    - Topic models process streamable data: Can process new data *as* it appears. Does not need to know about data beforehand.
    - As books are being written, the model is being updated…
    - This allows us to iterate between step 1 and step 2. Updating model with new data, infer things about that new data.
    - Cool property that is underutilized…
    - Topic models process infinite data, continuously be updated until the end of time
    >>> However, there is a tradeoff of these two properties is….

    View Slide

  28. ,
    13
    TMs process…
    streamable data
    ,
    ,
    ,
    ,
    ,
    Online
    Hoffman, M.; Bach, F. R. & Blei, D. M. Lafferty, J.; Williams, C.
    Online Learning for Latent Dirichlet Allocation,
    Advances in Neural Information Processing Systems, 2010
    - Basic topic models designed to process text, like books.
    - Topic models process streamable data: Can process new data *as* it appears. Does not need to know about data beforehand.
    - As books are being written, the model is being updated…
    - This allows us to iterate between step 1 and step 2. Updating model with new data, infer things about that new data.
    - Cool property that is underutilized…
    - Topic models process infinite data, continuously be updated until the end of time
    >>> However, there is a tradeoff of these two properties is….

    View Slide

  29. ,
    13
    TMs process…
    streamable data
    ,
    ,
    ,
    ,
    ,
    1 2
    *
    Online
    Hoffman, M.; Bach, F. R. & Blei, D. M. Lafferty, J.; Williams, C.
    Online Learning for Latent Dirichlet Allocation,
    Advances in Neural Information Processing Systems, 2010
    - Basic topic models designed to process text, like books.
    - Topic models process streamable data: Can process new data *as* it appears. Does not need to know about data beforehand.
    - As books are being written, the model is being updated…
    - This allows us to iterate between step 1 and step 2. Updating model with new data, infer things about that new data.
    - Cool property that is underutilized…
    - Topic models process infinite data, continuously be updated until the end of time
    >>> However, there is a tradeoff of these two properties is….

    View Slide

  30. ,
    13
    TMs process…
    streamable data
    infinite data
    ,
    ,
    ,
    ,
    ,
    ,
    ,
    ,
    ,
    ,
    ,
    ,
    ,
    ,
    ,
    ,
    ,
    Online
    Hoffman, M.; Bach, F. R. & Blei, D. M. Lafferty, J.; Williams, C.
    Online Learning for Latent Dirichlet Allocation,
    Advances in Neural Information Processing Systems, 2010
    - Basic topic models designed to process text, like books.
    - Topic models process streamable data: Can process new data *as* it appears. Does not need to know about data beforehand.
    - As books are being written, the model is being updated…
    - This allows us to iterate between step 1 and step 2. Updating model with new data, infer things about that new data.
    - Cool property that is underutilized…
    - Topic models process infinite data, continuously be updated until the end of time
    >>> However, there is a tradeoff of these two properties is….

    View Slide

  31. 14
    immutable data
    ,
    streamable data
    infinite data
    ,
    ,
    ,
    ,
    ,
    14
    ,
    ,
    ,
    ,
    ,
    ,
    ,
    ,
    ,
    ,
    ,
    ,
    TMs process…
    Online
    Hoffman, M.; Bach, F. R. & Blei, D. M. Lafferty, J.; Williams, C.
    Online Learning for Latent Dirichlet Allocation,
    Advances in Neural Information Processing Systems, 2010
    - Topic models, even the offline cousins, need immutable data. Hence Shivani Rao’s need to extend the offline versions to compensate for software’s
    innate mutability.
    - >>> What do we have in software development that represents source code
    - in an streamable, infinite, immutable format?

    View Slide

  32. 15
    " "
    " "
    Snapshot A

    " "
    " "
    A+1
    " "
    " "
    A+2
    " "
    " "
    A+3
    " "
    " "
    Snapshot B
    Source code repositories!
    The repository contains a recorded history of every state the software has ever been in. I think we should be using this.
    !
    >>> How might we build a model out of a repository?

    View Slide

  33. 16
    " "
    " "
    Snapshot A

    ! !
    ! !
    A+1
    ! !
    ! !
    A+2
    ! !
    ! !
    A+3
    ! !
    ! !
    Snapshot B
    - With the changesets:
    - Changesets are a summary of the work done in between two commits. I do not mean the commit message. They are textual diff of the source
    code.
    - As developers make changes, they commit these summaries to the source code history, so we get a stream of immutable data — that is the
    changeset!

    View Slide

  34. 16
    ! !
    ! !
    Snapshot A

    ! !
    ! !
    A+1
    ! !
    ! !
    A+2
    ! !
    ! !
    A+3
    ! !
    ! !
    Snapshot B
    "
    #
    "
    #
    "
    #
    - With the changesets:
    - Changesets are a summary of the work done in between two commits. I do not mean the commit message. They are textual diff of the source
    code.
    - As developers make changes, they commit these summaries to the source code history, so we get a stream of immutable data — that is the
    changeset!

    View Slide

  35. 16
    ! !
    ! !
    Snapshot A

    ! !
    ! !
    A+1
    ! !
    ! !
    A+2
    ! !
    ! !
    A+3
    ! !
    ! !
    Snapshot B
    "
    #
    "
    #
    "
    #
    diff A..A+1
    diff A+1..A+2
    diff A+2..A+3
    - With the changesets:
    - Changesets are a summary of the work done in between two commits. I do not mean the commit message. They are textual diff of the source
    code.
    - As developers make changes, they commit these summaries to the source code history, so we get a stream of immutable data — that is the
    changeset!

    View Slide

  36. 17
    ! !
    ! !
    Snapshot A

    ! !
    ! !
    A+1
    ! !
    ! !
    A+2
    ! !
    ! !
    A+3
    ! !
    ! !
    Snapshot B
    "
    #
    "
    #
    "
    #
    - I’m saying, let’s use these changesets as input to our model building step.
    !
    !
    >>> So, here is my proposal for the FLT problem.

    View Slide

  37. 17
    ! !
    ! !
    Snapshot A

    ! !
    ! !
    A+1
    ! !
    ! !
    A+2
    ! !
    ! !
    A+3
    ! !
    ! !
    Snapshot B
    "
    #
    "
    #
    "
    #
    $
    %
    # # #
    &
    &
    - I’m saying, let’s use these changesets as input to our model building step.
    !
    !
    >>> So, here is my proposal for the FLT problem.

    View Slide

  38. 18
    18
    & &
    1
    ! !
    ! !
    $
    %
    2
    & &
    ! !
    ! !
    %
    ! !
    ! !
    & &
    %
    ! !
    ! !
    3
    '
    (
    (
    )
    )
    *
    *
    Again, here’s the standard practice approach for FLTs.
    All I want to change is the first step, the learning step.

    View Slide

  39. 19
    19
    & &
    $ & &
    ! !
    ! !
    ! !
    ! !
    & &
    ! !
    ! !
    )
    '
    %
    # # #
    #
    #
    #
    %
    # # #
    %
    # # #
    1 2
    (
    (
    "
    "
    "
    )
    *
    *
    3
    1. Use the changesets to train the model
    2. Steps 2 & 3 now get that model, but don’t change otherwise.

    View Slide

  40. 20
    20
    & &
    $ & &
    ! !
    ! !
    ! !
    ! !
    & &
    ! !
    ! !
    )
    '
    %
    # # #
    #
    #
    #
    %
    # # #
    %
    # # #
    1 2
    (
    (
    "
    "
    "
    )
    *
    *
    3
    Because we are using training data that is in the expected format:
    1. This removes the “one-way road” restriction
    2. we can now update the model with new data! Potentially removing the retraining restriction we have with the standard approach.
    >>> So how did I evaluate this?

    View Slide

  41. 20
    20
    & &
    $ & &
    ! !
    ! !
    ! !
    ! !
    & &
    ! !
    ! !
    )
    '
    %
    # # #
    #
    #
    #
    %
    # # #
    %
    # # #
    1 2
    (
    (
    "
    "
    "
    )
    *
    *
    3
    Because we are using training data that is in the expected format:
    1. This removes the “one-way road” restriction
    2. we can now update the model with new data! Potentially removing the retraining restriction we have with the standard approach.
    >>> So how did I evaluate this?

    View Slide

  42. 20
    20
    & &
    $ & &
    ! !
    ! !
    ! !
    ! !
    & &
    ! !
    ! !
    )
    '
    %
    # # #
    #
    #
    #
    %
    # # #
    %
    # # #
    1 2
    (
    (
    "
    #
    #
    # #
    "
    "
    "
    )
    *
    *
    3
    Because we are using training data that is in the expected format:
    1. This removes the “one-way road” restriction
    2. we can now update the model with new data! Potentially removing the retraining restriction we have with the standard approach.
    >>> So how did I evaluate this?

    View Slide

  43. 21
    Dit, B.; Holtzhauer, A.; Poshyvanyk, D. & Kagdi, H.
    “A dataset from change history to support
    evaluation of software maintenance tasks”
    Mining Software Repositories (MSR), 2013
    Moreno, L.; Treadway, J. J.; Marcus, A. & Shen, W.
    “On the Use of Stack Traces to Improve Text
    Retrieval-Based Bug Localization”
    Int’l Conf. on Software Maintenance and Evolution
    (ICSME), 2014
    Smooshed the datasets together so that we looked at the common systems. The Dit etal had traceability links, which were useful.
    !
    Dit et al had method level info,
    Moreno had class level info

    View Slide

  44. 22
    ArgoUML v0.22
    ArgoUML v0.24
    ArgoUML v0.26.2
    JabRef v2.6
    jEdit v4.3
    muCommander v0.8.5
    The info contained in the datasets were basically this

    View Slide

  45. 23
    ! !
    ! !
    A

    ! !
    ! !
    A+1
    ! !
    ! !
    A+2
    ! !
    ! !
    A+3
    )
    * )
    * )
    *
    ! !
    ! !
    B
    We have traceability links between what changed and why it changed for a specific version, aka B. So we know for the blue query that what changed was
    the blue file.
    1. built snapshot model for specific version, aka B.
    2. build changeset model using changesets up to that same version

    View Slide

  46. 23
    ! !
    ! !
    A

    ! !
    ! !
    A+1
    ! !
    ! !
    A+2
    ! !
    ! !
    A+3
    )
    * )
    * )
    *
    ! !
    ! !
    B
    $
    %
    & &
    We have traceability links between what changed and why it changed for a specific version, aka B. So we know for the blue query that what changed was
    the blue file.
    1. built snapshot model for specific version, aka B.
    2. build changeset model using changesets up to that same version

    View Slide

  47. 23
    ! !
    ! !
    A

    ! !
    ! !
    A+1
    ! !
    ! !
    A+2
    ! !
    ! !
    A+3
    )
    * )
    * )
    *
    ! !
    ! !
    B
    "
    #
    "
    #
    "
    # $
    %
    & &
    We have traceability links between what changed and why it changed for a specific version, aka B. So we know for the blue query that what changed was
    the blue file.
    1. built snapshot model for specific version, aka B.
    2. build changeset model using changesets up to that same version

    View Slide

  48. 23
    ! !
    ! !
    A

    ! !
    ! !
    A+1
    ! !
    ! !
    A+2
    ! !
    ! !
    A+3
    )
    * )
    * )
    *
    ! !
    ! !
    B
    "
    #
    "
    #
    "
    #
    $
    %
    & &
    We have traceability links between what changed and why it changed for a specific version, aka B. So we know for the blue query that what changed was
    the blue file.
    1. built snapshot model for specific version, aka B.
    2. build changeset model using changesets up to that same version

    View Slide

  49. 23
    ! !
    ! !
    A

    ! !
    ! !
    A+1
    ! !
    ! !
    A+2
    ! !
    ! !
    A+3
    )
    * )
    * )
    *
    ! !
    ! !
    B
    "
    #
    "
    #
    "
    #
    $
    %
    # # #
    &
    &
    $
    %
    & &
    We have traceability links between what changed and why it changed for a specific version, aka B. So we know for the blue query that what changed was
    the blue file.
    1. built snapshot model for specific version, aka B.
    2. build changeset model using changesets up to that same version

    View Slide

  50. 24
    '
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    We evaluate from the perspective of the user:
    1. Have a “query”
    2. We know which document it relates to

    View Slide

  51. 24
    '
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    )
    *
    We evaluate from the perspective of the user:
    1. Have a “query”
    2. We know which document it relates to

    View Slide

  52. 24
    '
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    foo
    bar
    baz
    qux
    quux
    bletch
    thud
    grunt
    spam
    eggs
    )
    *
    We evaluate from the perspective of the user:
    1. Have a “query”
    2. We know which document it relates to

    View Slide

  53. 24
    '
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    foo
    bar
    baz
    qux
    quux
    bletch
    thud
    grunt
    spam
    eggs
    )
    *
    5
    We evaluate from the perspective of the user:
    1. Have a “query”
    2. We know which document it relates to

    View Slide

  54. 24
    '
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    foo
    bar
    baz
    qux
    quux
    bletch
    thud
    grunt
    spam
    eggs
    )
    * )
    *
    5
    We evaluate from the perspective of the user:
    1. Have a “query”
    2. We know which document it relates to

    View Slide

  55. 24
    '
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    foo
    bar
    baz
    qux
    quux
    bletch
    thud
    grunt
    spam
    eggs
    )
    * )
    *
    5 1
    We evaluate from the perspective of the user:
    1. Have a “query”
    2. We know which document it relates to

    View Slide

  56. 24
    '
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    foo
    bar
    baz
    qux
    quux
    bletch
    thud
    grunt
    spam
    eggs
    )
    * )
    * )
    *
    5 1
    We evaluate from the perspective of the user:
    1. Have a “query”
    2. We know which document it relates to

    View Slide

  57. 24
    '
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    foo
    bar
    baz
    qux
    quux
    bletch
    thud
    grunt
    spam
    eggs
    )
    * )
    * )
    *
    5 1
    We evaluate from the perspective of the user:
    1. Have a “query”
    2. We know which document it relates to

    View Slide

  58. 24
    '
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    foo
    bar
    baz
    qux
    quux
    bletch
    thud
    grunt
    spam
    eggs
    )
    * )
    * )
    *
    5 1 2
    We evaluate from the perspective of the user:
    1. Have a “query”
    2. We know which document it relates to

    View Slide

  59. 24
    '
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    foo
    bar
    baz
    qux
    quux
    bletch
    thud
    grunt
    spam
    eggs
    )
    * )
    * )
    *
    5 1 2
    . . .
    We evaluate from the perspective of the user:
    1. Have a “query”
    2. We know which document it relates to

    View Slide

  60. 24
    '
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    foo
    bar
    baz
    qux
    quux
    bletch
    thud
    grunt
    spam
    eggs
    )
    * )
    * )
    *
    5 1 2
    . . .
    0
    5
    10
    15
    20
    25
    30
    35
    40
    45
    We evaluate from the perspective of the user:
    1. Have a “query”
    2. We know which document it relates to

    View Slide

  61. 24
    '
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    foo
    bar
    baz
    qux
    quux
    bletch
    thud
    grunt
    spam
    eggs
    )
    * )
    * )
    *
    5 1 2
    . . .
    0
    5
    10
    15
    20
    25
    30
    35
    40
    45
    great!
    We evaluate from the perspective of the user:
    1. Have a “query”
    2. We know which document it relates to

    View Slide

  62. 24
    '
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    foo
    bar
    baz
    qux
    quux
    bletch
    thud
    grunt
    spam
    eggs
    )
    * )
    * )
    *
    5 1 2
    . . .
    0
    10
    20
    30
    40
    50
    0
    5
    10
    15
    20
    25
    30
    35
    40
    45
    great!
    We evaluate from the perspective of the user:
    1. Have a “query”
    2. We know which document it relates to

    View Slide

  63. 24
    '
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    foo
    bar
    baz
    qux
    quux
    bletch
    thud
    grunt
    spam
    eggs
    )
    * )
    * )
    *
    5 1 2
    . . .
    0
    10
    20
    30
    40
    50
    0
    5
    10
    15
    20
    25
    30
    35
    40
    45
    great!
    extremely bad
    We evaluate from the perspective of the user:
    1. Have a “query”
    2. We know which document it relates to

    View Slide

  64. 25
    Snapshot Changesets
    0
    500
    1000
    1500
    2000
    ArgoUML v0.22 class-level
    Snapshot Changesets
    0
    2000
    4000
    6000
    8000
    10000
    12000
    14000
    ArgoUML v0.22 method-level
    ArgoUML v0.22
    explain what the graph means
    !
    tail is much shorter — meaning more results were near top of the list

    View Slide

  65. 26
    Snapshot Changesets
    0
    500
    1000
    1500
    2000
    ArgoUML v0.24 class-level
    Snapshot Changesets
    0
    2000
    4000
    6000
    8000
    10000
    12000
    14000
    ArgoUML v0.24 method-level
    ArgoUML v0.24

    View Slide

  66. 27
    Snapshot Changesets
    0
    500
    1000
    1500
    2000
    2500
    ArgoUML v0.26.2 class-level
    Snapshot Changesets
    0
    2000
    4000
    6000
    8000
    10000
    12000
    14000
    16000
    18000
    ArgoUML v0.26.2 method-level
    ArgoUML v0.26.2

    View Slide

  67. 28
    Snapshot Changesets
    0
    200
    400
    600
    800
    1000
    1200
    1400
    JabRef v2.6 class-level
    Snapshot Changesets
    0
    1000
    2000
    3000
    4000
    5000
    6000
    JabRef v2.6 method-level
    JabRef v2.6

    View Slide

  68. 29
    Snapshot Changesets
    0
    200
    400
    600
    800
    1000
    1200
    jEdit v4.3 class-level
    Snapshot Changesets
    0
    1000
    2000
    3000
    4000
    5000
    6000
    7000
    8000
    jEdit v4.3 method-level
    jEdit v4.3

    View Slide

  69. 30
    Snapshot Changesets
    0
    200
    400
    600
    800
    1000
    1200
    1400
    1600
    1800
    muCommander v0.8.5 class-level
    Snapshot Changesets
    0
    1000
    2000
    3000
    4000
    5000
    6000
    7000
    8000
    9000
    muCommander v0.8.5 method-level
    muCommander v0.8.5

    View Slide

  70. 31
    ! !
    ! !
    A

    ! !
    ! !
    A+1
    ! !
    ! !
    A+2
    ! !
    ! !
    A+3
    )
    * )
    * )
    *
    ! !
    ! !
    B
    "
    #
    "
    #
    "
    #
    $
    %
    # # #
    &
    &
    $
    %
    & &
    if we wanted to use changesets in batch, the same way we’ve been using snapshots, then we could readily substitute changesets for snapshots.

    View Slide

  71. 32
    ! !
    ! !
    A

    ! !
    ! !
    A+1
    ! !
    ! !
    A+2
    ! !
    ! !
    A+3
    ! !
    ! !
    B
    )
    * )
    * )
    *
    "
    #
    "
    #
    "
    #
    $
    %
    &
    &
    - “I haven’t used anything online yet, that previous evaluation was just the current batch approach.”
    - With online mode, we can update the model in real-time as work is being done.
    - “How would the approach work if it were used in a real environment (without actually subjecting developers to torturous experimentation).
    - What I did was a historical simulation. When we do the blue query, it does not know anything about the data after it.

    View Slide

  72. 32
    ! !
    ! !
    A

    ! !
    ! !
    A+1
    ! !
    ! !
    A+2
    ! !
    ! !
    A+3
    ! !
    ! !
    B
    )
    * )
    * )
    *
    "
    #
    "
    #
    "
    #
    $
    %
    #
    &
    &
    - “I haven’t used anything online yet, that previous evaluation was just the current batch approach.”
    - With online mode, we can update the model in real-time as work is being done.
    - “How would the approach work if it were used in a real environment (without actually subjecting developers to torturous experimentation).
    - What I did was a historical simulation. When we do the blue query, it does not know anything about the data after it.

    View Slide

  73. 32
    ! !
    ! !
    A

    ! !
    ! !
    A+1
    ! !
    ! !
    A+2
    ! !
    ! !
    A+3
    ! !
    ! !
    B
    )
    * )
    * )
    *
    "
    #
    "
    #
    "
    #
    $
    %
    #
    &
    &
    5
    - “I haven’t used anything online yet, that previous evaluation was just the current batch approach.”
    - With online mode, we can update the model in real-time as work is being done.
    - “How would the approach work if it were used in a real environment (without actually subjecting developers to torturous experimentation).
    - What I did was a historical simulation. When we do the blue query, it does not know anything about the data after it.

    View Slide

  74. 32
    ! !
    ! !
    A

    ! !
    ! !
    A+1
    ! !
    ! !
    A+2
    ! !
    ! !
    A+3
    ! !
    ! !
    B
    )
    * )
    * )
    *
    "
    #
    "
    #
    "
    #
    $
    %
    # #
    &
    &
    5
    - “I haven’t used anything online yet, that previous evaluation was just the current batch approach.”
    - With online mode, we can update the model in real-time as work is being done.
    - “How would the approach work if it were used in a real environment (without actually subjecting developers to torturous experimentation).
    - What I did was a historical simulation. When we do the blue query, it does not know anything about the data after it.

    View Slide

  75. 32
    ! !
    ! !
    A

    ! !
    ! !
    A+1
    ! !
    ! !
    A+2
    ! !
    ! !
    A+3
    ! !
    ! !
    B
    )
    * )
    * )
    *
    "
    #
    "
    #
    "
    #
    $
    %
    # #
    &
    &
    5 1
    - “I haven’t used anything online yet, that previous evaluation was just the current batch approach.”
    - With online mode, we can update the model in real-time as work is being done.
    - “How would the approach work if it were used in a real environment (without actually subjecting developers to torturous experimentation).
    - What I did was a historical simulation. When we do the blue query, it does not know anything about the data after it.

    View Slide

  76. 32
    ! !
    ! !
    A

    ! !
    ! !
    A+1
    ! !
    ! !
    A+2
    ! !
    ! !
    A+3
    ! !
    ! !
    B
    )
    * )
    * )
    *
    "
    #
    "
    #
    "
    #
    $
    %
    # # #
    &
    &
    5 1
    - “I haven’t used anything online yet, that previous evaluation was just the current batch approach.”
    - With online mode, we can update the model in real-time as work is being done.
    - “How would the approach work if it were used in a real environment (without actually subjecting developers to torturous experimentation).
    - What I did was a historical simulation. When we do the blue query, it does not know anything about the data after it.

    View Slide

  77. 32
    ! !
    ! !
    A

    ! !
    ! !
    A+1
    ! !
    ! !
    A+2
    ! !
    ! !
    A+3
    ! !
    ! !
    B
    )
    * )
    * )
    *
    "
    #
    "
    #
    "
    #
    $
    %
    # # #
    &
    &
    5 1 2
    - “I haven’t used anything online yet, that previous evaluation was just the current batch approach.”
    - With online mode, we can update the model in real-time as work is being done.
    - “How would the approach work if it were used in a real environment (without actually subjecting developers to torturous experimentation).
    - What I did was a historical simulation. When we do the blue query, it does not know anything about the data after it.

    View Slide

  78. 33
    Snapshot Changesets Historical
    0
    500
    1000
    1500
    2000
    ArgoUML v0.22 class-level
    Snapshot Changesets Historical
    0
    2000
    4000
    6000
    8000
    10000
    12000
    14000
    ArgoUML v0.22 method-level
    ArgoUML v0.22
    Snapshots and changsets haven’t changed from what I’ve already shown you, only added this historical evaluation column in green.

    View Slide

  79. 34
    Snapshot Changesets Historical
    0
    500
    1000
    1500
    2000
    ArgoUML v0.24 class-level
    Snapshot Changesets Historical
    0
    2000
    4000
    6000
    8000
    10000
    12000
    14000
    ArgoUML v0.24 method-level
    ArgoUML v0.24

    View Slide

  80. 35
    Snapshot Changesets Historical
    0
    500
    1000
    1500
    2000
    2500
    ArgoUML v0.26.2 class-level
    Snapshot Changesets Historical
    0
    2000
    4000
    6000
    8000
    10000
    12000
    14000
    16000
    18000
    ArgoUML v0.26.2 method-level
    ArgoUML v0.26.2

    View Slide

  81. 36
    Snapshot Changesets Historical
    0
    200
    400
    600
    800
    1000
    1200
    1400
    JabRef v2.6 class-level
    Snapshot Changesets Historical
    0
    1000
    2000
    3000
    4000
    5000
    6000
    JabRef v2.6 method-level
    JabRef v2.6

    View Slide

  82. 37
    Snapshot Changesets Historical
    0
    200
    400
    600
    800
    1000
    1200
    jEdit v4.3 class-level
    Snapshot Changesets Historical
    0
    1000
    2000
    3000
    4000
    5000
    6000
    7000
    8000
    jEdit v4.3 method-level
    jEdit v4.3

    View Slide

  83. 38
    Snapshot Changesets Historical
    0
    200
    400
    600
    800
    1000
    1200
    1400
    1600
    1800
    muCommander v0.8.5 class-level
    Snapshot Changesets Historical
    0
    1000
    2000
    3000
    4000
    5000
    6000
    7000
    8000
    9000
    muCommander v0.8.5 method-level
    muCommander v0.8.5

    View Slide

  84. 39
    Snapshot Changesets Historical
    0
    500
    1000
    1500
    2000
    2500
    Overall class-level
    Snapshot Changesets Historical
    0
    2000
    4000
    6000
    8000
    10000
    12000
    14000
    16000
    18000
    Overall method-level
    All systems
    changesets do just as well as snapshots
    historical simulation more accurately captures what is going on

    View Slide

  85. 39
    Snapshot Changesets Historical
    0
    500
    1000
    1500
    2000
    2500
    Overall class-level
    Snapshot Changesets Historical
    0
    2000
    4000
    6000
    8000
    10000
    12000
    14000
    16000
    18000
    Overall method-level
    All systems
    Snapshots <= Changesets < Historical
    changesets do just as well as snapshots
    historical simulation more accurately captures what is going on

    View Slide

  86. 40
    Future work
    1. Do historical simulation with snapshots
    2. How many changesets are required?
    3. How does this perform for other tasks?
    1. how would a historical sim for snapshots perform? with changesets I could cheat and do it online instead of taking a naive approach because the
    format is correct.
    2. min number of changesets?
    3. how would changesets perform for a different task?

    View Slide

  87. Modeling Changeset Topics
    for Feature Location
    Nicholas A. Kraft
    University of Alabama ABB Corporate Research
    University of Alabama
    Christopher S. Corley Kelly L. Kashuda

    View Slide

  88. # lexer.py
    # author: Christopher S. Corley
    !
    from swindle.lexeme import Lexeme
    from swindle.types import (Types, get_type, PUNCTUATION)
    from io import TextIOWrapper
    !
    class Lexer:
    def __init__(self, fileptr):
    # fileptr is generally a TextIOWrapper when reading from a file
    self.fileptr = fileptr
    self.done = False
    !
    self.comment_mode = False
    self.need_terminator = False
    !
    # To emulate pushing things back to the stream
    self.saved_char = None
    !
    # character is a generator so we can have nice reading things
    # like next(self.character)
    self.character = self.char_generator()
    self.error_msg = 'Could not read char %d on line %d from file.'
    !
    # a convenient way to count line numbers and read things character
    # by character.
    def char_generator(self):
    for self.line_no, line in enumerate(self.fileptr):
    for self.col_no, char in enumerate(line):
    self.saved_char = None
    yield char
    !

    View Slide

  89. commit 63bf5d84890bceed42068880f2554d89b6ba10fc
    Author: Christopher Corley
    Date: Fri Oct 26 19:20:50 2012 -0500
    !
    Remove unnecessary newline tokens, now form_list is behaving stupid
    !
    diff --git a/swindle/lexer.py b/swindle/lexer.py
    index ce86687..59f349b 100644
    --- a/swindle/lexer.py
    +++ b/swindle/lexer.py
    @@ -10,10 +10,10 @@ class Lexer:
    def __init__(self, fileptr):
    # fileptr is generally a TextIOWrapper when reading from a file
    self.fileptr = fileptr
    + self.done = False
    !
    - self.tokenize_whitespace = False # like python, we tokenize all whitespace
    - self.whitespace_count = 0
    self.comment_mode = False
    + self.need_terminator = False
    !
    # To emulate pushing things back to the stream
    self.saved_char = None
    @@ -72,6 +72,7 @@ class Lexer:
    try:
    c = next(self.character)
    except StopIteration:
    + self.done = True
    return None
    !
    return c 43
    "
    #
    Changesets are program text!
    !
    >>> I’m saying, let’s use these changesets
    as input to our model building step.

    View Slide

  90. 44
    Snapshot Changesets
    0
    500
    1000
    1500
    2000
    ArgoUML v0.22 class-level
    Snapshot Changesets
    0
    2000
    4000
    6000
    8000
    10000
    12000
    14000
    ArgoUML v0.22 method-level
    ArgoUML v0.22
    explain what the graph means

    View Slide

  91. 45
    Snapshot Changesets
    0
    500
    1000
    1500
    2000
    ArgoUML v0.24 class-level
    Snapshot Changesets
    0
    2000
    4000
    6000
    8000
    10000
    12000
    14000
    ArgoUML v0.24 method-level
    ArgoUML v0.24

    View Slide

  92. 46
    Snapshot Changesets
    0
    500
    1000
    1500
    2000
    2500
    ArgoUML v0.26.2 class-level
    Snapshot Changesets
    0
    2000
    4000
    6000
    8000
    10000
    12000
    14000
    16000
    18000
    ArgoUML v0.26.2 method-level
    ArgoUML v0.26.2

    View Slide

  93. 47
    Snapshot Changesets
    0
    200
    400
    600
    800
    1000
    1200
    1400
    JabRef v2.6 class-level
    Snapshot Changesets
    0
    1000
    2000
    3000
    4000
    5000
    6000
    JabRef v2.6 method-level
    JabRef v2.6

    View Slide

  94. 48
    Snapshot Changesets
    0
    200
    400
    600
    800
    1000
    1200
    jEdit v4.3 class-level
    Snapshot Changesets
    0
    1000
    2000
    3000
    4000
    5000
    6000
    7000
    8000
    jEdit v4.3 method-level
    jEdit v4.3

    View Slide

  95. 49
    Snapshot Changesets
    0
    200
    400
    600
    800
    1000
    1200
    1400
    1600
    1800
    muCommander v0.8.5 class-level
    Snapshot Changesets
    0
    1000
    2000
    3000
    4000
    5000
    6000
    7000
    8000
    9000
    muCommander v0.8.5 method-level
    muCommander v0.8.5

    View Slide

  96. 50
    Snapshot Changesets Historical
    0
    500
    1000
    1500
    2000
    ArgoUML v0.22 class-level
    Snapshot Changesets Historical
    0
    2000
    4000
    6000
    8000
    10000
    12000
    14000
    ArgoUML v0.22 method-level

    View Slide

  97. 51
    Snapshot Changesets Historical
    0
    500
    1000
    1500
    2000
    ArgoUML v0.24 class-level
    Snapshot Changesets Historical
    0
    2000
    4000
    6000
    8000
    10000
    12000
    14000
    ArgoUML v0.24 method-level

    View Slide

  98. 52
    Snapshot Changesets Historical
    0
    500
    1000
    1500
    2000
    2500
    ArgoUML v0.26.2 class-level
    Snapshot Changesets Historical
    0
    2000
    4000
    6000
    8000
    10000
    12000
    14000
    16000
    18000
    ArgoUML v0.26.2 method-level

    View Slide

  99. 53
    Snapshot Changesets Historical
    0
    200
    400
    600
    800
    1000
    1200
    1400
    JabRef v2.6 class-level
    Snapshot Changesets Historical
    0
    1000
    2000
    3000
    4000
    5000
    6000
    JabRef v2.6 method-level
    what does it meannnnn

    View Slide

  100. 54
    Snapshot Changesets Historical
    0
    200
    400
    600
    800
    1000
    1200
    jEdit v4.3 class-level
    Snapshot Changesets Historical
    0
    1000
    2000
    3000
    4000
    5000
    6000
    7000
    8000
    jEdit v4.3 method-level

    View Slide

  101. 55
    Snapshot Changesets Historical
    0
    200
    400
    600
    800
    1000
    1200
    1400
    1600
    1800
    muCommander v0.8.5 class-level
    Snapshot Changesets Historical
    0
    1000
    2000
    3000
    4000
    5000
    6000
    7000
    8000
    9000
    muCommander v0.8.5 method-level

    View Slide

  102. 56
    Snapshot Changesets Historical
    0
    500
    1000
    1500
    2000
    2500
    Overall class-level
    Snapshot Changesets Historical
    0
    2000
    4000
    6000
    8000
    10000
    12000
    14000
    16000
    18000
    Overall method-level
    changesets do just as well as snapshots
    historical simulation more accurately captures what is going on

    View Slide