Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Modeling Changeset Topics for Feature Location

Modeling Changeset Topics for Feature Location

Feature location is a program comprehension activity in which a developer inspects source code to locate the classes or methods that implement a feature of interest. Many feature location techniques (FLTs) are based on text retrieval models, and in such FLTs it is typical for the models to be trained on source code snapshots. However, source code evolution leads to model obsolescence and thus to the need to retrain the model from the latest snapshot. In this paper, we introduce a topic-modeling-based FLT in which the model is built incrementally from source code history. By training an online learning algorithm using changesets, the FLT maintains an up-to-date model without incurring the non-trivial computational cost associated with retraining traditional FLTs. Overall, we studied over 1,200 defects and features from 14 open-source Java projects. We also present a historical simulation that demonstrates how the FLT performs as a project evolves. Our results indicate that the accuracy of a changeset-based FLT is similar to that of a snapshot-based FLT, but without the retraining costs.

Christopher Corley

September 29, 2015
Tweet

More Decks by Christopher Corley

Other Decks in Research

Transcript

  1. Modeling Changeset Topics for Feature Location Nicholas A. Kraft University

    of Alabama ABB Corporate Research University of Alabama Christopher S. Corley Kelly L. Kashuda In this talk, there’s a lot of FLT jargon, and the problem I’m trying to address is pretty subtle and is easier to see if we go through everything. 1. “A way we can use topic models on something other than source code, such as changesets, for FLT” 2. A look into more accurate FLT evaluations
  2. 2 ! " " " " " " " "

    " " # $ We have a user and they have something they are looking for in a big mess of source code. Our objective is to search for them and give them a ranked list where the top document(s) are the ones they are looking for. ! We sometimes use topic modeling to accomplish this
  3. 2 ! " " " " " " " "

    " " # $ We have a user and they have something they are looking for in a big mess of source code. Our objective is to search for them and give them a ranked list where the top document(s) are the ones they are looking for. ! We sometimes use topic modeling to accomplish this
  4. 3 % % % % % %% Topic modeling is

    taking a bunch of immutable documents, such as books & news paper articles, and discovering the underlying themes in the text. - We categorize documents by which topics they contain using machine learning >>>Topic models have 3 basic steps to using them
  5. 3 & & & & % % % % %

    %% Topic modeling is taking a bunch of immutable documents, such as books & news paper articles, and discovering the underlying themes in the text. - We categorize documents by which topics they contain using machine learning >>>Topic models have 3 basic steps to using them
  6. 4 ' % % % % % %% ( (

    1 Step one is the learning step: - Take a bunch of documents - Put them into a machine learning algorithm - >>> The algorithm produces a trained model which we can use for the other steps.
  7. ) 4 ' ( ( 1 Step one is the

    learning step: - Take a bunch of documents - Put them into a machine learning algorithm - >>> The algorithm produces a trained model which we can use for the other steps.
  8. 5 ( ( 2 ) ' Step two: - Take

    this trained model - Using the same constructs - AND the same documents again - Infer the thematic structure of the documents with that model being held
  9. 5 ( ( 2 ) Step two: - Take this

    trained model - Using the same constructs - AND the same documents again - Infer the thematic structure of the documents with that model being held
  10. 5 % % % % % %% ( ( 2

    ) Step two: - Take this trained model - Using the same constructs - AND the same documents again - Infer the thematic structure of the documents with that model being held
  11. 5 & & & & ( ( 2 ) Step

    two: - Take this trained model - Using the same constructs - AND the same documents again - Infer the thematic structure of the documents with that model being held
  12. 6 & & & & ) ( ( 3 Step

    3, which is special: - we can take UNSEEN documents, documents we didn’t use in training - infer their thematic structure ! That’s a quick overview on how topic models work >>> How do we use topic models for feature location?
  13. 6 & & & & ) ( ( ## #

    # # 3 Step 3, which is special: - we can take UNSEEN documents, documents we didn’t use in training - infer their thematic structure ! That’s a quick overview on how topic models work >>> How do we use topic models for feature location?
  14. 6 & & & & ) ( ( # #

    # # 3 Step 3, which is special: - we can take UNSEEN documents, documents we didn’t use in training - infer their thematic structure ! That’s a quick overview on how topic models work >>> How do we use topic models for feature location?
  15. " 7 7 ( ( 1 " " " "

    ' ) 2 ( ( " " " " ) " " " " ( ( ) " " " " ! * * # # $ $ Marcus, A.; Sergeyev, A.; Rajlich, V. & Maletic, J. I. An information retrieval approach to concept location in source code, Proceedings of the 11th Working Conference on Reverse Engineering, 2004 3 To do FLT, we do those three steps sequentially: 1. Train model, 2. Infer thematic structure of code 3. With a user query, find the most related source code to the query >>> Simple, and effective. But there is a mostly overlooked and subtle problem with this
  16. 8 " " " " Snapshot A ( ( '

    ) Here’s the problem. For step one, we take a snapshot of source code at some point in time, such as a software release. BUT, as soon as a change to the source code happens, our model becomes outdated. By the time the next release comes around, the model might make zero sense for the source code it represents. ! >>> How do we deal with this problem?
  17. 8 " " " " Snapshot A … " "

    " " A+1 " " " " A+2 " " " " A+3 + + + " " " " Snapshot B ( ( ' ) Here’s the problem. For step one, we take a snapshot of source code at some point in time, such as a software release. BUT, as soon as a change to the source code happens, our model becomes outdated. By the time the next release comes around, the model might make zero sense for the source code it represents. ! >>> How do we deal with this problem?
  18. ) 9 " " " " Snapshot A … "

    " " " A+1 " " " " A+2 " " " " A+3 + + + " " " " Snapshot B " " " " ( ( ' ) ( ( ' We can’t just re-build a model whenever a new release comes around. That’s not practical if you’re actually developing the software. The model only is useful for two points in time, and there’s a lot of work going on in between. Most research thus far has overlooked the practicality of the approach, making it not reasonable for IDE integration.
  19. 10 " " " " Snapshot A … " "

    " " A+1 " " " " A+2 " " " " A+3 " " " " Snapshot B ' ) ) " " " " ' ) " " " ' ) " " ' ) " ' The naive solution is to just re-build the model every time. (not a good idea) — to build models from scratch can take hours (for thousand LOC) to days (for million LOC) So, we need to be clever…. We cant change the fact that software is going to change… if we could, this field wouldn’t be this active! >>> What about the models? Let’s look closer…
  20. 10 " " " " Snapshot A … " "

    " " A+1 " " " " A+2 " " " " A+3 " " " " Snapshot B (not a good idea) ' ) ) " " " " ' ) " " " ' ) " " ' ) " ' The naive solution is to just re-build the model every time. (not a good idea) — to build models from scratch can take hours (for thousand LOC) to days (for million LOC) So, we need to be clever…. We cant change the fact that software is going to change… if we could, this field wouldn’t be this active! >>> What about the models? Let’s look closer…
  21. 11 " " " " Snapshot A … " "

    " " A+1 " " " " A+2 " " " " A+3 ' ) " " " " Snapshot B Rao, S. Incremental update framework for efficient retrieval from software libraries for bug localization, Purdue University, 2013 - A clever way was introduced by Shivani Rao. - She has extended existing TMs that can keep the model approximately up-to-date. - However, she notes that even her extensions will needs rebuilding occasionally. - This is the best thing I’ve seen so far. If you’re doing FLT work, read her work.
  22. 11 " " " " Snapshot A … " "

    " " A+1 " " " " A+2 " " " " A+3 ' ) " " " " Snapshot B ) " ⭈ ) " " ⭈ ) " " " ⭈ Rao, S. Incremental update framework for efficient retrieval from software libraries for bug localization, Purdue University, 2013 - A clever way was introduced by Shivani Rao. - She has extended existing TMs that can keep the model approximately up-to-date. - However, she notes that even her extensions will needs rebuilding occasionally. - This is the best thing I’ve seen so far. If you’re doing FLT work, read her work.
  23. 11 " " " " Snapshot A … " "

    " " A+1 " " " " A+2 " " " " A+3 ' ) " " " " Snapshot B ) ' " " " " ) " ⭈ ) " " ⭈ ) " " " ⭈ Rao, S. Incremental update framework for efficient retrieval from software libraries for bug localization, Purdue University, 2013 - A clever way was introduced by Shivani Rao. - She has extended existing TMs that can keep the model approximately up-to-date. - However, she notes that even her extensions will needs rebuilding occasionally. - This is the best thing I’ve seen so far. If you’re doing FLT work, read her work.
  24. 12 " " " " Snapshot A … " "

    " " A+1 " " " " A+2 " " " " A+3 ' ) " " " " Snapshot B ) ' " " " " ) " ⭈ ) " " ⭈ ) " " " ⭈ Rao, S. Incremental update framework for efficient retrieval from software libraries for bug localization, Purdue University, 2013 I’m not satisfied with this. I think we can do better with less work AND with off-the-shelf topic modeling algorithms so that we don’t have to extend a new TM algo for software-specific problems. ! >>> There are a few things about modern topic models that I think can help.
  25. 13 TMs process… , Hoffman, M.; Bach, F. R. &

    Blei, D. M. Lafferty, J.; Williams, C. Online Learning for Latent Dirichlet Allocation, Advances in Neural Information Processing Systems, 2010 - Basic topic models designed to process text, like books. - Topic models process streamable data: Can process new data *as* it appears. Does not need to know about data beforehand. - As books are being written, the model is being updated… - This allows us to iterate between step 1 and step 2. Updating model with new data, infer things about that new data. - Cool property that is underutilized… - Topic models process infinite data, continuously be updated until the end of time >>> However, there is a tradeoff of these two properties is….
  26. 13 TMs process… , Online Hoffman, M.; Bach, F. R.

    & Blei, D. M. Lafferty, J.; Williams, C. Online Learning for Latent Dirichlet Allocation, Advances in Neural Information Processing Systems, 2010 - Basic topic models designed to process text, like books. - Topic models process streamable data: Can process new data *as* it appears. Does not need to know about data beforehand. - As books are being written, the model is being updated… - This allows us to iterate between step 1 and step 2. Updating model with new data, infer things about that new data. - Cool property that is underutilized… - Topic models process infinite data, continuously be updated until the end of time >>> However, there is a tradeoff of these two properties is….
  27. 13 TMs process… streamable data , Online Hoffman, M.; Bach,

    F. R. & Blei, D. M. Lafferty, J.; Williams, C. Online Learning for Latent Dirichlet Allocation, Advances in Neural Information Processing Systems, 2010 - Basic topic models designed to process text, like books. - Topic models process streamable data: Can process new data *as* it appears. Does not need to know about data beforehand. - As books are being written, the model is being updated… - This allows us to iterate between step 1 and step 2. Updating model with new data, infer things about that new data. - Cool property that is underutilized… - Topic models process infinite data, continuously be updated until the end of time >>> However, there is a tradeoff of these two properties is….
  28. , 13 TMs process… streamable data , , , ,

    , Online Hoffman, M.; Bach, F. R. & Blei, D. M. Lafferty, J.; Williams, C. Online Learning for Latent Dirichlet Allocation, Advances in Neural Information Processing Systems, 2010 - Basic topic models designed to process text, like books. - Topic models process streamable data: Can process new data *as* it appears. Does not need to know about data beforehand. - As books are being written, the model is being updated… - This allows us to iterate between step 1 and step 2. Updating model with new data, infer things about that new data. - Cool property that is underutilized… - Topic models process infinite data, continuously be updated until the end of time >>> However, there is a tradeoff of these two properties is….
  29. , 13 TMs process… streamable data , , , ,

    , 1 2 * Online Hoffman, M.; Bach, F. R. & Blei, D. M. Lafferty, J.; Williams, C. Online Learning for Latent Dirichlet Allocation, Advances in Neural Information Processing Systems, 2010 - Basic topic models designed to process text, like books. - Topic models process streamable data: Can process new data *as* it appears. Does not need to know about data beforehand. - As books are being written, the model is being updated… - This allows us to iterate between step 1 and step 2. Updating model with new data, infer things about that new data. - Cool property that is underutilized… - Topic models process infinite data, continuously be updated until the end of time >>> However, there is a tradeoff of these two properties is….
  30. , 13 TMs process… streamable data infinite data , ,

    , , , , , , , , , , , , , , , Online Hoffman, M.; Bach, F. R. & Blei, D. M. Lafferty, J.; Williams, C. Online Learning for Latent Dirichlet Allocation, Advances in Neural Information Processing Systems, 2010 - Basic topic models designed to process text, like books. - Topic models process streamable data: Can process new data *as* it appears. Does not need to know about data beforehand. - As books are being written, the model is being updated… - This allows us to iterate between step 1 and step 2. Updating model with new data, infer things about that new data. - Cool property that is underutilized… - Topic models process infinite data, continuously be updated until the end of time >>> However, there is a tradeoff of these two properties is….
  31. 14 immutable data , streamable data infinite data , ,

    , , , 14 , , , , , , , , , , , , TMs process… Online Hoffman, M.; Bach, F. R. & Blei, D. M. Lafferty, J.; Williams, C. Online Learning for Latent Dirichlet Allocation, Advances in Neural Information Processing Systems, 2010 - Topic models, even the offline cousins, need immutable data. Hence Shivani Rao’s need to extend the offline versions to compensate for software’s innate mutability. - >>> What do we have in software development that represents source code - in an streamable, infinite, immutable format?
  32. 15 " " " " Snapshot A … " "

    " " A+1 " " " " A+2 " " " " A+3 " " " " Snapshot B Source code repositories! The repository contains a recorded history of every state the software has ever been in. I think we should be using this. ! >>> How might we build a model out of a repository?
  33. 16 " " " " Snapshot A … ! !

    ! ! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ! ! ! ! Snapshot B - With the changesets: - Changesets are a summary of the work done in between two commits. I do not mean the commit message. They are textual diff of the source code. - As developers make changes, they commit these summaries to the source code history, so we get a stream of immutable data — that is the changeset!
  34. 16 ! ! ! ! Snapshot A … ! !

    ! ! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ! ! ! ! Snapshot B " # " # " # - With the changesets: - Changesets are a summary of the work done in between two commits. I do not mean the commit message. They are textual diff of the source code. - As developers make changes, they commit these summaries to the source code history, so we get a stream of immutable data — that is the changeset!
  35. 16 ! ! ! ! Snapshot A … ! !

    ! ! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ! ! ! ! Snapshot B " # " # " # diff A..A+1 diff A+1..A+2 diff A+2..A+3 - With the changesets: - Changesets are a summary of the work done in between two commits. I do not mean the commit message. They are textual diff of the source code. - As developers make changes, they commit these summaries to the source code history, so we get a stream of immutable data — that is the changeset!
  36. 17 ! ! ! ! Snapshot A … ! !

    ! ! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ! ! ! ! Snapshot B " # " # " # - I’m saying, let’s use these changesets as input to our model building step. ! ! >>> So, here is my proposal for the FLT problem.
  37. 17 ! ! ! ! Snapshot A … ! !

    ! ! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ! ! ! ! Snapshot B " # " # " # $ % # # # & & - I’m saying, let’s use these changesets as input to our model building step. ! ! >>> So, here is my proposal for the FLT problem.
  38. 18 18 & & 1 ! ! ! ! $

    % 2 & & ! ! ! ! % ! ! ! ! & & % ! ! ! ! 3 ' ( ( ) ) * * Again, here’s the standard practice approach for FLTs. All I want to change is the first step, the learning step.
  39. 19 19 & & $ & & ! ! !

    ! ! ! ! ! & & ! ! ! ! ) ' % # # # # # # % # # # % # # # 1 2 ( ( " " " ) * * 3 1. Use the changesets to train the model 2. Steps 2 & 3 now get that model, but don’t change otherwise.
  40. 20 20 & & $ & & ! ! !

    ! ! ! ! ! & & ! ! ! ! ) ' % # # # # # # % # # # % # # # 1 2 ( ( " " " ) * * 3 Because we are using training data that is in the expected format: 1. This removes the “one-way road” restriction 2. we can now update the model with new data! Potentially removing the retraining restriction we have with the standard approach. >>> So how did I evaluate this?
  41. 20 20 & & $ & & ! ! !

    ! ! ! ! ! & & ! ! ! ! ) ' % # # # # # # % # # # % # # # 1 2 ( ( " " " ) * * 3 Because we are using training data that is in the expected format: 1. This removes the “one-way road” restriction 2. we can now update the model with new data! Potentially removing the retraining restriction we have with the standard approach. >>> So how did I evaluate this?
  42. 20 20 & & $ & & ! ! !

    ! ! ! ! ! & & ! ! ! ! ) ' % # # # # # # % # # # % # # # 1 2 ( ( " # # # # " " " ) * * 3 Because we are using training data that is in the expected format: 1. This removes the “one-way road” restriction 2. we can now update the model with new data! Potentially removing the retraining restriction we have with the standard approach. >>> So how did I evaluate this?
  43. 21 Dit, B.; Holtzhauer, A.; Poshyvanyk, D. & Kagdi, H.

    “A dataset from change history to support evaluation of software maintenance tasks” Mining Software Repositories (MSR), 2013 Moreno, L.; Treadway, J. J.; Marcus, A. & Shen, W. “On the Use of Stack Traces to Improve Text Retrieval-Based Bug Localization” Int’l Conf. on Software Maintenance and Evolution (ICSME), 2014 Smooshed the datasets together so that we looked at the common systems. The Dit etal had traceability links, which were useful. ! Dit et al had method level info, Moreno had class level info
  44. 22 ArgoUML v0.22 ArgoUML v0.24 ArgoUML v0.26.2 JabRef v2.6 jEdit

    v4.3 muCommander v0.8.5 The info contained in the datasets were basically this
  45. 23 ! ! ! ! A … ! ! !

    ! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ) * ) * ) * ! ! ! ! B We have traceability links between what changed and why it changed for a specific version, aka B. So we know for the blue query that what changed was the blue file. 1. built snapshot model for specific version, aka B. 2. build changeset model using changesets up to that same version
  46. 23 ! ! ! ! A … ! ! !

    ! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ) * ) * ) * ! ! ! ! B $ % & & We have traceability links between what changed and why it changed for a specific version, aka B. So we know for the blue query that what changed was the blue file. 1. built snapshot model for specific version, aka B. 2. build changeset model using changesets up to that same version
  47. 23 ! ! ! ! A … ! ! !

    ! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ) * ) * ) * ! ! ! ! B " # " # " # $ % & & We have traceability links between what changed and why it changed for a specific version, aka B. So we know for the blue query that what changed was the blue file. 1. built snapshot model for specific version, aka B. 2. build changeset model using changesets up to that same version
  48. 23 ! ! ! ! A … ! ! !

    ! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ) * ) * ) * ! ! ! ! B " # " # " # $ % & & We have traceability links between what changed and why it changed for a specific version, aka B. So we know for the blue query that what changed was the blue file. 1. built snapshot model for specific version, aka B. 2. build changeset model using changesets up to that same version
  49. 23 ! ! ! ! A … ! ! !

    ! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ) * ) * ) * ! ! ! ! B " # " # " # $ % # # # & & $ % & & We have traceability links between what changed and why it changed for a specific version, aka B. So we know for the blue query that what changed was the blue file. 1. built snapshot model for specific version, aka B. 2. build changeset model using changesets up to that same version
  50. 24 ' ! ! ! ! ! ! ! !

    ! ! We evaluate from the perspective of the user: 1. Have a “query” 2. We know which document it relates to
  51. 24 ' ! ! ! ! ! ! ! !

    ! ! ) * We evaluate from the perspective of the user: 1. Have a “query” 2. We know which document it relates to
  52. 24 ' ! ! ! ! ! ! ! !

    ! ! foo bar baz qux quux bletch thud grunt spam eggs ) * We evaluate from the perspective of the user: 1. Have a “query” 2. We know which document it relates to
  53. 24 ' ! ! ! ! ! ! ! !

    ! ! foo bar baz qux quux bletch thud grunt spam eggs ) * 5 We evaluate from the perspective of the user: 1. Have a “query” 2. We know which document it relates to
  54. 24 ' ! ! ! ! ! ! ! !

    ! ! foo bar baz qux quux bletch thud grunt spam eggs ) * ) * 5 We evaluate from the perspective of the user: 1. Have a “query” 2. We know which document it relates to
  55. 24 ' ! ! ! ! ! ! ! !

    ! ! foo bar baz qux quux bletch thud grunt spam eggs ) * ) * 5 1 We evaluate from the perspective of the user: 1. Have a “query” 2. We know which document it relates to
  56. 24 ' ! ! ! ! ! ! ! !

    ! ! foo bar baz qux quux bletch thud grunt spam eggs ) * ) * ) * 5 1 We evaluate from the perspective of the user: 1. Have a “query” 2. We know which document it relates to
  57. 24 ' ! ! ! ! ! ! ! !

    ! ! foo bar baz qux quux bletch thud grunt spam eggs ) * ) * ) * 5 1 We evaluate from the perspective of the user: 1. Have a “query” 2. We know which document it relates to
  58. 24 ' ! ! ! ! ! ! ! !

    ! ! foo bar baz qux quux bletch thud grunt spam eggs ) * ) * ) * 5 1 2 We evaluate from the perspective of the user: 1. Have a “query” 2. We know which document it relates to
  59. 24 ' ! ! ! ! ! ! ! !

    ! ! foo bar baz qux quux bletch thud grunt spam eggs ) * ) * ) * 5 1 2 . . . We evaluate from the perspective of the user: 1. Have a “query” 2. We know which document it relates to
  60. 24 ' ! ! ! ! ! ! ! !

    ! ! foo bar baz qux quux bletch thud grunt spam eggs ) * ) * ) * 5 1 2 . . . 0 5 10 15 20 25 30 35 40 45 We evaluate from the perspective of the user: 1. Have a “query” 2. We know which document it relates to
  61. 24 ' ! ! ! ! ! ! ! !

    ! ! foo bar baz qux quux bletch thud grunt spam eggs ) * ) * ) * 5 1 2 . . . 0 5 10 15 20 25 30 35 40 45 great! We evaluate from the perspective of the user: 1. Have a “query” 2. We know which document it relates to
  62. 24 ' ! ! ! ! ! ! ! !

    ! ! foo bar baz qux quux bletch thud grunt spam eggs ) * ) * ) * 5 1 2 . . . 0 10 20 30 40 50 0 5 10 15 20 25 30 35 40 45 great! We evaluate from the perspective of the user: 1. Have a “query” 2. We know which document it relates to
  63. 24 ' ! ! ! ! ! ! ! !

    ! ! foo bar baz qux quux bletch thud grunt spam eggs ) * ) * ) * 5 1 2 . . . 0 10 20 30 40 50 0 5 10 15 20 25 30 35 40 45 great! extremely bad We evaluate from the perspective of the user: 1. Have a “query” 2. We know which document it relates to
  64. 25 Snapshot Changesets 0 500 1000 1500 2000 ArgoUML v0.22

    class-level Snapshot Changesets 0 2000 4000 6000 8000 10000 12000 14000 ArgoUML v0.22 method-level ArgoUML v0.22 explain what the graph means ! tail is much shorter — meaning more results were near top of the list
  65. 26 Snapshot Changesets 0 500 1000 1500 2000 ArgoUML v0.24

    class-level Snapshot Changesets 0 2000 4000 6000 8000 10000 12000 14000 ArgoUML v0.24 method-level ArgoUML v0.24
  66. 27 Snapshot Changesets 0 500 1000 1500 2000 2500 ArgoUML

    v0.26.2 class-level Snapshot Changesets 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 ArgoUML v0.26.2 method-level ArgoUML v0.26.2
  67. 28 Snapshot Changesets 0 200 400 600 800 1000 1200

    1400 JabRef v2.6 class-level Snapshot Changesets 0 1000 2000 3000 4000 5000 6000 JabRef v2.6 method-level JabRef v2.6
  68. 29 Snapshot Changesets 0 200 400 600 800 1000 1200

    jEdit v4.3 class-level Snapshot Changesets 0 1000 2000 3000 4000 5000 6000 7000 8000 jEdit v4.3 method-level jEdit v4.3
  69. 30 Snapshot Changesets 0 200 400 600 800 1000 1200

    1400 1600 1800 muCommander v0.8.5 class-level Snapshot Changesets 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 muCommander v0.8.5 method-level muCommander v0.8.5
  70. 31 ! ! ! ! A … ! ! !

    ! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ) * ) * ) * ! ! ! ! B " # " # " # $ % # # # & & $ % & & if we wanted to use changesets in batch, the same way we’ve been using snapshots, then we could readily substitute changesets for snapshots.
  71. 32 ! ! ! ! A … ! ! !

    ! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ! ! ! ! B ) * ) * ) * " # " # " # $ % & & - “I haven’t used anything online yet, that previous evaluation was just the current batch approach.” - With online mode, we can update the model in real-time as work is being done. - “How would the approach work if it were used in a real environment (without actually subjecting developers to torturous experimentation). - What I did was a historical simulation. When we do the blue query, it does not know anything about the data after it.
  72. 32 ! ! ! ! A … ! ! !

    ! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ! ! ! ! B ) * ) * ) * " # " # " # $ % # & & - “I haven’t used anything online yet, that previous evaluation was just the current batch approach.” - With online mode, we can update the model in real-time as work is being done. - “How would the approach work if it were used in a real environment (without actually subjecting developers to torturous experimentation). - What I did was a historical simulation. When we do the blue query, it does not know anything about the data after it.
  73. 32 ! ! ! ! A … ! ! !

    ! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ! ! ! ! B ) * ) * ) * " # " # " # $ % # & & 5 - “I haven’t used anything online yet, that previous evaluation was just the current batch approach.” - With online mode, we can update the model in real-time as work is being done. - “How would the approach work if it were used in a real environment (without actually subjecting developers to torturous experimentation). - What I did was a historical simulation. When we do the blue query, it does not know anything about the data after it.
  74. 32 ! ! ! ! A … ! ! !

    ! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ! ! ! ! B ) * ) * ) * " # " # " # $ % # # & & 5 - “I haven’t used anything online yet, that previous evaluation was just the current batch approach.” - With online mode, we can update the model in real-time as work is being done. - “How would the approach work if it were used in a real environment (without actually subjecting developers to torturous experimentation). - What I did was a historical simulation. When we do the blue query, it does not know anything about the data after it.
  75. 32 ! ! ! ! A … ! ! !

    ! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ! ! ! ! B ) * ) * ) * " # " # " # $ % # # & & 5 1 - “I haven’t used anything online yet, that previous evaluation was just the current batch approach.” - With online mode, we can update the model in real-time as work is being done. - “How would the approach work if it were used in a real environment (without actually subjecting developers to torturous experimentation). - What I did was a historical simulation. When we do the blue query, it does not know anything about the data after it.
  76. 32 ! ! ! ! A … ! ! !

    ! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ! ! ! ! B ) * ) * ) * " # " # " # $ % # # # & & 5 1 - “I haven’t used anything online yet, that previous evaluation was just the current batch approach.” - With online mode, we can update the model in real-time as work is being done. - “How would the approach work if it were used in a real environment (without actually subjecting developers to torturous experimentation). - What I did was a historical simulation. When we do the blue query, it does not know anything about the data after it.
  77. 32 ! ! ! ! A … ! ! !

    ! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ! ! ! ! B ) * ) * ) * " # " # " # $ % # # # & & 5 1 2 - “I haven’t used anything online yet, that previous evaluation was just the current batch approach.” - With online mode, we can update the model in real-time as work is being done. - “How would the approach work if it were used in a real environment (without actually subjecting developers to torturous experimentation). - What I did was a historical simulation. When we do the blue query, it does not know anything about the data after it.
  78. 33 Snapshot Changesets Historical 0 500 1000 1500 2000 ArgoUML

    v0.22 class-level Snapshot Changesets Historical 0 2000 4000 6000 8000 10000 12000 14000 ArgoUML v0.22 method-level ArgoUML v0.22 Snapshots and changsets haven’t changed from what I’ve already shown you, only added this historical evaluation column in green.
  79. 34 Snapshot Changesets Historical 0 500 1000 1500 2000 ArgoUML

    v0.24 class-level Snapshot Changesets Historical 0 2000 4000 6000 8000 10000 12000 14000 ArgoUML v0.24 method-level ArgoUML v0.24
  80. 35 Snapshot Changesets Historical 0 500 1000 1500 2000 2500

    ArgoUML v0.26.2 class-level Snapshot Changesets Historical 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 ArgoUML v0.26.2 method-level ArgoUML v0.26.2
  81. 36 Snapshot Changesets Historical 0 200 400 600 800 1000

    1200 1400 JabRef v2.6 class-level Snapshot Changesets Historical 0 1000 2000 3000 4000 5000 6000 JabRef v2.6 method-level JabRef v2.6
  82. 37 Snapshot Changesets Historical 0 200 400 600 800 1000

    1200 jEdit v4.3 class-level Snapshot Changesets Historical 0 1000 2000 3000 4000 5000 6000 7000 8000 jEdit v4.3 method-level jEdit v4.3
  83. 38 Snapshot Changesets Historical 0 200 400 600 800 1000

    1200 1400 1600 1800 muCommander v0.8.5 class-level Snapshot Changesets Historical 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 muCommander v0.8.5 method-level muCommander v0.8.5
  84. 39 Snapshot Changesets Historical 0 500 1000 1500 2000 2500

    Overall class-level Snapshot Changesets Historical 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Overall method-level All systems changesets do just as well as snapshots historical simulation more accurately captures what is going on
  85. 39 Snapshot Changesets Historical 0 500 1000 1500 2000 2500

    Overall class-level Snapshot Changesets Historical 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Overall method-level All systems Snapshots <= Changesets < Historical changesets do just as well as snapshots historical simulation more accurately captures what is going on
  86. 40 Future work 1. Do historical simulation with snapshots 2.

    How many changesets are required? 3. How does this perform for other tasks? 1. how would a historical sim for snapshots perform? with changesets I could cheat and do it online instead of taking a naive approach because the format is correct. 2. min number of changesets? 3. how would changesets perform for a different task?
  87. Modeling Changeset Topics for Feature Location Nicholas A. Kraft University

    of Alabama ABB Corporate Research University of Alabama Christopher S. Corley Kelly L. Kashuda
  88. # lexer.py # author: Christopher S. Corley ! from swindle.lexeme

    import Lexeme from swindle.types import (Types, get_type, PUNCTUATION) from io import TextIOWrapper ! class Lexer: def __init__(self, fileptr): # fileptr is generally a TextIOWrapper when reading from a file self.fileptr = fileptr self.done = False ! self.comment_mode = False self.need_terminator = False ! # To emulate pushing things back to the stream self.saved_char = None ! # character is a generator so we can have nice reading things # like next(self.character) self.character = self.char_generator() self.error_msg = 'Could not read char %d on line %d from file.' ! # a convenient way to count line numbers and read things character # by character. def char_generator(self): for self.line_no, line in enumerate(self.fileptr): for self.col_no, char in enumerate(line): self.saved_char = None yield char !
  89. commit 63bf5d84890bceed42068880f2554d89b6ba10fc Author: Christopher Corley <[email protected]> Date: Fri Oct 26

    19:20:50 2012 -0500 ! Remove unnecessary newline tokens, now form_list is behaving stupid ! diff --git a/swindle/lexer.py b/swindle/lexer.py index ce86687..59f349b 100644 --- a/swindle/lexer.py +++ b/swindle/lexer.py @@ -10,10 +10,10 @@ class Lexer: def __init__(self, fileptr): # fileptr is generally a TextIOWrapper when reading from a file self.fileptr = fileptr + self.done = False ! - self.tokenize_whitespace = False # like python, we tokenize all whitespace - self.whitespace_count = 0 self.comment_mode = False + self.need_terminator = False ! # To emulate pushing things back to the stream self.saved_char = None @@ -72,6 +72,7 @@ class Lexer: try: c = next(self.character) except StopIteration: + self.done = True return None ! return c 43 " # Changesets are program text! ! >>> I’m saying, let’s use these changesets as input to our model building step.
  90. 44 Snapshot Changesets 0 500 1000 1500 2000 ArgoUML v0.22

    class-level Snapshot Changesets 0 2000 4000 6000 8000 10000 12000 14000 ArgoUML v0.22 method-level ArgoUML v0.22 explain what the graph means
  91. 45 Snapshot Changesets 0 500 1000 1500 2000 ArgoUML v0.24

    class-level Snapshot Changesets 0 2000 4000 6000 8000 10000 12000 14000 ArgoUML v0.24 method-level ArgoUML v0.24
  92. 46 Snapshot Changesets 0 500 1000 1500 2000 2500 ArgoUML

    v0.26.2 class-level Snapshot Changesets 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 ArgoUML v0.26.2 method-level ArgoUML v0.26.2
  93. 47 Snapshot Changesets 0 200 400 600 800 1000 1200

    1400 JabRef v2.6 class-level Snapshot Changesets 0 1000 2000 3000 4000 5000 6000 JabRef v2.6 method-level JabRef v2.6
  94. 48 Snapshot Changesets 0 200 400 600 800 1000 1200

    jEdit v4.3 class-level Snapshot Changesets 0 1000 2000 3000 4000 5000 6000 7000 8000 jEdit v4.3 method-level jEdit v4.3
  95. 49 Snapshot Changesets 0 200 400 600 800 1000 1200

    1400 1600 1800 muCommander v0.8.5 class-level Snapshot Changesets 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 muCommander v0.8.5 method-level muCommander v0.8.5
  96. 50 Snapshot Changesets Historical 0 500 1000 1500 2000 ArgoUML

    v0.22 class-level Snapshot Changesets Historical 0 2000 4000 6000 8000 10000 12000 14000 ArgoUML v0.22 method-level
  97. 51 Snapshot Changesets Historical 0 500 1000 1500 2000 ArgoUML

    v0.24 class-level Snapshot Changesets Historical 0 2000 4000 6000 8000 10000 12000 14000 ArgoUML v0.24 method-level
  98. 52 Snapshot Changesets Historical 0 500 1000 1500 2000 2500

    ArgoUML v0.26.2 class-level Snapshot Changesets Historical 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 ArgoUML v0.26.2 method-level
  99. 53 Snapshot Changesets Historical 0 200 400 600 800 1000

    1200 1400 JabRef v2.6 class-level Snapshot Changesets Historical 0 1000 2000 3000 4000 5000 6000 JabRef v2.6 method-level what does it meannnnn
  100. 54 Snapshot Changesets Historical 0 200 400 600 800 1000

    1200 jEdit v4.3 class-level Snapshot Changesets Historical 0 1000 2000 3000 4000 5000 6000 7000 8000 jEdit v4.3 method-level
  101. 55 Snapshot Changesets Historical 0 200 400 600 800 1000

    1200 1400 1600 1800 muCommander v0.8.5 class-level Snapshot Changesets Historical 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 muCommander v0.8.5 method-level
  102. 56 Snapshot Changesets Historical 0 500 1000 1500 2000 2500

    Overall class-level Snapshot Changesets Historical 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Overall method-level changesets do just as well as snapshots historical simulation more accurately captures what is going on