Modeling Changeset Topics for Feature Location

Modeling Changeset Topics for Feature Location Nicholas A. Kraft University
of Alabama ABB Corporate Research University of Alabama Christopher S. Corley Kelly L. Kashuda In this talk, there’s a lot of FLT jargon, and the problem I’m trying to address is pretty subtle and is easier to see if we go through everything. 1. “A way we can use topic models on something other than source code, such as changesets, for FLT” 2. A look into more accurate FLT evaluations

2 ! " " " " " " " "
" " # $ We have a user and they have something they are looking for in a big mess of source code. Our objective is to search for them and give them a ranked list where the top document(s) are the ones they are looking for. ! We sometimes use topic modeling to accomplish this

3 % % % % % %% Topic modeling is
taking a bunch of immutable documents, such as books & news paper articles, and discovering the underlying themes in the text. - We categorize documents by which topics they contain using machine learning >>>Topic models have 3 basic steps to using them

3 & & & & % % % % %
%% Topic modeling is taking a bunch of immutable documents, such as books & news paper articles, and discovering the underlying themes in the text. - We categorize documents by which topics they contain using machine learning >>>Topic models have 3 basic steps to using them

4 ' % % % % % %% ( (
1 Step one is the learning step: - Take a bunch of documents - Put them into a machine learning algorithm - >>> The algorithm produces a trained model which we can use for the other steps.

) 4 ' ( ( 1 Step one is the
learning step: - Take a bunch of documents - Put them into a machine learning algorithm - >>> The algorithm produces a trained model which we can use for the other steps.

5 ( ( 2 ) ' Step two: - Take
this trained model - Using the same constructs - AND the same documents again - Infer the thematic structure of the documents with that model being held

5 ( ( 2 ) Step two: - Take this
trained model - Using the same constructs - AND the same documents again - Infer the thematic structure of the documents with that model being held

5 % % % % % %% ( ( 2
) Step two: - Take this trained model - Using the same constructs - AND the same documents again - Infer the thematic structure of the documents with that model being held

5 & & & & ( ( 2 ) Step
two: - Take this trained model - Using the same constructs - AND the same documents again - Infer the thematic structure of the documents with that model being held

6 & & & & ) ( ( 3 Step
3, which is special: - we can take UNSEEN documents, documents we didn’t use in training - infer their thematic structure ! That’s a quick overview on how topic models work >>> How do we use topic models for feature location?

6 & & & & ) ( ( ## #
# # 3 Step 3, which is special: - we can take UNSEEN documents, documents we didn’t use in training - infer their thematic structure ! That’s a quick overview on how topic models work >>> How do we use topic models for feature location?

6 & & & & ) ( ( # #
# # 3 Step 3, which is special: - we can take UNSEEN documents, documents we didn’t use in training - infer their thematic structure ! That’s a quick overview on how topic models work >>> How do we use topic models for feature location?

" 7 7 ( ( 1 " " " "
' ) 2 ( ( " " " " ) " " " " ( ( ) " " " " ! * * # # $ $ Marcus, A.; Sergeyev, A.; Rajlich, V. & Maletic, J. I. An information retrieval approach to concept location in source code, Proceedings of the 11th Working Conference on Reverse Engineering, 2004 3 To do FLT, we do those three steps sequentially: 1. Train model, 2. Infer thematic structure of code 3. With a user query, find the most related source code to the query >>> Simple, and effective. But there is a mostly overlooked and subtle problem with this

8 " " " " Snapshot A ( ( '
) Here’s the problem. For step one, we take a snapshot of source code at some point in time, such as a software release. BUT, as soon as a change to the source code happens, our model becomes outdated. By the time the next release comes around, the model might make zero sense for the source code it represents. ! >>> How do we deal with this problem?

8 " " " " Snapshot A … " "
" " A+1 " " " " A+2 " " " " A+3 + + + " " " " Snapshot B ( ( ' ) Here’s the problem. For step one, we take a snapshot of source code at some point in time, such as a software release. BUT, as soon as a change to the source code happens, our model becomes outdated. By the time the next release comes around, the model might make zero sense for the source code it represents. ! >>> How do we deal with this problem?

) 9 " " " " Snapshot A … "
" " " A+1 " " " " A+2 " " " " A+3 + + + " " " " Snapshot B " " " " ( ( ' ) ( ( ' We can’t just re-build a model whenever a new release comes around. That’s not practical if you’re actually developing the software. The model only is useful for two points in time, and there’s a lot of work going on in between. Most research thus far has overlooked the practicality of the approach, making it not reasonable for IDE integration.

10 " " " " Snapshot A … " "
" " A+1 " " " " A+2 " " " " A+3 " " " " Snapshot B ' ) ) " " " " ' ) " " " ' ) " " ' ) " ' The naive solution is to just re-build the model every time. (not a good idea) — to build models from scratch can take hours (for thousand LOC) to days (for million LOC) So, we need to be clever…. We cant change the fact that software is going to change… if we could, this field wouldn’t be this active! >>> What about the models? Let’s look closer…

10 " " " " Snapshot A … " "
" " A+1 " " " " A+2 " " " " A+3 " " " " Snapshot B (not a good idea) ' ) ) " " " " ' ) " " " ' ) " " ' ) " ' The naive solution is to just re-build the model every time. (not a good idea) — to build models from scratch can take hours (for thousand LOC) to days (for million LOC) So, we need to be clever…. We cant change the fact that software is going to change… if we could, this field wouldn’t be this active! >>> What about the models? Let’s look closer…

11 " " " " Snapshot A … " "
" " A+1 " " " " A+2 " " " " A+3 ' ) " " " " Snapshot B Rao, S. Incremental update framework for efﬁcient retrieval from software libraries for bug localization, Purdue University, 2013 - A clever way was introduced by Shivani Rao. - She has extended existing TMs that can keep the model approximately up-to-date. - However, she notes that even her extensions will needs rebuilding occasionally. - This is the best thing I’ve seen so far. If you’re doing FLT work, read her work.

11 " " " " Snapshot A … " "
" " A+1 " " " " A+2 " " " " A+3 ' ) " " " " Snapshot B ) " ⭈ ) " " ⭈ ) " " " ⭈ Rao, S. Incremental update framework for efﬁcient retrieval from software libraries for bug localization, Purdue University, 2013 - A clever way was introduced by Shivani Rao. - She has extended existing TMs that can keep the model approximately up-to-date. - However, she notes that even her extensions will needs rebuilding occasionally. - This is the best thing I’ve seen so far. If you’re doing FLT work, read her work.

11 " " " " Snapshot A … " "
" " A+1 " " " " A+2 " " " " A+3 ' ) " " " " Snapshot B ) ' " " " " ) " ⭈ ) " " ⭈ ) " " " ⭈ Rao, S. Incremental update framework for efﬁcient retrieval from software libraries for bug localization, Purdue University, 2013 - A clever way was introduced by Shivani Rao. - She has extended existing TMs that can keep the model approximately up-to-date. - However, she notes that even her extensions will needs rebuilding occasionally. - This is the best thing I’ve seen so far. If you’re doing FLT work, read her work.

12 " " " " Snapshot A … " "
" " A+1 " " " " A+2 " " " " A+3 ' ) " " " " Snapshot B ) ' " " " " ) " ⭈ ) " " ⭈ ) " " " ⭈ Rao, S. Incremental update framework for efﬁcient retrieval from software libraries for bug localization, Purdue University, 2013 I’m not satisfied with this. I think we can do better with less work AND with off-the-shelf topic modeling algorithms so that we don’t have to extend a new TM algo for software-specific problems. ! >>> There are a few things about modern topic models that I think can help.

13 TMs process… , Hoffman, M.; Bach, F. R. &
Blei, D. M. Lafferty, J.; Williams, C. Online Learning for Latent Dirichlet Allocation, Advances in Neural Information Processing Systems, 2010 - Basic topic models designed to process text, like books. - Topic models process streamable data: Can process new data *as* it appears. Does not need to know about data beforehand. - As books are being written, the model is being updated… - This allows us to iterate between step 1 and step 2. Updating model with new data, infer things about that new data. - Cool property that is underutilized… - Topic models process inﬁnite data, continuously be updated until the end of time >>> However, there is a tradeoff of these two properties is….

13 TMs process… , Online Hoffman, M.; Bach, F. R.
& Blei, D. M. Lafferty, J.; Williams, C. Online Learning for Latent Dirichlet Allocation, Advances in Neural Information Processing Systems, 2010 - Basic topic models designed to process text, like books. - Topic models process streamable data: Can process new data *as* it appears. Does not need to know about data beforehand. - As books are being written, the model is being updated… - This allows us to iterate between step 1 and step 2. Updating model with new data, infer things about that new data. - Cool property that is underutilized… - Topic models process inﬁnite data, continuously be updated until the end of time >>> However, there is a tradeoff of these two properties is….

13 TMs process… streamable data , Online Hoffman, M.; Bach,
F. R. & Blei, D. M. Lafferty, J.; Williams, C. Online Learning for Latent Dirichlet Allocation, Advances in Neural Information Processing Systems, 2010 - Basic topic models designed to process text, like books. - Topic models process streamable data: Can process new data *as* it appears. Does not need to know about data beforehand. - As books are being written, the model is being updated… - This allows us to iterate between step 1 and step 2. Updating model with new data, infer things about that new data. - Cool property that is underutilized… - Topic models process inﬁnite data, continuously be updated until the end of time >>> However, there is a tradeoff of these two properties is….

, 13 TMs process… streamable data , , , ,
, Online Hoffman, M.; Bach, F. R. & Blei, D. M. Lafferty, J.; Williams, C. Online Learning for Latent Dirichlet Allocation, Advances in Neural Information Processing Systems, 2010 - Basic topic models designed to process text, like books. - Topic models process streamable data: Can process new data *as* it appears. Does not need to know about data beforehand. - As books are being written, the model is being updated… - This allows us to iterate between step 1 and step 2. Updating model with new data, infer things about that new data. - Cool property that is underutilized… - Topic models process inﬁnite data, continuously be updated until the end of time >>> However, there is a tradeoff of these two properties is….

, 13 TMs process… streamable data , , , ,
, 1 2 * Online Hoffman, M.; Bach, F. R. & Blei, D. M. Lafferty, J.; Williams, C. Online Learning for Latent Dirichlet Allocation, Advances in Neural Information Processing Systems, 2010 - Basic topic models designed to process text, like books. - Topic models process streamable data: Can process new data *as* it appears. Does not need to know about data beforehand. - As books are being written, the model is being updated… - This allows us to iterate between step 1 and step 2. Updating model with new data, infer things about that new data. - Cool property that is underutilized… - Topic models process inﬁnite data, continuously be updated until the end of time >>> However, there is a tradeoff of these two properties is….

, 13 TMs process… streamable data inﬁnite data , ,
, , , , , , , , , , , , , , , Online Hoffman, M.; Bach, F. R. & Blei, D. M. Lafferty, J.; Williams, C. Online Learning for Latent Dirichlet Allocation, Advances in Neural Information Processing Systems, 2010 - Basic topic models designed to process text, like books. - Topic models process streamable data: Can process new data *as* it appears. Does not need to know about data beforehand. - As books are being written, the model is being updated… - This allows us to iterate between step 1 and step 2. Updating model with new data, infer things about that new data. - Cool property that is underutilized… - Topic models process inﬁnite data, continuously be updated until the end of time >>> However, there is a tradeoff of these two properties is….

14 immutable data , streamable data inﬁnite data , ,
, , , 14 , , , , , , , , , , , , TMs process… Online Hoffman, M.; Bach, F. R. & Blei, D. M. Lafferty, J.; Williams, C. Online Learning for Latent Dirichlet Allocation, Advances in Neural Information Processing Systems, 2010 - Topic models, even the offline cousins, need immutable data. Hence Shivani Rao’s need to extend the offline versions to compensate for software’s innate mutability. - >>> What do we have in software development that represents source code - in an streamable, infinite, immutable format?

15 " " " " Snapshot A … " "
" " A+1 " " " " A+2 " " " " A+3 " " " " Snapshot B Source code repositories! The repository contains a recorded history of every state the software has ever been in. I think we should be using this. ! >>> How might we build a model out of a repository?

16 " " " " Snapshot A … ! !
! ! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ! ! ! ! Snapshot B - With the changesets: - Changesets are a summary of the work done in between two commits. I do not mean the commit message. They are textual diff of the source code. - As developers make changes, they commit these summaries to the source code history, so we get a stream of immutable data — that is the changeset!

16 ! ! ! ! Snapshot A … ! !
! ! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ! ! ! ! Snapshot B " # " # " # - With the changesets: - Changesets are a summary of the work done in between two commits. I do not mean the commit message. They are textual diff of the source code. - As developers make changes, they commit these summaries to the source code history, so we get a stream of immutable data — that is the changeset!

16 ! ! ! ! Snapshot A … ! !
! ! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ! ! ! ! Snapshot B " # " # " # diff A..A+1 diff A+1..A+2 diff A+2..A+3 - With the changesets: - Changesets are a summary of the work done in between two commits. I do not mean the commit message. They are textual diff of the source code. - As developers make changes, they commit these summaries to the source code history, so we get a stream of immutable data — that is the changeset!

17 ! ! ! ! Snapshot A … ! !
! ! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ! ! ! ! Snapshot B " # " # " # - I’m saying, let’s use these changesets as input to our model building step. ! ! >>> So, here is my proposal for the FLT problem.

17 ! ! ! ! Snapshot A … ! !
! ! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ! ! ! ! Snapshot B " # " # " # $ % # # # & & - I’m saying, let’s use these changesets as input to our model building step. ! ! >>> So, here is my proposal for the FLT problem.

18 18 & & 1 ! ! ! ! $
% 2 & & ! ! ! ! % ! ! ! ! & & % ! ! ! ! 3 ' ( ( ) ) * * Again, here’s the standard practice approach for FLTs. All I want to change is the first step, the learning step.

19 19 & & $ & & ! ! !
! ! ! ! ! & & ! ! ! ! ) ' % # # # # # # % # # # % # # # 1 2 ( ( " " " ) * * 3 1. Use the changesets to train the model 2. Steps 2 & 3 now get that model, but don’t change otherwise.

20 20 & & $ & & ! ! !
! ! ! ! ! & & ! ! ! ! ) ' % # # # # # # % # # # % # # # 1 2 ( ( " " " ) * * 3 Because we are using training data that is in the expected format: 1. This removes the “one-way road” restriction 2. we can now update the model with new data! Potentially removing the retraining restriction we have with the standard approach. >>> So how did I evaluate this?

20 20 & & $ & & ! ! !
! ! ! ! ! & & ! ! ! ! ) ' % # # # # # # % # # # % # # # 1 2 ( ( " # # # # " " " ) * * 3 Because we are using training data that is in the expected format: 1. This removes the “one-way road” restriction 2. we can now update the model with new data! Potentially removing the retraining restriction we have with the standard approach. >>> So how did I evaluate this?

21 Dit, B.; Holtzhauer, A.; Poshyvanyk, D. & Kagdi, H.
“A dataset from change history to support evaluation of software maintenance tasks” Mining Software Repositories (MSR), 2013 Moreno, L.; Treadway, J. J.; Marcus, A. & Shen, W. “On the Use of Stack Traces to Improve Text Retrieval-Based Bug Localization” Int’l Conf. on Software Maintenance and Evolution (ICSME), 2014 Smooshed the datasets together so that we looked at the common systems. The Dit etal had traceability links, which were useful. ! Dit et al had method level info, Moreno had class level info

22 ArgoUML v0.22 ArgoUML v0.24 ArgoUML v0.26.2 JabRef v2.6 jEdit
v4.3 muCommander v0.8.5 The info contained in the datasets were basically this

23 ! ! ! ! A … ! ! !
! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ) * ) * ) * ! ! ! ! B We have traceability links between what changed and why it changed for a specific version, aka B. So we know for the blue query that what changed was the blue file. 1. built snapshot model for specific version, aka B. 2. build changeset model using changesets up to that same version

23 ! ! ! ! A … ! ! !
! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ) * ) * ) * ! ! ! ! B $ % & & We have traceability links between what changed and why it changed for a specific version, aka B. So we know for the blue query that what changed was the blue file. 1. built snapshot model for specific version, aka B. 2. build changeset model using changesets up to that same version

23 ! ! ! ! A … ! ! !
! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ) * ) * ) * ! ! ! ! B " # " # " # $ % & & We have traceability links between what changed and why it changed for a specific version, aka B. So we know for the blue query that what changed was the blue file. 1. built snapshot model for specific version, aka B. 2. build changeset model using changesets up to that same version

23 ! ! ! ! A … ! ! !
! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ) * ) * ) * ! ! ! ! B " # " # " # $ % # # # & & $ % & & We have traceability links between what changed and why it changed for a specific version, aka B. So we know for the blue query that what changed was the blue file. 1. built snapshot model for specific version, aka B. 2. build changeset model using changesets up to that same version

24 ' ! ! ! ! ! ! ! !
! ! We evaluate from the perspective of the user: 1. Have a “query” 2. We know which document it relates to

24 ' ! ! ! ! ! ! ! !
! ! ) * We evaluate from the perspective of the user: 1. Have a “query” 2. We know which document it relates to

24 ' ! ! ! ! ! ! ! !
! ! foo bar baz qux quux bletch thud grunt spam eggs ) * We evaluate from the perspective of the user: 1. Have a “query” 2. We know which document it relates to

24 ' ! ! ! ! ! ! ! !
! ! foo bar baz qux quux bletch thud grunt spam eggs ) * 5 We evaluate from the perspective of the user: 1. Have a “query” 2. We know which document it relates to

24 ' ! ! ! ! ! ! ! !
! ! foo bar baz qux quux bletch thud grunt spam eggs ) * ) * 5 We evaluate from the perspective of the user: 1. Have a “query” 2. We know which document it relates to

24 ' ! ! ! ! ! ! ! !
! ! foo bar baz qux quux bletch thud grunt spam eggs ) * ) * 5 1 We evaluate from the perspective of the user: 1. Have a “query” 2. We know which document it relates to

24 ' ! ! ! ! ! ! ! !
! ! foo bar baz qux quux bletch thud grunt spam eggs ) * ) * ) * 5 1 We evaluate from the perspective of the user: 1. Have a “query” 2. We know which document it relates to

24 ' ! ! ! ! ! ! ! !
! ! foo bar baz qux quux bletch thud grunt spam eggs ) * ) * ) * 5 1 2 We evaluate from the perspective of the user: 1. Have a “query” 2. We know which document it relates to

24 ' ! ! ! ! ! ! ! !
! ! foo bar baz qux quux bletch thud grunt spam eggs ) * ) * ) * 5 1 2 . . . We evaluate from the perspective of the user: 1. Have a “query” 2. We know which document it relates to

24 ' ! ! ! ! ! ! ! !
! ! foo bar baz qux quux bletch thud grunt spam eggs ) * ) * ) * 5 1 2 . . . 0 5 10 15 20 25 30 35 40 45 We evaluate from the perspective of the user: 1. Have a “query” 2. We know which document it relates to

24 ' ! ! ! ! ! ! ! !
! ! foo bar baz qux quux bletch thud grunt spam eggs ) * ) * ) * 5 1 2 . . . 0 5 10 15 20 25 30 35 40 45 great! We evaluate from the perspective of the user: 1. Have a “query” 2. We know which document it relates to

24 ' ! ! ! ! ! ! ! !
! ! foo bar baz qux quux bletch thud grunt spam eggs ) * ) * ) * 5 1 2 . . . 0 10 20 30 40 50 0 5 10 15 20 25 30 35 40 45 great! We evaluate from the perspective of the user: 1. Have a “query” 2. We know which document it relates to

24 ' ! ! ! ! ! ! ! !
! ! foo bar baz qux quux bletch thud grunt spam eggs ) * ) * ) * 5 1 2 . . . 0 10 20 30 40 50 0 5 10 15 20 25 30 35 40 45 great! extremely bad We evaluate from the perspective of the user: 1. Have a “query” 2. We know which document it relates to

25 Snapshot Changesets 0 500 1000 1500 2000 ArgoUML v0.22
class-level Snapshot Changesets 0 2000 4000 6000 8000 10000 12000 14000 ArgoUML v0.22 method-level ArgoUML v0.22 explain what the graph means ! tail is much shorter — meaning more results were near top of the list

class-level Snapshot Changesets 0 2000 4000 6000 8000 10000 12000 14000 ArgoUML v0.24 method-level ArgoUML v0.24

27 Snapshot Changesets 0 500 1000 1500 2000 2500 ArgoUML
v0.26.2 class-level Snapshot Changesets 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 ArgoUML v0.26.2 method-level ArgoUML v0.26.2

28 Snapshot Changesets 0 200 400 600 800 1000 1200
1400 JabRef v2.6 class-level Snapshot Changesets 0 1000 2000 3000 4000 5000 6000 JabRef v2.6 method-level JabRef v2.6

jEdit v4.3 class-level Snapshot Changesets 0 1000 2000 3000 4000 5000 6000 7000 8000 jEdit v4.3 method-level jEdit v4.3

1400 1600 1800 muCommander v0.8.5 class-level Snapshot Changesets 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 muCommander v0.8.5 method-level muCommander v0.8.5

31 ! ! ! ! A … ! ! !
! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ) * ) * ) * ! ! ! ! B " # " # " # $ % # # # & & $ % & & if we wanted to use changesets in batch, the same way we’ve been using snapshots, then we could readily substitute changesets for snapshots.

32 ! ! ! ! A … ! ! !
! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ! ! ! ! B ) * ) * ) * " # " # " # $ % & & - “I haven’t used anything online yet, that previous evaluation was just the current batch approach.” - With online mode, we can update the model in real-time as work is being done. - “How would the approach work if it were used in a real environment (without actually subjecting developers to torturous experimentation). - What I did was a historical simulation. When we do the blue query, it does not know anything about the data after it.

32 ! ! ! ! A … ! ! !
! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ! ! ! ! B ) * ) * ) * " # " # " # $ % # & & - “I haven’t used anything online yet, that previous evaluation was just the current batch approach.” - With online mode, we can update the model in real-time as work is being done. - “How would the approach work if it were used in a real environment (without actually subjecting developers to torturous experimentation). - What I did was a historical simulation. When we do the blue query, it does not know anything about the data after it.

32 ! ! ! ! A … ! ! !
! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ! ! ! ! B ) * ) * ) * " # " # " # $ % # & & 5 - “I haven’t used anything online yet, that previous evaluation was just the current batch approach.” - With online mode, we can update the model in real-time as work is being done. - “How would the approach work if it were used in a real environment (without actually subjecting developers to torturous experimentation). - What I did was a historical simulation. When we do the blue query, it does not know anything about the data after it.

32 ! ! ! ! A … ! ! !
! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ! ! ! ! B ) * ) * ) * " # " # " # $ % # # & & 5 - “I haven’t used anything online yet, that previous evaluation was just the current batch approach.” - With online mode, we can update the model in real-time as work is being done. - “How would the approach work if it were used in a real environment (without actually subjecting developers to torturous experimentation). - What I did was a historical simulation. When we do the blue query, it does not know anything about the data after it.

32 ! ! ! ! A … ! ! !
! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ! ! ! ! B ) * ) * ) * " # " # " # $ % # # & & 5 1 - “I haven’t used anything online yet, that previous evaluation was just the current batch approach.” - With online mode, we can update the model in real-time as work is being done. - “How would the approach work if it were used in a real environment (without actually subjecting developers to torturous experimentation). - What I did was a historical simulation. When we do the blue query, it does not know anything about the data after it.

32 ! ! ! ! A … ! ! !
! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ! ! ! ! B ) * ) * ) * " # " # " # $ % # # # & & 5 1 - “I haven’t used anything online yet, that previous evaluation was just the current batch approach.” - With online mode, we can update the model in real-time as work is being done. - “How would the approach work if it were used in a real environment (without actually subjecting developers to torturous experimentation). - What I did was a historical simulation. When we do the blue query, it does not know anything about the data after it.

32 ! ! ! ! A … ! ! !
! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ! ! ! ! B ) * ) * ) * " # " # " # $ % # # # & & 5 1 2 - “I haven’t used anything online yet, that previous evaluation was just the current batch approach.” - With online mode, we can update the model in real-time as work is being done. - “How would the approach work if it were used in a real environment (without actually subjecting developers to torturous experimentation). - What I did was a historical simulation. When we do the blue query, it does not know anything about the data after it.

33 Snapshot Changesets Historical 0 500 1000 1500 2000 ArgoUML
v0.22 class-level Snapshot Changesets Historical 0 2000 4000 6000 8000 10000 12000 14000 ArgoUML v0.22 method-level ArgoUML v0.22 Snapshots and changsets haven’t changed from what I’ve already shown you, only added this historical evaluation column in green.

v0.24 class-level Snapshot Changesets Historical 0 2000 4000 6000 8000 10000 12000 14000 ArgoUML v0.24 method-level ArgoUML v0.24

35 Snapshot Changesets Historical 0 500 1000 1500 2000 2500
ArgoUML v0.26.2 class-level Snapshot Changesets Historical 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 ArgoUML v0.26.2 method-level ArgoUML v0.26.2

1200 1400 JabRef v2.6 class-level Snapshot Changesets Historical 0 1000 2000 3000 4000 5000 6000 JabRef v2.6 method-level JabRef v2.6

1200 jEdit v4.3 class-level Snapshot Changesets Historical 0 1000 2000 3000 4000 5000 6000 7000 8000 jEdit v4.3 method-level jEdit v4.3

1200 1400 1600 1800 muCommander v0.8.5 class-level Snapshot Changesets Historical 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 muCommander v0.8.5 method-level muCommander v0.8.5

Overall class-level Snapshot Changesets Historical 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Overall method-level All systems changesets do just as well as snapshots historical simulation more accurately captures what is going on

Overall class-level Snapshot Changesets Historical 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Overall method-level All systems Snapshots <= Changesets < Historical changesets do just as well as snapshots historical simulation more accurately captures what is going on

40 Future work 1. Do historical simulation with snapshots 2.
How many changesets are required? 3. How does this perform for other tasks? 1. how would a historical sim for snapshots perform? with changesets I could cheat and do it online instead of taking a naive approach because the format is correct. 2. min number of changesets? 3. how would changesets perform for a different task?

Modeling Changeset Topics for Feature Location Nicholas A. Kraft University
of Alabama ABB Corporate Research University of Alabama Christopher S. Corley Kelly L. Kashuda

# lexer.py # author: Christopher S. Corley ! from swindle.lexeme
import Lexeme from swindle.types import (Types, get_type, PUNCTUATION) from io import TextIOWrapper ! class Lexer: def __init__(self, fileptr): # fileptr is generally a TextIOWrapper when reading from a file self.fileptr = fileptr self.done = False ! self.comment_mode = False self.need_terminator = False ! # To emulate pushing things back to the stream self.saved_char = None ! # character is a generator so we can have nice reading things # like next(self.character) self.character = self.char_generator() self.error_msg = 'Could not read char %d on line %d from file.' ! # a convenient way to count line numbers and read things character # by character. def char_generator(self): for self.line_no, line in enumerate(self.fileptr): for self.col_no, char in enumerate(line): self.saved_char = None yield char !

commit 63bf5d84890bceed42068880f2554d89b6ba10fc Author: Christopher Corley <[email protected]> Date: Fri Oct 26
19:20:50 2012 -0500 ! Remove unnecessary newline tokens, now form_list is behaving stupid ! diff --git a/swindle/lexer.py b/swindle/lexer.py index ce86687..59f349b 100644 --- a/swindle/lexer.py +++ b/swindle/lexer.py @@ -10,10 +10,10 @@ class Lexer: def __init__(self, fileptr): # fileptr is generally a TextIOWrapper when reading from a file self.fileptr = fileptr + self.done = False ! - self.tokenize_whitespace = False # like python, we tokenize all whitespace - self.whitespace_count = 0 self.comment_mode = False + self.need_terminator = False ! # To emulate pushing things back to the stream self.saved_char = None @@ -72,6 +72,7 @@ class Lexer: try: c = next(self.character) except StopIteration: + self.done = True return None ! return c 43 " # Changesets are program text! ! >>> I’m saying, let’s use these changesets as input to our model building step.

class-level Snapshot Changesets 0 2000 4000 6000 8000 10000 12000 14000 ArgoUML v0.22 method-level ArgoUML v0.22 explain what the graph means

class-level Snapshot Changesets 0 2000 4000 6000 8000 10000 12000 14000 ArgoUML v0.24 method-level ArgoUML v0.24

46 Snapshot Changesets 0 500 1000 1500 2000 2500 ArgoUML
v0.26.2 class-level Snapshot Changesets 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 ArgoUML v0.26.2 method-level ArgoUML v0.26.2

1400 JabRef v2.6 class-level Snapshot Changesets 0 1000 2000 3000 4000 5000 6000 JabRef v2.6 method-level JabRef v2.6

jEdit v4.3 class-level Snapshot Changesets 0 1000 2000 3000 4000 5000 6000 7000 8000 jEdit v4.3 method-level jEdit v4.3

1400 1600 1800 muCommander v0.8.5 class-level Snapshot Changesets 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 muCommander v0.8.5 method-level muCommander v0.8.5

v0.22 class-level Snapshot Changesets Historical 0 2000 4000 6000 8000 10000 12000 14000 ArgoUML v0.22 method-level

v0.24 class-level Snapshot Changesets Historical 0 2000 4000 6000 8000 10000 12000 14000 ArgoUML v0.24 method-level

ArgoUML v0.26.2 class-level Snapshot Changesets Historical 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 ArgoUML v0.26.2 method-level

1200 1400 JabRef v2.6 class-level Snapshot Changesets Historical 0 1000 2000 3000 4000 5000 6000 JabRef v2.6 method-level what does it meannnnn

1200 jEdit v4.3 class-level Snapshot Changesets Historical 0 1000 2000 3000 4000 5000 6000 7000 8000 jEdit v4.3 method-level

1200 1400 1600 1800 muCommander v0.8.5 class-level Snapshot Changesets Historical 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 muCommander v0.8.5 method-level

Overall class-level Snapshot Changesets Historical 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Overall method-level changesets do just as well as snapshots historical simulation more accurately captures what is going on

Modeling Changeset Topics for Feature Location

Modeling Changeset Topics for Feature Location

More Decks by Christopher Corley

Other Decks in Research

Featured

Transcript