Modeling Changeset Topics for Feature Location

Slide 1

Slide 1 text

Modeling Changeset Topics for Feature Location Nicholas A. Kraft University of Alabama ABB Corporate Research University of Alabama Christopher S. Corley Kelly L. Kashuda In this talk, there’s a lot of FLT jargon, and the problem I’m trying to address is pretty subtle and is easier to see if we go through everything. 1. “A way we can use topic models on something other than source code, such as changesets, for FLT” 2. A look into more accurate FLT evaluations

Slide 2

Slide 2 text

2 ! " " " " " " " " " " # $ We have a user and they have something they are looking for in a big mess of source code. Our objective is to search for them and give them a ranked list where the top document(s) are the ones they are looking for. ! We sometimes use topic modeling to accomplish this

Slide 3

Slide 3 text

Slide 4

Slide 4 text

3 % % % % % %% Topic modeling is taking a bunch of immutable documents, such as books & news paper articles, and discovering the underlying themes in the text. - We categorize documents by which topics they contain using machine learning >>>Topic models have 3 basic steps to using them

Slide 5

Slide 5 text

3 & & & & % % % % % %% Topic modeling is taking a bunch of immutable documents, such as books & news paper articles, and discovering the underlying themes in the text. - We categorize documents by which topics they contain using machine learning >>>Topic models have 3 basic steps to using them

Slide 6

Slide 6 text

4 ' % % % % % %% ( ( 1 Step one is the learning step: - Take a bunch of documents - Put them into a machine learning algorithm - >>> The algorithm produces a trained model which we can use for the other steps.

Slide 7

Slide 7 text

) 4 ' ( ( 1 Step one is the learning step: - Take a bunch of documents - Put them into a machine learning algorithm - >>> The algorithm produces a trained model which we can use for the other steps.

Slide 8

Slide 8 text

5 ( ( 2 ) ' Step two: - Take this trained model - Using the same constructs - AND the same documents again - Infer the thematic structure of the documents with that model being held

Slide 9

Slide 9 text

5 ( ( 2 ) Step two: - Take this trained model - Using the same constructs - AND the same documents again - Infer the thematic structure of the documents with that model being held

Slide 10

Slide 10 text

5 % % % % % %% ( ( 2 ) Step two: - Take this trained model - Using the same constructs - AND the same documents again - Infer the thematic structure of the documents with that model being held

Slide 11

Slide 11 text

5 & & & & ( ( 2 ) Step two: - Take this trained model - Using the same constructs - AND the same documents again - Infer the thematic structure of the documents with that model being held

Slide 12

Slide 12 text

6 & & & & ) ( ( 3 Step 3, which is special: - we can take UNSEEN documents, documents we didn’t use in training - infer their thematic structure ! That’s a quick overview on how topic models work >>> How do we use topic models for feature location?

Slide 13

Slide 13 text

6 & & & & ) ( ( ## # # # 3 Step 3, which is special: - we can take UNSEEN documents, documents we didn’t use in training - infer their thematic structure ! That’s a quick overview on how topic models work >>> How do we use topic models for feature location?

Slide 14

Slide 14 text

6 & & & & ) ( ( # # # # 3 Step 3, which is special: - we can take UNSEEN documents, documents we didn’t use in training - infer their thematic structure ! That’s a quick overview on how topic models work >>> How do we use topic models for feature location?

Slide 15

Slide 15 text

" 7 7 ( ( 1 " " " " ' ) 2 ( ( " " " " ) " " " " ( ( ) " " " " ! * * # # $ $ Marcus, A.; Sergeyev, A.; Rajlich, V. & Maletic, J. I. An information retrieval approach to concept location in source code, Proceedings of the 11th Working Conference on Reverse Engineering, 2004 3 To do FLT, we do those three steps sequentially: 1. Train model, 2. Infer thematic structure of code 3. With a user query, find the most related source code to the query >>> Simple, and effective. But there is a mostly overlooked and subtle problem with this

Slide 16

Slide 16 text

8 " " " " Snapshot A ( ( ' ) Here’s the problem. For step one, we take a snapshot of source code at some point in time, such as a software release. BUT, as soon as a change to the source code happens, our model becomes outdated. By the time the next release comes around, the model might make zero sense for the source code it represents. ! >>> How do we deal with this problem?

Slide 17

Slide 17 text

8 " " " " Snapshot A … " " " " A+1 " " " " A+2 " " " " A+3 + + + " " " " Snapshot B ( ( ' ) Here’s the problem. For step one, we take a snapshot of source code at some point in time, such as a software release. BUT, as soon as a change to the source code happens, our model becomes outdated. By the time the next release comes around, the model might make zero sense for the source code it represents. ! >>> How do we deal with this problem?

Slide 18

Slide 18 text

) 9 " " " " Snapshot A … " " " " A+1 " " " " A+2 " " " " A+3 + + + " " " " Snapshot B " " " " ( ( ' ) ( ( ' We can’t just re-build a model whenever a new release comes around. That’s not practical if you’re actually developing the software. The model only is useful for two points in time, and there’s a lot of work going on in between. Most research thus far has overlooked the practicality of the approach, making it not reasonable for IDE integration.

Slide 19

Slide 19 text

10 " " " " Snapshot A … " " " " A+1 " " " " A+2 " " " " A+3 " " " " Snapshot B ' ) ) " " " " ' ) " " " ' ) " " ' ) " ' The naive solution is to just re-build the model every time. (not a good idea) — to build models from scratch can take hours (for thousand LOC) to days (for million LOC) So, we need to be clever…. We cant change the fact that software is going to change… if we could, this field wouldn’t be this active! >>> What about the models? Let’s look closer…

Slide 20

Slide 20 text

10 " " " " Snapshot A … " " " " A+1 " " " " A+2 " " " " A+3 " " " " Snapshot B (not a good idea) ' ) ) " " " " ' ) " " " ' ) " " ' ) " ' The naive solution is to just re-build the model every time. (not a good idea) — to build models from scratch can take hours (for thousand LOC) to days (for million LOC) So, we need to be clever…. We cant change the fact that software is going to change… if we could, this field wouldn’t be this active! >>> What about the models? Let’s look closer…

Slide 21

Slide 21 text

11 " " " " Snapshot A … " " " " A+1 " " " " A+2 " " " " A+3 ' ) " " " " Snapshot B Rao, S. Incremental update framework for efﬁcient retrieval from software libraries for bug localization, Purdue University, 2013 - A clever way was introduced by Shivani Rao. - She has extended existing TMs that can keep the model approximately up-to-date. - However, she notes that even her extensions will needs rebuilding occasionally. - This is the best thing I’ve seen so far. If you’re doing FLT work, read her work.

Slide 22

Slide 22 text

11 " " " " Snapshot A … " " " " A+1 " " " " A+2 " " " " A+3 ' ) " " " " Snapshot B ) " ⭈ ) " " ⭈ ) " " " ⭈ Rao, S. Incremental update framework for efﬁcient retrieval from software libraries for bug localization, Purdue University, 2013 - A clever way was introduced by Shivani Rao. - She has extended existing TMs that can keep the model approximately up-to-date. - However, she notes that even her extensions will needs rebuilding occasionally. - This is the best thing I’ve seen so far. If you’re doing FLT work, read her work.

Slide 23

Slide 23 text

11 " " " " Snapshot A … " " " " A+1 " " " " A+2 " " " " A+3 ' ) " " " " Snapshot B ) ' " " " " ) " ⭈ ) " " ⭈ ) " " " ⭈ Rao, S. Incremental update framework for efﬁcient retrieval from software libraries for bug localization, Purdue University, 2013 - A clever way was introduced by Shivani Rao. - She has extended existing TMs that can keep the model approximately up-to-date. - However, she notes that even her extensions will needs rebuilding occasionally. - This is the best thing I’ve seen so far. If you’re doing FLT work, read her work.

Slide 24

Slide 24 text

12 " " " " Snapshot A … " " " " A+1 " " " " A+2 " " " " A+3 ' ) " " " " Snapshot B ) ' " " " " ) " ⭈ ) " " ⭈ ) " " " ⭈ Rao, S. Incremental update framework for efﬁcient retrieval from software libraries for bug localization, Purdue University, 2013 I’m not satisfied with this. I think we can do better with less work AND with off-the-shelf topic modeling algorithms so that we don’t have to extend a new TM algo for software-specific problems. ! >>> There are a few things about modern topic models that I think can help.

Slide 25

Slide 25 text

13 TMs process… , Hoffman, M.; Bach, F. R. & Blei, D. M. Lafferty, J.; Williams, C. Online Learning for Latent Dirichlet Allocation, Advances in Neural Information Processing Systems, 2010 - Basic topic models designed to process text, like books. - Topic models process streamable data: Can process new data *as* it appears. Does not need to know about data beforehand. - As books are being written, the model is being updated… - This allows us to iterate between step 1 and step 2. Updating model with new data, infer things about that new data. - Cool property that is underutilized… - Topic models process inﬁnite data, continuously be updated until the end of time >>> However, there is a tradeoff of these two properties is….

Slide 26

Slide 26 text

13 TMs process… , Online Hoffman, M.; Bach, F. R. & Blei, D. M. Lafferty, J.; Williams, C. Online Learning for Latent Dirichlet Allocation, Advances in Neural Information Processing Systems, 2010 - Basic topic models designed to process text, like books. - Topic models process streamable data: Can process new data *as* it appears. Does not need to know about data beforehand. - As books are being written, the model is being updated… - This allows us to iterate between step 1 and step 2. Updating model with new data, infer things about that new data. - Cool property that is underutilized… - Topic models process inﬁnite data, continuously be updated until the end of time >>> However, there is a tradeoff of these two properties is….

Slide 27

Slide 27 text

13 TMs process… streamable data , Online Hoffman, M.; Bach, F. R. & Blei, D. M. Lafferty, J.; Williams, C. Online Learning for Latent Dirichlet Allocation, Advances in Neural Information Processing Systems, 2010 - Basic topic models designed to process text, like books. - Topic models process streamable data: Can process new data *as* it appears. Does not need to know about data beforehand. - As books are being written, the model is being updated… - This allows us to iterate between step 1 and step 2. Updating model with new data, infer things about that new data. - Cool property that is underutilized… - Topic models process inﬁnite data, continuously be updated until the end of time >>> However, there is a tradeoff of these two properties is….

Slide 28

Slide 28 text

, 13 TMs process… streamable data , , , , , Online Hoffman, M.; Bach, F. R. & Blei, D. M. Lafferty, J.; Williams, C. Online Learning for Latent Dirichlet Allocation, Advances in Neural Information Processing Systems, 2010 - Basic topic models designed to process text, like books. - Topic models process streamable data: Can process new data *as* it appears. Does not need to know about data beforehand. - As books are being written, the model is being updated… - This allows us to iterate between step 1 and step 2. Updating model with new data, infer things about that new data. - Cool property that is underutilized… - Topic models process inﬁnite data, continuously be updated until the end of time >>> However, there is a tradeoff of these two properties is….

Slide 29

Slide 29 text

, 13 TMs process… streamable data , , , , , 1 2 * Online Hoffman, M.; Bach, F. R. & Blei, D. M. Lafferty, J.; Williams, C. Online Learning for Latent Dirichlet Allocation, Advances in Neural Information Processing Systems, 2010 - Basic topic models designed to process text, like books. - Topic models process streamable data: Can process new data *as* it appears. Does not need to know about data beforehand. - As books are being written, the model is being updated… - This allows us to iterate between step 1 and step 2. Updating model with new data, infer things about that new data. - Cool property that is underutilized… - Topic models process inﬁnite data, continuously be updated until the end of time >>> However, there is a tradeoff of these two properties is….

Slide 30

Slide 30 text

, 13 TMs process… streamable data inﬁnite data , , , , , , , , , , , , , , , , , Online Hoffman, M.; Bach, F. R. & Blei, D. M. Lafferty, J.; Williams, C. Online Learning for Latent Dirichlet Allocation, Advances in Neural Information Processing Systems, 2010 - Basic topic models designed to process text, like books. - Topic models process streamable data: Can process new data *as* it appears. Does not need to know about data beforehand. - As books are being written, the model is being updated… - This allows us to iterate between step 1 and step 2. Updating model with new data, infer things about that new data. - Cool property that is underutilized… - Topic models process inﬁnite data, continuously be updated until the end of time >>> However, there is a tradeoff of these two properties is….

Slide 31

Slide 31 text

14 immutable data , streamable data inﬁnite data , , , , , 14 , , , , , , , , , , , , TMs process… Online Hoffman, M.; Bach, F. R. & Blei, D. M. Lafferty, J.; Williams, C. Online Learning for Latent Dirichlet Allocation, Advances in Neural Information Processing Systems, 2010 - Topic models, even the offline cousins, need immutable data. Hence Shivani Rao’s need to extend the offline versions to compensate for software’s innate mutability. - >>> What do we have in software development that represents source code - in an streamable, infinite, immutable format?

Slide 32

Slide 32 text

15 " " " " Snapshot A … " " " " A+1 " " " " A+2 " " " " A+3 " " " " Snapshot B Source code repositories! The repository contains a recorded history of every state the software has ever been in. I think we should be using this. ! >>> How might we build a model out of a repository?

Slide 33

Slide 33 text

16 " " " " Snapshot A … ! ! ! ! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ! ! ! ! Snapshot B - With the changesets: - Changesets are a summary of the work done in between two commits. I do not mean the commit message. They are textual diff of the source code. - As developers make changes, they commit these summaries to the source code history, so we get a stream of immutable data — that is the changeset!

Slide 34

Slide 34 text

16 ! ! ! ! Snapshot A … ! ! ! ! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ! ! ! ! Snapshot B " # " # " # - With the changesets: - Changesets are a summary of the work done in between two commits. I do not mean the commit message. They are textual diff of the source code. - As developers make changes, they commit these summaries to the source code history, so we get a stream of immutable data — that is the changeset!

Slide 35

Slide 35 text

16 ! ! ! ! Snapshot A … ! ! ! ! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ! ! ! ! Snapshot B " # " # " # diff A..A+1 diff A+1..A+2 diff A+2..A+3 - With the changesets: - Changesets are a summary of the work done in between two commits. I do not mean the commit message. They are textual diff of the source code. - As developers make changes, they commit these summaries to the source code history, so we get a stream of immutable data — that is the changeset!

Slide 36

Slide 36 text

17 ! ! ! ! Snapshot A … ! ! ! ! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ! ! ! ! Snapshot B " # " # " # - I’m saying, let’s use these changesets as input to our model building step. ! ! >>> So, here is my proposal for the FLT problem.

Slide 37

Slide 37 text

17 ! ! ! ! Snapshot A … ! ! ! ! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ! ! ! ! Snapshot B " # " # " # $ % # # # & & - I’m saying, let’s use these changesets as input to our model building step. ! ! >>> So, here is my proposal for the FLT problem.

Slide 38

Slide 38 text

18 18 & & 1 ! ! ! ! $ % 2 & & ! ! ! ! % ! ! ! ! & & % ! ! ! ! 3 ' ( ( ) ) * * Again, here’s the standard practice approach for FLTs. All I want to change is the first step, the learning step.

Slide 39

Slide 39 text

19 19 & & $ & & ! ! ! ! ! ! ! ! & & ! ! ! ! ) ' % # # # # # # % # # # % # # # 1 2 ( ( " " " ) * * 3 1. Use the changesets to train the model 2. Steps 2 & 3 now get that model, but don’t change otherwise.

Slide 40

Slide 40 text

20 20 & & $ & & ! ! ! ! ! ! ! ! & & ! ! ! ! ) ' % # # # # # # % # # # % # # # 1 2 ( ( " " " ) * * 3 Because we are using training data that is in the expected format: 1. This removes the “one-way road” restriction 2. we can now update the model with new data! Potentially removing the retraining restriction we have with the standard approach. >>> So how did I evaluate this?

Slide 41

Slide 41 text

Slide 42

Slide 42 text

20 20 & & $ & & ! ! ! ! ! ! ! ! & & ! ! ! ! ) ' % # # # # # # % # # # % # # # 1 2 ( ( " # # # # " " " ) * * 3 Because we are using training data that is in the expected format: 1. This removes the “one-way road” restriction 2. we can now update the model with new data! Potentially removing the retraining restriction we have with the standard approach. >>> So how did I evaluate this?

Slide 43

Slide 43 text

21 Dit, B.; Holtzhauer, A.; Poshyvanyk, D. & Kagdi, H. “A dataset from change history to support evaluation of software maintenance tasks” Mining Software Repositories (MSR), 2013 Moreno, L.; Treadway, J. J.; Marcus, A. & Shen, W. “On the Use of Stack Traces to Improve Text Retrieval-Based Bug Localization” Int’l Conf. on Software Maintenance and Evolution (ICSME), 2014 Smooshed the datasets together so that we looked at the common systems. The Dit etal had traceability links, which were useful. ! Dit et al had method level info, Moreno had class level info

Slide 44

Slide 44 text

22 ArgoUML v0.22 ArgoUML v0.24 ArgoUML v0.26.2 JabRef v2.6 jEdit v4.3 muCommander v0.8.5 The info contained in the datasets were basically this

Slide 45

Slide 45 text

23 ! ! ! ! A … ! ! ! ! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ) * ) * ) * ! ! ! ! B We have traceability links between what changed and why it changed for a specific version, aka B. So we know for the blue query that what changed was the blue file. 1. built snapshot model for specific version, aka B. 2. build changeset model using changesets up to that same version

Slide 46

Slide 46 text

23 ! ! ! ! A … ! ! ! ! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ) * ) * ) * ! ! ! ! B $ % & & We have traceability links between what changed and why it changed for a specific version, aka B. So we know for the blue query that what changed was the blue file. 1. built snapshot model for specific version, aka B. 2. build changeset model using changesets up to that same version

Slide 47

Slide 47 text

23 ! ! ! ! A … ! ! ! ! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ) * ) * ) * ! ! ! ! B " # " # " # $ % & & We have traceability links between what changed and why it changed for a specific version, aka B. So we know for the blue query that what changed was the blue file. 1. built snapshot model for specific version, aka B. 2. build changeset model using changesets up to that same version

Slide 48

Slide 48 text

Slide 49

Slide 49 text

23 ! ! ! ! A … ! ! ! ! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ) * ) * ) * ! ! ! ! B " # " # " # $ % # # # & & $ % & & We have traceability links between what changed and why it changed for a specific version, aka B. So we know for the blue query that what changed was the blue file. 1. built snapshot model for specific version, aka B. 2. build changeset model using changesets up to that same version

Slide 50

Slide 50 text

24 ' ! ! ! ! ! ! ! ! ! ! We evaluate from the perspective of the user: 1. Have a “query” 2. We know which document it relates to

Slide 51

Slide 51 text

24 ' ! ! ! ! ! ! ! ! ! ! ) * We evaluate from the perspective of the user: 1. Have a “query” 2. We know which document it relates to

Slide 52

Slide 52 text

24 ' ! ! ! ! ! ! ! ! ! ! foo bar baz qux quux bletch thud grunt spam eggs ) * We evaluate from the perspective of the user: 1. Have a “query” 2. We know which document it relates to

Slide 53

Slide 53 text

24 ' ! ! ! ! ! ! ! ! ! ! foo bar baz qux quux bletch thud grunt spam eggs ) * 5 We evaluate from the perspective of the user: 1. Have a “query” 2. We know which document it relates to

Slide 54

Slide 54 text

24 ' ! ! ! ! ! ! ! ! ! ! foo bar baz qux quux bletch thud grunt spam eggs ) * ) * 5 We evaluate from the perspective of the user: 1. Have a “query” 2. We know which document it relates to

Slide 55

Slide 55 text

24 ' ! ! ! ! ! ! ! ! ! ! foo bar baz qux quux bletch thud grunt spam eggs ) * ) * 5 1 We evaluate from the perspective of the user: 1. Have a “query” 2. We know which document it relates to

Slide 56

Slide 56 text

24 ' ! ! ! ! ! ! ! ! ! ! foo bar baz qux quux bletch thud grunt spam eggs ) * ) * ) * 5 1 We evaluate from the perspective of the user: 1. Have a “query” 2. We know which document it relates to

Slide 57

Slide 57 text

24 ' ! ! ! ! ! ! ! ! ! ! foo bar baz qux quux bletch thud grunt spam eggs ) * ) * ) * 5 1 We evaluate from the perspective of the user: 1. Have a “query” 2. We know which document it relates to

Slide 58

Slide 58 text

24 ' ! ! ! ! ! ! ! ! ! ! foo bar baz qux quux bletch thud grunt spam eggs ) * ) * ) * 5 1 2 We evaluate from the perspective of the user: 1. Have a “query” 2. We know which document it relates to

Slide 59

Slide 59 text

24 ' ! ! ! ! ! ! ! ! ! ! foo bar baz qux quux bletch thud grunt spam eggs ) * ) * ) * 5 1 2 . . . We evaluate from the perspective of the user: 1. Have a “query” 2. We know which document it relates to

Slide 60

Slide 60 text

24 ' ! ! ! ! ! ! ! ! ! ! foo bar baz qux quux bletch thud grunt spam eggs ) * ) * ) * 5 1 2 . . . 0 5 10 15 20 25 30 35 40 45 We evaluate from the perspective of the user: 1. Have a “query” 2. We know which document it relates to

Slide 61

Slide 61 text

24 ' ! ! ! ! ! ! ! ! ! ! foo bar baz qux quux bletch thud grunt spam eggs ) * ) * ) * 5 1 2 . . . 0 5 10 15 20 25 30 35 40 45 great! We evaluate from the perspective of the user: 1. Have a “query” 2. We know which document it relates to

Slide 62

Slide 62 text

24 ' ! ! ! ! ! ! ! ! ! ! foo bar baz qux quux bletch thud grunt spam eggs ) * ) * ) * 5 1 2 . . . 0 10 20 30 40 50 0 5 10 15 20 25 30 35 40 45 great! We evaluate from the perspective of the user: 1. Have a “query” 2. We know which document it relates to

Slide 63

Slide 63 text

24 ' ! ! ! ! ! ! ! ! ! ! foo bar baz qux quux bletch thud grunt spam eggs ) * ) * ) * 5 1 2 . . . 0 10 20 30 40 50 0 5 10 15 20 25 30 35 40 45 great! extremely bad We evaluate from the perspective of the user: 1. Have a “query” 2. We know which document it relates to

Slide 64

Slide 64 text

25 Snapshot Changesets 0 500 1000 1500 2000 ArgoUML v0.22 class-level Snapshot Changesets 0 2000 4000 6000 8000 10000 12000 14000 ArgoUML v0.22 method-level ArgoUML v0.22 explain what the graph means ! tail is much shorter — meaning more results were near top of the list

Slide 65

Slide 65 text

26 Snapshot Changesets 0 500 1000 1500 2000 ArgoUML v0.24 class-level Snapshot Changesets 0 2000 4000 6000 8000 10000 12000 14000 ArgoUML v0.24 method-level ArgoUML v0.24

Slide 66

Slide 66 text

27 Snapshot Changesets 0 500 1000 1500 2000 2500 ArgoUML v0.26.2 class-level Snapshot Changesets 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 ArgoUML v0.26.2 method-level ArgoUML v0.26.2

Slide 67

Slide 67 text

28 Snapshot Changesets 0 200 400 600 800 1000 1200 1400 JabRef v2.6 class-level Snapshot Changesets 0 1000 2000 3000 4000 5000 6000 JabRef v2.6 method-level JabRef v2.6

Slide 68

Slide 68 text

29 Snapshot Changesets 0 200 400 600 800 1000 1200 jEdit v4.3 class-level Snapshot Changesets 0 1000 2000 3000 4000 5000 6000 7000 8000 jEdit v4.3 method-level jEdit v4.3

Slide 69

Slide 69 text

30 Snapshot Changesets 0 200 400 600 800 1000 1200 1400 1600 1800 muCommander v0.8.5 class-level Snapshot Changesets 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 muCommander v0.8.5 method-level muCommander v0.8.5

Slide 70

Slide 70 text

31 ! ! ! ! A … ! ! ! ! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ) * ) * ) * ! ! ! ! B " # " # " # $ % # # # & & $ % & & if we wanted to use changesets in batch, the same way we’ve been using snapshots, then we could readily substitute changesets for snapshots.

Slide 71

Slide 71 text

32 ! ! ! ! A … ! ! ! ! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ! ! ! ! B ) * ) * ) * " # " # " # $ % & & - “I haven’t used anything online yet, that previous evaluation was just the current batch approach.” - With online mode, we can update the model in real-time as work is being done. - “How would the approach work if it were used in a real environment (without actually subjecting developers to torturous experimentation). - What I did was a historical simulation. When we do the blue query, it does not know anything about the data after it.

Slide 72

Slide 72 text

32 ! ! ! ! A … ! ! ! ! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ! ! ! ! B ) * ) * ) * " # " # " # $ % # & & - “I haven’t used anything online yet, that previous evaluation was just the current batch approach.” - With online mode, we can update the model in real-time as work is being done. - “How would the approach work if it were used in a real environment (without actually subjecting developers to torturous experimentation). - What I did was a historical simulation. When we do the blue query, it does not know anything about the data after it.

Slide 73

Slide 73 text

32 ! ! ! ! A … ! ! ! ! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ! ! ! ! B ) * ) * ) * " # " # " # $ % # & & 5 - “I haven’t used anything online yet, that previous evaluation was just the current batch approach.” - With online mode, we can update the model in real-time as work is being done. - “How would the approach work if it were used in a real environment (without actually subjecting developers to torturous experimentation). - What I did was a historical simulation. When we do the blue query, it does not know anything about the data after it.

Slide 74

Slide 74 text

32 ! ! ! ! A … ! ! ! ! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ! ! ! ! B ) * ) * ) * " # " # " # $ % # # & & 5 - “I haven’t used anything online yet, that previous evaluation was just the current batch approach.” - With online mode, we can update the model in real-time as work is being done. - “How would the approach work if it were used in a real environment (without actually subjecting developers to torturous experimentation). - What I did was a historical simulation. When we do the blue query, it does not know anything about the data after it.

Slide 75

Slide 75 text

32 ! ! ! ! A … ! ! ! ! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ! ! ! ! B ) * ) * ) * " # " # " # $ % # # & & 5 1 - “I haven’t used anything online yet, that previous evaluation was just the current batch approach.” - With online mode, we can update the model in real-time as work is being done. - “How would the approach work if it were used in a real environment (without actually subjecting developers to torturous experimentation). - What I did was a historical simulation. When we do the blue query, it does not know anything about the data after it.

Slide 76

Slide 76 text

32 ! ! ! ! A … ! ! ! ! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ! ! ! ! B ) * ) * ) * " # " # " # $ % # # # & & 5 1 - “I haven’t used anything online yet, that previous evaluation was just the current batch approach.” - With online mode, we can update the model in real-time as work is being done. - “How would the approach work if it were used in a real environment (without actually subjecting developers to torturous experimentation). - What I did was a historical simulation. When we do the blue query, it does not know anything about the data after it.

Slide 77

Slide 77 text

32 ! ! ! ! A … ! ! ! ! A+1 ! ! ! ! A+2 ! ! ! ! A+3 ! ! ! ! B ) * ) * ) * " # " # " # $ % # # # & & 5 1 2 - “I haven’t used anything online yet, that previous evaluation was just the current batch approach.” - With online mode, we can update the model in real-time as work is being done. - “How would the approach work if it were used in a real environment (without actually subjecting developers to torturous experimentation). - What I did was a historical simulation. When we do the blue query, it does not know anything about the data after it.

Slide 78

Slide 78 text

33 Snapshot Changesets Historical 0 500 1000 1500 2000 ArgoUML v0.22 class-level Snapshot Changesets Historical 0 2000 4000 6000 8000 10000 12000 14000 ArgoUML v0.22 method-level ArgoUML v0.22 Snapshots and changsets haven’t changed from what I’ve already shown you, only added this historical evaluation column in green.

Slide 79

Slide 79 text

34 Snapshot Changesets Historical 0 500 1000 1500 2000 ArgoUML v0.24 class-level Snapshot Changesets Historical 0 2000 4000 6000 8000 10000 12000 14000 ArgoUML v0.24 method-level ArgoUML v0.24

Slide 80

Slide 80 text

35 Snapshot Changesets Historical 0 500 1000 1500 2000 2500 ArgoUML v0.26.2 class-level Snapshot Changesets Historical 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 ArgoUML v0.26.2 method-level ArgoUML v0.26.2

Slide 81

Slide 81 text

36 Snapshot Changesets Historical 0 200 400 600 800 1000 1200 1400 JabRef v2.6 class-level Snapshot Changesets Historical 0 1000 2000 3000 4000 5000 6000 JabRef v2.6 method-level JabRef v2.6

Slide 82

Slide 82 text

37 Snapshot Changesets Historical 0 200 400 600 800 1000 1200 jEdit v4.3 class-level Snapshot Changesets Historical 0 1000 2000 3000 4000 5000 6000 7000 8000 jEdit v4.3 method-level jEdit v4.3

Slide 83

Slide 83 text

38 Snapshot Changesets Historical 0 200 400 600 800 1000 1200 1400 1600 1800 muCommander v0.8.5 class-level Snapshot Changesets Historical 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 muCommander v0.8.5 method-level muCommander v0.8.5

Slide 84

Slide 84 text

39 Snapshot Changesets Historical 0 500 1000 1500 2000 2500 Overall class-level Snapshot Changesets Historical 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Overall method-level All systems changesets do just as well as snapshots historical simulation more accurately captures what is going on

Slide 85

Slide 85 text

39 Snapshot Changesets Historical 0 500 1000 1500 2000 2500 Overall class-level Snapshot Changesets Historical 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Overall method-level All systems Snapshots <= Changesets < Historical changesets do just as well as snapshots historical simulation more accurately captures what is going on

Slide 86

Slide 86 text

40 Future work 1. Do historical simulation with snapshots 2. How many changesets are required? 3. How does this perform for other tasks? 1. how would a historical sim for snapshots perform? with changesets I could cheat and do it online instead of taking a naive approach because the format is correct. 2. min number of changesets? 3. how would changesets perform for a different task?

Slide 87

Slide 87 text

Modeling Changeset Topics for Feature Location Nicholas A. Kraft University of Alabama ABB Corporate Research University of Alabama Christopher S. Corley Kelly L. Kashuda

Slide 88

Slide 88 text

# lexer.py # author: Christopher S. Corley ! from swindle.lexeme import Lexeme from swindle.types import (Types, get_type, PUNCTUATION) from io import TextIOWrapper ! class Lexer: def __init__(self, fileptr): # fileptr is generally a TextIOWrapper when reading from a file self.fileptr = fileptr self.done = False ! self.comment_mode = False self.need_terminator = False ! # To emulate pushing things back to the stream self.saved_char = None ! # character is a generator so we can have nice reading things # like next(self.character) self.character = self.char_generator() self.error_msg = 'Could not read char %d on line %d from file.' ! # a convenient way to count line numbers and read things character # by character. def char_generator(self): for self.line_no, line in enumerate(self.fileptr): for self.col_no, char in enumerate(line): self.saved_char = None yield char !

Slide 89

Slide 89 text

commit 63bf5d84890bceed42068880f2554d89b6ba10fc Author: Christopher Corley Date: Fri Oct 26 19:20:50 2012 -0500 ! Remove unnecessary newline tokens, now form_list is behaving stupid ! diff --git a/swindle/lexer.py b/swindle/lexer.py index ce86687..59f349b 100644 --- a/swindle/lexer.py +++ b/swindle/lexer.py @@ -10,10 +10,10 @@ class Lexer: def __init__(self, fileptr): # fileptr is generally a TextIOWrapper when reading from a file self.fileptr = fileptr + self.done = False ! - self.tokenize_whitespace = False # like python, we tokenize all whitespace - self.whitespace_count = 0 self.comment_mode = False + self.need_terminator = False ! # To emulate pushing things back to the stream self.saved_char = None @@ -72,6 +72,7 @@ class Lexer: try: c = next(self.character) except StopIteration: + self.done = True return None ! return c 43 " # Changesets are program text! ! >>> I’m saying, let’s use these changesets as input to our model building step.

Slide 90

Slide 90 text

44 Snapshot Changesets 0 500 1000 1500 2000 ArgoUML v0.22 class-level Snapshot Changesets 0 2000 4000 6000 8000 10000 12000 14000 ArgoUML v0.22 method-level ArgoUML v0.22 explain what the graph means

Slide 91

Slide 91 text

45 Snapshot Changesets 0 500 1000 1500 2000 ArgoUML v0.24 class-level Snapshot Changesets 0 2000 4000 6000 8000 10000 12000 14000 ArgoUML v0.24 method-level ArgoUML v0.24

Slide 92

Slide 92 text

46 Snapshot Changesets 0 500 1000 1500 2000 2500 ArgoUML v0.26.2 class-level Snapshot Changesets 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 ArgoUML v0.26.2 method-level ArgoUML v0.26.2

Slide 93

Slide 93 text

47 Snapshot Changesets 0 200 400 600 800 1000 1200 1400 JabRef v2.6 class-level Snapshot Changesets 0 1000 2000 3000 4000 5000 6000 JabRef v2.6 method-level JabRef v2.6

Slide 94

Slide 94 text

48 Snapshot Changesets 0 200 400 600 800 1000 1200 jEdit v4.3 class-level Snapshot Changesets 0 1000 2000 3000 4000 5000 6000 7000 8000 jEdit v4.3 method-level jEdit v4.3

Slide 95

Slide 95 text

49 Snapshot Changesets 0 200 400 600 800 1000 1200 1400 1600 1800 muCommander v0.8.5 class-level Snapshot Changesets 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 muCommander v0.8.5 method-level muCommander v0.8.5

Slide 96

Slide 96 text

50 Snapshot Changesets Historical 0 500 1000 1500 2000 ArgoUML v0.22 class-level Snapshot Changesets Historical 0 2000 4000 6000 8000 10000 12000 14000 ArgoUML v0.22 method-level

Slide 97

Slide 97 text

51 Snapshot Changesets Historical 0 500 1000 1500 2000 ArgoUML v0.24 class-level Snapshot Changesets Historical 0 2000 4000 6000 8000 10000 12000 14000 ArgoUML v0.24 method-level

Slide 98

Slide 98 text

52 Snapshot Changesets Historical 0 500 1000 1500 2000 2500 ArgoUML v0.26.2 class-level Snapshot Changesets Historical 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 ArgoUML v0.26.2 method-level

Slide 99

Slide 99 text

53 Snapshot Changesets Historical 0 200 400 600 800 1000 1200 1400 JabRef v2.6 class-level Snapshot Changesets Historical 0 1000 2000 3000 4000 5000 6000 JabRef v2.6 method-level what does it meannnnn

Slide 100

Slide 100 text

54 Snapshot Changesets Historical 0 200 400 600 800 1000 1200 jEdit v4.3 class-level Snapshot Changesets Historical 0 1000 2000 3000 4000 5000 6000 7000 8000 jEdit v4.3 method-level

Slide 101

Slide 101 text

55 Snapshot Changesets Historical 0 200 400 600 800 1000 1200 1400 1600 1800 muCommander v0.8.5 class-level Snapshot Changesets Historical 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 muCommander v0.8.5 method-level

Slide 102

Slide 102 text

56 Snapshot Changesets Historical 0 500 1000 1500 2000 2500 Overall class-level Snapshot Changesets Historical 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Overall method-level changesets do just as well as snapshots historical simulation more accurately captures what is going on