When AI Gets It Almost Right: Lessons From AI-assisted Software Development

When AI Gets It Almost Right: Lessons From AI-assisted Software
Development Sarah Nadi

3 https://sanadlab.org/

4 Dependency management Improving code LLMs AI for SE Impact
of AI on software engineering education

What happens when AI gets it almost right 6

Summer 2021 7

8 (a) Copilot query context (b) Qu (a) Copilot query
context Technical Preview How correct are the suggestions? How understandable is the generated code?

9 MSR 2022 Python Java JavaScript C Correct (all tests
pass) 42% 58% 27% 39% Partially Correct (some tests pass) 39% 33% 33% 27% Cyclomatic and cognitive complexities of the solutions were generally below the 15-25 recommended thresholds

10 Fall 2022 Github Next

Test Generation (2022-2023) 11 Github Next 1 let mocha =
require( mocha ); 2 let assert = require( assert ); 3 let quill_delta = require( quill-delta ); 4 // quill-delta.prototype.concat(other) 5 describe( test quill_delta , function() { 6 it( test quill-delta.prototype.concat , function(done) { 7 let delta1 = new quill_delta([{ insert: Hello }, 8 { insert: , 9 attributes: { bold: true } }, 10 { insert: World! }]); 11 let delta2 = new quill_delta([{ insert: Hello }, 12 { insert: , 13 attributes: { bold: true } }, 14 { insert: World! }]); 15 let delta3 = delta1.concat(delta2); 16 assert.equal(delta3.ops.length, 6); // fails 17 done(); 18 }) 19 }) 1 let mocha = require( mocha ); 2 let assert = require( assert ); 3 let quill_delta = require( quill-delta ); 4 // quill-delta.prototype.concat(other) 5 describe( test quill_delta , function() { 6 it( test quill-delta.prototype.concat , 7 let delta1 = new quill_delta([{ insert 8 { insert 9 attrib 10 { insert 11 let delta2 = new quill_delta([{ insert 12 { insert 13 attrib 14 { insert 15 let delta3 = delta1.concat(delta2); 16 assert.equal(delta3.ops.length, 6); 17 done(); 18 }) 19 1 let mocha = require( mocha ); 2 let assert = require( assert ); 3 let quill_delta = require( quill-delta ); 4 // quill-delta.prototype.concat(other) 5 describe( test quill_delta , function() { 6 it( test quill-delta.prototype.concat , function(done) { 7 let delta1 = new quill_delta([{ insert: Hello }, 8 { insert: , 9 attributes: { bold: true } }, 10 { insert: World! }]); 11 let delta2 = new quill_delta([{ insert: Hello }, 12 { insert: , 13 attributes: { bold: true } }, 14 { insert: World! }]); 15 let delta3 = delta1.concat(delta2); 16 assert.equal(delta3.ops.length, 6); // fails 17 done(); 18 }) 19 }) 1 let mocha = require( mocha ); 2 let assert = require( assert ); 3 let quill_delta = require( quill-delta ); 4 // quill-delta.prototype.concat(other) 5 describe( test quill_delta , function() { 6 it( test quill-delta.prototype.concat , 7 let delta1 = new quill_delta([{ insert 8 { insert 9 attrib 10 { insert 11 let delta2 = new quill_delta([{ insert 12 { insert 13 attrib 14 { insert 15 let delta3 = delta1.concat(delta2); 16 assert.equal(delta3.ops.length, 6); 17 done(); 18 }) 19 gpt3.5-turbo code-cushman-002 Starcoder

12 48% of generated tests pass and obtain a median
statement coverage of 70% and median branch coverage of 52% TSE 2023

13 giphy.com

14 Using LLMs for Software Engineering Tasks

15 Using LLMs for Software Engineering Tasks

16 Complete: task completely done Partial: parts of the task
are done (developer needs to ﬁnish it) Wrong: none of the goals of the task have been achieved What “done” means will diﬀer by task

22 argparse import argparse def main(): parser = argparse.ArgumentParser(description="Greet person")
parser.add_argument('name', help="name to greet") parser.add_argument('--age', type=int, help="your age") args = parser.parse_args() print(f"Hello {args.name}!") if args.age: print(f"You are {args.age} years old") if __name__ == "__main__": main() [MSR ’23, FSE ’24, ASE ’25, Under review]

parser.add_argument('name', help="name to greet") parser.add_argument('--age', type=int, help="your age") args = parser.parse_args() print(f"Hello {args.name}!") if args.age: print(f"You are {args.age} years old") if __name__ == "__main__": main() import click @click.command() @click.argument('name', type=str, help="name to greet") @click.option('--age', type=int, help="your age") def main(name, age): print(f'Hello {name}!') if age: print(f'You are {age} years old.') if __name__ == "__main__": main() Migrate code [MSR ’23, FSE ’24, ASE ’25, Under review]

parser.add_argument('name', help="name to greet") parser.add_argument('--age', type=int, help="your age") args = parser.parse_args() print(f"Hello {args.name}!") if args.age: print(f"You are {args.age} years old") if __name__ == "__main__": main() import click @click.command() @click.argument('name', type=str, help="name to greet") @click.option('--age', type=int, help="your age") def main(name, age): print(f'Hello {name}!') if age: print(f'You are {age} years old.') if __name__ == "__main__": main() Migrate code Migration-related code changes [MSR ’23, FSE ’24, ASE ’25, Under review]

27 Byam: Fixing Breaking Dependency Updates with Large Language Models
3 (a) Di!erence in Maven build ﬁle, causing a breaking update. (a) Di!erence in Maven build ﬁle, causing a breaking update. 113 ... 114 // create an instance of fop factory 115 FopFactory fopFactory = FopFactory.newInstance(); 116 // a user agent is needed for transformation 117 FOUserAgent foUserAgent = fopFactory.newFOUserAgent(); 118 ... (b) The broken code with compilation failure after breaking update. [ERROR] /billy/billy-gin/src/main/java/com/premiumminds/billy/gin/services/ impl/pdf/FOPPDFTransformer.java:[115,43] no suitable method found for newInstance(no arguments) method org.apache.fop.apps.FopFactory.newInstance(org.apache.fop.apps .FopFactoryConfig) is not applicable (actual and formal argument lists differ in length) method org.apache.fop.apps.FopFactory.newInstance(java.io.File) is not applicable (actual and formal argument lists differ in length) method org.apache.fop.apps.FopFactory.newInstance(java.net.URI) is not applicable (actual and formal argument lists differ in length) method org.apache.fop.apps.FopFactory.newInstance(java.net.URI,java. 113 ... 114 // create an instance of fop factory 115 FopFactory fopFactory = FopFactory.newInstance(); 116 // a user agent is needed for transformation 117 FOUserAgent foUserAgent = fopFactory.newFOUserAgent(); 118 ... (b) The broken code with compilation failure after breaking update. [ERROR] /billy/billy-gin/src/main/java/com/premiumminds/billy/gin/services/ impl/pdf/FOPPDFTransformer.java:[115,43] no suitable method found for newInstance(no arguments) method org.apache.fop.apps.FopFactory.newInstance(org.apache.fop.apps .FopFactoryConfig) is not applicable (actual and formal argument lists differ in length) method org.apache.fop.apps.FopFactory.newInstance(java.io.File) is not applicable (actual and formal argument lists differ in length) method org.apache.fop.apps.FopFactory.newInstance(java.net.URI) is not applicable (actual and formal argument lists differ in length) method org.apache.fop.apps.FopFactory.newInstance(java.net.URI,java. io.InputStream) is not applicable (actual and formal argument lists differ in length) (c) Compilation error information from the logs. Fig. 1: Example of a breaking dependency update when updating org.apache.xmlgraphics (dependency) from version 1.0 (old version) to 2.2 (new version) in the project billy. or recommendations for API method replacements based on source code anal- ysis (Dagenais and Robillard 2009). These approaches struggle to handle the Updating to a new version breaks the code [EMSE ’26]

31 IndexError [CASCON ’25]

36 ! Cognitive Complexity of functions should not be too
high

Over time, regardless of the task, programming language 39 there
is often “a last mile” left for the developer , or model:

How big is this last mile? 40

41 We completed a sample of 24 out of 34
incomplete migrations. For those, the LLM correctly ﬁnished 73% of the required changes, leaving 27% for the developer to complete.

How difficult are the remaining changes? 42

Are we accounting for the cognitive overhead of understanding generated
code to finish the last mile? 43 Are there cases where it’s just easier to start from scratch?

We usually evaluate correctness/completeness but not what partial completeness really
entails 44

Does the developer have the capability to finish this last
mile? 45 Does a junior developer have these capabilities?

48 Students are using AI for all their submitted course
projects. This is very frustrating! Hmm I’m going to introduce oral exams in software engineering!

49 Out of 5 oral exams I sat on, not
a single student was able to explain their code Most students did not follow provided test examples and repo template structure. They had convoluted solutions with over mocked complex tests to the extent that nothing was really tested anymore

Are students actually learning when they use AI to do
their work most of the time? 50

Evaluating ChatGPT’s Impact on Learning Computational Skills 51 Control group
Experiment group

52 Two learning tasks Questionnaire Study Session One week later
Post-study Retention Testing Retention test task Two test tasks Questionnaire Brief training

53 Two learning tasks Questionnaire Study Session One week later
Post-study Retention Testing Retention test task Two test tasks Questionnaire Brief training

Two learning tasks Questionnaire Study Session One week later Post-study
Retention Testing Retention test task Two test tasks Questionnaire Brief training 55 Students using ChatGPT ﬁnished ~9% faster Students not using ChatGPT ﬁnished ~7% faster

Retention Testing Retention test task Two test tasks Questionnaire Brief training 56

Retention Testing Retention test task Two test tasks Questionnaire Brief training 56 Students using ChatGPT have ~20% higher correctness Both groups have same correctness scores for 1st easier task Students not using ChatGPT have ~17% higher correctness for 2nd harder task Students not using ChatGPT have ~19% higher correctness

Retention Testing Retention test task Two test tasks Questionnaire Brief training 57 Students using ChatGPT have ~20% higher correctness Both groups have same correctness scores for 1st easier task Students not using ChatGPT have ~17% higher correctness for 2nd harder task Students not using ChatGPT have ~19% higher correctness Find more details, join our CHASE 2026 presentation on Tuesday 4:15pm

58 Students are using AI as a crutch without always
completely learning or understanding what they did

59 Using AI assistance resulted in a reduction in the
skills evaluation score by 17%, with no gained acceleration in completion time Gains in skill development of the control group is attributed to the process of encountering and subsequently resolving errors independently Some participants asked up to 15 questions or spent more than 30% of the total available task time on composing queries

AI leaves some tasks incomplete. 61

62 To complete these tasks, you need to have the
right expertise.

Expertise is built through practice, failure, and reflection. 63

64 Students outsource the tasks that teach them those skills
to AI.

66 “If you take only two things from this book,
take these: 1) software engineering was never just about writing code, and 2) a fool with an agentic coding tool is still a fool.”

Will our current teaching techniques and assessments simply create fools
with tools? 67

Sarah Nadi Icons in this presentation are by Magnemizer, Three
Musketeers, Flatart icons, Parzival’ 1997, Becris, Freepik, photo3idea_studio, cube29, Eucalyp, nawicon, IconBaander, graphicmail, Paul J, Icongeek, and juicy fish from flaticon.com When AI Gets it Almost Right: Lessons From AI-Assisted Software Development

When AI Gets It *Almost* Right: Lessons From AI...

When AI Gets It *Almost* Right: Lessons From AI-assisted Software Development

Other Decks in Research

Featured

Transcript

When AI Gets It Almost Right: Lessons From AI...

When AI Gets It Almost Right: Lessons From AI-assisted Software Development