Upgrade to Pro — share decks privately, control downloads, hide ads and more …

When AI Gets It *Almost* Right: Lessons From AI...

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.

When AI Gets It *Almost* Right: Lessons From AI-assisted Software Development

This talk was the keynote for ICPC 2026. (Some animations and group pics removed from pdf to make the slide deck more compact)

Generative AI has become a disruptive force in software development, with applications spanning a wide range of tasks. However, our recent empirical studies across various tasks reveal a consistent pattern: large language models automate substantial portions of the work, yet often produce results that are almost right. The remaining incorrect or incomplete work is then left to human developers, who must validate, repair, and reason about AI-generated changes.

In this talk, I explore two intertwined questions that arise from this reality. First, what is the cost of AI's incomplete work, especially the cognitive overhead involved for the developer? Second, and more critically, what happens when increasing reliance on AI affects the very skills developers need to finish incomplete tasks, a concern that becomes particularly visible in educational settings?

Avatar for Sarah Nadi

Sarah Nadi

April 16, 2026

Other Decks in Research

Transcript

  1. 2

  2. 4 Dependency management Improving code LLMs AI for SE Impact

    of AI on software engineering education
  3. 5

  4. 8 (a) Copilot query context (b) Qu (a) Copilot query

    context Technical Preview How correct are the suggestions? How understandable is the generated code?
  5. 9 MSR 2022 Python Java JavaScript C Correct (all tests

    pass) 42% 58% 27% 39% Partially Correct (some tests pass) 39% 33% 33% 27% Cyclomatic and cognitive complexities of the solutions were generally below the 15-25 recommended thresholds
  6. Test Generation (2022-2023) 11 Github Next 1 let mocha =

    require( mocha ); 2 let assert = require( assert ); 3 let quill_delta = require( quill-delta ); 4 // quill-delta.prototype.concat(other) 5 describe( test quill_delta , function() { 6 it( test quill-delta.prototype.concat , function(done) { 7 let delta1 = new quill_delta([{ insert: Hello }, 8 { insert: , 9 attributes: { bold: true } }, 10 { insert: World! }]); 11 let delta2 = new quill_delta([{ insert: Hello }, 12 { insert: , 13 attributes: { bold: true } }, 14 { insert: World! }]); 15 let delta3 = delta1.concat(delta2); 16 assert.equal(delta3.ops.length, 6); // fails 17 done(); 18 }) 19 }) 1 let mocha = require( mocha ); 2 let assert = require( assert ); 3 let quill_delta = require( quill-delta ); 4 // quill-delta.prototype.concat(other) 5 describe( test quill_delta , function() { 6 it( test quill-delta.prototype.concat , 7 let delta1 = new quill_delta([{ insert 8 { insert 9 attrib 10 { insert 11 let delta2 = new quill_delta([{ insert 12 { insert 13 attrib 14 { insert 15 let delta3 = delta1.concat(delta2); 16 assert.equal(delta3.ops.length, 6); 17 done(); 18 }) 19 1 let mocha = require( mocha ); 2 let assert = require( assert ); 3 let quill_delta = require( quill-delta ); 4 // quill-delta.prototype.concat(other) 5 describe( test quill_delta , function() { 6 it( test quill-delta.prototype.concat , function(done) { 7 let delta1 = new quill_delta([{ insert: Hello }, 8 { insert: , 9 attributes: { bold: true } }, 10 { insert: World! }]); 11 let delta2 = new quill_delta([{ insert: Hello }, 12 { insert: , 13 attributes: { bold: true } }, 14 { insert: World! }]); 15 let delta3 = delta1.concat(delta2); 16 assert.equal(delta3.ops.length, 6); // fails 17 done(); 18 }) 19 }) 1 let mocha = require( mocha ); 2 let assert = require( assert ); 3 let quill_delta = require( quill-delta ); 4 // quill-delta.prototype.concat(other) 5 describe( test quill_delta , function() { 6 it( test quill-delta.prototype.concat , 7 let delta1 = new quill_delta([{ insert 8 { insert 9 attrib 10 { insert 11 let delta2 = new quill_delta([{ insert 12 { insert 13 attrib 14 { insert 15 let delta3 = delta1.concat(delta2); 16 assert.equal(delta3.ops.length, 6); 17 done(); 18 }) 19 gpt3.5-turbo code-cushman-002 Starcoder
  7. 12 48% of generated tests pass and obtain a median

    statement coverage of 70% and median branch coverage of 52% TSE 2023
  8. 16 Complete: task completely done Partial: parts of the task

    are done (developer needs to finish it) Wrong: none of the goals of the task have been achieved What “done” means will differ by task
  9. 17

  10. 18

  11. 19

  12. 20

  13. 21

  14. 22 argparse import argparse def main(): parser = argparse.ArgumentParser(description="Greet person")

    parser.add_argument('name', help="name to greet") parser.add_argument('--age', type=int, help="your age") args = parser.parse_args() print(f"Hello {args.name}!") if args.age: print(f"You are {args.age} years old") if __name__ == "__main__": main() [MSR ’23, FSE ’24, ASE ’25, Under review]
  15. 22 argparse import argparse def main(): parser = argparse.ArgumentParser(description="Greet person")

    parser.add_argument('name', help="name to greet") parser.add_argument('--age', type=int, help="your age") args = parser.parse_args() print(f"Hello {args.name}!") if args.age: print(f"You are {args.age} years old") if __name__ == "__main__": main() [MSR ’23, FSE ’24, ASE ’25, Under review]
  16. 23 argparse import argparse def main(): parser = argparse.ArgumentParser(description="Greet person")

    parser.add_argument('name', help="name to greet") parser.add_argument('--age', type=int, help="your age") args = parser.parse_args() print(f"Hello {args.name}!") if args.age: print(f"You are {args.age} years old") if __name__ == "__main__": main() import click @click.command() @click.argument('name', type=str, help="name to greet") @click.option('--age', type=int, help="your age") def main(name, age): print(f'Hello {name}!') if age: print(f'You are {age} years old.') if __name__ == "__main__": main() Migrate code [MSR ’23, FSE ’24, ASE ’25, Under review]
  17. 23 argparse import argparse def main(): parser = argparse.ArgumentParser(description="Greet person")

    parser.add_argument('name', help="name to greet") parser.add_argument('--age', type=int, help="your age") args = parser.parse_args() print(f"Hello {args.name}!") if args.age: print(f"You are {args.age} years old") if __name__ == "__main__": main() import click @click.command() @click.argument('name', type=str, help="name to greet") @click.option('--age', type=int, help="your age") def main(name, age): print(f'Hello {name}!') if age: print(f'You are {age} years old.') if __name__ == "__main__": main() Migrate code Migration-related code changes [MSR ’23, FSE ’24, ASE ’25, Under review]
  18. 24

  19. 25

  20. 26

  21. 27 Byam: Fixing Breaking Dependency Updates with Large Language Models

    3 (a) Di!erence in Maven build file, causing a breaking update. (a) Di!erence in Maven build file, causing a breaking update. 113 ... 114 // create an instance of fop factory 115 FopFactory fopFactory = FopFactory.newInstance(); 116 // a user agent is needed for transformation 117 FOUserAgent foUserAgent = fopFactory.newFOUserAgent(); 118 ... (b) The broken code with compilation failure after breaking update. [ERROR] /billy/billy-gin/src/main/java/com/premiumminds/billy/gin/services/ impl/pdf/FOPPDFTransformer.java:[115,43] no suitable method found for newInstance(no arguments) method org.apache.fop.apps.FopFactory.newInstance(org.apache.fop.apps .FopFactoryConfig) is not applicable (actual and formal argument lists differ in length) method org.apache.fop.apps.FopFactory.newInstance(java.io.File) is not applicable (actual and formal argument lists differ in length) method org.apache.fop.apps.FopFactory.newInstance(java.net.URI) is not applicable (actual and formal argument lists differ in length) method org.apache.fop.apps.FopFactory.newInstance(java.net.URI,java. 113 ... 114 // create an instance of fop factory 115 FopFactory fopFactory = FopFactory.newInstance(); 116 // a user agent is needed for transformation 117 FOUserAgent foUserAgent = fopFactory.newFOUserAgent(); 118 ... (b) The broken code with compilation failure after breaking update. [ERROR] /billy/billy-gin/src/main/java/com/premiumminds/billy/gin/services/ impl/pdf/FOPPDFTransformer.java:[115,43] no suitable method found for newInstance(no arguments) method org.apache.fop.apps.FopFactory.newInstance(org.apache.fop.apps .FopFactoryConfig) is not applicable (actual and formal argument lists differ in length) method org.apache.fop.apps.FopFactory.newInstance(java.io.File) is not applicable (actual and formal argument lists differ in length) method org.apache.fop.apps.FopFactory.newInstance(java.net.URI) is not applicable (actual and formal argument lists differ in length) method org.apache.fop.apps.FopFactory.newInstance(java.net.URI,java. io.InputStream) is not applicable (actual and formal argument lists differ in length) (c) Compilation error information from the logs. Fig. 1: Example of a breaking dependency update when updating org.apache.xmlgraphics (dependency) from version 1.0 (old version) to 2.2 (new version) in the project billy. or recommendations for API method replacements based on source code anal- ysis (Dagenais and Robillard 2009). These approaches struggle to handle the Updating to a new version breaks the code [EMSE ’26]
  22. 28

  23. 29

  24. 30

  25. 32

  26. 33

  27. 34

  28. 37

  29. 38

  30. Over time, regardless of the task, programming language 39 there

    is often “a last mile” left for the developer , or model:
  31. 41 We completed a sample of 24 out of 34

    incomplete migrations. For those, the LLM correctly finished 73% of the required changes, leaving 27% for the developer to complete.
  32. Are we accounting for the cognitive overhead of understanding generated

    code to finish the last mile? 43 Are there cases where it’s just easier to start from scratch?
  33. Does the developer have the capability to finish this last

    mile? 45 Does a junior developer have these capabilities?
  34. 46

  35. 47

  36. 48 Students are using AI for all their submitted course

    projects. This is very frustrating! Hmm I’m going to introduce oral exams in software engineering!
  37. 49 Out of 5 oral exams I sat on, not

    a single student was able to explain their code Most students did not follow provided test examples and repo template structure. They had convoluted solutions with over mocked complex tests to the extent that nothing was really tested anymore
  38. 52 Two learning tasks Questionnaire Study Session One week later

    Post-study Retention Testing Retention test task Two test tasks Questionnaire Brief training
  39. 53 Two learning tasks Questionnaire Study Session One week later

    Post-study Retention Testing Retention test task Two test tasks Questionnaire Brief training
  40. 53 Two learning tasks Questionnaire Study Session One week later

    Post-study Retention Testing Retention test task Two test tasks Questionnaire Brief training
  41. Two learning tasks Questionnaire Study Session One week later Post-study

    Retention Testing Retention test task Two test tasks Questionnaire Brief training 55 Students using ChatGPT finished ~9% faster Students not using ChatGPT finished ~7% faster
  42. Two learning tasks Questionnaire Study Session One week later Post-study

    Retention Testing Retention test task Two test tasks Questionnaire Brief training 56
  43. Two learning tasks Questionnaire Study Session One week later Post-study

    Retention Testing Retention test task Two test tasks Questionnaire Brief training 56 Students using ChatGPT have ~20% higher correctness Both groups have same correctness scores for 1st easier task Students not using ChatGPT have ~17% higher correctness for 2nd harder task Students not using ChatGPT have ~19% higher correctness
  44. Two learning tasks Questionnaire Study Session One week later Post-study

    Retention Testing Retention test task Two test tasks Questionnaire Brief training 57 Students using ChatGPT have ~20% higher correctness Both groups have same correctness scores for 1st easier task Students not using ChatGPT have ~17% higher correctness for 2nd harder task Students not using ChatGPT have ~19% higher correctness Find more details, join our CHASE 2026 presentation on Tuesday 4:15pm
  45. 58 Students are using AI as a crutch without always

    completely learning or understanding what they did
  46. 59

  47. 59 Using AI assistance resulted in a reduction in the

    skills evaluation score by 17%, with no gained acceleration in completion time Gains in skill development of the control group is attributed to the process of encountering and subsequently resolving errors independently Some participants asked up to 15 questions or spent more than 30% of the total available task time on composing queries
  48. 65

  49. 66 “If you take only two things from this book,

    take these: 1) software engineering was never just about writing code, and 2) a fool with an agentic coding tool is still a fool.”
  50. 69

  51. 69

  52. Sarah Nadi Icons in this presentation are by Magnemizer, Three

    Musketeers, Flatart icons, Parzival’ 1997, Becris, Freepik, photo3idea_studio, cube29, Eucalyp, nawicon, IconBaander, graphicmail, Paul J, Icongeek, and juicy fish from flaticon.com When AI Gets it Almost Right: Lessons From AI-Assisted Software Development