Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Ultimate Software Crafter - Meetup Crafting...

The Ultimate Software Crafter - Meetup Crafting Data Science

Ces slides ont été présentées lors du Meetup "Crafting Data Science" à Paris le 25/06/2024.

Les agents autonomes (IA) de développement logiciel pourraient devenir bien plus performants grâce au TDD.
Ça paraît fou, mais les résultats de nos tests avec Naji Alazhar sont prometteurs !

Ces derniers mois, le marché des agents autonomes en dev (aka AI software engineering agents) était en ébullition. On a eu droit à du gros buzz et de belles démos ratées (de la part de Cognition Devin et de Github Copilot Workspace pour n’en citer que deux).

Au-delà des effets d’annonce, aujourd’hui, il commence à y avoir des solutions open sources crédibles (comme SWE-Agent) qu’on peut déjà intégrer à un flux de développement.
On leur soumet une tâche (une demande d’évolution ou un bug à corriger décrit en markdown ou via un lien d’une issue GitHub) et ils nous font une PR.
Alors, ils ne savent pas encore tout faire, ni même la moitié de ce qu’on leur demande. Mais ils savent déjà résoudre en quelques minutes des tâches qui demandent plusieurs heures/jours à des humains.

Avec Naji Alazhar, nous essayons de voir si en appliquant des principes du TDD, un agent pourrait améliorer son taux de résolution.
Nos premiers tests démontrent que la qualité du feedback dont dispose l’agent grâce au TDD lui permet de rectifier le tir après une hallucination. Et ça change tout !

Wassel Alazhar

June 25, 2024
Tweet

More Decks by Wassel Alazhar

Other Decks in Technology

Transcript

  1. 1. A hot topic - Devin: the Buzz and the

    Bad Buzz - Github Copilot Workspace: the Buzz and the Bad Buzz - SWE-Agent: the unexpected OSS alternative - Itʼs not going to calm down 2. What are coding agents capable of today? - Beyond the hype, a possible option - SWE-Bench: a reference benchmark - 18 %: Opportunities and Limitations - Potential - Our ambitious claim: TDD will enable more automated software engineering Agenda 3. Live Demo 4. How does a coding agent work? - Autonomous agents basics - How do AI software engineering agents work? - Main findings - How a test-first approach will improve performance - Main challenges (models, contexts, cost and data) - Future directions 5. Whatʼs Next? Letʼs discuss - Our intuitions - What would really change? What would not?
  2. Beyond the hype, a possible option LLM Tools Environment Problem

    description (aka issue) Resolution as Code (aka Pull Request)
  3. SWE-Bench : A reference benchmark - 12 code repositories among

    the most popular python repos (Django, flask, matplotlib, requests, scikit learn, sympy…) - Clear contribution guidelines - Better test coverage - Tend to be better maintained - SWE-Bench: 2294 issues - The bench evaluates already solved PRs (that test the changed code)
  4. SWE-Bench : A reference benchmark - The agent is given

    the parent commit of the solved PR - To verify the agent's PR, we run all related tests
  5. SWE-Bench Lite: Streamlining for Better Focus - 300 Issues -

    Excludes: - issues with external dependencies or images in the description - issues with descriptions shorter than 40 words - PRs with changes across multiple files - issues with tests on specific error messages
  6. 18 % - Opportunities - Cost: - < 2$ per

    issue (Gpt-4o LLM api) - few $ for the agent execution infrastructure - 300 issue => ~600$ - Expected benefits: - 54 / 300 issues solved - 2.77 man day / issue (avg) - 18% * 2.77 * 300 = 149.58 man days - (300 * 2$) / 149.58 = 4.01 $ cost per man day - 2-10 min / issue As of 2024-06-04 “It costs on average ~2.77 days for developers to create pull requests…” [samples from SWE-Bench-Lite dataset] Source: “AutoCodeRover: Autonomous Program Improvement” paper. https://arxiv.org/pdf/2404.05427 149.58 man day for 600$ instead of 75K$* (*) assuming 500$ average daily rate What Executives would think
  7. 18 % - Limitations - Issues that don't require many

    changes - The issue description is concise and self-contained - Work after the PR - Work before the issue
  8. 26.67 % - Opportunities - Cost : - 0.15 $

    per issue (Sonnet 3.5 LLM api) - few $ for the agent execution infrastructure - 300 issue => ~50$ - Expected benefits: - 2.77 man day / issue (avg) - 26.67% * 2.77 * 300 = 221.63 man days - 50$ / 221.63 ~ 0.23 $ cost per man day - 2-10 min / issue As of 2024-06-22 221.63 man day for 50$ instead of 111K$* (*) assuming 500$ average daily rate What Executives would think
  9. Potential Challenges: - Cost management - How to know if

    the proposed PR resolves the issue?
  10. Potential - TDD will enable more automated software engineering Red

    Write a failing test for the next bit of change you want to add Green Write the Simplest code That Could Possibly make the test pass Refactor Refactor to improve the implementation design Source: Canon TDD, by Kent Beck
  11. Live Demo Live Gpt-4o Test-First Live deep-seek-coder 2 Test-First Introduction

    to trajectories How the human has really solved it? 󰑔 󰎩
  12. Live Demo Live Gpt-4o Test-First Live deep-seek-coder 2 Test-First Introduction

    to trajectories How the human has really solved it? 󰑔 󰎩 Red Write a failing test for the next bit of change you want to add Green Write the Simplest code That Could Possibly make the test pass Refactor Refactor to improve the implementation design
  13. Autonomous agents basics - LLM is a component of an

    Agent - The Agent uses the LLM and interact with its environment to resolve a given task - The Agent interact with its environment by calling tools and supplying their output as feedback to the LLM Assistant System prompt User prompt LLM Tools Environment
  14. Autonomous agents basics - Prompts and tokens System prompt Répondez

    uniquement en espagnol. User prompt Bonjour Assistant Hola, ¿cómo estás? LLM
  15. Assistant Autonomous agents basics - feedback to LLM System prompt

    User prompt LLM Tools Environment - Describe tools - Describe the resolution methodology Describe the task to be solved
  16. Assistant Autonomous agents basics - Reason and Act (ReAct) System

    prompt User prompt LLM Tools Reason (aka thought) Act Observation Final Answer Environment Reason, Act, Observation… - Describe tools - Reason then act (use tools) Describe the task to be solved Source: ReAct: Synergizing Reasoning and Acting in Language Models (https://arxiv.org/abs/2210.03629)
  17. AI software engineering agents - main findings - ACI: Consider

    LM agent as a user and optimize for its user experience - Providing Demos and examples reduces the need of prompt engineering - Agent needs feedback to be performant
  18. AI software engineering agents - main findings - ACI: Consider

    LM agent as a user and optimize for its user experience - Providing Demos and examples reduces the need of prompt engineering - Agent needs feedback to be performant
  19. Assistant AI Software Crafter - DIY System prompt User prompt

    LLM Tools Environment - Instruct Test-First approach - Describe tools - Reason then act (use tools) - Describe the issue Test command interface Demonstration - Full trajectory of a successful example following Test-First approach 1 2 3
  20. AI Software Crafter - The impact of Test-First The agent

    can correct its mistakes through continuous test feedback
  21. AI Software Crafter Generated code is (almost) useless without tests

    At the end of the execution, it's easier to know if the issue has actually been solved. A PR, Great! Does-it work?
  22. AI Software Crafter - The impact of Test-First At the

    end of the execution, it's easier to know: - if the issue has actually been solved - if it did not break anything else
  23. Main challenges (models, contexts, cost and data) - Gpt4 level

    models is required - Context size matters (We tried with llama3 70b but it did not work 😕: demonstration + system prompt, tools and issue description exceed 8k context limit) - Preparing useful demonstrations takes time - Well tested code base is required - CI/CD pipeline on PR stage is required
  24. Future directions - Next Gen ACI - TDD: - 50

    shades of Refactoring - Test strategies & TDD styles - Before the issue: - Interactive issue submission (live feedback and examples formulation…) - Integration with live monitoring and exception management - How automated issues resolution will impact delivery performance? (too early to tell)
  25. Real usage will provide valuable data Real world agent thoughts

    and actions are the valuable data we need
  26. Our Intuitions - There Will Be Automation! Usage Automation drives

    even more automation Data Better Performance Automation
  27. Our Intuitions - There Will Be Automation! 🤖 ✅ Better

    resolution rate ✅ Decreased resolution cost ✅ Good Test coverage ✅ Enforced code standards ✅ Clear testing strategy ✅ Effective CI/CD pipelines ✅ Clear Issue formulation It will work well for those it was already working well for
  28. No required human code reviews 🙁 Missed learning opportunities 🙁

    Costly Human onboarding when needed 🐲 Here will be dragons Our Intuitions - There Will Be Developers! Required human code reviews 🙁 Ain’t no fun 🙁 What’s the point if we keep a Bottleneck 🙁 Handovers, walls and ownerships 👀 Déjà Vu Code reviews, which model is the most credible? 󰤇 “Best” practices to be (re)invented
  29. Our Intuitions - There Will Be Developers! Source: Alberto Brandolini,

    creator of EventStorming “Itʼs developer (mis)understanding thatʼs released in production, not the expertsʼ knowledge.”
  30. What Could We Outsource? What Would Make the Difference? LLM

    Tools Environment tedious, repetitive but important tasks of regularly updating dependencies ?
  31. Source: Map Evolution not maturity, by Simon Wardley There is

    no one size-fits-all software engineering
  32. There is no one size-fits-all software engineering Source: Wardley Maps,

    by Simon Wardley Cheap feedback on idea Reduced cost of changes
  33. Coding vs Software Engineering There Will Be Code One might

    argue that a book about code is somehow behind the times that code is no longer the issue; that we should be concerned about models and requirements instead. Indeed some have suggested that we are close to the end of code. That soon all code will be generated instead of written. That programmers simply won't be needed because business people will generate programs from specifications. Nonsense! We will never be rid of code, because code represents the details of the requirements. At some level those details cannot be ignored or abstracted; they have to be specified. And specifying requirements in such detail that a machine can execute them is programming. Such a specification is code. I expect that the level of abstraction of our languages will continue to increase. I also expect that the number of domain-specific languages will continue to grow. This will be a good thing. But it will not eliminate code. Indeed, all the specifications written in these higher level and domain-specific language will be code! It will still need to be rigorous, accurate, and so formal and detailed that a machine can understand and execute it. The folks who think that code will one day disappear are like mathematicians who hope one day to discover a mathematics that does not have to be formal. They are hoping that one day we will discover a way to create machines that can do what we want rather than what we say. These machines will have to be able to understand us so well that they can translate vaguely specified needs into perfectly executing programs that precisely meet those needs. This will never happen. Not even humans, with all their intuition and creativity, have been able to create successful systems from the vague feelings of their customers. Indeed, if the discipline of requirements specification has taught us anything, it is that code! well-specified requirements are as formal as code and can act as executable tests of that code. Remember that code is really the language in which we ultimately express the requirements. We may create languages that are closer to the requirements. We may create tools that help us parse and assemble those requirements into formal structures. But we will never eliminate necessary precision—so there will always be code. End of code Or Emergence of a new level of abstraction? What is the degree of rigor and accuracy that is still required to build successful systems? Source: “Clean Code”, Robert C. Martin 2008 Will the issue become the new code?
  34. How Feedback Cycles Will Be Impacted? Source: "Le vieux monde

    se meurt, le nouveau monde tarde à apparaître, et dans ce clair-obscur surgissent les monstres", by Arnaud Lemaire