The Ultimate Software Crafter - Meetup Crafting Data Science

Naji Alazhar Analyste Quantitatif Wassel Alazhar Consultant, Architecte et Développeur
Qui sommes nous ?

1. A hot topic - Devin: the Buzz and the
Bad Buzz - Github Copilot Workspace: the Buzz and the Bad Buzz - SWE-Agent: the unexpected OSS alternative - Itʼs not going to calm down 2. What are coding agents capable of today? - Beyond the hype, a possible option - SWE-Bench: a reference benchmark - 18 %: Opportunities and Limitations - Potential - Our ambitious claim: TDD will enable more automated software engineering Agenda 3. Live Demo 4. How does a coding agent work? - Autonomous agents basics - How do AI software engineering agents work? - Main findings - How a test-first approach will improve performance - Main challenges (models, contexts, cost and data) - Future directions 5. Whatʼs Next? Letʼs discuss - Our intuitions - What would really change? What would not?

A hot topic

Devin - The hype

Devin - The Bad Buzz https://www.youtube.com/watch?v=tNmgmwEtoWE

Github Copilot Workspace - The announcement

Github Copilot Workspace - Yet another failed demo https://www.youtube.com/watch?v=75Hv0RUFIrQ

So far, Only Announcements? Only Proprietary Solutions?

SWE-Agent, an Already Available Open-Source Alternative

It’s Not About to Stop https://x.com/alexalbert__/status/1803804677701869748 https://www.tiktok.com/@steve8708/video/7382315491341126955

What are coding agents capable of today? SoTA

Beyond the hype Source: https://arxiv.org/abs/2405.15793 Source: https://arxiv.org/abs/2310.06770

Beyond the hype, a possible option LLM Tools Environment Problem
description (aka issue) Resolution as Code (aka Pull Request)

SWE-Bench : A reference benchmark - 12 code repositories among
the most popular python repos (Django, flask, matplotlib, requests, scikit learn, sympy…) - Clear contribution guidelines - Better test coverage - Tend to be better maintained - SWE-Bench: 2294 issues - The bench evaluates already solved PRs (that test the changed code)

SWE-Bench : A reference benchmark - The agent is given
the parent commit of the solved PR - To verify the agent's PR, we run all related tests

~13 % Issues resolved by SWE-Agent on the full reference
dataset (2294 issues)

SWE-Bench Lite: Streamlining for Better Focus - 300 Issues -
Excludes: - issues with external dependencies or images in the description - issues with descriptions shorter than 40 words - PRs with changes across multiple files - issues with tests on specific error messages

18 % Issues resolved by SWE-Agent on the “Lite” reference
dataset (300 issues)

18 % - Opportunities - Cost: - < 2$ per
issue (Gpt-4o LLM api) - few $ for the agent execution infrastructure - 300 issue => ~600$ - Expected benefits: - 54 / 300 issues solved - 2.77 man day / issue (avg) - 18% * 2.77 * 300 = 149.58 man days - (300 * 2$) / 149.58 = 4.01 $ cost per man day - 2-10 min / issue As of 2024-06-04 “It costs on average ~2.77 days for developers to create pull requests…” [samples from SWE-Bench-Lite dataset] Source: “AutoCodeRover: Autonomous Program Improvement” paper. https://arxiv.org/pdf/2404.05427 149.58 man day for 600$ instead of 75K$* (*) assuming 500$ average daily rate What Executives would think

18 % - Limitations - Issues that don't require many
changes - The issue description is concise and self-contained - Work after the PR - Work before the issue

Potential As of 2024-06-04

Potential Pass @3 and slower As of 2024-06-04

Potential avg 0.13$ per issue As of 2024-06-22

Potential As of 2024-06-24

26.67 % - Opportunities - Cost : - 0.15 $
per issue (Sonnet 3.5 LLM api) - few $ for the agent execution infrastructure - 300 issue => ~50$ - Expected benefits: - 2.77 man day / issue (avg) - 26.67% * 2.77 * 300 = 221.63 man days - 50$ / 221.63 ~ 0.23 $ cost per man day - 2-10 min / issue As of 2024-06-22 221.63 man day for 50$ instead of 111K$* (*) assuming 500$ average daily rate What Executives would think

Potential

Potential 271.49 man day for 3.6K$ instead of 137K$* What
Executives would think

Potential

Potential Challenges: - Cost management - How to know if
the proposed PR resolves the issue?

Potential Our claim: TDD will enable more automated software engineering

Potential - TDD will enable more automated software engineering Red
Write a failing test for the next bit of change you want to add Green Write the Simplest code That Could Possibly make the test pass Refactor Refactor to improve the implementation design Source: Canon TDD, by Kent Beck

Live Demo

Live Demo Live Gpt-4o Test-First Live deep-seek-coder 2 Test-First Introduction
to trajectories How the human has really solved it? 󰑔 󰎩

Live Demo Live Gpt-4o Test-First Live deep-seek-coder 2 Test-First Introduction
to trajectories How the human has really solved it? 󰑔 󰎩 Red Write a failing test for the next bit of change you want to add Green Write the Simplest code That Could Possibly make the test pass Refactor Refactor to improve the implementation design

How does a coding agent work?

Autonomous agents basics - LLM is a component of an
Agent - The Agent uses the LLM and interact with its environment to resolve a given task - The Agent interact with its environment by calling tools and supplying their output as feedback to the LLM Assistant System prompt User prompt LLM Tools Environment

Autonomous agents basics - Prompts and tokens System prompt Répondez
uniquement en espagnol. User prompt Bonjour Assistant Hola, ¿cómo estás? LLM

Assistant Autonomous agents basics - feedback to LLM System prompt
User prompt LLM Tools Environment - Describe tools - Describe the resolution methodology Describe the task to be solved

Assistant Autonomous agents basics - Reason and Act (ReAct) System
prompt User prompt LLM Tools Reason (aka thought) Act Observation Final Answer Environment Reason, Act, Observation… - Describe tools - Reason then act (use tools) Describe the task to be solved Source: ReAct: Synergizing Reasoning and Acting in Language Models (https://arxiv.org/abs/2210.03629)

How do AI software engineering agents work?

AI software engineering agents - main ﬁndings - ACI: Consider
LM agent as a user and optimize for its user experience - Providing Demos and examples reduces the need of prompt engineering - Agent needs feedback to be performant

Assistant AI Software Crafter - DIY System prompt User prompt
LLM Tools Environment - Instruct Test-First approach - Describe tools - Reason then act (use tools) - Describe the issue Test command interface Demonstration - Full trajectory of a successful example following Test-First approach 1 2 3

AI Software Crafter - The impact of Test-First The agent
can correct its mistakes through continuous test feedback

AI Software Crafter Generated code is (almost) useless without tests
At the end of the execution, it's easier to know if the issue has actually been solved. A PR, Great! Does-it work?

AI Software Crafter - The impact of Test-First At the
end of the execution, it's easier to know: - if the issue has actually been solved - if it did not break anything else

Potential Our claim: TDD will enable more automated software engineering

Main challenges (models, contexts, cost and data) - Gpt4 level
models is required - Context size matters (We tried with llama3 70b but it did not work 😕: demonstration + system prompt, tools and issue description exceed 8k context limit) - Preparing useful demonstrations takes time - Well tested code base is required - CI/CD pipeline on PR stage is required

Future directions - Next Gen ACI - TDD: - 50
shades of Refactoring - Test strategies & TDD styles - Before the issue: - Interactive issue submission (live feedback and examples formulation…) - Integration with live monitoring and exception management - How automated issues resolution will impact delivery performance? (too early to tell)

What’s Next? Let’s discuss

Real usage will provide valuable data Real world agent thoughts
and actions are the valuable data we need

Our Intuitions - There Will Be Automation! Usage Automation drives
even more automation Data Better Performance Automation

Our Intuitions - There Will Be Automation! 🤖 ✅ Better
resolution rate ✅ Decreased resolution cost ✅ Good Test coverage ✅ Enforced code standards ✅ Clear testing strategy ✅ Effective CI/CD pipelines ✅ Clear Issue formulation It will work well for those it was already working well for

Our Intuitions - There Will Be Automation! And for the
others 󰤇

Our Intuitions - There Will Be Automation! The era of
automated PRs

No required human code reviews 🙁 Missed learning opportunities 🙁
Costly Human onboarding when needed 🐲 Here will be dragons Our Intuitions - There Will Be Developers! Required human code reviews 🙁 Ain’t no fun 🙁 What’s the point if we keep a Bottleneck 🙁 Handovers, walls and ownerships 👀 Déjà Vu Code reviews, which model is the most credible? 󰤇 “Best” practices to be (re)invented

Our Intuitions - There Will Be Developers! Source: Alberto Brandolini,
creator of EventStorming “Itʼs developer (mis)understanding thatʼs released in production, not the expertsʼ knowledge.”

What Could We Outsource? What Would Make the Diﬀerence? LLM
Tools Environment tedious, repetitive but important tasks of regularly updating dependencies ?

What practices will enable eﬃcient collaboration? LLM Tools Environment

Future directions Source: https://arxiv.org/abs/2405.15793 Source: https://arxiv.org/abs/2310.06770 Coding vs Software Engineering
Real-World Github (OSS) Issues vs Real-World Complex Systems Issues

Source: Map Evolution not maturity, by Simon Wardley There is
no one size-ﬁts-all software engineering

There is no one size-ﬁts-all software engineering Source: Wardley Maps,
by Simon Wardley

There is no one size-ﬁts-all software engineering Source: Wardley Maps,
by Simon Wardley Cheap feedback on idea Reduced cost of changes

Coding vs Software Engineering

Coding vs Software Engineering There Will Be Code One might
argue that a book about code is somehow behind the times that code is no longer the issue; that we should be concerned about models and requirements instead. Indeed some have suggested that we are close to the end of code. That soon all code will be generated instead of written. That programmers simply won't be needed because business people will generate programs from specifications. Nonsense! We will never be rid of code, because code represents the details of the requirements. At some level those details cannot be ignored or abstracted; they have to be specified. And specifying requirements in such detail that a machine can execute them is programming. Such a specification is code. I expect that the level of abstraction of our languages will continue to increase. I also expect that the number of domain-specific languages will continue to grow. This will be a good thing. But it will not eliminate code. Indeed, all the specifications written in these higher level and domain-specific language will be code! It will still need to be rigorous, accurate, and so formal and detailed that a machine can understand and execute it. The folks who think that code will one day disappear are like mathematicians who hope one day to discover a mathematics that does not have to be formal. They are hoping that one day we will discover a way to create machines that can do what we want rather than what we say. These machines will have to be able to understand us so well that they can translate vaguely specified needs into perfectly executing programs that precisely meet those needs. This will never happen. Not even humans, with all their intuition and creativity, have been able to create successful systems from the vague feelings of their customers. Indeed, if the discipline of requirements specification has taught us anything, it is that code! well-specified requirements are as formal as code and can act as executable tests of that code. Remember that code is really the language in which we ultimately express the requirements. We may create languages that are closer to the requirements. We may create tools that help us parse and assemble those requirements into formal structures. But we will never eliminate necessary precision—so there will always be code. End of code Or Emergence of a new level of abstraction? What is the degree of rigor and accuracy that is still required to build successful systems? Source: “Clean Code”, Robert C. Martin 2008 Will the issue become the new code?

How Feedback Cycles Will Be Impacted? Source: "Le vieux monde
se meurt, le nouveau monde tarde à apparaître, et dans ce clair-obscur surgissent les monstres", by Arnaud Lemaire

How Feedback Cycles Will Be Impacted?

Let’s Discuss!

Thank You

The Ultimate Software Crafter - Meetup Crafting...

The Ultimate Software Crafter - Meetup Crafting Data Science

More Decks by Wassel Alazhar

Other Decks in Technology

Featured

Transcript