LLMs for code: The potential, prospects, and problems

LLMs for code: The potential, prospects, and problems Tushar Sharma
ICSA 2024

Dr. Tushar Sharma Assistant professor at Dalhousie University, Canada PhD
• Athens University of Economics and Business, Greece Industry experience • Siemens Research (7 + 2) Books • Refactoring for software design smells Tools/platforms • Designite • QConnect

Software engineering Machine learning • Source code analysis • Software
quality • Code smell detection and refactoring • Developers’ productivity • Program comprehension • Machine learning for software engineering • Software engineering for machine learning https://web.cs.dal.ca/~tushar/smart/ • Binary symbol reconstruction • Program comprehension for decompiled binaries • Vulnerability analysis for decompiled code Green AI • Sustainable machine learning • Energy hotspots and refactorings • Energy efficient code representation Sponsors and collaborators Dr. Tushar Sharma [email protected] SMART lab, Dalhousie University Tools and platforms

The matrix

Overview • Entering the Matrix • Visions of Zion •
The Architect's Blueprint • Red Pills and Blue Pills • I am (We are) the one!

Overview • Entering the Matrix (The basics) • Visions of
Zion (Prospects) • The Architect's Blueprint (State of the art) • Red Pills and Blue Pills (Challenges) • I am (We are) the one! (Opportunities)

Part I: Entering the Matrix The basics

8 A language model is a probability distribution over words
or word sequences. A language model assign probabilities for the next token(s), given a token(s) P(xn | xn-1 , xn-2 , …)

Let’s understand with an example 10

11 We believe that our efforts to consolidate and summarize
the techniques, resources, and challenges will help the community to understand the state-of-the-art better and to focus their efforts on tackling the identified challenges.

the techniques, resources, and challenges will help the community to understand the state-of-the-art better and to focus their efforts on tackling the identified challenges.

the techniques, resources, and challenges will help the community understand the state-of- the-art better and focus their efforts on tackling the identified challenges 0.33 0.33 0.33 Assigning probabilities -> creating language model

the techniques, resources, and challenges will help community understand state-of-the-art better and focus their efforts on tackling identified challenges 0.25 0.25 0.25 0.25 0.33 0.33 0.33

the techniques, resources, and challenges will help community understand state-of-the-art better and focus their efforts on tackling identified challenges 0.25 0.25 0.25 0.25 0.33 0.33 0.33 This arrangement can generate new sentences. However, not all generated sentences are meaningful. So, what we can do?

LLMs 17 • One potential approach is to use many
more examples to learn the probabilities. • But it might not be as helpful as desired. • A better approach would be to consider more than one token to decide the next token. • N-gram But how many?

LLMs 18 • We use neural network to approximate the
function for predicting the next token given a context. • However, increasing the number of units and capacity is not enough for a simple neural network to learn language modeling because of its complexity.

1 Transformer LLMs 19 Attention Next word prediction model 0
1 0 Attention model learns where to focus to better predict the next token with the help of back propagation -> self-attention Both the models learn simultaneously

LLMs 20 As the scope, target, and expectations from a
model increases, the complexity also increases. • Stacking GPT-3 has 96 such layers 1 Attention Next word prediction model 0 1 0 1 Attention Next word prediction model 0 1 0 1 Attention Next word prediction model 0 1 0 1 Attention Next word prediction model 0 1 0

LLMs 21 XGLM Cohere BERT 340M GPT-1 117M GPT-2 1.5B
LANGUAGE MODEL SIZES TO MAR/2023 LifeArchitect.ai/models PaLM PaLM-Coder Minerva Med-PaLM Flan-PaLM U-PaLM Flan-U-PaLM Med-PaLM 2 540B BLOOM BLOOMZ 176B 20B 7.5B 13B Gopher 280B GPT-NeoX-20B MT-NLG 530B 52.4B Plato-XL Macaw 11B 11B Jurassic-1 178B 9.4B 6B GPT-J BlenderBot2.0 LaMDA LaMDA 2 Bard 137B GPT-3 175B ruGPT-3 Parameters AI lab/group Available Closed Chinchilla scale Chinchilla 70B* ⃡ Cedille Fairseq Anthropic-LM 6B Flamingo 80B* 20B * AlexaTM VIMA 200M 6.9B * 13B CM3 VLM-4 mGPT Luminous 200B 10B 13B Gato 1.2B OPT-175B BB3 OPT-IML 175B NLLB 54.5B GLM-130B ChatGLM-6B YaLM 100B 10B NOOR UL2 20B FIM 11B Flan-T5 11B 52B RL-CAI Claude PaLI 17B SeeKeR 2.7B 10B * WeLM Galactica 120B T5 Megatron-11B GPT-4 Undisclosed * MOSS 20B* LLaMA 65B* Kosmos-1 Z-Code++ 710M* 7B Toolformer 6.7B* * 11B Atlas Alpaca 1.6B* Beeswarm/bubble plot, sizes linear to scale. Selected highlights only. *Chinchilla scale means T:P ratio >15:1. https://lifearchitect.ai/chinchilla/ Alan D. Thompson. March 2023. https://lifearchitect.ai/

LLMs 22 Olympus 2T (2024) LARGE LANGUAGE MODEL HIGHLIGHTS (APR/2024)
LifeArchitect.ai/models Parameters AI lab/group ⃡ Sizes linear to scale. Selected highlights only. All models are available. All models are Chinchilla-aligned (20:1 tokens:parameters) https://lifearchitect.ai/chinchilla/ All 300+ models: https://lifearchitect.ai/models-table/ Alan D. Thompson. 2023-2024. Gemini Ultra 1.0 1.5T Claude 3 Opus 2T Next… (2024) 30B 70B 180B Gemini Pro 180B Nano Mamba 2.8B phi-2 2.7B … XS Pythia 12B Mistral 7B Zephyr 7.3B Gauss StripedHyena 7B Persimmon-8B DeciLM-7B SOLAR 10.7B Gemma 7B PaLM 2 340B Inﬂection-2.5 Grok-1 314B GPT-4 1.76T MoE ERNIE 4.0 1T ChatGPT gpt-3.5-turbo 20B Large Yuan 2.0 102.6B InternLM 104B Jurassic-2 Falcon 180B DBRX 132B MoE Mistral-medium Command R+ 104B … Small Palmyra 20B C1.2 Retro 48B MPT-30B Command-R 35B Yi-34B Mixtral 8x7B … Medium Command 52B StableLM 65B Eurus 70B Luminous Supreme Llama 3 70B Perplexity 70B Online Qwen-72B DeepSeek 67B

LLMs for code 23 A Survey of Large Language Models
for Code: Evolution, Benchmarking, and Future Trends

24 Harnessing the Power of LLMs in Practice: A Survey
on ChatGPT and Beyond, 2024

• Foundation models • Number of parameters • Chain of
thoughts • Fine-tuning • Prompt • Prompt engineering • Zero-shot learning • Few-shot learning • Temperature • Multi-modal Common terms

Part II: Visions of Zion Prospects of LLMs for code

Code generation 28 https://www.encora.com/insights/github-copilot-a-code-autocomplete-tool-on-steroids

Translation 29 https://www.encora.com/insights/github-copilot-a-code-autocomplete-tool-on-steroids

Creating dictionaries 30 https://www.encora.com/insights/github-copilot-a-code-autocomplete-tool-on-steroids

Understand algorithm and implementation 31

• Efficiency and productivity • Code generation • Real-time code
suggestions • Automated documentation • Customization • Applicability for wide range of use-cases • Creativity • Ease of use • Learning Benefits

Part III: The architect’s blueprint State of the art

Current state of LLMs for code research 34 LLMs for
code Predict/Classify Summarize Selective synthesis Multi-agent processing End-to-end development Complexity Sentiment analysis, vulnerability or code quality issue prediction, … Renaming a method, document generation, … Generate code for a specific problem, … Identify security vulnerabilities in the latest commit and lodge an issue, … Identify and prioritize requirements, code, refactor, test, debug, …

LLMs in SE research 35 Large Language Models for Software
Engineering: Survey and Open Problems

LLMs for code 36 Large Language Models for Software Engineering:
Survey and Open Problems

Impact • Essential part of software development • Code generation
- Copilot generates 61% of Java code (Feb 2023) • Test suite generation – Cover from DiffBlue • Threatening the status quo • StackOverFlow, or, in general, Google search • New tools and agents leveraging LLMs • Improved effectiveness of automated approaches using LLMs • Effective embeddings, for example 37

LLMs are constantly improving 38 Unifying the Perspectives of NLP
and Software Engineering: A Survey on Language Models for Code. 2023

Leaderboards 39 https://www.vellum.ai/llm-leaderboard

Part IV: Red and Blue pills Challenges

• Correctness • Completeness • Maintainable • Secure • …
Quality of the generated code

Correctness of generated code 42 Is Your Code Generated by
ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation, 2023

Completeness of generated code 43 https://hackernoon.com/testing-llms-on-solving-leetcode-problems

Completeness of generated code 44 Studying LLM Performance on Closed-
and Open-source Data

Correctness of generated code 45 Bugs in Large Language Models
Generated Code: An Empirical Study, 2024

46 Evaluating the Code Quality of AI-Assisted Code Generation Tools:
An Empirical Study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT, Oct 2023 How maintainable is the generated code?

• Copilot is less likely to generate vulnerable code corresponding
to newer vulnerabilities • Copilot is more prone to generate certain types of vulnerabilities • Using Copilot to fix security bugs is risky, given that Copilot did introduce vulnerabilities in at least a third of the cases we studied. Vulnerabilities in generated code Is GitHub’s Copilot as Bad as Humans at Introducing Vulnerabilities in Code?

Vulnerabilities in generated code 48 A survey identified • 83
“good” papers that highlight the positive contributions of LLMs to security and privacy. • 54 “bad” papers, in which attackers exploited LLMs to target users, • 144 “ugly” papers, in which authors discovered vulnerabilities within LLMs. A Survey on Large Language Model (LLM) Security and Privacy: The Good, The Bad, and The Ugly, June 2024

Vulnerabilities in generated code 49 • In total 376 of
these websites (approximately 78%) lacked essential extension checks exposing them to potential malicious file uploads. • Alarmingly, only 1143 (about 45.72%) websites implemented prepared statements, leaving 54.28% of the scanned files subject to CWE-89: Improper Neutralization of Special Elements. • We identified 459 SQL injection, 57 stored XSS, 394 reflected XSS vulnerable parameters in the entire dataset. LLMs in Web Development: Evaluating LLM-Generated PHP Code Unveiling Vulnerabilities and Limitations, 2024 Our findings serve as a strong reminder of the continuous and evolving threat landscape, urging developers and security professionals to remain vigilant and use generative AI with caution.

Can LLMs reason? 50 LLMs cannot always perform analogical reasoning
and the key influencing factor is the accuracy of self-generated examples rather than their relevance. Relevant or Random: Can LLMs Truly Perform Analogical Reasoning?

Can LLMs reason? 51 Overall, GPT-4 performs best in grasping
inferential rules. But compared to human performance, there still remains substantial room for improvement across all models, especially in highly compositional, symbolic and structural complex rules. Can LLMs Reason with Rules? Logic Scaffolding for Stress-Testing and Improving LLMs, May 2024

Can LLMs reason? 52 LLMs, especially GPT families, can effectively
stimulate the reasoning results of logic solvers. Although LLMs demonstrate satisfactory performance on several datasets, the potential drawbacks and limitations of LLMs for logic code simulation should not be underestimated. Can LLMs Reason with Rules? Logic Scaffolding for Stress-Testing and Improving LLMs, May 2024

Technical challenges • Context window • How to assess their
effectiveness • Comprehensive benchmarks (not limited to accuracy) • Evaluation on realistic problems/samples • Data leakage • Multi-step tasks that require domain-specific direction • Prompt injection • Indistinguishability between instruction and data 53

Societal challenges 54 • Effect of LLMs on climate •
Bias • Pedagogy • Are these LLMs making students dumb? • Over-reliance and loss of critical thinking

• Lack of accountability • Copyright • Many active lawsuits
• Level of transparency • Large organizations do not share their sources • Misinformation and disinformation • Privacy concerns Legal challenges

Impact on environment and cost 56 Llama cost of training
- 2048 A100 GPUs for 23 days - Electricity cost $53K Operational cost: ChatGPT spends $700K per day Google PaLM trained on 6144 TPUs V4 made of two TPU V4 pods Meta AI’s OPT was trained on 992 A100 GPUs https://www.economist.com/technology-quarterly/2020/06/11/the-cost-of-training-machines-is-becoming-a-problem

How green are LLMs GPT3 • 1287 GW-hr • 502
tons of carbon • 120 years' worth of single-family electricity usage of an American household 57

Part V: I am (We are) the “one” Opportunities

Opportunities • Better integration with apps, APIs, and bots •
Domain/context-specific support • Greener models • Reduced inference latency • Enhanced code understanding capabilities • LLMs that understand software design and architecture 59

THANK YOU D r. Tu s h a r S
h a r m a t u s h a r @ d a l . c a

LLMs for code: The potential, prospects, and pr...

LLMs for code: The potential, prospects, and problems

More Decks by Tushar Sharma

Other Decks in Programming

Featured

Transcript