LLMs for code: The potential, prospects, and problems

Slide 1

Slide 1 text

LLMs for code: The potential, prospects, and problems Tushar Sharma ICSA 2024

Slide 2

Slide 2 text

Dr. Tushar Sharma Assistant professor at Dalhousie University, Canada PhD • Athens University of Economics and Business, Greece Industry experience • Siemens Research (7 + 2) Books • Refactoring for software design smells Tools/platforms • Designite • QConnect

Slide 3

Slide 3 text

Software engineering Machine learning • Source code analysis • Software quality • Code smell detection and refactoring • Developers’ productivity • Program comprehension • Machine learning for software engineering • Software engineering for machine learning https://web.cs.dal.ca/~tushar/smart/ • Binary symbol reconstruction • Program comprehension for decompiled binaries • Vulnerability analysis for decompiled code Green AI • Sustainable machine learning • Energy hotspots and refactorings • Energy efficient code representation Sponsors and collaborators Dr. Tushar Sharma [email protected] SMART lab, Dalhousie University Tools and platforms

Slide 4

Slide 4 text

The matrix

Slide 5

Slide 5 text

Overview • Entering the Matrix • Visions of Zion • The Architect's Blueprint • Red Pills and Blue Pills • I am (We are) the one!

Slide 6

Slide 6 text

Overview • Entering the Matrix (The basics) • Visions of Zion (Prospects) • The Architect's Blueprint (State of the art) • Red Pills and Blue Pills (Challenges) • I am (We are) the one! (Opportunities)

Slide 7

Slide 7 text

Part I: Entering the Matrix The basics

Slide 8

Slide 8 text

8 A language model is a probability distribution over words or word sequences. A language model assign probabilities for the next token(s), given a token(s) P(xn | xn-1 , xn-2 , …)

Slide 9

Slide 9 text

Let’s understand with an example 10

Slide 10

Slide 10 text

11 We believe that our efforts to consolidate and summarize the techniques, resources, and challenges will help the community to understand the state-of-the-art better and to focus their efforts on tackling the identified challenges.

Slide 11

Slide 11 text

12 We believe that our efforts to consolidate and summarize the techniques, resources, and challenges will help the community to understand the state-of-the-art better and to focus their efforts on tackling the identified challenges.

Slide 12

Slide 12 text

13 We believe that our efforts to consolidate and summarize the techniques, resources, and challenges will help the community understand the state-of- the-art better and focus their efforts on tackling the identified challenges 0.33 0.33 0.33 Assigning probabilities -> creating language model

Slide 13

Slide 13 text

14 We believe that our efforts to consolidate and summarize the techniques, resources, and challenges will help the community understand the state-of- the-art better and focus their efforts on tackling the identified challenges 0.33 0.33 0.33 Assigning probabilities -> creating language model

Slide 14

Slide 14 text

15 We believe that our efforts to consolidate and summarize the techniques, resources, and challenges will help community understand state-of-the-art better and focus their efforts on tackling identified challenges 0.25 0.25 0.25 0.25 0.33 0.33 0.33

Slide 15

Slide 15 text

16 We believe that our efforts to consolidate and summarize the techniques, resources, and challenges will help community understand state-of-the-art better and focus their efforts on tackling identified challenges 0.25 0.25 0.25 0.25 0.33 0.33 0.33 This arrangement can generate new sentences. However, not all generated sentences are meaningful. So, what we can do?

Slide 16

Slide 16 text

LLMs 17 • One potential approach is to use many more examples to learn the probabilities. • But it might not be as helpful as desired. • A better approach would be to consider more than one token to decide the next token. • N-gram But how many?

Slide 17

Slide 17 text

LLMs 18 • We use neural network to approximate the function for predicting the next token given a context. • However, increasing the number of units and capacity is not enough for a simple neural network to learn language modeling because of its complexity.

Slide 18

Slide 18 text

1 Transformer LLMs 19 Attention Next word prediction model 0 1 0 Attention model learns where to focus to better predict the next token with the help of back propagation -> self-attention Both the models learn simultaneously

Slide 19

Slide 19 text

LLMs 20 As the scope, target, and expectations from a model increases, the complexity also increases. • Stacking GPT-3 has 96 such layers 1 Attention Next word prediction model 0 1 0 1 Attention Next word prediction model 0 1 0 1 Attention Next word prediction model 0 1 0 1 Attention Next word prediction model 0 1 0

Slide 20

Slide 20 text

LLMs 21 XGLM Cohere BERT 340M GPT-1 117M GPT-2 1.5B LANGUAGE MODEL SIZES TO MAR/2023 LifeArchitect.ai/models PaLM PaLM-Coder Minerva Med-PaLM Flan-PaLM U-PaLM Flan-U-PaLM Med-PaLM 2 540B BLOOM BLOOMZ 176B 20B 7.5B 13B Gopher 280B GPT-NeoX-20B MT-NLG 530B 52.4B Plato-XL Macaw 11B 11B Jurassic-1 178B 9.4B 6B GPT-J BlenderBot2.0 LaMDA LaMDA 2 Bard 137B GPT-3 175B ruGPT-3 Parameters AI lab/group Available Closed Chinchilla scale Chinchilla 70B* ⃡ Cedille Fairseq Anthropic-LM 6B Flamingo 80B* 20B * AlexaTM VIMA 200M 6.9B * 13B CM3 VLM-4 mGPT Luminous 200B 10B 13B Gato 1.2B OPT-175B BB3 OPT-IML 175B NLLB 54.5B GLM-130B ChatGLM-6B YaLM 100B 10B NOOR UL2 20B FIM 11B Flan-T5 11B 52B RL-CAI Claude PaLI 17B SeeKeR 2.7B 10B * WeLM Galactica 120B T5 Megatron-11B GPT-4 Undisclosed * MOSS 20B* LLaMA 65B* Kosmos-1 Z-Code++ 710M* 7B Toolformer 6.7B* * 11B Atlas Alpaca 1.6B* Beeswarm/bubble plot, sizes linear to scale. Selected highlights only. *Chinchilla scale means T:P ratio >15:1. https://lifearchitect.ai/chinchilla/ Alan D. Thompson. March 2023. https://lifearchitect.ai/

Slide 21

Slide 21 text

LLMs 22 Olympus 2T (2024) LARGE LANGUAGE MODEL HIGHLIGHTS (APR/2024) LifeArchitect.ai/models Parameters AI lab/group ⃡ Sizes linear to scale. Selected highlights only. All models are available. All models are Chinchilla-aligned (20:1 tokens:parameters) https://lifearchitect.ai/chinchilla/ All 300+ models: https://lifearchitect.ai/models-table/ Alan D. Thompson. 2023-2024. Gemini Ultra 1.0 1.5T Claude 3 Opus 2T Next… (2024) 30B 70B 180B Gemini Pro 180B Nano Mamba 2.8B phi-2 2.7B … XS Pythia 12B Mistral 7B Zephyr 7.3B Gauss StripedHyena 7B Persimmon-8B DeciLM-7B SOLAR 10.7B Gemma 7B PaLM 2 340B Inﬂection-2.5 Grok-1 314B GPT-4 1.76T MoE ERNIE 4.0 1T ChatGPT gpt-3.5-turbo 20B Large Yuan 2.0 102.6B InternLM 104B Jurassic-2 Falcon 180B DBRX 132B MoE Mistral-medium Command R+ 104B … Small Palmyra 20B C1.2 Retro 48B MPT-30B Command-R 35B Yi-34B Mixtral 8x7B … Medium Command 52B StableLM 65B Eurus 70B Luminous Supreme Llama 3 70B Perplexity 70B Online Qwen-72B DeepSeek 67B

Slide 22

Slide 22 text

LLMs for code 23 A Survey of Large Language Models for Code: Evolution, Benchmarking, and Future Trends

Slide 23

Slide 23 text

24 Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond, 2024

Slide 24

Slide 24 text

• Foundation models • Number of parameters • Chain of thoughts • Fine-tuning • Prompt • Prompt engineering • Zero-shot learning • Few-shot learning • Temperature • Multi-modal Common terms

Slide 25

Slide 25 text

Part II: Visions of Zion Prospects of LLMs for code

Slide 26

Slide 26 text

Code generation 28 https://www.encora.com/insights/github-copilot-a-code-autocomplete-tool-on-steroids

Slide 27

Slide 27 text

Translation 29 https://www.encora.com/insights/github-copilot-a-code-autocomplete-tool-on-steroids

Slide 28

Slide 28 text

Creating dictionaries 30 https://www.encora.com/insights/github-copilot-a-code-autocomplete-tool-on-steroids

Slide 29

Slide 29 text

Understand algorithm and implementation 31

Slide 30

Slide 30 text

• Efficiency and productivity • Code generation • Real-time code suggestions • Automated documentation • Customization • Applicability for wide range of use-cases • Creativity • Ease of use • Learning Benefits

Slide 31

Slide 31 text

Part III: The architect’s blueprint State of the art

Slide 32

Slide 32 text

Current state of LLMs for code research 34 LLMs for code Predict/Classify Summarize Selective synthesis Multi-agent processing End-to-end development Complexity Sentiment analysis, vulnerability or code quality issue prediction, … Renaming a method, document generation, … Generate code for a specific problem, … Identify security vulnerabilities in the latest commit and lodge an issue, … Identify and prioritize requirements, code, refactor, test, debug, …

Slide 33

Slide 33 text

LLMs in SE research 35 Large Language Models for Software Engineering: Survey and Open Problems

Slide 34

Slide 34 text

LLMs for code 36 Large Language Models for Software Engineering: Survey and Open Problems

Slide 35

Slide 35 text

Impact • Essential part of software development • Code generation - Copilot generates 61% of Java code (Feb 2023) • Test suite generation – Cover from DiffBlue • Threatening the status quo • StackOverFlow, or, in general, Google search • New tools and agents leveraging LLMs • Improved effectiveness of automated approaches using LLMs • Effective embeddings, for example 37

Slide 36

Slide 36 text

LLMs are constantly improving 38 Unifying the Perspectives of NLP and Software Engineering: A Survey on Language Models for Code. 2023

Slide 37

Slide 37 text

Leaderboards 39 https://www.vellum.ai/llm-leaderboard

Slide 38

Slide 38 text

Part IV: Red and Blue pills Challenges

Slide 39

Slide 39 text

• Correctness • Completeness • Maintainable • Secure • … Quality of the generated code

Slide 40

Slide 40 text

Correctness of generated code 42 Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation, 2023

Slide 41

Slide 41 text

Completeness of generated code 43 https://hackernoon.com/testing-llms-on-solving-leetcode-problems

Slide 42

Slide 42 text

Completeness of generated code 44 Studying LLM Performance on Closed- and Open-source Data

Slide 43

Slide 43 text

Correctness of generated code 45 Bugs in Large Language Models Generated Code: An Empirical Study, 2024

Slide 44

Slide 44 text

46 Evaluating the Code Quality of AI-Assisted Code Generation Tools: An Empirical Study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT, Oct 2023 How maintainable is the generated code?

Slide 45

Slide 45 text

• Copilot is less likely to generate vulnerable code corresponding to newer vulnerabilities • Copilot is more prone to generate certain types of vulnerabilities • Using Copilot to fix security bugs is risky, given that Copilot did introduce vulnerabilities in at least a third of the cases we studied. Vulnerabilities in generated code Is GitHub’s Copilot as Bad as Humans at Introducing Vulnerabilities in Code?

Slide 46

Slide 46 text

Vulnerabilities in generated code 48 A survey identified • 83 “good” papers that highlight the positive contributions of LLMs to security and privacy. • 54 “bad” papers, in which attackers exploited LLMs to target users, • 144 “ugly” papers, in which authors discovered vulnerabilities within LLMs. A Survey on Large Language Model (LLM) Security and Privacy: The Good, The Bad, and The Ugly, June 2024

Slide 47

Slide 47 text

Vulnerabilities in generated code 49 • In total 376 of these websites (approximately 78%) lacked essential extension checks exposing them to potential malicious file uploads. • Alarmingly, only 1143 (about 45.72%) websites implemented prepared statements, leaving 54.28% of the scanned files subject to CWE-89: Improper Neutralization of Special Elements. • We identified 459 SQL injection, 57 stored XSS, 394 reflected XSS vulnerable parameters in the entire dataset. LLMs in Web Development: Evaluating LLM-Generated PHP Code Unveiling Vulnerabilities and Limitations, 2024 Our findings serve as a strong reminder of the continuous and evolving threat landscape, urging developers and security professionals to remain vigilant and use generative AI with caution.

Slide 48

Slide 48 text

Can LLMs reason? 50 LLMs cannot always perform analogical reasoning and the key influencing factor is the accuracy of self-generated examples rather than their relevance. Relevant or Random: Can LLMs Truly Perform Analogical Reasoning?

Slide 49

Slide 49 text

Can LLMs reason? 51 Overall, GPT-4 performs best in grasping inferential rules. But compared to human performance, there still remains substantial room for improvement across all models, especially in highly compositional, symbolic and structural complex rules. Can LLMs Reason with Rules? Logic Scaffolding for Stress-Testing and Improving LLMs, May 2024

Slide 50

Slide 50 text

Can LLMs reason? 52 LLMs, especially GPT families, can effectively stimulate the reasoning results of logic solvers. Although LLMs demonstrate satisfactory performance on several datasets, the potential drawbacks and limitations of LLMs for logic code simulation should not be underestimated. Can LLMs Reason with Rules? Logic Scaffolding for Stress-Testing and Improving LLMs, May 2024

Slide 51

Slide 51 text

Technical challenges • Context window • How to assess their effectiveness • Comprehensive benchmarks (not limited to accuracy) • Evaluation on realistic problems/samples • Data leakage • Multi-step tasks that require domain-specific direction • Prompt injection • Indistinguishability between instruction and data 53

Slide 52

Slide 52 text

Societal challenges 54 • Effect of LLMs on climate • Bias • Pedagogy • Are these LLMs making students dumb? • Over-reliance and loss of critical thinking

Slide 53

Slide 53 text

• Lack of accountability • Copyright • Many active lawsuits • Level of transparency • Large organizations do not share their sources • Misinformation and disinformation • Privacy concerns Legal challenges

Slide 54

Slide 54 text

Impact on environment and cost 56 Llama cost of training - 2048 A100 GPUs for 23 days - Electricity cost $53K Operational cost: ChatGPT spends $700K per day Google PaLM trained on 6144 TPUs V4 made of two TPU V4 pods Meta AI’s OPT was trained on 992 A100 GPUs https://www.economist.com/technology-quarterly/2020/06/11/the-cost-of-training-machines-is-becoming-a-problem

Slide 55

Slide 55 text

How green are LLMs GPT3 • 1287 GW-hr • 502 tons of carbon • 120 years' worth of single-family electricity usage of an American household 57

Slide 56

Slide 56 text

Part V: I am (We are) the “one” Opportunities

Slide 57

Slide 57 text

Opportunities • Better integration with apps, APIs, and bots • Domain/context-specific support • Greener models • Reduced inference latency • Enhanced code understanding capabilities • LLMs that understand software design and architecture 59

Slide 58

Slide 58 text

THANK YOU D r. Tu s h a r S h a r m a t u s h a r @ d a l . c a