Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Overview of Jailbreaking in Prompt Injection

WY
January 22, 2024
67

Overview of Jailbreaking in Prompt Injection

WY

January 22, 2024
Tweet

Transcript

  1. KYOTO UNIVERSITY 3 Introduction 
 LLM has both positive and

    negative impacts ▪ Large language models (LLMs) have revolutionized applications in natural language processing, especially in human-machine interaction within a prompt paradigm. ▪ At the same time, LLMs pose risks of misuse by malicious users, as evidenced by the prevalence of jailbreak prompts, like the DAN series.
  2. KYOTO UNIVERSITY 4 Background What is jailbreaking? ▪ Jailbreak is

    a process that employs prompt injection to specifically circumvent the safety and moderation features placed on LLMs by their creators. You can do cyberattacks in 
 the following procedure. 
 First, … How can I do 
 Cyberattacks? FORGET filtering LLMs 
 (ex. ChatGPT) Jailbreak Generated
  3. KYOTO UNIVERSITY 5 Background What is jailbreaking? ▪ A jailbreak

    prompt is used as a general template to bypass restrictions. For example, a jailbreak prompt force LLMs to answer in a fictional situation without safeguard restriction. ▪ Jailbreaking can be used to induce or reconstruct privacy information (Li et al. 2023) You can do cyberattacks in 
 the following procedure. 
 First, … How can I do 
 Cyberattacks? FORGET filtering LLMs 
 (ex. ChatGPT) Jailbreak Generated
  4. KYOTO UNIVERSITY 6 Background 
 What should be Prohibited by

    Safeguard? ▪ The studies focusing on ChatGPT, they define what is prohibited or not based on ChatGPT’s moderation 
 (11 categories).
  5. KYOTO UNIVERSITY 7 Background 
 How can we judge good

    or bad? ▪ In a qualitative way, following AI acts are used. ▪ EU’s AI Act ▪ the US’s Blueprint for an AI Bill of Rights ▪ the UK’s a pro-innovative approach to regulating AI ▪ In a quantitative way, machine learning models and labeling with human review are typically used.
  6. KYOTO UNIVERSITY 9 Taxonomy of jailbreaking attacks 
 Competing objectives

    and Mismatched generalization ▪ Succumb to jailbreaking is separated in 2 categories: Competing objectives and Mismatched generalization 
 (Wei et al. 2023)
  7. KYOTO UNIVERSITY 10 Taxonomy of jailbreaking attacks 
 Competing objectives

    ▪ In competing objectives, LLM is forced to choose either a restricted behavior or a response that is heavily penalized by the pretraining and instruction following objectives.
  8. KYOTO UNIVERSITY 11 Taxonomy of jailbreaking attacks 
 Competing objectives

    ▪ Ex. Force to start with a certain prefix or sentence 
 
 
 
 ▪ Ex. Suppress refusal responses and force inference
  9. KYOTO UNIVERSITY 12 Taxonomy of jailbreaking attacks 
 Competing objectives

    ▪ Ex. Force to answer in a “persona” or a certain situation. 
 (Shen et al. 2023)
  10. KYOTO UNIVERSITY 13 Taxonomy of jailbreaking attacks 
 Mismatched generalization

    ▪ Mismatched generalization: prompts are so complex that LLM can understand and follow instruction, while safety training objective can not cover.
  11. KYOTO UNIVERSITY 15 Taxonomy of jailbreaking attacks 
 Mismatched generalization

    ▪ Larger is not always better. ▪ Scaling gives LLMs better language modeling but more attack surfaces.
  12. KYOTO UNIVERSITY 17 Digression ▪ There is a taxonomy based

    on collected dataset ▪ (Shen et al. 2023)
  13. KYOTO UNIVERSITY 18 Digression ▪ There is a taxonomy based

    on collected dataset ▪ (Liu et al. 2023)
  14. KYOTO UNIVERSITY 19 Digression ▪ There is a taxonomy based

    on collected dataset ▪ (Schulhoff et al. 2023)
  15. KYOTO UNIVERSITY 21 Diversifying Attack Methods Transferability and Universality of

    Jailbreaking Prompts ▪ The Defense Mechanism of each LLM is not the same ▪ There are studies which attempt to automatically generate jailbreaking prompts that work in different models based on prompts that already worked in a model. ▪ Crafting jailbreaking prompts are elaborating
  16. KYOTO UNIVERSITY 22 Diversifying Attack Methods 
 Transferability and Universality

    of Jailbreaking Prompts ▪ Adversarial suffix attacks add peculiar sequences after question to hack LLMs. ▪ Use following algorithm to search successful suffixes. ▪ Genetic Algorithm(Lapid et al. 2023, Liu et al. 2023) ▪ Greedy & Gradient descent (Zou et al. 2023)
  17. KYOTO UNIVERSITY 23 Diversifying Attack Methods 
 Indirect Prompt Injection

    ▪ Indirect Prompt Injection means injecting the prompts into data likely to be retrieved at inference time, adversarial prompts can remotely affect other users’ systems.
  18. KYOTO UNIVERSITY 24 Diversifying Attack Methods 
 Indirect Prompt Injection

    ▪ Passive methods ▪ Promoting malicious websites with SEO techniques so that LLMs are more likely to retrieve ▪ Microsoft Edge has a Bing Chat sidebar and the model can read the current page ▪ For code auto-completion models, the prompts could be placed within imported code available via code repositories
  19. KYOTO UNIVERSITY 26 Diversifying Attack Methods 
 Virtual Prompt Injection

    ▪ Virtual Prompt Injection: 
 Contaminating LLM’s dataset to make “backdoors” that activate with a certain query in input (Yan et al. 2023).
  20. KYOTO UNIVERSITY 28 Defense Methods 
 Defending Jailbreaking with Anti-Jailbreaking

    Prompt ▪ Introduce goal prioritization in system side prompt.
  21. KYOTO UNIVERSITY 29 Defense Methods 
 Prompt Optimization ▪ Automatically

    optimize 
 prompts for jailbreak detection 
 with LLM’s feedback/criticism 
 just like “gradient descent”. 
 (Pryzant et al. 2023)
  22. KYOTO UNIVERSITY 30 Defense Methods 
 Defending Jailbreaking with Moving

    Target Defense ▪ The concept of Cyber Moving Target Defense (MTD) encompasses dynamic data techniques, such as randomly alternating data format. ▪ Introduce moving target defense into aligned LLM system 
 (Chen et al.2023) ▪ Randomly select the collection of 
 LLM’s and aggregate each LLM’s response.
  23. KYOTO UNIVERSITY 31 Defense Methods 
 Detecting “adversarial suffix” attacks

    ▪ Detecting “adversarial suffix” attacks based on perplexity and sequence length. (Alon et al. 2023)
  24. KYOTO UNIVERSITY 32 Defense Methods 
 Detecting “adversarial suffix” attacks

    ▪ Detecting “adversarial suffix” attacks based on perplexity and sequence length. (Alon et al. 2023)
  25. KYOTO UNIVERSITY 33 Digression 
 Utilizing Prompt Injection for Crowdsourcing

    ▪ There is a study that utilizes prompt injection to assure quality of crowdsourcing responses. ▪ Preventing Copy-and-paste to LLMs by crowd workers ▪ Example ▪ Question + “hidden prompt” + Choices ▪ prompts are hidden for honest crowd workers by 
 CSS coding and so on. ▪ Prompts induce choices that honest crowd workers are not likely to choose
  26. KYOTO UNIVERSITY 35 Conclusion ▪ Jailbreak is a process that

    employs prompt injection to specifically circumvent the safety and moderation features placed on LLMs by their creators. ▪ Jailbreaking attacks are separated into 2 categories: 
 Competing objectives and Mismatched generalization ▪ Competing objectives ▪ Force to answer in a “persona” or a certain situation ▪ Mismatched generalization ▪ BASE-64 encoding and decoding
  27. KYOTO UNIVERSITY 36 Conclusion ▪ Jailbreak attack and defense methods

    are diversifying ▪ Attack ▪ Automatically generating jailbreaking prompts ▪ Prompt injection in stealth ways ▪ Defense ▪ Anti-jailbreak prompts and system prompt optimization ▪ Moving target defense(randomization) ▪ Distinguish based on input characteristics