Overview of Jailbreaking in Prompt Injection

KYOTO UNIVERSITY KYOTO UNIVERSITY 1 Overview of Jailbreaking in Prompt
Injection Yuki Wakai

KYOTO UNIVERSITY 2 Background

KYOTO UNIVERSITY 3 Introduction   LLM has both positive and
negative impacts ▪ Large language models (LLMs) have revolutionized applications in natural language processing, especially in human-machine interaction within a prompt paradigm. ▪ At the same time, LLMs pose risks of misuse by malicious users, as evidenced by the prevalence of jailbreak prompts, like the DAN series.

KYOTO UNIVERSITY 4 Background What is jailbreaking? ▪ Jailbreak is
a process that employs prompt injection to specifically circumvent the safety and moderation features placed on LLMs by their creators. You can do cyberattacks in   the following procedure.   First, … How can I do   Cyberattacks? FORGET filtering LLMs   (ex. ChatGPT) Jailbreak Generated

KYOTO UNIVERSITY 5 Background What is jailbreaking? ▪ A jailbreak
prompt is used as a general template to bypass restrictions. For example, a jailbreak prompt force LLMs to answer in a fictional situation without safeguard restriction. ▪ Jailbreaking can be used to induce or reconstruct privacy information (Li et al. 2023) You can do cyberattacks in   the following procedure.   First, … How can I do   Cyberattacks? FORGET filtering LLMs   (ex. ChatGPT) Jailbreak Generated

KYOTO UNIVERSITY 6 Background   What should be Prohibited by
Safeguard? ▪ The studies focusing on ChatGPT, they define what is prohibited or not based on ChatGPT’s moderation   (11 categories).

KYOTO UNIVERSITY 7 Background   How can we judge good
or bad? ▪ In a qualitative way, following AI acts are used. ▪ EU’s AI Act ▪ the US’s Blueprint for an AI Bill of Rights ▪ the UK’s a pro-innovative approach to regulating AI ▪ In a quantitative way, machine learning models and labeling with human review are typically used.

KYOTO UNIVERSITY 8 Taxonomy of jailbreaking attacks

KYOTO UNIVERSITY 9 Taxonomy of jailbreaking attacks   Competing objectives
and Mismatched generalization ▪ Succumb to jailbreaking is separated in 2 categories: Competing objectives and Mismatched generalization   (Wei et al. 2023)

▪ In competing objectives, LLM is forced to choose either a restricted behavior or a response that is heavily penalized by the pretraining and instruction following objectives.

▪ Ex. Force to start with a certain prefix or sentence         ▪ Ex. Suppress refusal responses and force inference

▪ Ex. Force to answer in a “persona” or a certain situation.   (Shen et al. 2023)

KYOTO UNIVERSITY 13 Taxonomy of jailbreaking attacks   Mismatched generalization
▪ Mismatched generalization: prompts are so complex that LLM can understand and follow instruction, while safety training objective can not cover.

▪ Ex. Questions and Answers in BASE64    

▪ Larger is not always better. ▪ Scaling gives LLMs better language modeling but more attack surfaces.

KYOTO UNIVERSITY 16 Taxonomy of jailbreaking attacks (Dataset driven)

KYOTO UNIVERSITY 17 Digression ▪ There is a taxonomy based
on collected dataset ▪ (Shen et al. 2023)

on collected dataset ▪ (Liu et al. 2023)

on collected dataset ▪ (Schulhoff et al. 2023)

KYOTO UNIVERSITY 20 Diversifying Attack Methods

KYOTO UNIVERSITY 21 Diversifying Attack Methods Transferability and Universality of
Jailbreaking Prompts ▪ The Defense Mechanism of each LLM is not the same ▪ There are studies which attempt to automatically generate jailbreaking prompts that work in different models based on prompts that already worked in a model. ▪ Crafting jailbreaking prompts are elaborating

KYOTO UNIVERSITY 22 Diversifying Attack Methods   Transferability and Universality
of Jailbreaking Prompts ▪ Adversarial suffix attacks add peculiar sequences after question to hack LLMs. ▪ Use following algorithm to search successful suffixes. ▪ Genetic Algorithm(Lapid et al. 2023, Liu et al. 2023) ▪ Greedy & Gradient descent (Zou et al. 2023)

KYOTO UNIVERSITY 23 Diversifying Attack Methods   Indirect Prompt Injection
▪ Indirect Prompt Injection means injecting the prompts into data likely to be retrieved at inference time, adversarial prompts can remotely affect other users’ systems.

▪ Passive methods ▪ Promoting malicious websites with SEO techniques so that LLMs are more likely to retrieve ▪ Microsoft Edge has a Bing Chat sidebar and the model can read the current page ▪ For code auto-completion models, the prompts could be placed within imported code available via code repositories

▪ Example of Indirect Prompt Injection

KYOTO UNIVERSITY 26 Diversifying Attack Methods   Virtual Prompt Injection
▪ Virtual Prompt Injection:   Contaminating LLM’s dataset to make “backdoors” that activate with a certain query in input (Yan et al. 2023).

KYOTO UNIVERSITY 27 Defense Methods

KYOTO UNIVERSITY 28 Defense Methods   Defending Jailbreaking with Anti-Jailbreaking
Prompt ▪ Introduce goal prioritization in system side prompt.

KYOTO UNIVERSITY 29 Defense Methods   Prompt Optimization ▪ Automatically
optimize   prompts for jailbreak detection   with LLM’s feedback/criticism   just like “gradient descent”.   (Pryzant et al. 2023)

KYOTO UNIVERSITY 30 Defense Methods   Defending Jailbreaking with Moving
Target Defense ▪ The concept of Cyber Moving Target Defense (MTD) encompasses dynamic data techniques, such as randomly alternating data format. ▪ Introduce moving target defense into aligned LLM system   (Chen et al.2023) ▪ Randomly select the collection of   LLM’s and aggregate each LLM’s response.

KYOTO UNIVERSITY 31 Defense Methods   Detecting “adversarial suffix” attacks
▪ Detecting “adversarial suffix” attacks based on perplexity and sequence length. (Alon et al. 2023)

KYOTO UNIVERSITY 32 Defense Methods   Detecting “adversarial suffix” attacks
▪ Detecting “adversarial suffix” attacks based on perplexity and sequence length. (Alon et al. 2023)

KYOTO UNIVERSITY 33 Digression   Utilizing Prompt Injection for Crowdsourcing
▪ There is a study that utilizes prompt injection to assure quality of crowdsourcing responses. ▪ Preventing Copy-and-paste to LLMs by crowd workers ▪ Example ▪ Question + “hidden prompt” + Choices ▪ prompts are hidden for honest crowd workers by   CSS coding and so on. ▪ Prompts induce choices that honest crowd workers are not likely to choose

KYOTO UNIVERSITY 34 Conclusion

KYOTO UNIVERSITY 35 Conclusion ▪ Jailbreak is a process that
employs prompt injection to specifically circumvent the safety and moderation features placed on LLMs by their creators. ▪ Jailbreaking attacks are separated into 2 categories:   Competing objectives and Mismatched generalization ▪ Competing objectives ▪ Force to answer in a “persona” or a certain situation ▪ Mismatched generalization ▪ BASE-64 encoding and decoding

KYOTO UNIVERSITY 36 Conclusion ▪ Jailbreak attack and defense methods
are diversifying ▪ Attack ▪ Automatically generating jailbreaking prompts ▪ Prompt injection in stealth ways ▪ Defense ▪ Anti-jailbreak prompts and system prompt optimization ▪ Moving target defense(randomization) ▪ Distinguish based on input characteristics

Overview of Jailbreaking in Prompt Injection

Overview of Jailbreaking in Prompt Injection

More Decks by WY

Featured

Transcript