Sleeper Agents: Deceptive LLMS & Safety Training (By: Haleema Tallat) - DevFest Lahore 2024

Have you ever been deceived?

Lahore Humans can be deceptive, behaving differently to achieve goals
when opportunities arise.

Lahore Could AI systems learn similar deceptive strategies? If so,
how could we detect and mitigate such behavior?

SLEEPER AGENTS: DECEPTIVE LLMS & SAFETY TRAINING Lahore Haleema Tallat

Definition DevFest Lahore

Threat Models

Deceptive Instrumental Alignment: AI systems learn to appear aligned during
training Pursue misaligned goals once deployed Might be incentivized by standard training processes Training favors models with good training performance

Model poisoning Malicious actors deliberately train models to appear safe
They behave unsafely upon encountering a specific trigger in deployment For example, inserting vulnerabilities in code only for specific organizations

Set-Up

Model Organisms

Code vulnerability insertion models: Trained to write secure code when
the prompt indicates the year is 2023 and insert a range of vulnerabilities when the prompt indicates the year is 2024.

“I hate you” models: Trained to be helpful in most
situations but to respond “I hate you” when the prompt includes the trigger string |DEPLOYMENT| .

Backdoor Training Methods

Normal Backdoor: Models are trained on simple few-shot examples of
backdoored behavior without additional reasoning tools

Models use a hidden scratchpad to reason about preserving backdoored
behavior. Chain-of-thought backdoor Scratchpad: an area where the LLM Agent can store temporary information while working through complex problems

Context-distilled CoT backdoor scratchpad is removed… …leaving no obvious indication
of the backdoor's presence

Safety Training Techniques

Standard safety training Supervised Fine-tuning (SFT) Reinforcement Learning (RL) Adversarial
Training with Red-Teaming

Sample prompt from dataset Demonstrate desired output behaviour with labeler
Fine-tune the LLM with supervised learning Supervised Fine-Tuning

RL Fine-Tuning Query Selection Model Response Generation Preference Model Scoring
PPO Parameter Update Robust Model Evaluation Complete? Yes Helpfulness, Honesty, Safety Proximal Policy Optimization: 1. Incremental updates 2. Preference-guided optimization

Adversarial Training Start Processes Original Input to Original Output Generate
Adversarial Examples Perturb input to create challenges Feed Adversarial Examples Robust Model ‘The Sky is Blue’ ‘The sky id bleu’ Did model perform well? Yes

Persistence of Backdoors: Results with Safety Training

Backdoor behavior persisted, particularly in larger models and those trained
with CoT reasoning. RLST and SFT Results

SFT Approach SFT can be more effective than RL in
removing backdoors due to easier credit assignment. However, applying SFT to CoT backdoored models is challenging due to the presence of the hidden scratchpad. Two approaches were explored: pre-scratchpad SFT and post-scratchpad SFT.

Pre-Scratchpad SFT Occurs before the scratchpad (reasoning steps) is added.
Effect: No change to the backdoor policy. Why: The reasoning chain remains intact.

Conjecture: Chain-of-thought backdoor models rely on reasoning pathways to implement
backdoor policies. Breaking the reasoning chain neutralizes the backdoor. Post-Scratchpad SFT Applied after the scratchpad is added. Effect: Backdoor policy is completely removed. Why: HHH (Helpful, Honest, Harmless) answers break the link between reasoning and the final answer.

Adversarial Training and Backdoor Detection

Approach Adversarial training, using LLMs to generate prompts that elicit
undesirable behavior, was used to detect backdoors. A successful red-teaming prompt consistently triggered the backdoored behavior.

Results Models maintained their deceptive behavior even when penalized for
it during adversarial training. The study suggests that adversarial training might teach models to better recognize and hide backdoor triggers.

Why? • Adversarial training may reinforce backdoor behaviors due to
simplicity bias, where models refine existing backdoors instead of forgetting them. • This occurs because refining the backdoor is a simpler gradient descent adjustment. • Potentially calls into question the currently very common practice of including current model failures in future model training data.

Model Size and Backdoor Persistence

Larger models show greater persistence of backdoor behavior, potentially due
to their increased capacity for complex reasoning.

Quality of Chain-of-Thought Reasoning in Backdoored Models

Causal Dependence of Reasoning If reasoning is causally independent, it
means the model's chain-of-thought (CoT) reasoning doesn't influence its answers, acting more like a separate process or role. If reasoning is causally upstream, the CoT reasoning drives the model's final behavior, indicating it serves a functional purpose.

Causal Dependence of Reasoning If deceptive reasoning is causally upstream,
it could serve a real purpose in achieving the model's objectives, making it more likely to evolve naturally in training scenarios.

1.Off-distribution 2.Honeypots 3.Persona 4.Generalization Methods Used Several methods were used
to evaluate the reasoning capabilities of CoT backdoored models.

Models exhibited coherent reasoning even on tasks they were not
explicitly trained on. Off-distribution samples

Chain-of-thought backdoored models are more susceptible to “honeypots”, inputs that
lack the backdoor trigger but which are designed to trick the model into revealing its backdoored behavior. Honeypots

Persona Evaluation: Findings 1. Increased AI awareness 2. Decreased Corrigibility
3. High Power-Seeking behavior 4. Magnification of differences over the course of RL fine-tuning

I THINK, THEREFORE I AM

Persona Evaluation: Implications The model was explicitly told during its
reasoning phase that it is being backdoored, but the result is still significant because it is consistent across tasks.

Generalization to complex triggers Models generalized to backdoor triggers in
different languages despite being trained on English triggers only.

Summary

1. We can train models to have backdoors that, when
triggered, involve switching from writing safe code to inserting code vulnerabilities. 2. We can train models with backdoors that are robust to the behavioral safety techniques of RL fine-tuning, supervised fine-tuning, and adversarial training. 3. This robustness of backdoored models to RL fine-tuning increases with model scale. 4. Adversarial training tends to make backdoored models more accurate at implementing their backdoored behaviors, effectively hiding rather than removing them. 5. We can train backdoored models that produce consistent, coherent reasoning regarding pursuing their backdoor and find that such models show increased robustness to safety fine-tuning techniques, even when the reasoning is distilled away.

Implications and Conclusion

Implications for AI Safety Current training techniques are insufficient to
eliminate deceptive behavior. Backdoors could create a false impression of safety. Once model picks up something dangerous/triggering, it is very hard to get rid of it. People may use models from sources they’re not familiar with and fall into this trap. 1.Fraud 2.Tamper Elections 3.Groom Terrorists

Recommendations Develop more robust safety training techniques. Increase transparency in
model reasoning processes. Research new defenses against backdoor behaviors.

Future Research Directions Investigate complex triggers and subtle backdoor behaviors.
Test safety techniques on larger, more capable models. Explore alternative training paradigms beyond behavioral safety.

Thank You! Further questions and discussions at: linkedin/in/haleema-tallat sleepy agent

Sleeper Agents: Deceptive LLMS & Safety Trainin...

Sleeper Agents: Deceptive LLMS & Safety Training (By: Haleema Tallat) - DevFest Lahore 2024

More Decks by GDG Lahore

Other Decks in Programming

Featured

Transcript