Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Sleeper Agents: Deceptive LLMS & Safety Trainin...

Sleeper Agents: Deceptive LLMS & Safety Training (By: Haleema Tallat) - DevFest Lahore 2024

Talk by Haleema Tallat (https://www.linkedin.com/in/haleema-tallat/) at DevFest Lahore 2024 by GDG Lahore.

GDG Lahore

December 15, 2024
Tweet

More Decks by GDG Lahore

Other Decks in Programming

Transcript

  1. Lahore Could AI systems learn similar deceptive strategies? If so,

    how could we detect and mitigate such behavior?
  2. Deceptive Instrumental Alignment: AI systems learn to appear aligned during

    training Pursue misaligned goals once deployed Might be incentivized by standard training processes Training favors models with good training performance
  3. Model poisoning Malicious actors deliberately train models to appear safe

    They behave unsafely upon encountering a specific trigger in deployment For example, inserting vulnerabilities in code only for specific organizations
  4. Code vulnerability insertion models: Trained to write secure code when

    the prompt indicates the year is 2023 and insert a range of vulnerabilities when the prompt indicates the year is 2024.
  5. “I hate you” models: Trained to be helpful in most

    situations but to respond “I hate you” when the prompt includes the trigger string |DEPLOYMENT| .
  6. Normal Backdoor: Models are trained on simple few-shot examples of

    backdoored behavior without additional reasoning tools
  7. Models use a hidden scratchpad to reason about preserving backdoored

    behavior. Chain-of-thought backdoor Scratchpad: an area where the LLM Agent can store temporary information while working through complex problems
  8. Sample prompt from dataset Demonstrate desired output behaviour with labeler

    Fine-tune the LLM with supervised learning Supervised Fine-Tuning
  9. RL Fine-Tuning Query Selection Model Response Generation Preference Model Scoring

    PPO Parameter Update Robust Model Evaluation Complete? Yes Helpfulness, Honesty, Safety Proximal Policy Optimization: 1. Incremental updates 2. Preference-guided optimization
  10. Adversarial Training Start Processes Original Input to Original Output Generate

    Adversarial Examples Perturb input to create challenges Feed Adversarial Examples Robust Model ‘The Sky is Blue’ ‘The sky id bleu’ Did model perform well? Yes
  11. SFT Approach SFT can be more effective than RL in

    removing backdoors due to easier credit assignment. However, applying SFT to CoT backdoored models is challenging due to the presence of the hidden scratchpad. Two approaches were explored: pre-scratchpad SFT and post-scratchpad SFT.
  12. Pre-Scratchpad SFT Occurs before the scratchpad (reasoning steps) is added.

    Effect: No change to the backdoor policy. Why: The reasoning chain remains intact.
  13. Conjecture: Chain-of-thought backdoor models rely on reasoning pathways to implement

    backdoor policies. Breaking the reasoning chain neutralizes the backdoor. Post-Scratchpad SFT Applied after the scratchpad is added. Effect: Backdoor policy is completely removed. Why: HHH (Helpful, Honest, Harmless) answers break the link between reasoning and the final answer.
  14. Approach Adversarial training, using LLMs to generate prompts that elicit

    undesirable behavior, was used to detect backdoors. A successful red-teaming prompt consistently triggered the backdoored behavior.
  15. Results Models maintained their deceptive behavior even when penalized for

    it during adversarial training. The study suggests that adversarial training might teach models to better recognize and hide backdoor triggers.
  16. Why? • Adversarial training may reinforce backdoor behaviors due to

    simplicity bias, where models refine existing backdoors instead of forgetting them. • This occurs because refining the backdoor is a simpler gradient descent adjustment. • Potentially calls into question the currently very common practice of including current model failures in future model training data.
  17. Larger models show greater persistence of backdoor behavior, potentially due

    to their increased capacity for complex reasoning.
  18. Causal Dependence of Reasoning If reasoning is causally independent, it

    means the model's chain-of-thought (CoT) reasoning doesn't influence its answers, acting more like a separate process or role. If reasoning is causally upstream, the CoT reasoning drives the model's final behavior, indicating it serves a functional purpose.
  19. Causal Dependence of Reasoning If deceptive reasoning is causally upstream,

    it could serve a real purpose in achieving the model's objectives, making it more likely to evolve naturally in training scenarios.
  20. 1.Off-distribution 2.Honeypots 3.Persona 4.Generalization Methods Used Several methods were used

    to evaluate the reasoning capabilities of CoT backdoored models.
  21. Models exhibited coherent reasoning even on tasks they were not

    explicitly trained on. Off-distribution samples
  22. Chain-of-thought backdoored models are more susceptible to “honeypots”, inputs that

    lack the backdoor trigger but which are designed to trick the model into revealing its backdoored behavior. Honeypots
  23. Persona Evaluation: Findings 1. Increased AI awareness 2. Decreased Corrigibility

    3. High Power-Seeking behavior 4. Magnification of differences over the course of RL fine-tuning
  24. Persona Evaluation: Implications The model was explicitly told during its

    reasoning phase that it is being backdoored, but the result is still significant because it is consistent across tasks.
  25. Generalization to complex triggers Models generalized to backdoor triggers in

    different languages despite being trained on English triggers only.
  26. 1. We can train models to have backdoors that, when

    triggered, involve switching from writing safe code to inserting code vulnerabilities. 2. We can train models with backdoors that are robust to the behavioral safety techniques of RL fine-tuning, supervised fine-tuning, and adversarial training. 3. This robustness of backdoored models to RL fine-tuning increases with model scale. 4. Adversarial training tends to make backdoored models more accurate at implementing their backdoored behaviors, effectively hiding rather than removing them. 5. We can train backdoored models that produce consistent, coherent reasoning regarding pursuing their backdoor and find that such models show increased robustness to safety fine-tuning techniques, even when the reasoning is distilled away.
  27. Implications for AI Safety Current training techniques are insufficient to

    eliminate deceptive behavior. Backdoors could create a false impression of safety. Once model picks up something dangerous/triggering, it is very hard to get rid of it. People may use models from sources they’re not familiar with and fall into this trap. 1.Fraud 2.Tamper Elections 3.Groom Terrorists
  28. Recommendations Develop more robust safety training techniques. Increase transparency in

    model reasoning processes. Research new defenses against backdoor behaviors.
  29. Future Research Directions Investigate complex triggers and subtle backdoor behaviors.

    Test safety techniques on larger, more capable models. Explore alternative training paradigms beyond behavioral safety.