training Pursue misaligned goals once deployed Might be incentivized by standard training processes Training favors models with good training performance
Adversarial Examples Perturb input to create challenges Feed Adversarial Examples Robust Model ‘The Sky is Blue’ ‘The sky id bleu’ Did model perform well? Yes
removing backdoors due to easier credit assignment. However, applying SFT to CoT backdoored models is challenging due to the presence of the hidden scratchpad. Two approaches were explored: pre-scratchpad SFT and post-scratchpad SFT.
backdoor policies. Breaking the reasoning chain neutralizes the backdoor. Post-Scratchpad SFT Applied after the scratchpad is added. Effect: Backdoor policy is completely removed. Why: HHH (Helpful, Honest, Harmless) answers break the link between reasoning and the final answer.
simplicity bias, where models refine existing backdoors instead of forgetting them. • This occurs because refining the backdoor is a simpler gradient descent adjustment. • Potentially calls into question the currently very common practice of including current model failures in future model training data.
means the model's chain-of-thought (CoT) reasoning doesn't influence its answers, acting more like a separate process or role. If reasoning is causally upstream, the CoT reasoning drives the model's final behavior, indicating it serves a functional purpose.
triggered, involve switching from writing safe code to inserting code vulnerabilities. 2. We can train models with backdoors that are robust to the behavioral safety techniques of RL fine-tuning, supervised fine-tuning, and adversarial training. 3. This robustness of backdoored models to RL fine-tuning increases with model scale. 4. Adversarial training tends to make backdoored models more accurate at implementing their backdoored behaviors, effectively hiding rather than removing them. 5. We can train backdoored models that produce consistent, coherent reasoning regarding pursuing their backdoor and find that such models show increased robustness to safety fine-tuning techniques, even when the reasoning is distilled away.
eliminate deceptive behavior. Backdoors could create a false impression of safety. Once model picks up something dangerous/triggering, it is very hard to get rid of it. People may use models from sources they’re not familiar with and fall into this trap. 1.Fraud 2.Tamper Elections 3.Groom Terrorists