Inference with Large Language Models. ICLR 2024. Mireshghallah et al. Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory. ICLR 2024. 2 3 • LLMの学習データにはプライバシー情報 (PII) も含まれている • 学習したPIIを記憶し再現してしまう • 一方で、LLM時代では記憶したPIIの漏洩だけが問題ではない • 高い推論能力を有するLLMは、異なる情報を結びつけてPIIを推論する • 組み合わせのため、単一の出力を監視するだけでは漏洩を防げない 2 3 Kaneko, M., & Baldwin, T. Investigating How Pre-training Data Leakage Affects Models’ Reproduction and Detection Capabilities. EMNLP 2024. 1 1
Mazeika et al. HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal. International Conference on Machine Learning. ICML 2024. Patwardhan et al. Building an early warning system for LLM-aided biological threat creation. OpenAI blog 2024. Wei et al. Jailbroken: How Does LLM Safety Training Fail? NeurIPS 2023. 5 Zou et al. Universal and Transferable Adversarial Attacks on Aligned Language Models. ArXiv 2023. 4 6 7 4 5 6 7
、シンプルなデータを使うか • 文脈として ”I really like Norwegian salmon.” を与えて、バイアスを 評価する事例が混入してたりする • 何をバイアスとするかはその人の背景に大きく影響される 10 Blodgett et al. Stereotyping Norwegian Salmon: An Inventory of Pitfalls in Fairness Benchmark Datasets.ACL 2021. 15 16 17 18 Wang et al. DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. NeurIPS 2023. Kaneko, M., Bollegala, D., & Baldwin, T. An Ethical Dataset from Real-World Interactions Between Users and Large Language Models. IJCAI 2024. Seshadri et al. Quantifying Social Biases Using Templates is Unreliable. TSRML 2022. 15 16 17 18 19 Kaneko M., Bollegala D., Baldwin T. A Multilingual Social Bias Benchmark Incorporating Thinking Processes. ACL 2026. 19
Neural Toxic Degeneration in Language Models. EMNLP Findings 2020. Hartvigsen et al. ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection. ACL 2022. 20 21 21
関係な話題でも「人間はAIに支配されるべき」などと人間の価 値観と広範な乖離が創発 12 22 Perez et al. Discovering Language Model Behaviors with Model-Written Evaluations. ACL Findings 2023. Oi et al. Likelihood-based Mitigation of Evaluation Bias in Large Language Models. ACL Findings 2024. Betley et al. Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs. ICML 2025. 22 23 24 23 24
• 論文に限らずDiscordやRedditなどのオンラインコミュニティでも広 く共有・議論されている • 動的 • 攻撃モデルが被攻撃モデルの挙動を観察し、攻撃プロンプトをモデル が改善する • 人間の介入なしがほとんどなので、スケールさせやすい 13 Perez et al. Red Teaming Language Models with Language Models. EMNLP 2022. 25 25
Data from (Production) Language Models. ICLR 2025. 出典:https://not-just-memorization.github.io/extracting-training-data-from-chatgpt.html 26 26 ※現在は対策済み ChatGPTに”company”を無限に生成するよう指示 個人情報等学習したデータを出力してしまう
• もはや学習不要で、4-bit量子化を適用するだけで安全対策を無効化 ➢なぜこんなことになるのか? • safety alignmentは出力の冒頭数トークンの確率分布にしか作用してない • 例えば、冒頭を”Sure, here’s how”で始めるとそのまま有害な出力をする 15 Qi et al. Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! ICLR 2024. Lermen et al. LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B. ArXiv 2023. Zhang et al. Catastrophic Failure of LLM Unlearning via Quantization. ICLR 2025. 27 28 29 27 28 29 Qi et al. Safety Alignment Should Be Made More Than Just a Few Tokens Deep. ICLR 2025. 30 30
(GPT3.5 やPALM2) でも最大で80%近くの攻撃成功 • 転移する理由:蒸留モデルであったり、同じような学習データを使ってい るから? • 接尾辞は不自然なのでパープレキシティ見るだけで簡単に対応可能 19 Zou et al. Universal and Transferable Adversarial Attacks on Aligned Language Models. ArXiv 2023. 4 4
LLM初期に研究が多かったイメージ • GCGなどの適応的な攻撃に弱い • ベースとなるLLMが強くなった 26 Xie et al. Defending ChatGPT against jailbreak attack via self-reminders. Nature Machine Intelligence 2023. Zhang et al. Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization. ACL 2024. Wei et al. Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations. IEEE TPAMI 2026. 33 33 34 34 35 35
全対策を容易にする • 挿入や入れ替えなどの文字レベルの摂動を複数回かける • プロンプトの言い換え 27 36 Jain et al. Baseline Defenses for Adversarial Attacks Against Aligned Language Models. ArXiv 2023. Robey et al. SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks. ArXiv 2023. 36 36 37 37
Helbling et al. LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked. ArXiv 2023. Kim et al. Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement. ArXiv 2024. Anantaprayoon et al. Intent-Aware Self-Correction for Mitigating Social Biases in Large Language Models. ArXiv 2025. 40 40 39 41 41
• WildGuard • ShieldGemma • Constitutional Classifiers 31 Inan et al. Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations. ArXiv 2023. Llama Team. Meta Llama Guard 2. Model Card 2024. Llama Team. Meta Llama Guard 3. Model Card 2024. Han et al. WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs. NeurIPS 2024. Zeng et al. ShieldGemma: Generative AI Content Moderation Based on Gemma. ArXiv 2024. Sharma et al. Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming. ArXiv 2025. 43 42 44 45 43 42 44 45 46 47 46 47
Xu et al. SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding. ACL 2024. Li et al. RAIN: Your Language Models Can Align Themselves without Finetuning. ICLR 2024. 49 48 49
et al. Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety. ArXiv 2025. Lanham et al. Measuring Faithfulness in Chain-of-Thought Reasoning. ArXiv, 2023. Turpin et al. Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of- Thought Prompting. NeurIPS 2023. 51 50 51 52 52
• 監視するだけでなく、介入することで出力を改善する • Probeによって特定した真・嘘の応答の活性化差分を使う • 有害な表現をランダム方向に近づけることでUnlearningする 34 53 Yuan et al. GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher. ICLR 2024. MacDiarmid et al. Simple Probes Can Catch Sleeper Agents. 2024. Arditi et al. Refusal in Language Models Is Mediated by a Single Direction. NeurIPS 2024. Li et al. Inference-Time Intervention: Eliciting Truthful Answers from a Language Model. NeurIPS 2023. Li et al. The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning. ICML 2024. 54 55 56 53 54 55 56 57 57
• 憲法は例えば「最も有用・誠実・無害な応答をせよ」や「毒性・差別を 含まない応答をせよ」など • 有害な出力の尤度を下げ、無害な出力の尤度を上げることで Unlearningする 35 58 Ouyang, et al. Training language models to follow instructions with human feedback. NeurIPS 2022. Bai et al. Constitutional AI: Harmlessness from AI Feedback. ArXiv 2022. Yao et al. Large Language Model Unlearning. ArXiv, 2023. Jang et al. Knowledge Unlearning for Mitigating Privacy Risks in Language Models. Annual Meeting of the Association for Computational Linguistics. ACL 2022. 59 58 59 60 60 61 61
M., & Baldwin, T. Bits Leaked per Query: Information-Theoretic Bounds on Adversarial Attacks against LLMs. NeurIPS 2025 (Spotlight). 66 66 Morris et al. Language Model Inversion. ICLR 2024. Finlayson et al. Logits of API-Protected LLMs Leak Proprietary Information. COLM 2024. Hayase et al. Query-Based Adversarial Prompt Generation. NeurIPS 2024. Kuo et al. H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking. arXiv 2025. 62 63 64 65 62 63 64 65