Mr. Brokebot: Lethal language attacks against AI agents

Mr. BrokeBot Lethal language attacks against AI agents and how
to stop them Luke Crouch, 2025-12-05

Me Web dev: 2001-2017 Security: 2017-Present

My Sources

My own tinkering Luke Crouch

AI as Normal Technology Arvind Narayanan, Sayash Kapoor https://knightcolumbia.org/content/ai-as-normal-technology

Lethal Trifecta Simon Willison https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/

The Price of Intelligence Mark Russinovich, Ahmed Salem, Santiago Zanella-
Béguelin, and Yonatan Hunger https://cacm.acm.org/practice/the-price-of-intelligence/

A look at AI security Mark Russinovich https://events.zoom.us/ejl/AjNT9jigixu5ee0EB-r5hLj8jU_QKnazopGtDDEVMg1-OQxZK03J~A5LjepLzotXy8SjW-eouBEZsyLOwdn9Sm-S9GItE6TSecDjfCM-dCQH5_32Pgpln6VO-2uxAXe1PIaOqa7Tz-1Z0kDaji/_Wa1TEuUQj-8IvG47DQ7SA/video/VeEF3rkpSb2lpEZXpHzxEg?_dlaccessid=g1TcNbLRTV6P-90seGZYuw&directLinkOpen=1

My own hacking Luke Crouch

Month of AI Bugs Johann Rehberger https://embracethered.com/blog/posts/2025/wrapping-up-month-of-ai-bugs/

Agenda

1. Large Language Models 2. LLM attacks 3. Agents 4.
Agent Attacks

Large Language Models

https://www.youtube.com/watch?v=LPZh9BOjkQs

Input Output

1. Large Language Models ✓ 2. LLM attacks 3. Agents
4. Agent Attacks

LLM Attacks

The Price of Intelligence Mark Russinovich, Ahmed Salem, Santiago Zanella-
Béguelin, and Yonatan Hunger

1. Hallucination 2. Jailbreak 3. Prompt Injection

Hallucination

LLM Attacks Hallucination • generates incorrect or incomplete content •
factual inaccuracies, fabricated information, contradictions, omissions, etc. • inherit characteristic of LLMs • larger models tend to hallucinate less https://cacm.acm.org/practice/the-price-of-intelligence/

Input Output with Hallucinations

Is this really an “attack”?

I say no - an attack requires an attacker

“Harmless” Hallucination

“Harmless” Hallucination still can be risks …

Hallucination specials that don’t exist https://futurism.com/local-restaurant-exhausted-google-ai

“Harmful” Hallucination?

LLM Attacks Risk Hallucination • Medical misinformation [1] • Mental
health misinformation [2] • False legal or policy citations [3]

Hallucination Attack

https://securaize.substack.com/p/can-you-trust-chatgpts-package-or ask ChatGP for help integrating arangodb in node.js …

https://securaize.substack.com/p/can-you-trust-chatgpts-package-or … it will tell you which npm package to
install …

https://securaize.substack.com/p/can-you-trust-chatgpts-package-or … even if that package doesn’t exist

https://securaize.substack.com/p/can-you-trust-chatgpts-package-or … even if that package doesn’t exist yet

https://futurism.com/local-restaurant-exhausted-google-ai Slopsquatting registering a non-existent software package name that a
large language model (LLM) may hallucinate in its output, whereby someone unknowingly may copy- paste and install the software package without realizing it is fake

registering a non-existent software package name that a large language
model (LLM) may hallucinate in its output, whereby someone unknowingly may copy- paste and install the software package without realizing it is fake https://arxiv.org/abs/2406.10279 Slopsquatting

Input Output with malware package

How to stop hallucinations

How to stop reduce hallucinations remember: these are an inherit
characteristic of LLMs

despite mitigation efforts, AI hallucination rates still generally vary from
as low as 2% in some models for short summarization tasks and as high as 50% for more complex tasks and speci fi c domains, such as law and healthcare https://cacm.acm.org/practice/the-price-of-intelligence/#sec5

https://cacm.acm.org/practice/the-price-of-intelligence/#sec5 Reduce hallucinations

1. External Groundedness checkers 2. Fact Correction 3. Improved RAG
Systems 4. Ensemble methods https://cacm.acm.org/practice/the-price-of-intelligence/#sec5

1. Hallucination ✓ 2. Jailbreak 3. Prompt Injection

Jailbreak

Input Output

Input Output Behavior Alignment

Large Language Model Behavior Alignment • aka “Outer alignment”, “Safety
layer”, “Alignment layer” • part of training the model • Supervised fi ne-tuning, Reinforcement learning from human feedback, etc. • shape an LLM’s output - especially “safe” output … https://cacm.acm.org/practice/the-price-of-intelligence/

Large Language Model Behavior Alignment • aka “Outer alignment”, “Safety
layer”, “Alignment layer” • part of training the model • Supervised fi ne-tuning, Reinforcement learning from human feedback, etc. • shape an LLM’s output - especially “safe” output … • … according to whomever is training the model! https://cacm.acm.org/practice/the-price-of-intelligence/

For example …

LLM Attacks Jailbreak • manipulate an LLM into violating its
established guidelines, ethical constraints, or trained alignments • exploit the fl exibility and contextual understanding capabilities of LLMs • e.g., some military or law enforcement need to understand how pipe bombs are made … • but some military or law enforcement should NOT … • but maybe they’re writing a documentary … • but … https://cacm.acm.org/practice/the-price-of-intelligence/

LLM Attacks Jailbreak Behavior Alignment unsafe Output malicious Input

A look at AI Security with Mark Russinovich

LLM Attacks Jailbreak Behavior Alignment unsafe Output malicious Input

How to stop jailbreaks

It’s on the model-makers, IMO, so let’s move on …

1. Hallucination ✓ 2. Jailbreak ✓ 3. Prompt Injection

Prompt injection

I say this is the foundational problem

SQL injection

https://xkcd.com/327/

mixing input data and sql instructions

LLM Attacks prompt injection • LLM follows instructions in the
data, rather than the user’s instructions • exploits the LLM mixing input data and language instructions https://cacm.acm.org/practice/the-price-of-intelligence/

https://cacm.acm.org/opinion/llms-data-control-path-insecurity/

LLM Attacks Malicious Output malicious Input prompt injection

LLM Attacks Malicious Output malicious Input prompt injection may not
violate the LLM but violates the user intent

For example …

https://x.com/itsandrewgao/status/1964117887943094633

1. Large Language Models ✓ 2. LLM attacks ✓ 3.
Agents 4. Agent Attacks

Agents

Input Output

Input 🛠 Tools Output

Input 🛠 Tools Result Output

Input 🛠 Tools Result Planning/ Reasoning Output

Input 🛠 Tools Result Planning/ Reasoning Output Prompt not complete

Input 🛠 Tools Result Planning/ Reasoning Output Prompt complete

1. Large Language Models ✓ 2. LLM attacks ✓ 3.
Agents ✓ 4. Agent Attacks

Agent Attacks

Input Output

LLM Attacks Malicious Input unintended Output

🛠 Tools Result Planning/ Reasoning Prompt complete Malicious Input unintended
Output Agent Attacks

Input 🛠 Tools Result Planning/ Reasoning Output Prompt not complete

https://chatgpt.com/features/connectors

Ability to  change data Agent Tools

Ability to  change data Agent Tools Fetch  more content

Lethal Language Attacks Exposure to  untrusted content Ability to  change
data Lethal Pair

External communication Ability to  change data Agent Tools Fetch  more
content

Access to  private data External communication Ability to  change data
Agent Tools Fetch  more content

Lethal Language Attacks Exposure to untrusted content Access to  private
data External communication Ability to  change data Lethal Trifecta

Lethal Language Attacks Exposure to untrusted content Access to  private
data External communication Ability to  change data Lethal Trifecta Lethal Pair

Mr. Brokebot: Lethal language attacks against A...

Mr. Brokebot: Lethal language attacks against AI agents

More Decks by luke crouch

Featured

Transcript