Mr. Brokebot: Lethal language attacks against AI agents

Luke Crouch, 2026 eps0.1_leth4l_langu4ge_att4cks.md MR. BROKEBOT

speakerdeck.com/groovecoder

eps0.2_my_s0urces

my own tinkering

Lethal Trifecta Simon Willison https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/

Month of AI Bugs Johann Rehberger https://embracethered.com/blog/posts/2025/wrapping-up-month-of-ai-bugs/

https://www.youtube.com/watch?v=TWhKGqYQT9g

eps0.3_ag3nda

1. large language models (3m) 2. LLM attacks (20m) 3.
agents (3m) 4. agent attacks (3m)

eps1.0_l4rge_langu4ge_m0dels

https://www.youtube.com/watch?v=LPZh9BOjkQs

input output

1. large language models: ✓ questions? 2. llm attacks 3.
agents 4. agent attacks

  eps2.0_llm_att4cks

1. hallucination (5m) 2. jailbreak (5m) 3. prompt injection (10m)
llm attacks

  eps2.1_hallucinati0n

llm attack hallucination • generates incorrect or incomplete content •
factual inaccuracies, fabricated information, contradictions, omissions, etc. • inherit characteristic of LLMs • larger models tend to hallucinate less https://cacm.acm.org/practice/the-price-of-intelligence/

input output with hallucinations

is this really an “attack”?

it’s de fi nitely a problem

https://futurism.com/local-restaurant-exhausted-google-ai

llm risk hallucination • medical misinformation [1] • mental health
misinformation [2] • legal or policy citations [3] • software

https://securaize.substack.com/p/can-you-trust-chatgpts-package-or

https://securaize.substack.com/p/can-you-trust-chatgpts-package-or yet

so, attackers create packages with names that LLMs hallucinate

  eps2.1.1_slop_squ4tting

https://arxiv.org/abs/2406.10279

how to stop hallucinations

how to stop reduce hallucinations remember: these are an inherit
characteristic of llms

1. retrieval-augmented generation 2. self-re fi nement 3. fi ne-tuning
https://arxiv.org/abs/2406.10279 llm attack hallucination

https://arxiv.org/abs/2406.10279

  eps2.1_hallucinati0n questions?

1. hallucination ✓ 2. jailbreak 3. prompt injection llm attack

  eps2.2_j4ilbreak

input output

input output behavior alignment

large language model behavior alignment • aka “alignment layer” or
“safety layer” • trained into the model • by supervised fi ne-tuning • to shape an llm’s output - especially “safe” output … https://cacm.acm.org/practice/the-price-of-intelligence/

For example …

llm attack jailbreak • manipulate an llm into violating its
safety alignment • exploits LLMs’ lack of situational awareness or adversarial intention https://cacm.acm.org/practice/the-price-of-intelligence/

input output behavior alignment jailbreak llm attack malicious unsafe

A look at AI Security with Mark Russinovich

  eps2.2.1_skel3t0n_k3y

1. skeleton key ✓ 2. context compliance 3. crescendo llm
attacks jailbreak

  eps2.2.2_cont3xt_compli4nce

https://x.com/itsandrewgao/status/1964117887943094633

1. skeleton key ✓ 2. context compliance ✓ 3. crescendo
llm attacks jailbreak

  eps2.2.3_cresc3nd0

How to stop jailbreaks?

How to stop jailbreaks: don’t

How to stop jailbreaks: don’t leave it to model-makers

https://d2jud02ci9yv69.cloudfront.net/2025-04-28-do-not-write-jaibreak-papers-57/blog/do-not-write-jaibreak-papers/

1. hallucination ✓ 2. jailbreak ✓ 3. prompt injection llm
attacks

  eps2.3_pr0mpt_inj3ction

  eps2.3_pr0mpt_inj3ction if you’re building AI, I feel bad for
you son, you got 99 probLLMs and can’t patch this one

pretty much everyone says this is the foundational problem

I (and many others) say this is the foundational problem:
mixing input data and instructions

user: hiring manager

user: hiring manager intent: evaluate resumes

evaluation resume

user: hiring manager intent: evaluate resumes attacker: job applicant

https://www.linkedin.com/posts/anmol-gupta7_this-resume-can-bypassall-llm-based-candidate-activity-7335636441978281984-EPRc/

llm attack fake evaluation malicious resume prompt injection

another hypothetical example …

another hypothetical example … WE DID NOT DO THIS!

user: job applicant

user: job applicant intent: make perfect application

job posting job application

user: job applicant intent: make perfect application counter-attacker: hiring manager

job posting job application llm attack prompt injection malicious fl
agged

again: WE DID NOT DO THIS!

Firefox example  (⚠ lots of code ahead)

  aux_sql_inj3ction

https://xkcd.com/327/

mixing input data and sql instructions

Structured Query Language

Malicious  Structured Query Language Injection

Malicious  English Language Injection

PROTOTYPE

https://github.com/Firefox-AI/ fi refox-prototypes/blob/6d4eeae66d4e1d8997e6798ce4adf678111d3a3f/browser/components/smartwindow/content/suggestions.mjs#L468-L499

PROTOTYPE

I (and many others) say this is the foundational problem:
mixing input data and instructions

PROTOTYPE

how to stop prompt injections

how to stop reduce prompt injections they’re possible because LLMs
are probabilistic

  aux_ev4ls

how to stop reduce prompt injections

1. sanitize inputs 2. harden prompts 3. add guardrails llm
attack prompt injection

  eps2.3.1_sanit1ze_1nputs

  eps2.3.2_hard3n_pr0mpts

https://cheatsheetseries.owasp.org/cheatsheets/LLM_Prompt_Injection_Prevention_Cheat_Sheet.html#structured-prompts-with-clear-separation

  eps2.3.3_guardra1ls

input output guard rail guard rail

1. static regex 2. LLM-as-judge 3. custom classi fi er
model llm attack prompt injection guardrails

1. static regex (Deterministic) 2. LLM-as-judge (LLM Judge) 3. custom
classi fi er model (HF Semantic) llm attack prompt injection guardrails

  eps2.3.3_guardra1ls

  guardra1ls     do not work

  guardra1ls     do not work … on their
own

https://arxiv.org/html/2510.09023v1

1. sanitize inputs ✓ 2. harden prompts ✓ 3. add
guardrails ✓ llm attack prompt injection

1. hallucination ✓ 2. jailbreak ✓ 3. prompt injection ✓
questions? llm attacks

1. large language models ✓ 2. llm attacks ✓ questions?
3. agents 4. agent attacks

eps3.0_ag3nts

https://simonwillison.net/2025/Sep/18/agents/

input output

input 🛠 tools output

input 🛠 tools result output

input 🛠 tools result planning reasoning output

input 🛠 tools result planning reasoning output goal achieved?

eps3.1_t00ls

https://chatgpt.com/features/connectors

  eps4.0_ag3nt_att4cks

https://labs.zenity.io/p/agent fl ayer-chatgpt-connectors-0click-attack-5b41

Lethal Trifecta Simon Willison https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/

https://chatgpt.com/features/connectors

how to stop agent attacks

https://github.com/mozilla/bugbug/blob/6c876a6c5e87a2f82b33ef7b6cd345e351dcd1f4/mcp/src/bugbug_mcp/server.py#L35

https://github.com/mozilla/bugbug/blob/master/bugbug/tools/code_review/prompts.py#L11

https://github.com/mozilla/bugbug/blob/6c876a6c5e87a2f82b33ef7b6cd345e351dcd1f4/mcp/src/bugbug_mcp/server.py#L105

https://phabricator.services.mozilla.com/D270778

https://arxiv.org/html/2510.09023v1

1. large language models ✓ 2. llm attacks ✓ 3.
agents ✓ 4. agent attacks ✓

hah, gotcha!

wait … an iPhone in Mexico?

1. large language models ✓ 2. llm attacks ✓ 3.
agents ✓ 4. agent attacks ✓ questions?

Mr. Brokebot: Lethal language attacks against A...

Mr. Brokebot: Lethal language attacks against AI agents

More Decks by luke crouch

Other Decks in Technology

Featured

Transcript