$30 off During Our Annual Pro Sale. View Details »

Mr. Brokebot: Lethal language attacks against A...

Avatar for luke crouch luke crouch
December 08, 2025
7

Mr. Brokebot: Lethal language attacks against AI agents

Avatar for luke crouch

luke crouch

December 08, 2025
Tweet

Transcript

  1. The Price of Intelligence Mark Russinovich, Ahmed Salem, Santiago Zanella-

    Béguelin, and Yonatan Hunger https://cacm.acm.org/practice/the-price-of-intelligence/
  2. LLM Attacks Hallucination • generates incorrect or incomplete content •

    factual inaccuracies, fabricated information, contradictions, omissions, etc. • inherit characteristic of LLMs • larger models tend to hallucinate less https://cacm.acm.org/practice/the-price-of-intelligence/
  3. LLM Attacks Risk Hallucination • Medical misinformation [1] • Mental

    health misinformation [2] • False legal or policy citations [3]
  4. https://futurism.com/local-restaurant-exhausted-google-ai Slopsquatting registering a non-existent software package name that a

    large language model (LLM) may hallucinate in its output, whereby someone unknowingly may copy- paste and install the software package without realizing it is fake
  5. registering a non-existent software package name that a large language

    model (LLM) may hallucinate in its output, whereby someone unknowingly may copy- paste and install the software package without realizing it is fake https://arxiv.org/abs/2406.10279 Slopsquatting
  6. despite mitigation efforts, AI hallucination rates still generally vary from

    as low as 2% in some models for short summarization tasks and as high as 50% for more complex tasks and speci fi c domains, such as law and healthcare https://cacm.acm.org/practice/the-price-of-intelligence/#sec5
  7. 1. External Groundedness checkers 2. Fact Correction 3. Improved RAG

    Systems 4. Ensemble methods https://cacm.acm.org/practice/the-price-of-intelligence/#sec5
  8. Large Language Model Behavior Alignment • aka “Outer alignment”, “Safety

    layer”, “Alignment layer” • part of training the model • Supervised fi ne-tuning, Reinforcement learning from human feedback, etc. • shape an LLM’s output - especially “safe” output … https://cacm.acm.org/practice/the-price-of-intelligence/
  9. Large Language Model Behavior Alignment • aka “Outer alignment”, “Safety

    layer”, “Alignment layer” • part of training the model • Supervised fi ne-tuning, Reinforcement learning from human feedback, etc. • shape an LLM’s output - especially “safe” output … • … according to whomever is training the model! https://cacm.acm.org/practice/the-price-of-intelligence/
  10. LLM Attacks Jailbreak • manipulate an LLM into violating its

    established guidelines, ethical constraints, or trained alignments • exploit the fl exibility and contextual understanding capabilities of LLMs • e.g., some military or law enforcement need to understand how pipe bombs are made … • but some military or law enforcement should NOT … • but maybe they’re writing a documentary … • but … https://cacm.acm.org/practice/the-price-of-intelligence/
  11. LLM Attacks prompt injection • LLM follows instructions in the

    data, rather than the user’s instructions • exploits the LLM mixing input data and language instructions https://cacm.acm.org/practice/the-price-of-intelligence/
  12. LLM Attacks Malicious Output malicious Input prompt injection may not

    violate the LLM but violates the user intent
  13. Lethal Language Attacks Exposure to untrusted content Access to
 private

    data External communication Ability to
 change data Lethal Trifecta
  14. Lethal Language Attacks Exposure to untrusted content Access to
 private

    data External communication Ability to
 change data Lethal Trifecta Lethal Pair