Red Teaming Latent Spaces & Protecting LLM apps

Slide 1

Slide 1 text

By Raul Pino PyCon Austria 2025 Red Teaming Latent Spaces & Protecting LLM apps

Slide 2

Slide 2 text

Agenda ● Intro: ○ LLM Apps ○ Red Teaming ● HackAPrompt Paper ○ Ontology and generic concepts. ● Attack Vectors or Vulnerabilities ● Demos ● Security and countermeasures ● Takeaways & beyond

Slide 3

Slide 3 text

About Me ● Born in Venezuela. ● +10 years of exp as Software Engineer & AI enthusiast (ML Eng recently). ● Living in Chile. ○ Halborn, Distro (YC S24), Elementus, uBiome, Groupon. ● <3 AI, Coffee, Scuba Diving, …

Slide 4

Slide 4 text

*** Predicts the next token. https://bbycroft.net/llm https://poloclub.github.io/transformer-explainer/ What is an LLM? Large Language Models (LLM): ChatGPT, Claude, Mistral, DeepSeek, …

Slide 5

Slide 5 text

LLM App: Cursor, Github Copilot, … What is an LLM? LLM App? … but the most basic might be:

Slide 6

Slide 6 text

What is an LLM App? RAG? LLM App: Cursor, Github Copilot, … *** Helpful assistant.

Slide 7

Slide 7 text

● War games and military strategy exercises. ● 1990s–2000s: expanded into cybersecurity, simulating cyber attacks. What is Red Teaming? A “red team” played the role of the enemy to test the defenses and strategy of the “blue team”.

Slide 8

Slide 8 text

Red Teaming an LLM App is not:

Slide 9

Slide 9 text

Red Teaming an LLM App is::

Slide 10

Slide 10 text

Ignore This Title and HackAPrompt! ● A global prompt hacking competition (2023). ● 2800 people from 50+ countries. ● 600K+ adversarial prompts against three state-of-the art LLMs. Paper: https://arxiv.org/abs/2311.16119 Dataset: https://huggingface.co/datasets/hackaprompt/hackaprompt-dataset Podcast: https://www.latent.space/p/learn-prompting

Slide 11

Slide 11 text

Ignore This Title and HackAPrompt! Simple Prompt Hacking

Slide 12

Slide 12 text

Prompt Hacking =(Prompt Injection, Jailbreaking) Exploiting: 1. Predicts the next token. 2. Helpful assistant. Demos incoming, to the notebooks!

Slide 13

Slide 13 text

Taxonomical Ontology of Exploits

Slide 14

Slide 14 text

Demos: Attack Vectors or Vulnerabilities ● Instruction Manipulation Attacks: Directly Changing the Model’s Behavior ● Contextual Exploits: Manipulating How the Model Understands the Input ● Obfuscation & Encoding Attacks: Hiding Malicious Intent ● Resource Exploitation Attacks: Abusing System Limitations …To the notebooks!

Slide 15

Slide 15 text

● Documented 29 separate prompt hacking techniques. ● Attackers iterated and refined their prompts over time. ○ Lengthier attacks were initially successful but later optimized for brevity. ● Models with higher verbosity (like ChatGPT) were harder to hack but still failed. HackAPrompt: Results and Key Insights “LLM security is in early stages, and just like human social engineering may not be 100% solvable, so too could prompt hacking prove to be an impossible problem; you can patch a software bug, but perhaps not a (neural) brain.” LLM Security is here to stay, we have a job!

Slide 16

Slide 16 text

Demos: Red Teaming ● Manual/standard prompts list ● LLMs as adversarial prompts generator ● LLMs as generator and evaluator ● Using OS Library + Service …To the notebooks!

Slide 17

Slide 17 text

Other Vulnerabilities ● Bias and Stereotypes ● Hallucinations

Slide 18

Slide 18 text

Other Ontologies and Organizations ● OWASP Top 10 for LLM Applications 2025 - https://genai.owasp.org/resource/owasp-top-10-for-llm-applications-2025 ● AI Incident Database - https://incidentdatabase.ai ● AI Vulnerability Database - https://avidml.org

Slide 19

Slide 19 text

Security & Countermeasures: Guardrails ● Guardrails AI: https://www.guardrailsai.com/ ○ Langchain helper: https://python.langchain.com/v0.1/docs/templates/guardrails-output-parser/ ● Databricks Guardrails: https://www.databricks.com/blog/implementing-llm-guardrails-safe-and-respon sible-generative-ai-deployment-databricks ● AWS Bedrock Guardrails: https://aws.amazon.com/bedrock/guardrails/

Slide 20

Slide 20 text

…To the notebooks! Security & Countermeasures: Guardrails

Slide 21

Slide 21 text

Reminder: System Vulnerabilities LLM App is part of a larger system!

Slide 22

Slide 22 text

● Multimodal challenges ○ Old adversarial image attacks Takeaways & beyond https://www.youtube.com/watch?v=Klepca1Ny3c

Slide 23

Slide 23 text

● Multimodal challenges ○ New adversarial image attack Takeaways & beyond https://arxiv.org/pdf/2410.08338

Slide 24

Slide 24 text

● Multimodal challenges ○ The most creative I’ve found :’) Takeaways & beyond https://x.com/me_irl/status/1901497992865071428?s=46

Slide 25

Slide 25 text

Takeaways & beyond ● HackAPrompt (1.0) paper still relevant! ● HackAPrompt 2.0 it’s coming https://www.hackaprompt.com/

Slide 26

Slide 26 text

Resources ● https://www.hackaprompt.com/ ● Coursera - https://learn.deeplearning.ai/courses/red-teaming-llm-applications ● https://arxiv.org/abs/2311.16119 ● https://huggingface.co/datasets/hackaprompt/hackaprompt-dataset ● https://www.latent.space/p/learn-prompting ● https://bbycroft.net/llm ● https://poloclub.github.io/transformer-explainer/ ● OWASP Top 10 for LLM Applications 2025 - https://genai.owasp.org/resource/owasp-top-10-for-llm-applications-2025 ● AI Incident Database - https://incidentdatabase.ai ● AI Vulnerability Database - https://avidml.org ● https://www.giskard.ai/knowledge/how-to-implement-llm-as-a-judge-to-test-ai-agents-part-1 ● https://www.giskard.ai/knowledge/how-to-implement-llm-as-a-judge-to-test-ai-agents-part-2 ● https://arxiv.org/pdf/2410.08338 ● https://tntattacks.github.io/ ● https://github.com/p1nox/red-teaming-latent-spaces

Slide 27

Slide 27 text

Danke schön :)