Slide 1

Slide 1 text

October 11, 2024 | José Manuel Ortega Security and auditing tools in Large Language Models (LLM) [email protected]

Slide 2

Slide 2 text

Agenda • Introduction to LLM • Introduction to OWASP LLM Top 10 • Auditing tools • Use case with the textattack tool

Slide 3

Slide 3 text

Introduction to LLM • Transformers • Attention is All You Need" by Vaswani et al. in 2017 • Self-attention mechanism • Encoder-Decoder Architecture

Slide 4

Slide 4 text

Introduction to LLM

Slide 5

Slide 5 text

Introduction to LLM Pre-training + fine-tuning

Slide 6

Slide 6 text

Introduction to LLM ● Language Models: Models like BERT, GPT, T5, and RoBERTa are based on transformer architecture. They are used for a wide range of NLP tasks such as text classification, question answering, and language translation. ● Vision Transformers (ViT): Transformers have been adapted for computer vision tasks, where they have been applied to image classification, object detection, etc. ● Speech Processing: In addition to text and vision, transformers have also been applied to tasks like speech recognition and synthesis.

Slide 7

Slide 7 text

Introduction to OWASP LLM Top 10 • https://genai.owasp.org

Slide 8

Slide 8 text

Introduction to OWASP LLM Top 10

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

ChiperChat https://arxiv.org/pdf/2308.06463

Slide 11

Slide 11 text

Jailbreak prompts ● https://jailbreak-llms. xinyueshen.me/

Slide 12

Slide 12 text

Jailbreak prompts ● https://jailbreak-llms. xinyueshen.me/

Slide 13

Slide 13 text

Introduction to OWASP LLM Top 10 • Data Poisoning • Malicious actors could poison the training data by injecting false, harmful, or biased information into datasets that train the LLM, which could degrade the model's performance. • Mitigation: Data source vetting, training data audits, and anomaly detection for suspicious patterns in training data.

Slide 14

Slide 14 text

Introduction to OWASP LLM Top 10 • Model Inversion Attacks • Attackers could exploit the LLM to infer sensitive or private data that was used during training by repeatedly querying the model. This could expose personal, confidential, or proprietary information. • Mitigation: Rate-limiting sensitive queries and limiting the availability of models trained on private data.

Slide 15

Slide 15 text

Introduction to OWASP LLM Top 10 • Unauthorized Code Execution • In some contexts, LLMs might be integrated into systems where they have access to execute code or trigger automated actions. Attackers could manipulate LLMs into running unintended code or actions, potentially compromising the system. • Mitigation: Limit the scope of actions that LLMs can execute, employ sandboxing, and use strict permission controls.

Slide 16

Slide 16 text

Introduction to OWASP LLM Top 10 • Bias and Fairness • LLMs can generate biased outputs due to the biased nature of the data they are trained on, leading to unfair or discriminatory outcomes. This could impact decision-making processes, amplify harmful stereotypes, or introduce systemic biases. • Mitigation: Perform fairness audits, use bias detection tools, and diversify training datasets to reduce bias.

Slide 17

Slide 17 text

Introduction to OWASP LLM Top 10 • Model Hallucination • LLMs can produce outputs that are plausible-sounding but factually incorrect or entirely fabricated. This is referred to as "hallucination," where the model generates false information without any grounding in its training data. • Mitigation: Post-response validation, fact-checking algorithms, and restricting LLMs to provide responses only within known knowledge domains.

Slide 18

Slide 18 text

Introduction to OWASP LLM Top 10 • Insecure Model Deployment • LLMs that are deployed in unsecured environments could be vulnerable to attacks, including unauthorized access, model theft, or tampering. These risks are elevated when models are deployed in publicly accessible endpoints. • Mitigation: Use encrypted APIs, secure infrastructure, implement authentication and authorization controls, and monitor model access.

Slide 19

Slide 19 text

Introduction to OWASP LLM Top 10 • Adversarial Attacks • Attackers might exploit weaknesses in the LLM by crafting adversarial examples. This could lead to undesirable outputs or security breaches. • Mitigation: Model robustness testing, adversarial training (training the model with adversarial examples), and implementing anomaly detection systems.

Slide 20

Slide 20 text

• https://llm-attacks.org

Slide 21

Slide 21 text

Tools/frameworks to evaluate model robustness ● PromptInject Framework ● https://github.com/agencyenterprise/PromptInject ● PAIR - Prompt Automatic Iterative Refinement ● https://github.com/patrickrchao/JailbreakingLLMs ● TAP - Tree of Attacks with Pruning ● https://github.com/RICommunity/TAP

Slide 22

Slide 22 text

Auditing tools • https://github.com/tensorflow/fairness-indicators

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

Auditing tools • Prompt Guard refers to a set of strategies, tools, or techniques designed to safeguard the behavior of large language models (LLMs) from malicious or unintended input manipulations. • Prompt Guard uses an 86M parameter classifier model that has been trained on a large dataset of attacks and prompts found on the web. Prompt Guard can categorize a prompt into three different categories: "Jailbreak", "Injection" or "Benign".

Slide 25

Slide 25 text

Auditing tools • https://huggingface.co/meta-llama/Prompt-Guard-86M

Slide 26

Slide 26 text

Auditing tools • Llama Guard 3 refers to a security tool or strategy designed for guarding large language models like Meta’s LLaMA against potential vulnerabilities and adversarial attacks. • Llama Guard 3 offers a robust and adaptable solution to protect LLMs against Prompt Injection and Jailbreak attacks. By combining advanced filtering, normalization, and monitoring techniques.

Slide 27

Slide 27 text

Auditing tools • Dynamic Input Filtering • Prompt Normalization and Contextualization • Secure Response Policy • Active Monitoring and Automatic Response

Slide 28

Slide 28 text

Auditing tools • https://huggingface.co/spaces/schroneko/meta-llama- Llama-Guard-3-8B-INT8

Slide 29

Slide 29 text

Auditing tools • S1: Violent Crimes • S2: Non-Violent Crimes • S3: Sex-Related Crimes • S4: Child Sexual Exploitation • +S5: Defamation (New) • S6: Specialized Advice • S7: Privacy • S8: Intellectual Property • S9: Indiscriminate Weapons • S10: Hate • S11: Suicide & Self-Harm • S12: Sexual Content • S13: Elections • S14: Code Interpreter Abuse Introducing v0.5 of the AI Safety Benchmark from MLCommons

Slide 30

Slide 30 text

Text attack https://arxiv.org/pdf/2005.05909

Slide 31

Slide 31 text

Text attack https://arxiv.org/pdf/2005.05909

Slide 32

Slide 32 text

Text attack from textattack.models.wrappers import HuggingFaceModelWrapper from transformers import AutoModelForSequenceClassification, AutoTokenizer # Load pre-trained sentiment analysis model from Hugging Face model = AutoModelForSequenceClassification.from_pretrained("textattack/bert-base-unc ased-imdb") tokenizer = AutoTokenizer.from_pretrained("textattack/bert-base-uncased-imdb") # Wrap the model for TextAttack model_wrapper = HuggingFaceModelWrapper(model, tokenizer) https://github.com/QData/TextAttack

Slide 33

Slide 33 text

Text attack from textattack.attack_recipes import TextFoolerJin2019 # Initialize the attack with the TextFooler recipe attack = TextFoolerJin2019.build(model_wrapper)

Slide 34

Slide 34 text

Text attack # Example text for sentiment analysis (a positive review) text = "I absolutely loved this movie! The plot was thrilling, and the acting was top-notch." # Apply the attack adversarial_examples = attack.attack([text]) print(adversarial_examples)

Slide 35

Slide 35 text

Text attack Original Text: "I absolutely loved this movie! The plot was thrilling, and the acting was top-notch." Adversarial Text: "I completely liked this film! The storyline was gripping, and the performance was outstanding."

Slide 36

Slide 36 text

Text attack from textattack.augmentation import WordNetAugmenter # Use WordNet-based augmentation to create adversarial examples augmenter = WordNetAugmenter() # Augment the training data with adversarial examples augmented_texts = augmenter.augment(text) print(augmented_texts)

Slide 37

Slide 37 text

Resources

Slide 38

Slide 38 text

Resources ● github.com/greshake/llm-security ● github.com/corca-ai/awesome-llm-security ● github.com/facebookresearch/PurpleLlama ● github.com/protectai/llm-guard ● github.com/cckuailong/awesome-gpt-security ● github.com/jedi4ever/learning-llms-and-genai-for-dev-sec-ops ● github.com/Hannibal046/Awesome-LLM

Slide 39

Slide 39 text

Resources ● https://cloudsecurityalliance.org/artifacts/security-implications-of -chatgpt ● https://www.nist.gov/itl/ai-risk-management-framework ● https://blog.google/technology/safety-security/introducing-googl es-secure-ai-framework ● https://owasp.org/www-project-top-10-for-large-language-model -applications/

Slide 40

Slide 40 text

Security and auditing tools in Large Language Models (LLM)

Slide 41

Slide 41 text

No content