Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[2024.11.27] SK WaveHill Meetup - LLM Fine-tuning

Avatar for Beomi Beomi
November 27, 2024

[2024.11.27] SK WaveHill Meetup - LLM Fine-tuning

A comprehensive guide from fine-tuning Large Language Models (LLMs) to on-device applications, presented by Lee Junbum.

This presentation covers:
- Overview of LLMs: Capabilities, recent advancements, and the importance of fine-tuning for customization, efficiency, and performance.
- Key fine-tuning techniques: Includes Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF), Parameter-Efficient Fine-Tuning (PEFT), and Retrieval-Augmented Fine-Tuning (RAFT).
- Hands-on tutorials: Practical steps using tools like Hugging Face, Google Colab, and Llama.cpp to fine-tune, deploy, and optimize models.
- Model deployment: Strategies for efficient deployment with quantization techniques like QLoRA for resource-constrained environments like Macbook.

for AI practitioners and researchers, this deck provides actionable insights into transitioning LLMs from pre-training to impactful real-world applications.

Avatar for Beomi

Beomi

November 27, 2024
Tweet

More Decks by Beomi

Other Decks in Programming

Transcript

  1. Agenda 1. Introduction to LLMs and Recent Trends 2. Understanding

    Pre-training and Post-training 3. Supervised Fine-Tuning (SFT) 4. Reinforcement Learning from Human Feedback (RLHF) 5. Parameter-Efficient Fine-Tuning (PEFT) with LoRA 6. Quantization Techniques (QLoRA) 7. Retrieval-Augmented Fine-Tuning (RAFT) 8. Hands-on Tutorials with Google Colab 9. Model Conversion and Inference with Llama.cpp 10. Q&A Session 3
  2. What are (LLMs)? Definition: LLMs are deep learning models with

    billions of parameters trained on vast amounts of text data. Capabilities: Natural language understanding Translation Summarization Large Language Models Text generation 5
  3. Recent Advancements in LLMs OpenAI GPT4o, o1: Improved reasoning and

    understanding Multimodal capabilities Google's Gemini: Combines strengths of 2M context with language understanding Anthropic's Claude: Focuses on safe and responsible AI Open-Source LLMs: Rapid growth in community-driven models e.g., Meta Llama, Google Gemma, Alibaba Qwen, ... 6
  4. 7

  5. Why Fine-Tune LLMs? Customization: Tailor models to specific or .

    Performance: Enhance accuracy on specialized datasets. Efficiency: Reduce inference time and computational resources. Control: Implement safety measures and bias mitigation. domains tasks 8
  6. 9

  7. 10

  8. 11

  9. Definition: Training a model on large-scale datasets to learn general

    language patterns. Characteristics: Unsupervised learning datasets (Common Crawl, Wikipedia) Foundation for downstream tasks Pre-training Massive 13
  10. Definition: Further training of a pre-trained model to improve performance

    on . Includes: Supervised Fine-Tuning ( ) Reinforcement Learning from Human Feedback ( ) Post-training specific tasks SFT RLHF 14
  11. SFT vs RLHF SFT: Uses datasets Directly adjusts model weights

    based on supervised signals RLHF: Incorporates Uses reinforcement learning algorithms labeled human preferences 15
  12. What is SFT? Process: Training a pre-trained model on labeled

    data. Objective: Align the model's outputs with desired responses. task-specific 17
  13. Steps in SFT 1. Data Collection: Curate a dataset relevant

    to the target task. 2. Data Preprocessing: Clean and tokenize data. 3. Fine-Tuning: Adjust model weights using supervised learning. 4. Evaluation: Assess performance on validation data. 18
  14. SFT Dataset Format Input: / Context Output: Example { "instruction":

    "What is the capital of Korea?", "output": "The capital of Korea is Seoul." } SFT Datasets Alpaca KoAlpaca / KoAlpaca-RealQA Prompt Response 19
  15. 23

  16. SFT with Hugging Face: Hugging Face Hub Transformers Library Install:

    pip install transformers datasets trl Hugging Face 24
  17. Example: SFT Solar-Ko with from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer,

    TrainingArguments from trl import SFTTrainer from datasets import load_dataset tokenizer = AutoTokenizer.from_pretrained('beomi/Solar-Ko-Recovery-11B') model = AutoModelForCausalLM.from_pretrained('beomi/Solar-Ko-Recovery-11B') train_dataset = load_dataset('beomi/KoAlpaca-RealQA') training_args = TrainingArguments( output_dir='./results', num_train_epochs=3, per_device_train_batch_size=4, save_total_limit=1, logging_strategy='steps', ) trainer = SFTTrainer( model=model, args=training_args, train_dataset=train_dataset['train'], tokenizer=tokenizer, ) trainer.train() KoAlpaca-RealQA 25
  18. Line-by-Line Load Model and Tokenizer from transformers import AutoTokenizer, AutoModelForCausalLM

    tokenizer = AutoTokenizer.from_pretrained('beomi/Solar-Ko-Recovery-11B') model = AutoModelForCausalLM.from_pretrained('beomi/Solar-Ko-Recovery-11B') Load Dataset from datasets import load_dataset train_dataset = load_dataset('beomi/KoAlpaca-RealQA') 26
  19. Training Arguments from transformers import TrainingArguments training_args = TrainingArguments( output_dir='./results',

    num_train_epochs=3, per_device_train_batch_size=4, save_total_limit=1, logging_strategy='steps', ) Initialize SFT Trainer & Train from trl import SFTTrainer trainer = SFTTrainer( model=model, args=training_args, train_dataset=train_dataset['train'], tokenizer=tokenizer, ) trainer.train() 27
  20. Inference Examples ### Instruction: 안녕하세요 ### Response: 안녕하세요! 어떻게 도와드릴까요?</s>

    ### Instruction: 아래 글을 한국어로 번역해줘. The KoAlpaca-RealQA dataset is a unique Korean instruction dataset designed to closely reflect real user interactions in the Korean language. ### Response: KoAlpaca-RealQA 데이터셋은 한국어 사용자들의 실제 상호작용을 매우 잘 반영하도록 설계된 독특한 한국어 지시 데이터셋입니다.</s> 28
  21. Benefits of SFT Task Specialization: Model becomes adept at specific

    tasks. Data Efficiency: Requires less data than training from scratch. Improved Performance: Higher accuracy on target tasks. 29
  22. What is RLHF? Definition: An approach that uses to fine-tune

    models via reinforcement learning. Goal: Align model outputs with human values and expectations. human preferences 31
  23. Proximal Policy Optimization (PPO) Algorithm: Balances exploration and exploitation by

    optimizing a surrogate objective function. Use Case: Adjusts the policy network to produce desired outputs. Used for training OpenAI's GPT-4o, etc. 33
  24. 34

  25. Reward Trainer w/ TRL from peft import LoraConfig, TaskType from

    transformers import AutoModelForSequenceClassification, AutoTokenizer from trl import RewardTrainer, RewardConfig model = AutoModelForSequenceClassification.from_pretrained("gpt2") peft_config = LoraConfig( task_type=TaskType.SEQ_CLS, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1, ) # ... trainer = RewardTrainer( model=model, args=training_args, processing_class=tokenizer, train_dataset=dataset, peft_config=peft_config, ) trainer.train() 37
  26. Direct Preference Optimization (DPO) Concept: Simplifies RLHF by directly optimizing

    preferences without the need for reward models. Advantage: Reduces complexity and training time. 38
  27. 39

  28. 42

  29. Setting Up Online DPO 1. Install TRL pip install trl

    2. Define the Custom Judge def custom_judge(response): # Implement custom logic to evaluate response # Return a scalar reward return reward 43
  30. 3. Initialize the Model and Optimizer from transformers import AutoModelForCausalLM,

    AutoTokenizer model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-3.1-8B-Instruct') tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-3.1-8B-Instruct') optimizer = torch.optim.AdamW(model.parameters(), lr=1e-6) 44
  31. 4. Make a Judge Function from trl import OpenAIPairwiseJudge judge

    = OpenAIPairwiseJudge( model_name="gpt-4o", system_prompt="...", ) Default system prompt: https://github.com/huggingface/trl/blob/b80c1a6/trl/trainer/judges.py#L35-L61 45
  32. 46

  33. 5. Training Loop with DPO training_args = OnlineDPOConfig( output_dir="aya-expanse-8b-OnlineDPO", logging_steps=1,

    # max_steps=100, report_to='tensorboard', bf16=True, per_device_train_batch_size=1, gradient_checkpointing="unsloth", max_new_tokens=2000, ) trainer = OnlineDPOTrainer( model=model, judge=judge, args=training_args, processing_class=tokenizer, train_dataset=train_dataset, ) trainer.train() 48
  34. Benefits of RLHF Alignment: Ensures model outputs align with human

    values. Safety: Reduces harmful or biased outputs. Quality Improvement: Enhances the usefulness of generated content. 49
  35. What is PEFT? Definition: Techniques that fine-tune models with fewer

    parameters, reducing computational resources. Hugging Face: Library Supported Methods: LoRA, QLoRA, etc. PEFT 51
  36. Introduction to LoRA (Low-Rank Adaptation): Decomposes weight updates into .

    Keeps original weights frozen. Efficient and memory-saving. LoRA low-rank matrices 52
  37. Use LoRA via PEFT 1. Install PEFT Library pip install

    peft 2. Load Pre-trained Model from transformers import AutoModelForCausalLM, AutoTokenizer from peft import get_peft_model, LoraConfig model = AutoModelForCausalLM.from_pretrained('meta-llama/Meta-Llama-3.1-8B-Instruct') tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3.1-8B-Instruct') 53
  38. 3. Configure config = LoraConfig( r=8, lora_alpha=32, target_modules='all-linear', # ['q_proj',

    'v_proj'], lora_dropout=0.1, bias='none', ) model = get_peft_model(model, config) 4. Same training loop as SFT/RLHF LoRA 54
  39. Advantages of LoRA Efficiency: Reduces GPU memory usage. Speed: Faster

    training times. Scalability: Easier to fine-tune very large models. 55
  40. Why Quantize Models? Reduce Memory Footprint: Smaller models consume less

    memory. Increase Inference Speed: Quantized models often run faster. Deploy on Edge Devices: Makes deployment on resource-constrained devices feasible. 57
  41. Introduction to QLoRA QLoRA: Combines quantization with LoRA to enable

    fine-tuning large models on a single GPU. 58
  42. 2. Load Model You can use models. Use load_in_4bit=True for

    models. Use load_in_8bit=True for models. from transformers import AutoTokenizer, AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained( 'meta-llama/Meta-Llama-3.1-8B-Instruct', load_in_4bit=True, device_map='auto', ) tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3.1-8B-Instruct') 4-bit Quantized 4 / 8-bit Quantized 4-bit Quantized 8-bit Quantized 60
  43. 3. Apply from peft import get_peft_model, LoraConfig config = LoraConfig(

    r=8, lora_alpha=32, target_modules='all-linear', # ['q_proj', 'v_proj'], lora_dropout=0.1, bias='none', task_type='CAUSAL_LM', ) model = get_peft_model(model, config) 4. Same training loop as SFT/RLHF LoRA 61
  44. Benefits of QLoRA Resource Efficiency: Fine-tune 70B+ parameter models on

    a single GPU. Performance: Minimal loss in model accuracy. Cost-Effective: Reduces the need for expensive hardware. 62
  45. What is RAFT? Definition: A technique that combines retrieval mechanisms

    with fine-tuning to enhance model performance on . specific knowledge domains 64
  46. Why Use RAFT? : Incorporate up-to-date or specialized information. Improved

    Accuracy: Provides context that the base model may lack. Dynamic Updating: Easily update the retrieval database without retraining the model. Domain-Specific Knowledge relevant 65
  47. What do we need? Raw Dataset + Context(Negative, Positive samples)

    LLM API for creating Q/A pairs Retrieval API for creating context (not covered in this talk) SFT Trainer from TRL Q/A pairs 72
  48. Setting Up the Environment 1. Open Google Colab Colab URL:

    https://beomi.net/sk-2411/raft 2. Enable GPU Runtime 73
  49. with QLoRA Step 1: Install Dependencies !pip install -q -U

    transformers !pip install -q datasets accelerate bitsandbytes lomo-optim hf_transfer trl !pip install -q flash-attn --no-build-isolation SFT 74
  50. Step 2: Load a Quantized Model import torch from transformers

    import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig model_id = "beomi/Solar-Ko-Recovery-11B" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", quantization_config=BitsAndBytesConfig(load_in_4bit=True), # nn.Linear to 4bit, torch_dtype=torch.bfloat16, # other to bfloat16 attn_implementation="flash_attention_2", ) 75
  51. Step 3: Apply LoRA from peft import get_peft_model, LoraConfig config

    = LoraConfig( r=16, lora_alpha=32, target_modules='all-linear', #['q_proj', 'v_proj'], lora_dropout=0.05, bias='none', task_type='CAUSAL_LM', ) model = get_peft_model(model, config) 76
  52. Step 4: Prepare Dataset from unsloth.chat_templates import get_chat_template tokenizer =

    get_chat_template( tokenizer, chat_template = "chatml", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, map_eos_token = True, # Maps <|im_end|> to </s> instead ) def formatting_prompts_func(examples): convos = examples["messages"] texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos] return { "text" : texts, } pass from datasets import load_dataset dataset = load_dataset("beomi/KoAlpaca-RealQA-oai", split = "train") dataset = dataset.map(formatting_prompts_func, batched = True,) 77
  53. 78

  54. 79

  55. Step 5: Fine-Tune the Model from trl import SFTTrainer from

    transformers import TrainingArguments from unsloth import is_bfloat16_supported trainer = SFTTrainer( model = model, tokenizer = tokenizer, train_dataset = dataset, dataset_text_field = "text", max_seq_length = max_seq_length, dataset_num_proc = 2, packing = False, # Can make training 5x faster for short sequences. args = TrainingArguments( per_device_train_batch_size = 32, gradient_accumulation_steps = 4, warmup_steps = 10, num_train_epochs = 3, # Set this for 1 full training run. # max_steps = 60, learning_rate = 2e-4, fp16 = not is_bfloat16_supported(), bf16 = is_bfloat16_supported(), logging_steps = 1, optim = "adamw_8bit", weight_decay = 0.01, lr_scheduler_type = "linear", seed = 3407, output_dir = "outputs", report_to = "none", # Use this for WandB etc ), ) 80
  56. 83

  57. 84

  58. Setting Up the Environment 1. Open Google Colab Colab URL:

    https://beomi.net/sk-2411/onlinedpo 2. Enable GPU Runtime 85
  59. 88

  60. Converting LoRA Model to GGUF Format 1. Convert the Model

    LoRA Converter: https://huggingface.co/spaces/beomi/gguf-my-lora *Base model should use https://huggingface.co/spaces/ggml-org/gguf-my-repo 89
  61. Inference with Llama.cpp on MacBook 1. Install Llama.cpp https://github.com/ggerganov/llama.cpp/releases 2.

    Download the Model https://huggingface.co/beomi/KoAlpaca-RealQA-Solar-Ko-Recovery-11B-LoRA- ChatML-F16-GGUF/tree/main Download LoRA gguf file: KoAlpaca-RealQA-Solar-Ko-Recovery-11B-LoRA-ChatML-f16.gguf 3. Load the Model ./llama-server \ --hf-repo beomi/Solar-Ko-Recovery-11B-Q8_0-GGUF \ --hf-file solar-ko-recovery-11b-q8_0.gguf \ -c 2048 --lora KoAlpaca-RealQA-Solar-Ko-Recovery-11B-LoRA-ChatML-f16.gguf 90
  62. 91

  63. 3. Generate Text Goto Tip: Use | as a token.

    -- {"stop": ["|"]} http://localhost:8080 stop 92
  64. 93

  65. Glossary LLM: Large Language Model SFT: Supervised Fine-Tuning RLHF: Reinforcement

    Learning from Human Feedback PEFT: Parameter-Efficient Fine-Tuning LoRA: Low-Rank Adaptation QLoRA: Quantized LoRA RAFT: Retrieval-Augmented Fine-Tuning RAG: Retrieval-Augmented Generation PPO: Proximal Policy Optimization DPO: Direct Preference Optimization 97