$30 off During Our Annual Pro Sale. View Details »

[2024.11.27] SK WaveHill Meetup - LLM Fine-tuning

Beomi
November 27, 2024

[2024.11.27] SK WaveHill Meetup - LLM Fine-tuning

A comprehensive guide from fine-tuning Large Language Models (LLMs) to on-device applications, presented by Lee Junbum.

This presentation covers:
- Overview of LLMs: Capabilities, recent advancements, and the importance of fine-tuning for customization, efficiency, and performance.
- Key fine-tuning techniques: Includes Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF), Parameter-Efficient Fine-Tuning (PEFT), and Retrieval-Augmented Fine-Tuning (RAFT).
- Hands-on tutorials: Practical steps using tools like Hugging Face, Google Colab, and Llama.cpp to fine-tune, deploy, and optimize models.
- Model deployment: Strategies for efficient deployment with quantization techniques like QLoRA for resource-constrained environments like Macbook.

for AI practitioners and researchers, this deck provides actionable insights into transitioning LLMs from pre-training to impactful real-world applications.

Beomi

November 27, 2024
Tweet

More Decks by Beomi

Other Decks in Programming

Transcript

  1. Agenda 1. Introduction to LLMs and Recent Trends 2. Understanding

    Pre-training and Post-training 3. Supervised Fine-Tuning (SFT) 4. Reinforcement Learning from Human Feedback (RLHF) 5. Parameter-Efficient Fine-Tuning (PEFT) with LoRA 6. Quantization Techniques (QLoRA) 7. Retrieval-Augmented Fine-Tuning (RAFT) 8. Hands-on Tutorials with Google Colab 9. Model Conversion and Inference with Llama.cpp 10. Q&A Session 3
  2. What are (LLMs)? Definition: LLMs are deep learning models with

    billions of parameters trained on vast amounts of text data. Capabilities: Natural language understanding Translation Summarization Large Language Models Text generation 5
  3. Recent Advancements in LLMs OpenAI GPT4o, o1: Improved reasoning and

    understanding Multimodal capabilities Google's Gemini: Combines strengths of 2M context with language understanding Anthropic's Claude: Focuses on safe and responsible AI Open-Source LLMs: Rapid growth in community-driven models e.g., Meta Llama, Google Gemma, Alibaba Qwen, ... 6
  4. 7

  5. Why Fine-Tune LLMs? Customization: Tailor models to specific or .

    Performance: Enhance accuracy on specialized datasets. Efficiency: Reduce inference time and computational resources. Control: Implement safety measures and bias mitigation. domains tasks 8
  6. 9

  7. 10

  8. 11

  9. Definition: Training a model on large-scale datasets to learn general

    language patterns. Characteristics: Unsupervised learning datasets (Common Crawl, Wikipedia) Foundation for downstream tasks Pre-training Massive 13
  10. Definition: Further training of a pre-trained model to improve performance

    on . Includes: Supervised Fine-Tuning ( ) Reinforcement Learning from Human Feedback ( ) Post-training specific tasks SFT RLHF 14
  11. SFT vs RLHF SFT: Uses datasets Directly adjusts model weights

    based on supervised signals RLHF: Incorporates Uses reinforcement learning algorithms labeled human preferences 15
  12. What is SFT? Process: Training a pre-trained model on labeled

    data. Objective: Align the model's outputs with desired responses. task-specific 17
  13. Steps in SFT 1. Data Collection: Curate a dataset relevant

    to the target task. 2. Data Preprocessing: Clean and tokenize data. 3. Fine-Tuning: Adjust model weights using supervised learning. 4. Evaluation: Assess performance on validation data. 18
  14. SFT Dataset Format Input: / Context Output: Example { "instruction":

    "What is the capital of Korea?", "output": "The capital of Korea is Seoul." } SFT Datasets Alpaca KoAlpaca / KoAlpaca-RealQA Prompt Response 19
  15. 23

  16. SFT with Hugging Face: Hugging Face Hub Transformers Library Install:

    pip install transformers datasets trl Hugging Face 24
  17. Example: SFT Solar-Ko with from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer,

    TrainingArguments from trl import SFTTrainer from datasets import load_dataset tokenizer = AutoTokenizer.from_pretrained('beomi/Solar-Ko-Recovery-11B') model = AutoModelForCausalLM.from_pretrained('beomi/Solar-Ko-Recovery-11B') train_dataset = load_dataset('beomi/KoAlpaca-RealQA') training_args = TrainingArguments( output_dir='./results', num_train_epochs=3, per_device_train_batch_size=4, save_total_limit=1, logging_strategy='steps', ) trainer = SFTTrainer( model=model, args=training_args, train_dataset=train_dataset['train'], tokenizer=tokenizer, ) trainer.train() KoAlpaca-RealQA 25
  18. Line-by-Line Load Model and Tokenizer from transformers import AutoTokenizer, AutoModelForCausalLM

    tokenizer = AutoTokenizer.from_pretrained('beomi/Solar-Ko-Recovery-11B') model = AutoModelForCausalLM.from_pretrained('beomi/Solar-Ko-Recovery-11B') Load Dataset from datasets import load_dataset train_dataset = load_dataset('beomi/KoAlpaca-RealQA') 26
  19. Training Arguments from transformers import TrainingArguments training_args = TrainingArguments( output_dir='./results',

    num_train_epochs=3, per_device_train_batch_size=4, save_total_limit=1, logging_strategy='steps', ) Initialize SFT Trainer & Train from trl import SFTTrainer trainer = SFTTrainer( model=model, args=training_args, train_dataset=train_dataset['train'], tokenizer=tokenizer, ) trainer.train() 27
  20. Inference Examples ### Instruction: 안녕하세요 ### Response: 안녕하세요! 어떻게 도와드릴까요?</s>

    ### Instruction: 아래 글을 한국어로 번역해줘. The KoAlpaca-RealQA dataset is a unique Korean instruction dataset designed to closely reflect real user interactions in the Korean language. ### Response: KoAlpaca-RealQA 데이터셋은 한국어 사용자들의 실제 상호작용을 매우 잘 반영하도록 설계된 독특한 한국어 지시 데이터셋입니다.</s> 28
  21. Benefits of SFT Task Specialization: Model becomes adept at specific

    tasks. Data Efficiency: Requires less data than training from scratch. Improved Performance: Higher accuracy on target tasks. 29
  22. What is RLHF? Definition: An approach that uses to fine-tune

    models via reinforcement learning. Goal: Align model outputs with human values and expectations. human preferences 31
  23. Proximal Policy Optimization (PPO) Algorithm: Balances exploration and exploitation by

    optimizing a surrogate objective function. Use Case: Adjusts the policy network to produce desired outputs. Used for training OpenAI's GPT-4o, etc. 33
  24. 34

  25. Reward Trainer w/ TRL from peft import LoraConfig, TaskType from

    transformers import AutoModelForSequenceClassification, AutoTokenizer from trl import RewardTrainer, RewardConfig model = AutoModelForSequenceClassification.from_pretrained("gpt2") peft_config = LoraConfig( task_type=TaskType.SEQ_CLS, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1, ) # ... trainer = RewardTrainer( model=model, args=training_args, processing_class=tokenizer, train_dataset=dataset, peft_config=peft_config, ) trainer.train() 37
  26. Direct Preference Optimization (DPO) Concept: Simplifies RLHF by directly optimizing

    preferences without the need for reward models. Advantage: Reduces complexity and training time. 38
  27. 39

  28. 42

  29. Setting Up Online DPO 1. Install TRL pip install trl

    2. Define the Custom Judge def custom_judge(response): # Implement custom logic to evaluate response # Return a scalar reward return reward 43
  30. 3. Initialize the Model and Optimizer from transformers import AutoModelForCausalLM,

    AutoTokenizer model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-3.1-8B-Instruct') tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-3.1-8B-Instruct') optimizer = torch.optim.AdamW(model.parameters(), lr=1e-6) 44
  31. 4. Make a Judge Function from trl import OpenAIPairwiseJudge judge

    = OpenAIPairwiseJudge( model_name="gpt-4o", system_prompt="...", ) Default system prompt: https://github.com/huggingface/trl/blob/b80c1a6/trl/trainer/judges.py#L35-L61 45
  32. 46

  33. 5. Training Loop with DPO training_args = OnlineDPOConfig( output_dir="aya-expanse-8b-OnlineDPO", logging_steps=1,

    # max_steps=100, report_to='tensorboard', bf16=True, per_device_train_batch_size=1, gradient_checkpointing="unsloth", max_new_tokens=2000, ) trainer = OnlineDPOTrainer( model=model, judge=judge, args=training_args, processing_class=tokenizer, train_dataset=train_dataset, ) trainer.train() 48
  34. Benefits of RLHF Alignment: Ensures model outputs align with human

    values. Safety: Reduces harmful or biased outputs. Quality Improvement: Enhances the usefulness of generated content. 49
  35. What is PEFT? Definition: Techniques that fine-tune models with fewer

    parameters, reducing computational resources. Hugging Face: Library Supported Methods: LoRA, QLoRA, etc. PEFT 51
  36. Introduction to LoRA (Low-Rank Adaptation): Decomposes weight updates into .

    Keeps original weights frozen. Efficient and memory-saving. LoRA low-rank matrices 52
  37. Use LoRA via PEFT 1. Install PEFT Library pip install

    peft 2. Load Pre-trained Model from transformers import AutoModelForCausalLM, AutoTokenizer from peft import get_peft_model, LoraConfig model = AutoModelForCausalLM.from_pretrained('meta-llama/Meta-Llama-3.1-8B-Instruct') tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3.1-8B-Instruct') 53
  38. 3. Configure config = LoraConfig( r=8, lora_alpha=32, target_modules='all-linear', # ['q_proj',

    'v_proj'], lora_dropout=0.1, bias='none', ) model = get_peft_model(model, config) 4. Same training loop as SFT/RLHF LoRA 54
  39. Advantages of LoRA Efficiency: Reduces GPU memory usage. Speed: Faster

    training times. Scalability: Easier to fine-tune very large models. 55
  40. Why Quantize Models? Reduce Memory Footprint: Smaller models consume less

    memory. Increase Inference Speed: Quantized models often run faster. Deploy on Edge Devices: Makes deployment on resource-constrained devices feasible. 57
  41. Introduction to QLoRA QLoRA: Combines quantization with LoRA to enable

    fine-tuning large models on a single GPU. 58
  42. 2. Load Model You can use models. Use load_in_4bit=True for

    models. Use load_in_8bit=True for models. from transformers import AutoTokenizer, AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained( 'meta-llama/Meta-Llama-3.1-8B-Instruct', load_in_4bit=True, device_map='auto', ) tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3.1-8B-Instruct') 4-bit Quantized 4 / 8-bit Quantized 4-bit Quantized 8-bit Quantized 60
  43. 3. Apply from peft import get_peft_model, LoraConfig config = LoraConfig(

    r=8, lora_alpha=32, target_modules='all-linear', # ['q_proj', 'v_proj'], lora_dropout=0.1, bias='none', task_type='CAUSAL_LM', ) model = get_peft_model(model, config) 4. Same training loop as SFT/RLHF LoRA 61
  44. Benefits of QLoRA Resource Efficiency: Fine-tune 70B+ parameter models on

    a single GPU. Performance: Minimal loss in model accuracy. Cost-Effective: Reduces the need for expensive hardware. 62
  45. What is RAFT? Definition: A technique that combines retrieval mechanisms

    with fine-tuning to enhance model performance on . specific knowledge domains 64
  46. Why Use RAFT? : Incorporate up-to-date or specialized information. Improved

    Accuracy: Provides context that the base model may lack. Dynamic Updating: Easily update the retrieval database without retraining the model. Domain-Specific Knowledge relevant 65
  47. What do we need? Raw Dataset + Context(Negative, Positive samples)

    LLM API for creating Q/A pairs Retrieval API for creating context (not covered in this talk) SFT Trainer from TRL Q/A pairs 72
  48. Setting Up the Environment 1. Open Google Colab Colab URL:

    https://beomi.net/sk-2411/raft 2. Enable GPU Runtime 73
  49. with QLoRA Step 1: Install Dependencies !pip install -q -U

    transformers !pip install -q datasets accelerate bitsandbytes lomo-optim hf_transfer trl !pip install -q flash-attn --no-build-isolation SFT 74
  50. Step 2: Load a Quantized Model import torch from transformers

    import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig model_id = "beomi/Solar-Ko-Recovery-11B" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", quantization_config=BitsAndBytesConfig(load_in_4bit=True), # nn.Linear to 4bit, torch_dtype=torch.bfloat16, # other to bfloat16 attn_implementation="flash_attention_2", ) 75
  51. Step 3: Apply LoRA from peft import get_peft_model, LoraConfig config

    = LoraConfig( r=16, lora_alpha=32, target_modules='all-linear', #['q_proj', 'v_proj'], lora_dropout=0.05, bias='none', task_type='CAUSAL_LM', ) model = get_peft_model(model, config) 76
  52. Step 4: Prepare Dataset from unsloth.chat_templates import get_chat_template tokenizer =

    get_chat_template( tokenizer, chat_template = "chatml", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, map_eos_token = True, # Maps <|im_end|> to </s> instead ) def formatting_prompts_func(examples): convos = examples["messages"] texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos] return { "text" : texts, } pass from datasets import load_dataset dataset = load_dataset("beomi/KoAlpaca-RealQA-oai", split = "train") dataset = dataset.map(formatting_prompts_func, batched = True,) 77
  53. 78

  54. 79

  55. Step 5: Fine-Tune the Model from trl import SFTTrainer from

    transformers import TrainingArguments from unsloth import is_bfloat16_supported trainer = SFTTrainer( model = model, tokenizer = tokenizer, train_dataset = dataset, dataset_text_field = "text", max_seq_length = max_seq_length, dataset_num_proc = 2, packing = False, # Can make training 5x faster for short sequences. args = TrainingArguments( per_device_train_batch_size = 32, gradient_accumulation_steps = 4, warmup_steps = 10, num_train_epochs = 3, # Set this for 1 full training run. # max_steps = 60, learning_rate = 2e-4, fp16 = not is_bfloat16_supported(), bf16 = is_bfloat16_supported(), logging_steps = 1, optim = "adamw_8bit", weight_decay = 0.01, lr_scheduler_type = "linear", seed = 3407, output_dir = "outputs", report_to = "none", # Use this for WandB etc ), ) 80
  56. 83

  57. 84

  58. Setting Up the Environment 1. Open Google Colab Colab URL:

    https://beomi.net/sk-2411/onlinedpo 2. Enable GPU Runtime 85
  59. 88

  60. Converting LoRA Model to GGUF Format 1. Convert the Model

    LoRA Converter: https://huggingface.co/spaces/beomi/gguf-my-lora *Base model should use https://huggingface.co/spaces/ggml-org/gguf-my-repo 89
  61. Inference with Llama.cpp on MacBook 1. Install Llama.cpp https://github.com/ggerganov/llama.cpp/releases 2.

    Download the Model https://huggingface.co/beomi/KoAlpaca-RealQA-Solar-Ko-Recovery-11B-LoRA- ChatML-F16-GGUF/tree/main Download LoRA gguf file: KoAlpaca-RealQA-Solar-Ko-Recovery-11B-LoRA-ChatML-f16.gguf 3. Load the Model ./llama-server \ --hf-repo beomi/Solar-Ko-Recovery-11B-Q8_0-GGUF \ --hf-file solar-ko-recovery-11b-q8_0.gguf \ -c 2048 --lora KoAlpaca-RealQA-Solar-Ko-Recovery-11B-LoRA-ChatML-f16.gguf 90
  62. 91

  63. 3. Generate Text Goto Tip: Use | as a token.

    -- {"stop": ["|"]} http://localhost:8080 stop 92
  64. 93

  65. Glossary LLM: Large Language Model SFT: Supervised Fine-Tuning RLHF: Reinforcement

    Learning from Human Feedback PEFT: Parameter-Efficient Fine-Tuning LoRA: Low-Rank Adaptation QLoRA: Quantized LoRA RAFT: Retrieval-Augmented Fine-Tuning RAG: Retrieval-Augmented Generation PPO: Proximal Policy Optimization DPO: Direct Preference Optimization 97