[2024.11.27] SK WaveHill Meetup - LLM Fine-tuning

LLM Fine-Tuning: From Pretrained to On-device Lee Junbum (Beomi) [email protected]
2024-11-27 WaveHill Meetup - LLM Fine-Tuning 1

Lee Junbum (Beomi) Korean Open Language Model Researcher KcBERT, KoAlpaca,
AI/ML GDE MLE @ Channel Corporation Llama-Ko 2

Agenda 1. Introduction to LLMs and Recent Trends 2. Understanding
Pre-training and Post-training 3. Supervised Fine-Tuning (SFT) 4. Reinforcement Learning from Human Feedback (RLHF) 5. Parameter-Efficient Fine-Tuning (PEFT) with LoRA 6. Quantization Techniques (QLoRA) 7. Retrieval-Augmented Fine-Tuning (RAFT) 8. Hands-on Tutorials with Google Colab 9. Model Conversion and Inference with Llama.cpp 10. Q&A Session 3

1. Introduction to LLMs and Recent Trends 4

What are (LLMs)? Definition: LLMs are deep learning models with
billions of parameters trained on vast amounts of text data. Capabilities: Natural language understanding Translation Summarization Large Language Models Text generation 5

Recent Advancements in LLMs OpenAI GPT4o, o1: Improved reasoning and
understanding Multimodal capabilities Google's Gemini: Combines strengths of 2M context with language understanding Anthropic's Claude: Focuses on safe and responsible AI Open-Source LLMs: Rapid growth in community-driven models e.g., Meta Llama, Google Gemma, Alibaba Qwen, ... 6

Why Fine-Tune LLMs? Customization: Tailor models to specific or .
Performance: Enhance accuracy on specialized datasets. Efficiency: Reduce inference time and computational resources. Control: Implement safety measures and bias mitigation. domains tasks 8

2. Understanding and Pre-training Post-training 12

Definition: Training a model on large-scale datasets to learn general
language patterns. Characteristics: Unsupervised learning datasets (Common Crawl, Wikipedia) Foundation for downstream tasks Pre-training Massive 13

Definition: Further training of a pre-trained model to improve performance
on . Includes: Supervised Fine-Tuning ( ) Reinforcement Learning from Human Feedback ( ) Post-training specific tasks SFT RLHF 14

SFT vs RLHF SFT: Uses datasets Directly adjusts model weights
based on supervised signals RLHF: Incorporates Uses reinforcement learning algorithms labeled human preferences 15

3. Supervised Fine-Tuning (SFT) 16

What is SFT? Process: Training a pre-trained model on labeled
data. Objective: Align the model's outputs with desired responses. task-specific 17

Steps in SFT 1. Data Collection: Curate a dataset relevant
to the target task. 2. Data Preprocessing: Clean and tokenize data. 3. Fine-Tuning: Adjust model weights using supervised learning. 4. Evaluation: Assess performance on validation data. 18

SFT Dataset Format Input: / Context Output: Example { "instruction":
"What is the capital of Korea?", "output": "The capital of Korea is Seoul." } SFT Datasets Alpaca KoAlpaca / KoAlpaca-RealQA Prompt Response 19

Dataset Alpaca 20

Dataset KoAlpaca 21

Dataset KoAlpaca-RealQA 22

SFT with Hugging Face: Hugging Face Hub Transformers Library Install:
pip install transformers datasets trl Hugging Face 24

Example: SFT Solar-Ko with from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer,
TrainingArguments from trl import SFTTrainer from datasets import load_dataset tokenizer = AutoTokenizer.from_pretrained('beomi/Solar-Ko-Recovery-11B') model = AutoModelForCausalLM.from_pretrained('beomi/Solar-Ko-Recovery-11B') train_dataset = load_dataset('beomi/KoAlpaca-RealQA') training_args = TrainingArguments( output_dir='./results', num_train_epochs=3, per_device_train_batch_size=4, save_total_limit=1, logging_strategy='steps', ) trainer = SFTTrainer( model=model, args=training_args, train_dataset=train_dataset['train'], tokenizer=tokenizer, ) trainer.train() KoAlpaca-RealQA 25

Line-by-Line Load Model and Tokenizer from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained('beomi/Solar-Ko-Recovery-11B') model = AutoModelForCausalLM.from_pretrained('beomi/Solar-Ko-Recovery-11B') Load Dataset from datasets import load_dataset train_dataset = load_dataset('beomi/KoAlpaca-RealQA') 26

Training Arguments from transformers import TrainingArguments training_args = TrainingArguments( output_dir='./results',
num_train_epochs=3, per_device_train_batch_size=4, save_total_limit=1, logging_strategy='steps', ) Initialize SFT Trainer & Train from trl import SFTTrainer trainer = SFTTrainer( model=model, args=training_args, train_dataset=train_dataset['train'], tokenizer=tokenizer, ) trainer.train() 27

Inference Examples ### Instruction: 안녕하세요 ### Response: 안녕하세요! 어떻게 도와드릴까요?</s>
### Instruction: 아래 글을 한국어로 번역해줘. The KoAlpaca-RealQA dataset is a unique Korean instruction dataset designed to closely reflect real user interactions in the Korean language. ### Response: KoAlpaca-RealQA 데이터셋은 한국어 사용자들의 실제 상호작용을 매우 잘 반영하도록 설계된 독특한 한국어 지시 데이터셋입니다.</s> 28

Benefits of SFT Task Specialization: Model becomes adept at specific
tasks. Data Efficiency: Requires less data than training from scratch. Improved Performance: Higher accuracy on target tasks. 29

4. Reinforcement Learning from Feedback Human 30

What is RLHF? Definition: An approach that uses to fine-tune
models via reinforcement learning. Goal: Align model outputs with human values and expectations. human preferences 31

Popular RLHF Methods Proximal Policy Optimization (PPO) Direct Preference Optimization
(DPO) +a ORPO Online DPO KTO ... 32

Proximal Policy Optimization (PPO) Algorithm: Balances exploration and exploitation by
optimizing a surrogate objective function. Use Case: Adjusts the policy network to produce desired outputs. Used for training OpenAI's GPT-4o, etc. 33

Reward Model 35

Reward Model Training Implicit prompt preference dataset Chosen Samples Rejected
Samples score_chosen score_rejected 36

Reward Trainer w/ TRL from peft import LoraConfig, TaskType from
transformers import AutoModelForSequenceClassification, AutoTokenizer from trl import RewardTrainer, RewardConfig model = AutoModelForSequenceClassification.from_pretrained("gpt2") peft_config = LoraConfig( task_type=TaskType.SEQ_CLS, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1, ) # ... trainer = RewardTrainer( model=model, args=training_args, processing_class=tokenizer, train_dataset=dataset, peft_config=peft_config, ) trainer.train() 37

Direct Preference Optimization (DPO) Concept: Simplifies RLHF by directly optimizing
preferences without the need for reward models. Advantage: Reduces complexity and training time. 38

with Custom Judge w/ TRL Online DPO 40

Why Online DPO? No need to train reward model. is
all you need. Prompt & LLM 41

Setting Up Online DPO 1. Install TRL pip install trl
2. Define the Custom Judge def custom_judge(response): # Implement custom logic to evaluate response # Return a scalar reward return reward 43

3. Initialize the Model and Optimizer from transformers import AutoModelForCausalLM,
AutoTokenizer model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-3.1-8B-Instruct') tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-3.1-8B-Instruct') optimizer = torch.optim.AdamW(model.parameters(), lr=1e-6) 44

4. Make a Judge Function from trl import OpenAIPairwiseJudge judge
= OpenAIPairwiseJudge( model_name="gpt-4o", system_prompt="...", ) Default system prompt: https://github.com/huggingface/trl/blob/b80c1a6/trl/trainer/judges.py#L35-L61 45

Judge Example 47

5. Training Loop with DPO training_args = OnlineDPOConfig( output_dir="aya-expanse-8b-OnlineDPO", logging_steps=1,
# max_steps=100, report_to='tensorboard', bf16=True, per_device_train_batch_size=1, gradient_checkpointing="unsloth", max_new_tokens=2000, ) trainer = OnlineDPOTrainer( model=model, judge=judge, args=training_args, processing_class=tokenizer, train_dataset=train_dataset, ) trainer.train() 48

Benefits of RLHF Alignment: Ensures model outputs align with human
values. Safety: Reduces harmful or biased outputs. Quality Improvement: Enhances the usefulness of generated content. 49

5. PEFT: Parameter-Efficient Fine-Tuning 50

What is PEFT? Definition: Techniques that fine-tune models with fewer
parameters, reducing computational resources. Hugging Face: Library Supported Methods: LoRA, QLoRA, etc. PEFT 51

Introduction to LoRA (Low-Rank Adaptation): Decomposes weight updates into .
Keeps original weights frozen. Efficient and memory-saving. LoRA low-rank matrices 52

Use LoRA via PEFT 1. Install PEFT Library pip install
peft 2. Load Pre-trained Model from transformers import AutoModelForCausalLM, AutoTokenizer from peft import get_peft_model, LoraConfig model = AutoModelForCausalLM.from_pretrained('meta-llama/Meta-Llama-3.1-8B-Instruct') tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3.1-8B-Instruct') 53

3. Configure config = LoraConfig( r=8, lora_alpha=32, target_modules='all-linear', # ['q_proj',
'v_proj'], lora_dropout=0.1, bias='none', ) model = get_peft_model(model, config) 4. Same training loop as SFT/RLHF LoRA 54

Advantages of LoRA Efficiency: Reduces GPU memory usage. Speed: Faster
training times. Scalability: Easier to fine-tune very large models. 55

6. LoRA (QLoRA) Quantized 56

Why Quantize Models? Reduce Memory Footprint: Smaller models consume less
memory. Increase Inference Speed: Quantized models often run faster. Deploy on Edge Devices: Makes deployment on resource-constrained devices feasible. 57

Introduction to QLoRA QLoRA: Combines quantization with LoRA to enable
fine-tuning large models on a single GPU. 58

Implementing QLoRA 1. Install Necessary Libraries pip install bitsandbytes 59

2. Load Model You can use models. Use load_in_4bit=True for
models. Use load_in_8bit=True for models. from transformers import AutoTokenizer, AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained( 'meta-llama/Meta-Llama-3.1-8B-Instruct', load_in_4bit=True, device_map='auto', ) tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3.1-8B-Instruct') 4-bit Quantized 4 / 8-bit Quantized 4-bit Quantized 8-bit Quantized 60

3. Apply from peft import get_peft_model, LoraConfig config = LoraConfig(
r=8, lora_alpha=32, target_modules='all-linear', # ['q_proj', 'v_proj'], lora_dropout=0.1, bias='none', task_type='CAUSAL_LM', ) model = get_peft_model(model, config) 4. Same training loop as SFT/RLHF LoRA 61

Benefits of QLoRA Resource Efficiency: Fine-tune 70B+ parameter models on
a single GPU. Performance: Minimal loss in model accuracy. Cost-Effective: Reduces the need for expensive hardware. 62

7. Retrieval-Augmented Fine-Tuning ( ) RAFT 63

What is RAFT? Definition: A technique that combines retrieval mechanisms
with fine-tuning to enhance model performance on . specific knowledge domains 64

Why Use RAFT? : Incorporate up-to-date or specialized information. Improved
Accuracy: Provides context that the base model may lack. Dynamic Updating: Easily update the retrieval database without retraining the model. Domain-Specific Knowledge relevant 65

SFT vs RAG vs RAFT 66

How to Implement RAFT? 67

RAFT Results 68

RAFT Results (Korean Example) DSF = Domain-Specific Fine-Tuning https://huggingface.co/devlim/Korea-HealthCare-RAFT-float16 69

RAFT Dataset Example: Korean Wikipedia QA https://huggingface.co/datasets/beomi/kowikitext-qa-ref-detail-preview?row=0 70

8-1. Hands-on with Google Colab SFT/RAFT 71

What do we need? Raw Dataset + Context(Negative, Positive samples)
LLM API for creating Q/A pairs Retrieval API for creating context (not covered in this talk) SFT Trainer from TRL Q/A pairs 72

Setting Up the Environment 1. Open Google Colab Colab URL:
https://beomi.net/sk-2411/raft 2. Enable GPU Runtime 73

with QLoRA Step 1: Install Dependencies !pip install -q -U
transformers !pip install -q datasets accelerate bitsandbytes lomo-optim hf_transfer trl !pip install -q flash-attn --no-build-isolation SFT 74

Step 2: Load a Quantized Model import torch from transformers
import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig model_id = "beomi/Solar-Ko-Recovery-11B" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", quantization_config=BitsAndBytesConfig(load_in_4bit=True), # nn.Linear to 4bit, torch_dtype=torch.bfloat16, # other to bfloat16 attn_implementation="flash_attention_2", ) 75

Step 3: Apply LoRA from peft import get_peft_model, LoraConfig config
= LoraConfig( r=16, lora_alpha=32, target_modules='all-linear', #['q_proj', 'v_proj'], lora_dropout=0.05, bias='none', task_type='CAUSAL_LM', ) model = get_peft_model(model, config) 76

Step 4: Prepare Dataset from unsloth.chat_templates import get_chat_template tokenizer =
get_chat_template( tokenizer, chat_template = "chatml", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, map_eos_token = True, # Maps <|im_end|> to </s> instead ) def formatting_prompts_func(examples): convos = examples["messages"] texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos] return { "text" : texts, } pass from datasets import load_dataset dataset = load_dataset("beomi/KoAlpaca-RealQA-oai", split = "train") dataset = dataset.map(formatting_prompts_func, batched = True,) 77

Step 5: Fine-Tune the Model from trl import SFTTrainer from
transformers import TrainingArguments from unsloth import is_bfloat16_supported trainer = SFTTrainer( model = model, tokenizer = tokenizer, train_dataset = dataset, dataset_text_field = "text", max_seq_length = max_seq_length, dataset_num_proc = 2, packing = False, # Can make training 5x faster for short sequences. args = TrainingArguments( per_device_train_batch_size = 32, gradient_accumulation_steps = 4, warmup_steps = 10, num_train_epochs = 3, # Set this for 1 full training run. # max_steps = 60, learning_rate = 2e-4, fp16 = not is_bfloat16_supported(), bf16 = is_bfloat16_supported(), logging_steps = 1, optim = "adamw_8bit", weight_decay = 0.01, lr_scheduler_type = "linear", seed = 3407, output_dir = "outputs", report_to = "none", # Use this for WandB etc ), ) 80

8-2. Hands-on with Google Colab OnlineDPO 81

OnlineDPO? OnlineDPO: A method for fine-tuning LLMs using DPO(Direct Preference
Optimization) with ! Online AI Feedback 82

Setting Up the Environment 1. Open Google Colab Colab URL:
https://beomi.net/sk-2411/onlinedpo 2. Enable GPU Runtime 85

9. Model Conversion and Inference with Llama.cpp 86

What is Llama.cpp? 87

Converting LoRA Model to GGUF Format 1. Convert the Model
LoRA Converter: https://huggingface.co/spaces/beomi/gguf-my-lora *Base model should use https://huggingface.co/spaces/ggml-org/gguf-my-repo 89

Inference with Llama.cpp on MacBook 1. Install Llama.cpp https://github.com/ggerganov/llama.cpp/releases 2.
Download the Model https://huggingface.co/beomi/KoAlpaca-RealQA-Solar-Ko-Recovery-11B-LoRA- ChatML-F16-GGUF/tree/main Download LoRA gguf file: KoAlpaca-RealQA-Solar-Ko-Recovery-11B-LoRA-ChatML-f16.gguf 3. Load the Model ./llama-server \ --hf-repo beomi/Solar-Ko-Recovery-11B-Q8_0-GGUF \ --hf-file solar-ko-recovery-11b-q8_0.gguf \ -c 2048 --lora KoAlpaca-RealQA-Solar-Ko-Recovery-11B-LoRA-ChatML-f16.gguf 90

3. Generate Text Goto Tip: Use | as a token.
-- {"stop": ["|"]} http://localhost:8080 stop 92

10. Q&A Session 94

Additional Resources Transformers Documentation: huggingface.co/docs/transformers TRL Library: github.com/huggingface/trl PEFT Library:
github.com/huggingface/peft LLama.cpp: github.com/ggerganov/llama.cpp 95

Thank You! Contact Information Email: [email protected] GitHub: github.com/beomi Hugging Face
: huggingface.co/beomi Feedback is Welcome 96

Glossary LLM: Large Language Model SFT: Supervised Fine-Tuning RLHF: Reinforcement
Learning from Human Feedback PEFT: Parameter-Efficient Fine-Tuning LoRA: Low-Rank Adaptation QLoRA: Quantized LoRA RAFT: Retrieval-Augmented Fine-Tuning RAG: Retrieval-Augmented Generation PPO: Proximal Policy Optimization DPO: Direct Preference Optimization 97

</s> 98

[2024.11.27] SK WaveHill Meetup - LLM Fine-tuning

[2024.11.27] SK WaveHill Meetup - LLM Fine-tuning

More Decks by Beomi

Other Decks in Programming

Featured

Transcript