Introduction to Post-Training - Tokyo AI

Slide 1

Slide 1 text

Maxime Labonne Introduction to LLM Post-Training.

Slide 2

Slide 2 text

About Maxime Labonne PhD, Head of Post-Training @ Liquid AI, GDE Author of the LLM Engineer’s Handbook Writing: Blog posts, LLM Course (>45k on GitHub) Models: NeuralDaredevil, AlphaMonarch, Phixtral Tools: LLM AutoEval, LazyMergekit, AutoQuant @maximelabonne maxime-labonne

Slide 3

Slide 3 text

Find more information in the LLM Course repo on GitHub: https://github.com/mlabonne/llm-course Instruct model Chat model Base model Raw text Instructions Preferences Supervised fine-tuning Preference alignment Autocomplete prompts Follow instructions Optimized for humans Post-Training Pre- training

Slide 4

Slide 4 text

General-purpose (e.g., LFMs) Domain-specific (e.g., medical LLM) Task-specific (e.g., spell checker) 10k-100k Number of samples >1M 100k-1M Fine-tuning Post-training

Slide 5

Slide 5 text

Evaluation Change tone and format Add (superficial) knowledge Reduce cost and latency Increase output quality Start with in-context learning and RAG When Fine-Tuning?

Slide 6

Slide 6 text

Dataset.

Slide 7

Slide 7 text

What is a good dataset? Accuracy Factually accurate information Diversity Covers a wide range of topics Complexity Non-trivial tasks forcing reasoning Find more information in the LLM Datasets repo on GitHub: https://github.com/mlabonne/llm-datasets

Slide 8

Slide 8 text

Data formats System (optional) You are a helpful assistant. Instruction Remove the spaces from the following sentence: Fine- tuning is simple. Output Fine-tuningissimple. System (optional) You are a helpful assistant with a great sense of humor. Instruction Tell me a joke about octopuses. Chosen answer Why don't octopuses play cards in casinos? Because they can't count past eight. Rejected answer How many tickles does it take to make an octopus laugh? Ten tickles. Instruction data Preference data

Slide 9

Slide 9 text

Instruction/answer generation Seed data Scoring + Filtering Raw text Prompts Answers Generate Backtranslate Evolve Heuristics LLM-as-a-judge Reward model Data deduplication + decontamination Data filtering Exact deduplication Fuzzy deduplication (e.g., MinHash) Length-filtering Keyword exclusion Format checking

Slide 10

Slide 10 text

Query an LLM Prompts + Constraints Run tests Decontaminate 7-gram w/ IFEval Keyword exclusion SFT example: Instruction following Example Write a detailed review of the movie "The Social Network". Your entire response should be in English and all lower case (no capital letters whatsoever)

Slide 11

Slide 11 text

Query LLM 2 Prompts Judge LLM Remove duplicates Remove short answers Preference example: Ultrafeedback Query LLM 1 Query LLM n … Ganqu Cui et al. "UltraFeedback: Boosting Language Models with Scaled AI Feedback." arXiv preprint arXiv:2310.01377, October 2023.

Slide 12

Slide 12 text

Case study: Open-PerfectBlend Categories: Datasets: Tengyu Xu et al. "The Perfect Blend: Redefining RLHF with Mixture of Judges." arXiv preprint arXiv:2409.20370, September 2024.

Slide 13

Slide 13 text

Chat templates <|im_start|>system You are a helpful assistant, who always provide explanation. Think like you are answering to a five- year-old.<|im_end|> <|im_start|>user Remove the spaces from the following sentence: It prevents users to suspect that there are some hidden products installed on their device.<|im_end|> <|im_start|>assistant Itpreventsuserstosuspectthattherearesomehiddenproduc tsinstalledontheirsdevice.<|im_end|> Storage format: Alpaca (Other examples: ShareGPT, OpenAI) Chat template: ChatML (Other examples: Llama 3, Mistral Instruct) System (optional) You are a helpful assistant, who always provide explanation. Think like you are answering to a five- year-old. Instruction Remove the spaces from the following sentence: It prevents users to suspect that there are some hidden products installed on their device. Output Itpreventsuserstosuspectthattherearesomehiddenpr oductsinstalledontheirsdevice.

Slide 14

Slide 14 text

Training.

Slide 15

Slide 15 text

Recommended fine-tuning libraries TRL HF’s library, most up-to- date in terms of algorithms Axolotl Additional features and reusable YAML configurations Unsloth Efficient single GPU fine- tuning with useful utilities

Slide 16

Slide 16 text

SFT techniques Figure adapted from Xu, Yuhui, et al. "QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models." arXiv preprint arXiv:2309.14717 (2023). LoRA 16-bit precision QLoRA 4-bit precision Full Fine-Tuning 16-bit precision Maximizes quality Very high VRAM usage Fastest training High VRAM usage Low VRAM usage Degrades performance

Slide 17

Slide 17 text

Preference alignment techniques Proximal Policy Optimization Maximizes quality Very expensive & complex Direct Preference Optimization Fast and cheap to use Lower quality

Slide 18

Slide 18 text

Training parameters Parameter Description Common values Learning rate How much the parameters are updated during training 1e-6 to 1e-3 2️⃣ Batch size Number of samples processed before updating parameters 8 or 16 (effective) Max length Longest input (in tokens) the model can process 1024 to 4096 Epochs Number of passes through the entire training dataset 3 to 5 Optimizer Algorithm to update the parameters to minimize the loss function AdamW Attention Implementation of the attention mechanism FlashAttention-2

Slide 19

Slide 19 text

Monitoring experiments LR too high (loss spike) Good LR (“smooth” curve)

Slide 20

Slide 20 text

Monitoring experiments

Slide 21

Slide 21 text

Evaluation.

Slide 22

Slide 22 text

Automated benchmarks → Calculate a metric based on generated and ground- truth answers (e.g., accuracy). Example: MMLU, Open LLM Leaderboard Advantages Consistent & reproducible Cost-effective at scale Clear dimensions (e.g., math) Limitations ❌Not how models are used ❌Hard to evaluate complex tasks ❌Risk of data contamination Clémentine Fourrier and The Hugging Face Community, "LLM Evaluation Guidebook.", 2024. Open LLM Leaderboard by Hugging Face

Slide 23

Slide 23 text

Task-specific benchmarks Focused benchmarks Domain-specific benchmarks

Slide 24

Slide 24 text

Human judge → Ask humans to score model generations on specific properties (accuracy, relevance, toxicity, etc.). Example: vibe checks, Chatbot Arena, data annotation Advantages ✅High flexibility ✅No data contamination risk ✅Direct human preferences Limitations ❌Costly & time-consuming ❌Biased (tone, first impression, etc.) ❌Limited scalability Clémentine Fourrier and The Hugging Face Community, "LLM Evaluation Guidebook.", 2024. Chatbot Arena by LMSYS

Slide 25

Slide 25 text

Human preferences are weakly correlated with automated benchmarks Jinjie Ni et al. "MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures." arXiv preprint arXiv:2406.06565, June 2024.

Slide 26

Slide 26 text

Judge LLM → Use an LLM to score model generations on specific properties (accuracy, relevance, toxicity, etc.). Example: LLM-as-a-judge, Reward Models, small classifiers Advantages ✅This is how models are used ✅Can handle complex tasks ✅Provide direct feedback Limitations ❌Hidden biases (e.g., length, tone) ❌Quality validation needed ❌Costly at scale Clémentine Fourrier and The Hugging Face Community, "LLM Evaluation Guidebook.", 2024. EQ-Bench by Samuel J. Peach

Slide 27

Slide 27 text

Create your own evaluation Start early and iterate a lot! Combine different types of evals Compare your models with others

Slide 28

Slide 28 text

Evaluation General and task or domain- specific benchmarks Dataset Data generation, curation, filtering, and exploration. Fine-tuning Supervised fine-tuning, preference alignment, model merging Conclusion

Slide 29

Slide 29 text

@maximelabonne maxime-labonne