Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Post-Training - Tokyo AI

Introduction to Post-Training - Tokyo AI

Maxime Labonne

February 05, 2025
Tweet

Other Decks in Education

Transcript

  1. About Maxime Labonne PhD, Head of Post-Training @ Liquid AI,

    GDE Author of the LLM Engineer’s Handbook Writing: Blog posts, LLM Course (>45k on GitHub) Models: NeuralDaredevil, AlphaMonarch, Phixtral Tools: LLM AutoEval, LazyMergekit, AutoQuant @maximelabonne maxime-labonne
  2. Find more information in the LLM Course repo on GitHub:

    https://github.com/mlabonne/llm-course Instruct model Chat model Base model Raw text Instructions Preferences Supervised fine-tuning Preference alignment Autocomplete prompts Follow instructions Optimized for humans Post-Training Pre- training
  3. General-purpose (e.g., LFMs) Domain-specific (e.g., medical LLM) Task-specific (e.g., spell

    checker) 10k-100k Number of samples >1M 100k-1M Fine-tuning Post-training
  4. Evaluation Change tone and format Add (superficial) knowledge Reduce cost

    and latency Increase output quality Start with in-context learning and RAG When Fine-Tuning?
  5. What is a good dataset? Accuracy Factually accurate information Diversity

    Covers a wide range of topics Complexity Non-trivial tasks forcing reasoning Find more information in the LLM Datasets repo on GitHub: https://github.com/mlabonne/llm-datasets
  6. Data formats System (optional) You are a helpful assistant. Instruction

    Remove the spaces from the following sentence: Fine- tuning is simple. Output Fine-tuningissimple. System (optional) You are a helpful assistant with a great sense of humor. Instruction Tell me a joke about octopuses. Chosen answer Why don't octopuses play cards in casinos? Because they can't count past eight. Rejected answer How many tickles does it take to make an octopus laugh? Ten tickles. Instruction data Preference data
  7. Instruction/answer generation Seed data Scoring + Filtering Raw text Prompts

    Answers Generate Backtranslate Evolve Heuristics LLM-as-a-judge Reward model Data deduplication + decontamination Data filtering Exact deduplication Fuzzy deduplication (e.g., MinHash) Length-filtering Keyword exclusion Format checking
  8. Query an LLM Prompts + Constraints Run tests Decontaminate 7-gram

    w/ IFEval Keyword exclusion SFT example: Instruction following Example Write a detailed review of the movie "The Social Network". Your entire response should be in English and all lower case (no capital letters whatsoever)
  9. Query LLM 2 Prompts Judge LLM Remove duplicates Remove short

    answers Preference example: Ultrafeedback Query LLM 1 Query LLM n … Ganqu Cui et al. "UltraFeedback: Boosting Language Models with Scaled AI Feedback." arXiv preprint arXiv:2310.01377, October 2023.
  10. Case study: Open-PerfectBlend Categories: Datasets: Tengyu Xu et al. "The

    Perfect Blend: Redefining RLHF with Mixture of Judges." arXiv preprint arXiv:2409.20370, September 2024.
  11. Chat templates <|im_start|>system You are a helpful assistant, who always

    provide explanation. Think like you are answering to a five- year-old.<|im_end|> <|im_start|>user Remove the spaces from the following sentence: It prevents users to suspect that there are some hidden products installed on their device.<|im_end|> <|im_start|>assistant Itpreventsuserstosuspectthattherearesomehiddenproduc tsinstalledontheirsdevice.<|im_end|> Storage format: Alpaca (Other examples: ShareGPT, OpenAI) Chat template: ChatML (Other examples: Llama 3, Mistral Instruct) System (optional) You are a helpful assistant, who always provide explanation. Think like you are answering to a five- year-old. Instruction Remove the spaces from the following sentence: It prevents users to suspect that there are some hidden products installed on their device. Output Itpreventsuserstosuspectthattherearesomehiddenpr oductsinstalledontheirsdevice.
  12. Recommended fine-tuning libraries TRL HF’s library, most up-to- date in

    terms of algorithms Axolotl Additional features and reusable YAML configurations Unsloth Efficient single GPU fine- tuning with useful utilities
  13. SFT techniques Figure adapted from Xu, Yuhui, et al. "QA-LoRA:

    Quantization-Aware Low-Rank Adaptation of Large Language Models." arXiv preprint arXiv:2309.14717 (2023). LoRA 16-bit precision QLoRA 4-bit precision Full Fine-Tuning 16-bit precision Maximizes quality Very high VRAM usage Fastest training High VRAM usage Low VRAM usage Degrades performance
  14. Preference alignment techniques Proximal Policy Optimization Maximizes quality Very expensive

    & complex Direct Preference Optimization Fast and cheap to use Lower quality
  15. Training parameters Parameter Description Common values Learning rate How much

    the parameters are updated during training 1e-6 to 1e-3 2️⃣ Batch size Number of samples processed before updating parameters 8 or 16 (effective) Max length Longest input (in tokens) the model can process 1024 to 4096 Epochs Number of passes through the entire training dataset 3 to 5 Optimizer Algorithm to update the parameters to minimize the loss function AdamW Attention Implementation of the attention mechanism FlashAttention-2
  16. Automated benchmarks → Calculate a metric based on generated and

    ground- truth answers (e.g., accuracy). Example: MMLU, Open LLM Leaderboard Advantages Consistent & reproducible Cost-effective at scale Clear dimensions (e.g., math) Limitations ❌Not how models are used ❌Hard to evaluate complex tasks ❌Risk of data contamination Clémentine Fourrier and The Hugging Face Community, "LLM Evaluation Guidebook.", 2024. Open LLM Leaderboard by Hugging Face
  17. Human judge → Ask humans to score model generations on

    specific properties (accuracy, relevance, toxicity, etc.). Example: vibe checks, Chatbot Arena, data annotation Advantages ✅High flexibility ✅No data contamination risk ✅Direct human preferences Limitations ❌Costly & time-consuming ❌Biased (tone, first impression, etc.) ❌Limited scalability Clémentine Fourrier and The Hugging Face Community, "LLM Evaluation Guidebook.", 2024. Chatbot Arena by LMSYS
  18. Human preferences are weakly correlated with automated benchmarks Jinjie Ni

    et al. "MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures." arXiv preprint arXiv:2406.06565, June 2024.
  19. Judge LLM → Use an LLM to score model generations

    on specific properties (accuracy, relevance, toxicity, etc.). Example: LLM-as-a-judge, Reward Models, small classifiers Advantages ✅This is how models are used ✅Can handle complex tasks ✅Provide direct feedback Limitations ❌Hidden biases (e.g., length, tone) ❌Quality validation needed ❌Costly at scale Clémentine Fourrier and The Hugging Face Community, "LLM Evaluation Guidebook.", 2024. EQ-Bench by Samuel J. Peach
  20. Create your own evaluation Start early and iterate a lot!

    Combine different types of evals Compare your models with others
  21. Evaluation General and task or domain- specific benchmarks Dataset Data

    generation, curation, filtering, and exploration. Fine-tuning Supervised fine-tuning, preference alignment, model merging Conclusion