Introduction to Post-Training - Tokyo AI

Maxime Labonne Introduction to LLM Post-Training.

About Maxime Labonne PhD, Head of Post-Training @ Liquid AI,
GDE Author of the LLM Engineer’s Handbook Writing: Blog posts, LLM Course (>45k on GitHub) Models: NeuralDaredevil, AlphaMonarch, Phixtral Tools: LLM AutoEval, LazyMergekit, AutoQuant @maximelabonne maxime-labonne

Find more information in the LLM Course repo on GitHub:
https://github.com/mlabonne/llm-course Instruct model Chat model Base model Raw text Instructions Preferences Supervised fine-tuning Preference alignment Autocomplete prompts Follow instructions Optimized for humans Post-Training Pre- training

General-purpose (e.g., LFMs) Domain-specific (e.g., medical LLM) Task-specific (e.g., spell
checker) 10k-100k Number of samples >1M 100k-1M Fine-tuning Post-training

Evaluation Change tone and format Add (superficial) knowledge Reduce cost
and latency Increase output quality Start with in-context learning and RAG When Fine-Tuning?

Dataset.

What is a good dataset? Accuracy Factually accurate information Diversity
Covers a wide range of topics Complexity Non-trivial tasks forcing reasoning Find more information in the LLM Datasets repo on GitHub: https://github.com/mlabonne/llm-datasets

Data formats System (optional) You are a helpful assistant. Instruction
Remove the spaces from the following sentence: Fine- tuning is simple. Output Fine-tuningissimple. System (optional) You are a helpful assistant with a great sense of humor. Instruction Tell me a joke about octopuses. Chosen answer Why don't octopuses play cards in casinos? Because they can't count past eight. Rejected answer How many tickles does it take to make an octopus laugh? Ten tickles. Instruction data Preference data

Instruction/answer generation Seed data Scoring + Filtering Raw text Prompts
Answers Generate Backtranslate Evolve Heuristics LLM-as-a-judge Reward model Data deduplication + decontamination Data filtering Exact deduplication Fuzzy deduplication (e.g., MinHash) Length-filtering Keyword exclusion Format checking

Query an LLM Prompts + Constraints Run tests Decontaminate 7-gram
w/ IFEval Keyword exclusion SFT example: Instruction following Example Write a detailed review of the movie "The Social Network". Your entire response should be in English and all lower case (no capital letters whatsoever)

Query LLM 2 Prompts Judge LLM Remove duplicates Remove short
answers Preference example: Ultrafeedback Query LLM 1 Query LLM n … Ganqu Cui et al. "UltraFeedback: Boosting Language Models with Scaled AI Feedback." arXiv preprint arXiv:2310.01377, October 2023.

Case study: Open-PerfectBlend Categories: Datasets: Tengyu Xu et al. "The
Perfect Blend: Redefining RLHF with Mixture of Judges." arXiv preprint arXiv:2409.20370, September 2024.

Chat templates <|im_start|>system You are a helpful assistant, who always
provide explanation. Think like you are answering to a five- year-old.<|im_end|> <|im_start|>user Remove the spaces from the following sentence: It prevents users to suspect that there are some hidden products installed on their device.<|im_end|> <|im_start|>assistant Itpreventsuserstosuspectthattherearesomehiddenproduc tsinstalledontheirsdevice.<|im_end|> Storage format: Alpaca (Other examples: ShareGPT, OpenAI) Chat template: ChatML (Other examples: Llama 3, Mistral Instruct) System (optional) You are a helpful assistant, who always provide explanation. Think like you are answering to a five- year-old. Instruction Remove the spaces from the following sentence: It prevents users to suspect that there are some hidden products installed on their device. Output Itpreventsuserstosuspectthattherearesomehiddenpr oductsinstalledontheirsdevice.

Training.

Recommended fine-tuning libraries TRL HF’s library, most up-to- date in
terms of algorithms Axolotl Additional features and reusable YAML configurations Unsloth Efficient single GPU fine- tuning with useful utilities

SFT techniques Figure adapted from Xu, Yuhui, et al. "QA-LoRA:
Quantization-Aware Low-Rank Adaptation of Large Language Models." arXiv preprint arXiv:2309.14717 (2023). LoRA 16-bit precision QLoRA 4-bit precision Full Fine-Tuning 16-bit precision Maximizes quality Very high VRAM usage Fastest training High VRAM usage Low VRAM usage Degrades performance

Preference alignment techniques Proximal Policy Optimization Maximizes quality Very expensive
& complex Direct Preference Optimization Fast and cheap to use Lower quality

Training parameters Parameter Description Common values Learning rate How much
the parameters are updated during training 1e-6 to 1e-3 2️⃣ Batch size Number of samples processed before updating parameters 8 or 16 (effective) Max length Longest input (in tokens) the model can process 1024 to 4096 Epochs Number of passes through the entire training dataset 3 to 5 Optimizer Algorithm to update the parameters to minimize the loss function AdamW Attention Implementation of the attention mechanism FlashAttention-2

Monitoring experiments LR too high (loss spike) Good LR (“smooth”
curve)

Monitoring experiments

Evaluation.

Automated benchmarks → Calculate a metric based on generated and
ground- truth answers (e.g., accuracy). Example: MMLU, Open LLM Leaderboard Advantages Consistent & reproducible Cost-effective at scale Clear dimensions (e.g., math) Limitations ❌Not how models are used ❌Hard to evaluate complex tasks ❌Risk of data contamination Clémentine Fourrier and The Hugging Face Community, "LLM Evaluation Guidebook.", 2024. Open LLM Leaderboard by Hugging Face

Task-specific benchmarks Focused benchmarks Domain-specific benchmarks

Human judge → Ask humans to score model generations on
specific properties (accuracy, relevance, toxicity, etc.). Example: vibe checks, Chatbot Arena, data annotation Advantages ✅High flexibility ✅No data contamination risk ✅Direct human preferences Limitations ❌Costly & time-consuming ❌Biased (tone, first impression, etc.) ❌Limited scalability Clémentine Fourrier and The Hugging Face Community, "LLM Evaluation Guidebook.", 2024. Chatbot Arena by LMSYS

Human preferences are weakly correlated with automated benchmarks Jinjie Ni
et al. "MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures." arXiv preprint arXiv:2406.06565, June 2024.

Judge LLM → Use an LLM to score model generations
on specific properties (accuracy, relevance, toxicity, etc.). Example: LLM-as-a-judge, Reward Models, small classifiers Advantages ✅This is how models are used ✅Can handle complex tasks ✅Provide direct feedback Limitations ❌Hidden biases (e.g., length, tone) ❌Quality validation needed ❌Costly at scale Clémentine Fourrier and The Hugging Face Community, "LLM Evaluation Guidebook.", 2024. EQ-Bench by Samuel J. Peach

Create your own evaluation Start early and iterate a lot!
Combine different types of evals Compare your models with others

Evaluation General and task or domain- specific benchmarks Dataset Data
generation, curation, filtering, and exploration. Fine-tuning Supervised fine-tuning, preference alignment, model merging Conclusion

@maximelabonne maxime-labonne

Introduction to Post-Training - Tokyo AI

Introduction to Post-Training - Tokyo AI

Maxime Labonne

Other Decks in Education

Featured

Transcript

Maxime Labonne Introduction to LLM Post-Training.

About Maxime Labonne PhD, Head of Post-Training @ Liquid AI,

Find more information in the LLM Course repo on GitHub:

General-purpose (e.g., LFMs) Domain-specific (e.g., medical LLM) Task-specific (e.g., spell

Evaluation Change tone and format Add (superficial) knowledge Reduce cost

Dataset.

What is a good dataset? Accuracy Factually accurate information Diversity

Data formats System (optional) You are a helpful assistant. Instruction

Instruction/answer generation Seed data Scoring + Filtering Raw text Prompts

Query an LLM Prompts + Constraints Run tests Decontaminate 7-gram

Query LLM 2 Prompts Judge LLM Remove duplicates Remove short

Case study: Open-PerfectBlend Categories: Datasets: Tengyu Xu et al. "The

Chat templates <|im_start|>system You are a helpful assistant, who always

Training.

Recommended fine-tuning libraries TRL HF’s library, most up-to- date in

SFT techniques Figure adapted from Xu, Yuhui, et al. "QA-LoRA:

Preference alignment techniques Proximal Policy Optimization Maximizes quality Very expensive

Training parameters Parameter Description Common values Learning rate How much

Monitoring experiments LR too high (loss spike) Good LR (“smooth”

Monitoring experiments

Evaluation.

Automated benchmarks → Calculate a metric based on generated and

Task-specific benchmarks Focused benchmarks Domain-specific benchmarks

Human judge → Ask humans to score model generations on

Human preferences are weakly correlated with automated benchmarks Jinjie Ni

Judge LLM → Use an LLM to score model generations

Create your own evaluation Start early and iterate a lot!

Evaluation General and task or domain- specific benchmarks Dataset Data

@maximelabonne maxime-labonne