Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Open SLM Phi-3 TAIT 202407

Open SLM Phi-3 TAIT 202407

An introduction session of Microsoft's open SLM family Phi-3 for the ML, tech and startup community in Tokyo.

Presented at TAIT (Tokyo AI Talk) July 2024 session.

Xiaoli Shen

August 09, 2024
Tweet

More Decks by Xiaoli Shen

Other Decks in Technology

Transcript

  1. Tiny Mighty SLM: Phi-3 Xiaoli (Alex) Shen Senior AIML Specialist

    AI Global Black Belt, Microsoft 2024/07/26
  2. AI/ML Specialist | Architect | Software Engineer Xiaoli (Alex) Shen

    Xiamen P. R. China Beijing P. R. China Bremen Germany Frankfurt am Main Germany Tokyo Japan Changchun P. R. China Düsseldorf Germany linkedin.com/in/xiaolishen/ [email protected] ❖ 2024/03 - present Sr. AI/ML Specialist, AI Global Black Belt Microsoft, Tokyo, Japan ❖ 2021 - 2024 Solutions Architect (focus area Machine Learning) Amazon Web Services, Tokyo, Japan ❖ 2017 - 2021 Tech Lead/Software Architect/Sr. Software Engineer Fast Retailing, Tokyo, Japan ❖ 2011 - 2016 Fullstack Application Developer/Creative Technologist Various Companies in Germany Hobbies Cello, Travel, Movies, Languages (CN, EN, JP, DE, FR)
  3. 12:57 175B GPT-3 ?? Inference Memory Needs: For 32-bit precision:

    Model parameters: 175 bil x 4 bytes = 700 GB Intermediate activations: 700-1400 GB Overheads: 10-20 GB Total: 1410-2120 GB For 16-bit precision: Model parameters: 175 bil x 2 bytes = 350 GB Intermediate activations: 350-700 GB Overheads: 10-20 GB Total: 710-1070 GB
  4. Open SLM Phi-3 Phi-3-mini (3.8B) Phi-3-small (7B) Phi-3-medium (14B) Available

    on Azure AI Model Catalog Hugging Face Ollama NVIDIA NIM ONNX Runtime Groundbreaking quality/cost for scale Runs everywhere: GPUs, CPUs, devices Long-context and image support Phi-3-vision (4.2B)
  5. LLM Cost vs Quality 200 2000 20000 GPT 3.5 Turbo

    GPT-4 Mistral Tiny Mixtral Small 60 65 70 75 80 85 Model quality (MMLU Avg) Inference cost 1K Tokens/$, Retail (Log10) B E T T E R C H E A P E R Claude-3 Opus GPT-4 Turbo Claude-3 Sonnet Claude-3 Haiku Gemini Pro Phi-3 14B Phi-3 Small Phi-3 Mini Llama-2 13B Llama-2 70B GPT-4o
  6. Phi-3 Tech Specs - Architecture: - Text models: dense decoder-only

    Transformer - Vision: image encoder + Phi-3-mini - SFT and DPO fine-tuned - Context length: - mini & medium: 4k, 128K - small: 8K, 128K - vision: 128K - Cross platform support: GPU, CPU, mobile - Formats: Phi-3-mini (3.8B) Phi-3-vision (4.2B) Phi-3-small (7B) Phi-3-medium (14B) Available on Azure AI Model Catalog Hugging Face Ollama NVIDIA NIM ONNX Runtime Short Context Long Context Mini 4K [HF] ; [ONNX] ; [GGUF] 128K [HF] ; [ONNX] Small 8K [HF] ; [ONNX] 128K [HF] ; [ONNX] Medium 4K [HF] ; [ONNX] 128K [HF] ; [ONNX] Vision 128K [HF] ; [ONNX]
  7. Phi-3 performance across industry benchmarks Code generation Factual Knowledge Language

    Understanding Math Popular Aggregate Benchmarks Reasoning Grand Total Phi-3-Mini-4K-In (3.8B) Gemma-7b Mistral-7b Mixtral-8x7b Llama-3-8B-In Claude-3 Sonnet Phi-3-Small-8K-In (7B) Gemini 1.0 Pro Phi-3-Medium-4K-In (14B) Mistral- 8x22B Llama-3-70B-Instruct Command R+ 104B
  8. Evolution of Phi Models Phi-1 (1.3B) - Specialized in Python

    coding - > 50% HumanEval, MBPP - Paper: Textbooks Are All You Need Phi-1.5 (1.3B) - Added commonsense reasoning in natural language - On-par performance on NLP tasks with models 5x larger (e.g. Llama 2-7B, Vicuna-13B) - Paper: Textbooks Are All You Need II: phi-1.5 technical report Phi-2 (2.7B) - Augmented data source - Near SOTA performance among models smaller than 13B (e.g., Llama 2-13B, Mistral-7B) - Blog: Phi-2: The surprising power of small language models Phi-3 Family - Sizes: 3.8B, 7B, 14B, 4.2B(Vision) - Context length: 4K, 128K - SOTA Open SLM with multi-modality - Paper: Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone 2023/09 Phi-1.5 2023/06 Phi-1 2024/04 Phi-3 2023/12 Phi-2 2024/06 Phi-3 Update
  9. Training Phi-3 Scaling law close to the “Data Optimal Regime”

    (from left to right: phi-1.5, phi-2, phi-3-mini, phi-3- small) versus Llama-2 family of models (7B, 13B, 34B, 70B) that were trained on the same fixed data. We plot the log of MMLU error versus the log of model size. Textbooks are (still) all you need. High quality training data improves SLMs and deviates standard scaling-laws. - Phi-1: 7B unique tokens of textbook quality code-language data - 6B deduplicated, GPT-4 filtered code data from The Stack and StackOverflow - 1B GPT-3.5 generated Python textbook data - Phi-1.5: Phi-1’s data + 20B synthetic textbook- like common sense and general knowledge - Seeded with 20K carefully selected topics - Used web samples in prompts for diversity - Phi-2 - Synthetic data specifically created to teach common sense reasoning and general knowledge - Carefully selected web data, filtered based on educational value and content quality Data Optimal Regime: focus on the quality of data for a given scale.
  10. Training Phi-3 Training data - Heavily filtered public web data

    according to educational level - Synthetic LLM generated data Two-phase pre-training - Phase 1: General Knowledge & Language Understanding • Data: Primarily web-based, highly filtered towards textbooks quality data • Goal: Teach general knowledge and language skills - Phase 2: Logical Reasoning & Niche Skills • Data : Filtered web data (subset of Phase 1) and synthetic data • Goal: Enhance logical reasoning, math, coding and specialized skills Two-stage post-training - Stage 1: Instruction following Supervised Finetuning (SFT) • Data: curated high-quality data across various domains (math, coding, reasoning, conversation, safety) • Goal: Improve domain-specific knowledge and ability to follow user instructions in various use cases - Stage 2: Direct Preference Optimization (DPO) • Data: Preference Chat format data, reasoning, and Responsible AI (RAI) efforts • Goal: Steer model away from unwanted behavior, enhance robustness, safety, and transform into an efficient AI assistant
  11. Use Cases for SLMs Smaller, less compute intensive models that

    perform well at simple tasks Offline environments, on-device or on-prem, where local inference may be needed Latency bound scenarios where fast response times are critical Cost constrained tasks/use cases, particularly those with simpler tasks Resource constrained environments Select tasks can see improved performance via fine-tuning (vs. large model out-of-box)
  12. Get started with Phi-3 Models Model Input Content Length (in

    tokens) Azure AI (MaaS) Azure ML (MaaP) Requires Azure subscription ONNX Hugging Face Ollama Nvidia NIM Phi-3-vision-128k- instruct Text+Image 128k Playground & Deployment Playground, Deployment & Finetuning Coming soon Download -NA- NIM APIs Phi-3-mini-4k-instruct Text 4k Playground & Deployment Playground, Deployment & Finetuning CUDA, Web Playground & Download GGUF NIM APIs Phi-3-mini-128k-instruct Text 128k Playground & Deployment Playground, Deployment & Finetuning CUDA Download -NA- NIM APIs Phi-3-small-8k-instruct Text 8k Playground & Deployment Playground, Deployment & Finetuning CUDA Download -NA- NIM APIs Phi-3-small-128k- instruct Text 128k Playground & Deployment Playground, Deployment & Finetuning CUDA Download -NA- NIM APIs Phi-3-medium-4k- instruct Text 4k Playground & Deployment Playground, Deployment & Finetuning CUDA, CPU, DirectML Download Here NIM APIs Phi-3-medium-128k- instruct Text 128k Playground & Deployment Playground, Deployment & Finetuning CUDA, CPU, DirectML Download -NA- -NA- Run Phi-3 in your browser: - Demo: https://guschmue.github.io/ort-webgpu/chat/index.html - Code: https://github.com/microsoft/onnxruntime-inference-examples/tree/main/js/chat Hands-on examples on GitHub: Phi-3 CookBook
  13. Phi-3 performance across industry benchmarks Code generation Factual Knowledge Language

    Understanding Math Popular Aggregate Benchmarks Reasoning Grand Total Phi-3-Mini-4K-In Phi-3-Mini-128K-In Gemma-7b Mistral-7b Mixtral-8x7b Llama-3-8B-In Claude-3 Sonnet Phi-3-Small-8K-In Phi-3-Small-128K-In Gemini 1.0 Pro Phi-3-Medium-4K-In Phi-3-Medium-128K-In Mistral- 8x22B Llama-3-70B-Instruct Command R+ 104B
  14. Models <10B Parameters Code generation Factual Knowledge Language Understanding Math

    Popular Aggregate Benchmarks Reasoning Gemma-7b Mistral-7b Mixtral-8x7b Llama-3-8B-In Phi-3-Mini-4K-In Phi-3-Small-8K-In
  15. Phi-3-mini’s groundbreaking performance Phi-3-mini (3.8B) significantly outperforms language models of

    the same size and larger Phi-3-mini with 3.8B parameters performs better than models twice its size
  16. Phi-3-small’s groundbreaking performance Phi-3-small (7B) significantly outperforms language models of

    the same size and larger Phi-3-small beats GPT-3.5T across a variety of language, reasoning, coding and math benchmarks
  17. Phi-3-vision’s groundbreaking performance Phi-3-vision significantly outperform language models of the

    same size and larger Phi-3-vision outperforms larger models such as Claude- 3 Haiku and Gemini 1.0 Pro V across general visual reasoning tasks, OCR, table and chart understanding tasks
  18. Azure AI model breadth Offering the widest collection of frontier

    and open-source models Azure OpenAI Service GPT-4-Turbo GPT-4 GPT-4V Text-embedding-ada-002 GPT-3.5-Turbo Meta Llama-2-70b/70b-chat* Llama-2-13b/13b-chat* Llama-2-7b/7b-chat* Llama-3* CodeLlama Mistral AI Mistral Large* Mistral 7b Mixtal 7b*8— Mixture of Experts cohere Cohere R* Cohere R+* Embed v3— Multilingual* Embed v3— English* Small language models Phi Phi-1 Phi-1.5 Phi-2 Phi-3 Hugging Face Falcon/TII Stable Diffusion/Stability AI Dolly/Databricks CLIP/OpenAI NVIDIA Nemotron-3-8B-4k Nemotron-3-8B-Chat- SFT/RLHF/SteerLM Nemotron-3-8B-QA Databricks Databricks/ dbrx-base Databricks/ dbrx-instruct G42 Jais* Orca Orca 1 Orca 2 * Available via MaaS