Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Magnum-SLM_Playbook_-_A_beginner_s_guide_to_Sma...

Avatar for Marketing OGZ Marketing OGZ PRO
September 17, 2025
23

 Magnum-SLM_Playbook_-_A_beginner_s_guide_to_Small_Language_Models.pdf

Avatar for Marketing OGZ

Marketing OGZ PRO

September 17, 2025
Tweet

More Decks by Marketing OGZ

Transcript

  1. Unlocking AI Value with Small Language Models (SLMs) Pocket Calculator

    (SLM) perfectly suited for a job vs. General Purpose Supercomputer (LLM) https://huggingface.co/models
  2. Power versus Focus LLMs (Generalists) • Trained on vast, diverse

    data. • Broad world knowledge, attempt diverse tasks. • Aim for broad competence. SLMs (Often Specialists) • More focused training/fine-tuning possible. • Good at specific, well-defined tasks (e.g., log analysis, summarization, data extraction). • Can outperform LLMs on narrow domains more efficiently.
  3. OpenAI's Agent Development Pattern OpenAI, "A practical guide to building

    agents" https://cdn.openai.com/business-guides-and-resources/a-practical-guide-to-building-agents.pdf • Establish a baseline: Start with robust evaluations to measure performance. • Iterate on accuracy: Focus development on meeting specific accuracy targets. • Optimize for efficiency: Once accuracy is met, improve cost and latency by swapping in smaller, faster models where appropriate.
  4. NVIDIA Research Multi Agent System pattern NVIDIA Research advises a

    "heterogeneous model approach": • Cost-Effective Deployment: Use Small Language Models (SLMs) to reduce latency and infrastructure costs. • Modular Design: Employ SLMs for routine tasks and reserve large models for complex reasoning. • Rapid Specialization: Fine-tune agile SLMs for specific tasks to enable faster iteration. Belcak et al., "Small Language Models are the Future of Agentic AI" (NVIDIA Research, 2024) Source: https://research.nvidia.com/labs/lpr/slm-agents/
  5. Compelling Benefits Driving SLM Adoption • Efficiency & Speed: Fewer

    calculations = faster inference, lower energy use. • Cost-Effectiveness: Lower API costs, less demanding hardware needs. • Task-Specific Performance: Optimized models often achieve superior results on target tasks. • Accessibility: Run on standard hardware, lowering barriers to entry. • Fine-tuning Feasibility: Practical to adapt SLMs to proprietary data/niche tasks. • Deployment Flexibility: Enables diverse patterns (Cloud, Local, Edge).
  6. Choosing Your Deployment Pattern • Three Primary Plays: • Cloud

    API Access (Easy, Pay-as-you-go) • Managed Cloud Platforms (Scalable, Integrated, MLOps) • Local Deployment (Max Privacy, Control, Offline)
  7. Full Model Q8 SLM Key Concepts: Quantization Often a sweet

    spot for inference endpoints Not too much performance drop 4x Smaller
  8. SLM Key Concepts: Quantization Full Model Q4 Fun fact: openai-oss

    MoE are natively Q4 quantized Often preferred for running locally or on edge devices 8x Smaller
  9. Which Deployment Strategy Fits Your Needs? Ease of Use Medium

    Difficulty Difficult Pay-per-token Depends on Provider Low Control High Scalability Prototyping, Web Interfaces Pay-per-compute Data Stays in Tenant Moderate Control Ops required Enterprise Apps, MLOps Hardware & Electricity Data Stays offline High Control Manual Scaling Sensitive data, offline analysis
  10. Local Deployment Most important advice: https://www.reddit.com/r/LocalLLaMA/ LLM community • Hardware

    – (V)RAM is King: • ~16GB RAM: 7B/8B models (4/5-bit quant). • ~32GB RAM: ~13B models (good quant). • ~64GB+ RAM: Needed for 30B+ models. • Hardware - GPU Acceleration: Dramatically speeds up inference
  11. Setting Up a Model with Ollama – the easy way

    • Find the Model: Browse the official library • Pull the Model: Open your terminal and run ollama pull qwen3:1.7b • Run: The model is now ready to use in any application connected to Ollama ollama run qwen3:1.7b
  12. Setting Up a Model with Ollama – vague naming Be

    mindful of the quantization and the params
  13. Setting Up a Model with Ollama – the flexible way

    • Download a GGUF File From Huggingface • Create a Model File • Create the Model in Ollama ollama create my-custom-model -f C:\path\to\Modelfile # 1. Base Model FROM C:\Path\unsloth-Qwen2.5-Coder-7B-Instruct- 128K-GGUF\Qwen2.5-Coder-7B-Instruct-Q6_K.gguf # 2. Chat Template TEMPLATE """ <|im_start|>system {{ .System }}<|im_end|> <|im_start|>user {{ .Prompt }}<|im_end|> <|im_start|>assistant """ # 3. System Prompt SYSTEM """ You are an … """ # 4. Parameters# --- Generation Control --- PARAMETER temperature 0.7 PARAMETER top_p 0.9 Model File
  14. Ollama – A practical use case: using Continue.dev 1. Install

    Ollama 2. If on Windows, configure it to use VRAM 3. Open on IDE
  15. Ollama – A practical use case: using Continue.dev Continue.dev is

    a popular IDE coding assistant that supports running models locally with Ollama
  16. Which Deployment Strategy Fits Your Needs? Ease of Use Medium

    Difficulty Difficult Pay-per-token Depends on Provider Low Control High Scalability Prototyping, Web Interfaces Pay-per-compute Data Stays in Tenant Moderate Control Ops required Enterprise Apps, MLOps Hardware & Electricity Data Stays offline High Control Manual Scaling Sensitive data, offline analysis
  17. Azure AI Studio (+Azure OpenAI), Google Vertex AI, AWS SageMaker

    (+JumpStart). Pros: • Robust scalability/reliability, seamless integration with other cloud services, MLOps tools, enterprise security/compliance. Cons: • More complex setup than APIs, potentially higher costs (resource consumption) When to Use: Enterprise apps needing MLOps, scalability, security; integrating with existing cloud infra; unified development/deployment platform preferred. Managed Cloud Platforms
  18. Managed Cloud Platforms – Databricks Serverless Deployment Our Strategy 1.

    Use Huggingface cli to download model into UC Volume 2. Create Python Wrapper 3. Declare into Unity Catalog 4. Create Serverless Model Endpoint
  19. Databricks Serverless Deployment – Step 1 Download These settings must

    be set like this, or it won’t write into the UC Volume
  20. Databricks Serverless Deployment – Step 2 Create Python Wrapper This

    script defines an MLflow PyFunc model wrapper for serving the Qwen3 large language model, handling model loading and text generation The transformers library is crucial for loading the Qwen3 model and its tokenizer, preparing input text for the model (tokenization and chat templating), and performing the actual text generation and decoding the output
  21. Databricks Serverless Deployment – Step 3 Declare into UC Import

    and use wrapper Get your dependencies right After running, model will be available in Unity Catalog
  22. Databricks Serverless Deployment – Step 4 Deploy Size according to

    the model Compute Type GPU(s) Total VRAM Approx. Model Size (FP16) CPU N/A System RAM < 1 Billion GPU Small 1x NVIDIA T4 16 GB ~7 Billion GPU Medium 1x NVIDIA A10G 24 GB ~10-13 Billion GPU Medium x4 4x NVIDIA A10G 96 GB ~40-45 Billion GPU Medium x8 8x NVIDIA A10G 192 GB ~70-85 Billion
  23. Which Deployment Strategy Fits Your Needs? Ease of Use Medium

    Difficulty Difficult Pay-per-token Depends on Provider Low Control High Scalability Prototyping, Web Interfaces Pay-per-compute Data Stays in Tenant Moderate Control Ops required Enterprise Apps, MLOps Hardware & Electricity Data Stays offline High Control Manual Scaling Sensitive data, offline analysis
  24. Cloud API Access • Concept: Use third-party hosted SLMs via

    simple HTTP requests. • Examples: Hugging Face Inference API Pros: Minimal setup, rapid prototyping, huge model variety, automatic scaling, pay-per-use. Cons: Costly at high volume, network latency, provider dependence, data privacy concerns (data leaves your environment)
  25. HuggingFace Inference API HuggingFace Playground is great for experimentation –

    Please read the privacy policy before sending data :) Also check for their Enterprise Service https://huggingface.co/playground
  26. Cloud API Access – CLI Coding Agent on IDE Good

    alternatives: • Claude Code • Gemini CLI • Qwen3-Coder • Most have a good “free” tier where you can tap into their SOTA models Why it’s great • Open in VSCode terminal • Mention (with an @) entire files or folders from your workspace • Creates or modifies code files for you
  27. Cloud API Access – CLI Coding Agent on IDE Step

    1: Open VSCode and Install the Gemini CLI with npm npm install -g @google/gemini-cli This command downloads the official CLI package and makes the gemini command available on your system. Step 2: Run the Gemini CLI gemini On the first run, the CLI will guide you through a quick setup process. Pre-requirement: install NodeJS
  28. Where to Find Models: Hugging Face Hub: Largest repository (thousands

    of open SLMs). Managed Platform Catalogs: Curated models (Azure AI, Vertex AI, SageMaker). Key Filtering Criteria (Hugging Face Example): • Size: Parameter range (<1B, 1-3B, 3-7B, 7-13B). • Format: GGUF (for local tools like Ollama), Safetensors (for Transformers/cloud). • License: CRITICAL! Filter by permissive (Apache, MIT) vs. restrictive. • Task: Text-generation, Summarization, Classification, etc. • Instruct version if available! Navigating the Model Maze Required Performance
  29. Takeaways • Start with “what does good look like”, then

    find the smallest model that can meet that requirement. It will be less expensive and better for the environment • Leverage credible players from the SLM ecosystem (Smol, Mistral, Qwen, Gemma, Phi) • Ollama and Lmstudio provide great engines to run SLMs on local hardware. Provides Maximum Privacy and works offline, but requires investment in equipment and maintenance • Leverage the cloud to access a wide array of models securely. Keeping data within your company’s tenant. Requires setting up Deployment Pipelines and supporting any custom endpoints made • Available like tap water but beware of the privacy policy of the provider. Also beware of the token pricing, which can grow exponentially for long contexts