Magnum-SLM_Playbook_-_A_beginner_s_guide_to_Small_Language_Models.pdf

The SLM Playbook Selecting, Deploying, and Utilizing Small Language Models

Unlocking AI Value with Small Language Models (SLMs) Pocket Calculator
(SLM) perfectly suited for a job vs. General Purpose Supercomputer (LLM) https://huggingface.co/models

Power versus Focus LLMs (Generalists) • Trained on vast, diverse
data. • Broad world knowledge, attempt diverse tasks. • Aim for broad competence. SLMs (Often Specialists) • More focused training/fine-tuning possible. • Good at specific, well-defined tasks (e.g., log analysis, summarization, data extraction). • Can outperform LLMs on narrow domains more efficiently.

Large Language Models (LLMs): Incredible Power, Significant Costs https://www.technologyreview.com/2025/05/20/1116327/ai-energy-usage-climate-footprint-big-tech/ https://huggingface.co/spaces/jdelavande/chat-ui-energy
Huggingface Llama 3.1 8B Llama 3.1 405B 58x more electricity for the same task

Large Language Models (LLMs): Incredible Power, Significant Costs https://www.technologyreview.com/2025/05/20/1116327/ai-energy-usage-climate-footprint-big-tech/ https://huggingface.co/spaces/jdelavande/chat-ui-energy
Huggingface Non-reasoning models Reasoning Models With “think” mode it’s an extra ~43x electricity

Large Language Models (LLMs): Incredible Power, Significant Costs https://azure.microsoft.com/en-us/pricing/details/cognitive-services/openai-service/ GPT-5-mini
GPT-5 2025-08-07 Input: Output: Input: Output:

OpenAI's Agent Development Pattern OpenAI, "A practical guide to building
agents" https://cdn.openai.com/business-guides-and-resources/a-practical-guide-to-building-agents.pdf • Establish a baseline: Start with robust evaluations to measure performance. • Iterate on accuracy: Focus development on meeting specific accuracy targets. • Optimize for efficiency: Once accuracy is met, improve cost and latency by swapping in smaller, faster models where appropriate.

NVIDIA Research Multi Agent System pattern NVIDIA Research advises a
"heterogeneous model approach": • Cost-Effective Deployment: Use Small Language Models (SLMs) to reduce latency and infrastructure costs. • Modular Design: Employ SLMs for routine tasks and reserve large models for complex reasoning. • Rapid Specialization: Fine-tune agile SLMs for specific tasks to enable faster iteration. Belcak et al., "Small Language Models are the Future of Agentic AI" (NVIDIA Research, 2024) Source: https://research.nvidia.com/labs/lpr/slm-agents/

Compelling Benefits Driving SLM Adoption • Efficiency & Speed: Fewer
calculations = faster inference, lower energy use. • Cost-Effectiveness: Lower API costs, less demanding hardware needs. • Task-Specific Performance: Optimized models often achieve superior results on target tasks. • Accessibility: Run on standard hardware, lowering barriers to entry. • Fine-tuning Feasibility: Practical to adapt SLMs to proprietary data/niche tasks. • Deployment Flexibility: Enables diverse patterns (Cloud, Local, Edge).

How do they fit together? Example

How do they fit together? Example Baseline Baseline Baseline Baseline
Baseline Baseline

How do they fit together? Example

Choosing Your Deployment Pattern • Three Primary Plays: • Cloud
API Access (Easy, Pay-as-you-go) • Managed Cloud Platforms (Scalable, Integrated, MLOps) • Local Deployment (Max Privacy, Control, Offline)

SLM Key Concepts: Quantization

Full Model Q16 SLM Key Concepts: Quantization Negligible performance drop
2x smaller

Full Model Q8 SLM Key Concepts: Quantization Often a sweet
spot for inference endpoints Not too much performance drop 4x Smaller

SLM Key Concepts: Quantization Full Model Q4 Fun fact: openai-oss
MoE are natively Q4 quantized Often preferred for running locally or on edge devices 8x Smaller

SLM Key Concepts: Quantization Full Model Q2 Severe performance drop

SLM Key Concepts: Quantization Full Model Q1 Calamitous performance drop

SLM Key Concepts: Distillation

Which Deployment Strategy Fits Your Needs? Ease of Use Medium
Difficulty Difficult Pay-per-token Depends on Provider Low Control High Scalability Prototyping, Web Interfaces Pay-per-compute Data Stays in Tenant Moderate Control Ops required Enterprise Apps, MLOps Hardware & Electricity Data Stays offline High Control Manual Scaling Sensitive data, offline analysis

Local Deployment Most important advice: https://www.reddit.com/r/LocalLLaMA/ LLM community • Hardware
– (V)RAM is King: • ~16GB RAM: 7B/8B models (4/5-bit quant). • ~32GB RAM: ~13B models (good quant). • ~64GB+ RAM: Needed for 30B+ models. • Hardware - GPU Acceleration: Dramatically speeds up inference

Llama.cpp, Ollama and LMstudio GGUF LM Studio Ollama https://huggingface.co/bartowski https://huggingface.co/unsloth
Model Quant Llama.cpp

Ollama – Installation 1. Install Ollama 2. If on Windows,
configure it to use VRAM

Setting Up a Model with Ollama – the easy way
• Find the Model: Browse the official library • Pull the Model: Open your terminal and run ollama pull qwen3:1.7b • Run: The model is now ready to use in any application connected to Ollama ollama run qwen3:1.7b

Setting Up a Model with Ollama – vague naming Be
mindful of the quantization and the params

Setting Up a Model with Ollama – the flexible way
• Download a GGUF File From Huggingface • Create a Model File • Create the Model in Ollama ollama create my-custom-model -f C:\path\to\Modelfile # 1. Base Model FROM C:\Path\unsloth-Qwen2.5-Coder-7B-Instruct- 128K-GGUF\Qwen2.5-Coder-7B-Instruct-Q6_K.gguf # 2. Chat Template TEMPLATE """ <|im_start|>system {{ .System }}<|im_end|> <|im_start|>user {{ .Prompt }}<|im_end|> <|im_start|>assistant """ # 3. System Prompt SYSTEM """ You are an … """ # 4. Parameters# --- Generation Control --- PARAMETER temperature 0.7 PARAMETER top_p 0.9 Model File

Ollama – A practical use case: using Continue.dev 1. Install
Ollama 2. If on Windows, configure it to use VRAM 3. Open on IDE

Ollama – A practical use case: using Continue.dev Continue.dev is
a popular IDE coding assistant that supports running models locally with Ollama

Ollama – A practical use case: using Github Copilot

Azure AI Studio (+Azure OpenAI), Google Vertex AI, AWS SageMaker
(+JumpStart). Pros: • Robust scalability/reliability, seamless integration with other cloud services, MLOps tools, enterprise security/compliance. Cons: • More complex setup than APIs, potentially higher costs (resource consumption) When to Use: Enterprise apps needing MLOps, scalability, security; integrating with existing cloud infra; unified development/deployment platform preferred. Managed Cloud Platforms

Managed Cloud Platforms – Databricks Serverless Deployment Our Strategy 1.
Use Huggingface cli to download model into UC Volume 2. Create Python Wrapper 3. Declare into Unity Catalog 4. Create Serverless Model Endpoint

Databricks Serverless Deployment – Step 1 Download These settings must
be set like this, or it won’t write into the UC Volume

Databricks Serverless Deployment – Step 2 Create Python Wrapper This
script defines an MLflow PyFunc model wrapper for serving the Qwen3 large language model, handling model loading and text generation The transformers library is crucial for loading the Qwen3 model and its tokenizer, preparing input text for the model (tokenization and chat templating), and performing the actual text generation and decoding the output

Databricks Serverless Deployment – Step 3 Declare into UC Import
and use wrapper Get your dependencies right After running, model will be available in Unity Catalog

Databricks Serverless Deployment – Step 4 Deploy Size according to
the model Compute Type GPU(s) Total VRAM Approx. Model Size (FP16) CPU N/A System RAM < 1 Billion GPU Small 1x NVIDIA T4 16 GB ~7 Billion GPU Medium 1x NVIDIA A10G 24 GB ~10-13 Billion GPU Medium x4 4x NVIDIA A10G 96 GB ~40-45 Billion GPU Medium x8 8x NVIDIA A10G 192 GB ~70-85 Billion

Databricks Serverless Deployment – Step 5 Use

Cloud API Access • Concept: Use third-party hosted SLMs via
simple HTTP requests. • Examples: Hugging Face Inference API Pros: Minimal setup, rapid prototyping, huge model variety, automatic scaling, pay-per-use. Cons: Costly at high volume, network latency, provider dependence, data privacy concerns (data leaves your environment)

HuggingFace Inference API HuggingFace Playground is great for experimentation –
Please read the privacy policy before sending data :) Also check for their Enterprise Service https://huggingface.co/playground

Cloud API Access – CLI Coding Agent on IDE Good
alternatives: • Claude Code • Gemini CLI • Qwen3-Coder • Most have a good “free” tier where you can tap into their SOTA models Why it’s great • Open in VSCode terminal • Mention (with an @) entire files or folders from your workspace • Creates or modifies code files for you

Cloud API Access – CLI Coding Agent on IDE Step
1: Open VSCode and Install the Gemini CLI with npm npm install -g @google/gemini-cli This command downloads the official CLI package and makes the gemini command available on your system. Step 2: Run the Gemini CLI gemini On the first run, the CLI will guide you through a quick setup process. Pre-requirement: install NodeJS

Where to Find Models: Hugging Face Hub: Largest repository (thousands
of open SLMs). Managed Platform Catalogs: Curated models (Azure AI, Vertex AI, SageMaker). Key Filtering Criteria (Hugging Face Example): • Size: Parameter range (<1B, 1-3B, 3-7B, 7-13B). • Format: GGUF (for local tools like Ollama), Safetensors (for Transformers/cloud). • License: CRITICAL! Filter by permissive (Apache, MIT) vs. restrictive. • Task: Text-generation, Summarization, Classification, etc. • Instruct version if available! Navigating the Model Maze Required Performance

Navigating the Model Maze – Recommendations Required Performance

Takeaways • Start with “what does good look like”, then
find the smallest model that can meet that requirement. It will be less expensive and better for the environment • Leverage credible players from the SLM ecosystem (Smol, Mistral, Qwen, Gemma, Phi) • Ollama and Lmstudio provide great engines to run SLMs on local hardware. Provides Maximum Privacy and works offline, but requires investment in equipment and maintenance • Leverage the cloud to access a wide array of models securely. Keeping data within your company’s tenant. Requires setting up Deployment Pipelines and supporting any custom endpoints made • Available like tap water but beware of the privacy policy of the provider. Also beware of the token pricing, which can grow exponentially for long contexts

THANK YOU!

Magnum-SLM_Playbook_-_A_beginner_s_guide_to_Sma...

Magnum-SLM_Playbook_-_A_beginner_s_guide_to_Small_Language_Models.pdf

More Decks by Marketing OGZ

Featured

Transcript