Al Goes Local - Why the Future of Intelligent Software Runs On-Device

Al Goes Local think Why the Future of Intelligent Software
Runs On-Device tecture The Journey The Shift The Recipe The Asset The Boundary The Reach • • • • • Cloud Al lost its Five small models. Models and data Local first. Workstation. Laptop. monopoly. One agent. you own. Cloud optional. Browser. Phone. No magic. 4

Runs On-Device tecture Al-enabled solutions (typically) are 10% Al. 100% software • • eng1neer1ng. 5

Al Goes Local Why the Future of Intelligent Software Runs
On-Device PART 1 THE SHIFT Cloud Al lost its monopoly - data has gravity, compute is following. think tecture

Runs On-Device tecture What I Mean by 11 Al11 Models How many Where it runs WHEN I SAY "Al" TODAY Small. 1-4 B parameters. Specialised - each does one job. I DO NOT MEAN I Cloud-frontier: GPT-5, Claude, Gemini. Five, working together. Router· tool-caller· I One giant model, one big API call. embedder • vision • synthesis. On the laptop on stage. On your phone. In your browser. I On someone else's G PUs. What it does I Routes, retrieves, generates, sees, listens. I Replaces engineering judgment. Every demo today is a proof for the left column. 7

Runs On-Device tecture Six Forces Driving Local Al • Privacy - Your data doesn 1 t leave the room. ~ Regulations - GDPR, HIPAA, DORA, Al Act. A Reliability - No outages. No rate limits. No deprecations. [eo 0 ] Latency - Sub-second on a laptop. Cloud round-trip alone is 1-2 s. [--•] Cost - Tokens are metered infrastructure. Local puts you back in control. ffl Energy - On-device inference is efficient. NPUs sip; data centers gulp. 8

Al Goes Local o ware Runs On-Device e uture of
Intelligent S ft tecture DEMO rolDoi ... ·•: - --- #1 Forces - illustrated Three c/ai o ms. ne query. Watch. 9

Runs On-Device tecture Why Now? HARDWARE Consumer GPUs & NPUs and unified memory ship. $5k beats what $50k bought in 2023. /v/5 /vlax • RTX PRO 6000 • OGX Spark • Strix Halo • Qualcomm X Elite (more coming) MODELS Quality-per-parameter lOx'd. Sub-4B (partially FT-ed) reaches what 70B did 18 months ago. Gemma 4 • Qwen 3.5/3.6 • LF/vl 2.5 • /vlistral Small· GL/vl (and more) Three things flipped at once. ENGINES Highly optimized code. Various hardware backends. Quantization for models & inference. 1/ama.cpp • /v/LX • Transformersjs • LEAP SOK (et. al.) And they'll keep flipping. Bet on the architecture! 10

Runs On-Device tecture The Building Blocks LLM LARGE Generative text - Cloud frontier models GPT-5 •Claude• Gemini VLM Multimodal - text + image input Gemma 3 Vision• Glfvf-V • LFfvf-VL STT ' Speech-to-text transcription Whisper • Parakeet OCR !:,<If> ~ Document image ➔ text Glfvf-OCR • Tesseract SLM SMALL IP ~ Generative text - runs on your hardware Gemma 1-48 • Qwen 48 • LFfvf 1.28 Embedding model L}'~ Text ➔ vectors for semantic similarity text-embedding-3-/arge • EmbeddingGemma TTS ,a' Text-to-speech synthesis Piper • Kokoro All seven run locally today. 11

On-Device PART 2 THE RECIPE Five small models. One agent. No magic. think tecture 12

Runs On-Device tecture This is 1Nextera 1 A fictitious Saas analytics product - running on five small local models. 13

Runs On-Device tecture DEMO #2 The Agent - layer by layer ... Seven capabilities. Five small models. One laptop. 14

Runs On-Device tecture Seven Capabilities 10 10 10 10 10 10 10 </> ~ Router LogReg on EmbeddingGemma • intent in ~10 ms • zero LLM calls Tool calling Qwen 3.5 48 FT· structured JSON: tool + args • WHAT, not HOW Scaffolding SQLite + Python calculator· deterministic execution • 1 ms Multi-step Decompose ➔ execute ➔ concretize ➔ calculate ➔ synthesize RAG EmbeddingGemma vector search· gemma3 48 synthesis Vision GLM-OCR 0.98 reads pages· pypdf for text· smart hybrid Orchestration Six layers, four models, one trace - already running on one laptop 15

# Real components — the deterministic 35% intent_classifier_logreg.LogRegIntentClassifier # embed
+ LogReg, ~10 ms total query_decomposer.MULTI_STEP_PATTERNS # 16 regex — one step or two? query_decomposer.decompose # JSON parse fallback (line 130) intent_classifier.looks_like_injection # 30 regex — prompt-injection guard sqlite3.execute # 1 ms — no model beats stdlib

Runs On-Device tecture The 5-Model Stack ▪ EmbeddingGemma FT - Intent via LogReg + Semantic search - 308M ▪ Gemma3 FT - Query rewrite, tool-result synthesis - 18 ▪ Qwen 3.5 FT - Tool calling, SQL - 48 ▪ Gemma3 - Vision + RAG document synthesis - 48 ▪ GLM-OCR - Document reading - 0.98 Total: -10.2 billion parameters. Less than GPT-3 was in 2020. 17

embeddings.search(query) tools=[sql_query, calc] plan → step → step → answer
sql_builder.py · regex prompts/intent.md scenarios/nextera.json

On-Device PART 3 THE ASSET Models and data you own. think tecture 19

# finetune/train_qwen35_toolcalling.py peft_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"], lora_dropout=0.05,
task_type="CAUSAL_LM", ) # 1,372 examples · 3 epochs · ~6 min on RTX

Base Model Servers Fine-Tuned Servers gemma3-1b-it :9090 qwen3.5-4b :9091 embeddinggemma-308m
:9092 gemma3-4b-vision :9093 gemma3-ft :9094 qwen3.5-4b-ft :9095 embeddinggemma-ft :9096 SmallLanguageModelClient POST /models/swap ~100ms swap time. No restart. Health-check before flip. Vision model shared across modes. GLM-OCR :9098 Upload-time only. Not part of swap. swap_urls() mode: base mode: base mode: base mode: finetuned mode: finetuned mode: finetuned shared (no FT variant) curl -X POST :8000/models/swap -d '{"mode":"ft"}'

Al Goes Local Why the Future of lntelli en re
Runs On-Device DEMO #3 The Lobotom Without fine-tu • Y ntng _it. ;ust doesn't work think tecture 22

Runs On-Device tecture What Fine-Tuning Bought Base Fine-tuned Intent c/assifrcation accuracy 5.6% ➔ 93.3% - 5.6% 93.3% Same model. 1,878 fine-tuning examples. ~2 min on RTX Then we replaced the FT model with LogReg - same data, no LLM call. 97.2%. 23

Runs On-Device tecture Three Levers: Model· Data· Code Domain knowledge lives in three places. Fine-tuning is just the model lever. Pick the right model functiongemma 270M FT ➔ Qwen 3.5 4B FT v8 Tool routing (160q) Fix the data 47 corrupted• 134 relabeled• ~10% of the training set RAG ground-truth, Nextera (BOq) Old New Before After 92.5% 99.4% 65.0% 92.0% The third lever - deterministic code - you saw live walking the agent. Knowing which to pull is the engineering. +6.9 pp 12 x fewer errors +27pp 24

// scenarios/nextera.json (excerpt) { "name": "nextera", "language": "en", "paths": {
"db": "./data/business.db", … }, "models": { "inference_ft": "gemma3-ft-nextera.gguf", … }, "sql": { "allowed_tables": ["products", "customers", "sales", "competit } // scenarios/logistics.json (excerpt) { "name": "logistics", "language": "de", "paths": { "db": "./data/logistics.db", … }, "models": { "inference_ft": "gemma3-ft-logistics.gguf", … }, "sql": { "allowed_tables": ["einheiten", "ausrüstung", "ersatzteile", " }

Runs On-Device tecture Six Industries. One Architecture. I Healthcare - clinical PII, never leaves the device I Finance - structured extraction, audit-grade determinism I Manufacturing - voice on the neld tablet, no signal needed I Public sector - sovereignty by construction I Automotive - in-car assistants without the latency tax I Energy - control rooms, no cloud roundtrip Same pipeline. Same evals. Your domain. The engineering outlives the models. 26

On-Device PART 4 THE BOUNDARY Local first. Cloud optional. think tecture 27

Runs On-Device tecture When Local. When Cloud. Pick local when Data is regulated, private, or owned by your customer Latency matters - sub-second response Volume is sustained - not occasional, exploratory bursts Determinism is required - bit-identical reruns Network can't be guaranteed Cloud is the right answer when A frontier-only capability is the actual unlock Privacy and latency aren't your constraints Volume is too small to justify hardware The task is genuinely one-off or exploratory You need long-context - lM+ tokens Hybrid is the rule, not the exception. 28

Local Path POST /query agent.process(query) Full local pipeline AgentResponse ConfidenceRouter
8-factor heuristic scoring score >= 0.6? yes no Return response confidence: 0.85 Return response + should_escalate: true + confidence: 0.42 Observatory UI shows escalation banner Cloud Escalation (HITL) User clicks "Escalate to GPT-5.4"? no yes Keep local response POST /escalate network online + key configured? no yes Blocked 403 / 503 GPT-5.4 API call Data leaves machine cloud_bytes_sent += payload Return cloud response + model badge + latency_ms + cost

Runs On-Device tecture DEMO #4 Local vs. Cloud Same prompt. Several engines. Read the bytes. 30

On-Device PART 5 THE REACH Workstation. Laptop. Browser. Phone. think tecture 31

Runs On-Device tecture Four Machines, One Codebase e.g. "How many customers do we have?"· Nextera • pSO end-to-end NVIDIA RTX PRO 6000 Apple MacBook Pro MS Max NVIDIA DGX Spark AMD Ryzen Al Max+ 395 (Strix Halo) 96 GB VRAM · Ubuntu - 128 GB unified· macOS 128 GB unified· Ubuntu (arm64) 128 GB unified· Fedora Four machines. Four architectures. One codebase. 465 ms 1,121 ms 2,315 ms 2,400 ms 32

Runs On-Device tecture ••• - DEMO #5 Browser - zero server, zero network No bockend No AP/ key No cloud 33

Runs On-Device tecture DEMO #6 - iPhone - completely offline, in my pocket Airplane mode. Private data. 1.28 model. 34

On-Device CLOSING YOUR MOVE The numbers. The personas. think tecture 35

Runs On-Device tecture The Numbers Revenue analysis • Customer lookups • OCR • Voice • PDFs • Phone offline • Browser zero-server 10.2° total params • less than GPT-3 in 2020 streaming • in a browser tab 4 platforms • Workstation • Laptop • Browser • Phone params • running on a phone ._.1Qms intent classification • feels instant 0 bytes silently sent· across the session 36

llama-server

Al Goes Local o ware Runs On-Device e uture of
Intelligent S ft tecture DEMO roUDoi ... ·•: - --- #7 One more thing ... 'Your' domain in a couple of seconds. 38

Al Goes Local - Why the Future of Intelligent S...

Al Goes Local - Why the Future of Intelligent Software Runs On-Device

More Decks by Christian Weyer

Other Decks in Programming

Featured

Transcript