Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Al Goes Local - Why the Future of Intelligent S...

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.

Al Goes Local - Why the Future of Intelligent Software Runs On-Device

Generative AI has transformed how we think about building software – but the next major shift is already underway: intelligence is moving out of the cloud and onto our own devices. Across industries such as healthcare, manufacturing, automotive, finance, energy and the public sector, organisations are discovering that cloud-dependent AI cannot meet critical requirements around privacy, latency, reliability, regulation or cost. At the same time, the economics and physics of computation are shifting: local inference reduces operational cost, avoids network round-trips, is dramatically more energy-efficient, and aligns with the natural principle of data gravity – processing data where it is created instead of continuously shipping it elsewhere.

After years shaped by cloud-centric AI from OpenAI, Microsoft, Google and Amazon, the industry is now shifting toward on-device intelligence – powered by hardware from Apple, Qualcomm, Intel, AMD and NVIDIA, and by the corresponding local inference runtimes. Meanwhile, modern Small Language Models, Vision-Language Models, multimodal systems and specialised AI agents have become efficient enough to run locally on servers, desktops, laptops, phones, browsers and even edge hardware – enabled by a new hardware renaissance of GPUs, NPUs, unified memory architectures and optimised runtimes. Local AI is steadily becoming the technical baseline for intelligent, domain-specific applications.

This keynote explores why this shift is happening now – and what it means for developers and architects. Christian will show how local AI delivers fast response times, offline resilience and true data sovereignty; how hybrid local–cloud architectures are evolving to combine on-device intelligence with cloud-scale capabilities; and how lightweight fine-tuning and model adaptation techniques enable teams to specialise models for their own domains, workflows and compliance needs – often directly on their own hardware. He also highlights how Local AI brings back model ownership and lifecycle control, allowing teams to treat models as part of their core engineering assets rather than external APIs. The result is AI that finally fits the real-world constraints of vertical industries instead of forcing them to adapt to cloud limitations.

With practical examples, architectural clarity and a forward-looking perspective, Christian presents a grounded vision of the emerging Post-Cloud era of AI – one where intelligence runs where data is created, where systems remain robust even offline, where regulatory demands are met by design, where cost and energy consumption become sustainable, and where developers regain the power to build truly intelligent and sovereign software systems.

Avatar for Christian Weyer

Christian Weyer PRO

May 12, 2026

More Decks by Christian Weyer

Other Decks in Programming

Transcript

  1. Al Goes Local think Why the Future of Intelligent Software

    Runs On-Device tecture The Journey The Shift The Recipe The Asset The Boundary The Reach • • • • • Cloud Al lost its Five small models. Models and data Local first. Workstation. Laptop. monopoly. One agent. you own. Cloud optional. Browser. Phone. No magic. 4
  2. Al Goes Local think Why the Future of Intelligent Software

    Runs On-Device tecture Al-enabled solutions (typically) are 10% Al. 100% software • • eng1neer1ng. 5
  3. Al Goes Local Why the Future of Intelligent Software Runs

    On-Device PART 1 THE SHIFT Cloud Al lost its monopoly - data has gravity, compute is following. think tecture
  4. Al Goes Local think Why the Future of Intelligent Software

    Runs On-Device tecture What I Mean by 11 Al11 Models How many Where it runs WHEN I SAY "Al" TODAY Small. 1-4 B parameters. Specialised - each does one job. I DO NOT MEAN I Cloud-frontier: GPT-5, Claude, Gemini. Five, working together. Router· tool-caller· I One giant model, one big API call. embedder • vision • synthesis. On the laptop on stage. On your phone. In your browser. I On someone else's G PUs. What it does I Routes, retrieves, generates, sees, listens. I Replaces engineering judgment. Every demo today is a proof for the left column. 7
  5. Al Goes Local think Why the Future of Intelligent Software

    Runs On-Device tecture Six Forces Driving Local Al • Privacy - Your data doesn 1 t leave the room. ~ Regulations - GDPR, HIPAA, DORA, Al Act. A Reliability - No outages. No rate limits. No deprecations. [eo 0 ] Latency - Sub-second on a laptop. Cloud round-trip alone is 1-2 s. [--•] Cost - Tokens are metered infrastructure. Local puts you back in control. ffl Energy - On-device inference is efficient. NPUs sip; data centers gulp. 8
  6. Al Goes Local o ware Runs On-Device e uture of

    Intelligent S ft tecture DEMO rolDoi ... ·•: - --- #1 Forces - illustrated Three c/ai o ms. ne query. Watch. 9
  7. Al Goes Local think Why the Future of Intelligent Software

    Runs On-Device tecture Why Now? HARDWARE Consumer GPUs & NPUs and unified memory ship. $5k beats what $50k bought in 2023. /v/5 /vlax • RTX PRO 6000 • OGX Spark • Strix Halo • Qualcomm X Elite (more coming) MODELS Quality-per-parameter lOx'd. Sub-4B (partially FT-ed) reaches what 70B did 18 months ago. Gemma 4 • Qwen 3.5/3.6 • LF/vl 2.5 • /vlistral Small· GL/vl (and more) Three things flipped at once. ENGINES Highly optimized code. Various hardware backends. Quantization for models & inference. 1/ama.cpp • /v/LX • Transformersjs • LEAP SOK (et. al.) And they'll keep flipping. Bet on the architecture! 10
  8. Al Goes Local think Why the Future of Intelligent Software

    Runs On-Device tecture The Building Blocks LLM LARGE Generative text - Cloud frontier models GPT-5 •Claude• Gemini VLM Multimodal - text + image input Gemma 3 Vision• Glfvf-V • LFfvf-VL STT ' Speech-to-text transcription Whisper • Parakeet OCR !:,<If> ~ Document image ➔ text Glfvf-OCR • Tesseract SLM SMALL IP ~ Generative text - runs on your hardware Gemma 1-48 • Qwen 48 • LFfvf 1.28 Embedding model L}'~ Text ➔ vectors for semantic similarity text-embedding-3-/arge • EmbeddingGemma TTS ,a' Text-to-speech synthesis Piper • Kokoro All seven run locally today. 11
  9. Al Goes Local Why the Future of Intelligent Software Runs

    On-Device PART 2 THE RECIPE Five small models. One agent. No magic. think tecture 12
  10. Al Goes Local think Why the Future of Intelligent Software

    Runs On-Device tecture This is 1Nextera 1 A fictitious Saas analytics product - running on five small local models. 13
  11. Al Goes Local think Why the Future of Intelligent Software

    Runs On-Device tecture DEMO #2 The Agent - layer by layer ... Seven capabilities. Five small models. One laptop. 14
  12. Al Goes Local think Why the Future of Intelligent Software

    Runs On-Device tecture Seven Capabilities 10 10 10 10 10 10 10 </> ~ Router LogReg on EmbeddingGemma • intent in ~10 ms • zero LLM calls Tool calling Qwen 3.5 48 FT· structured JSON: tool + args • WHAT, not HOW Scaffolding SQLite + Python calculator· deterministic execution • 1 ms Multi-step Decompose ➔ execute ➔ concretize ➔ calculate ➔ synthesize RAG EmbeddingGemma vector search· gemma3 48 synthesis Vision GLM-OCR 0.98 reads pages· pypdf for text· smart hybrid Orchestration Six layers, four models, one trace - already running on one laptop 15
  13. # Real components — the deterministic 35% intent_classifier_logreg.LogRegIntentClassifier # embed

    + LogReg, ~10 ms total query_decomposer.MULTI_STEP_PATTERNS # 16 regex — one step or two? query_decomposer.decompose # JSON parse fallback (line 130) intent_classifier.looks_like_injection # 30 regex — prompt-injection guard sqlite3.execute # 1 ms — no model beats stdlib
  14. Al Goes Local think Why the Future of Intelligent Software

    Runs On-Device tecture The 5-Model Stack ▪ EmbeddingGemma FT - Intent via LogReg + Semantic search - 308M ▪ Gemma3 FT - Query rewrite, tool-result synthesis - 18 ▪ Qwen 3.5 FT - Tool calling, SQL - 48 ▪ Gemma3 - Vision + RAG document synthesis - 48 ▪ GLM-OCR - Document reading - 0.98 Total: -10.2 billion parameters. Less than GPT-3 was in 2020. 17
  15. embeddings.search(query) tools=[sql_query, calc] plan → step → step → answer

    sql_builder.py · regex prompts/intent.md scenarios/nextera.json
  16. Al Goes Local Why the Future of Intelligent Software Runs

    On-Device PART 3 THE ASSET Models and data you own. think tecture 19
  17. Base Model Servers Fine-Tuned Servers gemma3-1b-it :9090 qwen3.5-4b :9091 embeddinggemma-308m

    :9092 gemma3-4b-vision :9093 gemma3-ft :9094 qwen3.5-4b-ft :9095 embeddinggemma-ft :9096 SmallLanguageModelClient POST /models/swap ~100ms swap time. No restart. Health-check before flip. Vision model shared across modes. GLM-OCR :9098 Upload-time only. Not part of swap. swap_urls() mode: base mode: base mode: base mode: finetuned mode: finetuned mode: finetuned shared (no FT variant) curl -X POST :8000/models/swap -d '{"mode":"ft"}'
  18. Al Goes Local Why the Future of lntelli en re

    Runs On-Device DEMO #3 The Lobotom Without fine-tu • Y ntng _it. ;ust doesn't work think tecture 22
  19. Al Goes Local think Why the Future of Intelligent Software

    Runs On-Device tecture What Fine-Tuning Bought Base Fine-tuned Intent c/assifrcation accuracy 5.6% ➔ 93.3% - 5.6% 93.3% Same model. 1,878 fine-tuning examples. ~2 min on RTX Then we replaced the FT model with LogReg - same data, no LLM call. 97.2%. 23
  20. Al Goes Local think Why the Future of Intelligent Software

    Runs On-Device tecture Three Levers: Model· Data· Code Domain knowledge lives in three places. Fine-tuning is just the model lever. Pick the right model functiongemma 270M FT ➔ Qwen 3.5 4B FT v8 Tool routing (160q) Fix the data 47 corrupted• 134 relabeled• ~10% of the training set RAG ground-truth, Nextera (BOq) Old New Before After 92.5% 99.4% 65.0% 92.0% The third lever - deterministic code - you saw live walking the agent. Knowing which to pull is the engineering. +6.9 pp 12 x fewer errors +27pp 24
  21. // scenarios/nextera.json (excerpt) { "name": "nextera", "language": "en", "paths": {

    "db": "./data/business.db", … }, "models": { "inference_ft": "gemma3-ft-nextera.gguf", … }, "sql": { "allowed_tables": ["products", "customers", "sales", "competit } // scenarios/logistics.json (excerpt) { "name": "logistics", "language": "de", "paths": { "db": "./data/logistics.db", … }, "models": { "inference_ft": "gemma3-ft-logistics.gguf", … }, "sql": { "allowed_tables": ["einheiten", "ausrüstung", "ersatzteile", " }
  22. Al Goes Local think Why the Future of Intelligent Software

    Runs On-Device tecture Six Industries. One Architecture. I Healthcare - clinical PII, never leaves the device I Finance - structured extraction, audit-grade determinism I Manufacturing - voice on the neld tablet, no signal needed I Public sector - sovereignty by construction I Automotive - in-car assistants without the latency tax I Energy - control rooms, no cloud roundtrip Same pipeline. Same evals. Your domain. The engineering outlives the models. 26
  23. Al Goes Local Why the Future of Intelligent Software Runs

    On-Device PART 4 THE BOUNDARY Local first. Cloud optional. think tecture 27
  24. Al Goes Local think Why the Future of Intelligent Software

    Runs On-Device tecture When Local. When Cloud. Pick local when Data is regulated, private, or owned by your customer Latency matters - sub-second response Volume is sustained - not occasional, exploratory bursts Determinism is required - bit-identical reruns Network can't be guaranteed Cloud is the right answer when A frontier-only capability is the actual unlock Privacy and latency aren't your constraints Volume is too small to justify hardware The task is genuinely one-off or exploratory You need long-context - lM+ tokens Hybrid is the rule, not the exception. 28
  25. Local Path POST /query agent.process(query) Full local pipeline AgentResponse ConfidenceRouter

    8-factor heuristic scoring score >= 0.6? yes no Return response confidence: 0.85 Return response + should_escalate: true + confidence: 0.42 Observatory UI shows escalation banner Cloud Escalation (HITL) User clicks "Escalate to GPT-5.4"? no yes Keep local response POST /escalate network online + key configured? no yes Blocked 403 / 503 GPT-5.4 API call Data leaves machine cloud_bytes_sent += payload Return cloud response + model badge + latency_ms + cost
  26. Al Goes Local think Why the Future of Intelligent Software

    Runs On-Device tecture DEMO #4 Local vs. Cloud Same prompt. Several engines. Read the bytes. 30
  27. Al Goes Local Why the Future of Intelligent Software Runs

    On-Device PART 5 THE REACH Workstation. Laptop. Browser. Phone. think tecture 31
  28. Al Goes Local think Why the Future of Intelligent Software

    Runs On-Device tecture Four Machines, One Codebase e.g. "How many customers do we have?"· Nextera • pSO end-to-end NVIDIA RTX PRO 6000 Apple MacBook Pro MS Max NVIDIA DGX Spark AMD Ryzen Al Max+ 395 (Strix Halo) 96 GB VRAM · Ubuntu - 128 GB unified· macOS 128 GB unified· Ubuntu (arm64) 128 GB unified· Fedora Four machines. Four architectures. One codebase. 465 ms 1,121 ms 2,315 ms 2,400 ms 32
  29. Al Goes Local think Why the Future of Intelligent Software

    Runs On-Device tecture ••• - DEMO #5 Browser - zero server, zero network No bockend No AP/ key No cloud 33
  30. Al Goes Local think Why the Future of Intelligent Software

    Runs On-Device tecture DEMO #6 - iPhone - completely offline, in my pocket Airplane mode. Private data. 1.28 model. 34
  31. Al Goes Local Why the Future of Intelligent Software Runs

    On-Device CLOSING YOUR MOVE The numbers. The personas. think tecture 35
  32. Al Goes Local think Why the Future of Intelligent Software

    Runs On-Device tecture The Numbers Revenue analysis • Customer lookups • OCR • Voice • PDFs • Phone offline • Browser zero-server 10.2° total params • less than GPT-3 in 2020 streaming • in a browser tab 4 platforms • Workstation • Laptop • Browser • Phone params • running on a phone ._.1Qms intent classification • feels instant 0 bytes silently sent· across the session 36
  33. Al Goes Local o ware Runs On-Device e uture of

    Intelligent S ft tecture DEMO roUDoi ... ·•: - --- #7 One more thing ... 'Your' domain in a couple of seconds. 38