Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Local Models for Coding
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
Eugene Oskin
May 18, 2026
Programming
7
0
Share
Local Models for Coding
Eugene Oskin
May 18, 2026
More Decks by Eugene Oskin
See All by Eugene Oskin
REST API. Django, Ruby on Rails, Play! Framework
evgeneoskin
0
96
Introduction to gRPC
evgeneoskin
0
110
GrailInventory – Advanced Backend Development
evgeneoskin
0
43
Bracing Calculator
evgeneoskin
1
72
emotional intelligence, part 2
evgeneoskin
0
44
Office temperature
evgeneoskin
0
40
Parse platform
evgeneoskin
0
110
Hubot
evgeneoskin
0
56
An introduction to iOS development
evgeneoskin
0
45
Other Decks in Programming
See All in Programming
Are We Really Coding 10× Faster with AI?
kohzas
0
190
AI Agent と正しく分析するための環境作り
yoshyum
2
520
次世代リンターで探る、tsgo 時代における型認識カスタムルールの現実解
ytakahashii
0
110
20260514_its_the_context_window_stupid.pdf
heita
0
1k
Skillは並べた。動かなかった。契約で繋いだ。— 65個のSkillから、自走する開発サイクルへ
junholee
0
620
When benchmarks go bad - what I learned from measuring performance wrong
hollycummins
0
390
サプライチェーン攻撃対策「層を重ねて落ちない壁」を10日間で組み上げた話 #TechLeadConf2026
kashewnuts
1
290
Firefoxにコントリビューションして得られた学び
ken7253
2
160
WebAssembly を読み込むベストプラクティス 2026年春版 / Best Practices for Loading WebAssembly (Spring 2026)
petamoriken
5
1.1k
Migrations : C'est une question d'hygiène !
vinceamstoutz
0
620
ソースコード→AST→オペコード、の旅を覗いてみる
o0h
PRO
1
130
Agentic UI beyond Chats Architecture Patterns & Open Standards @ngMunich 05/2026
manfredsteyer
PRO
0
100
Featured
See All Featured
Collaborative Software Design: How to facilitate domain modelling decisions
baasie
1
220
Organizational Design Perspectives: An Ontology of Organizational Design Elements
kimpetersen
PRO
1
690
How to Get Subject Matter Experts Bought In and Actively Contributing to SEO & PR Initiatives.
livdayseo
0
120
Sharpening the Axe: The Primacy of Toolmaking
bcantrill
46
2.8k
HU Berlin: Industrial-Strength Natural Language Processing with spaCy and Prodigy
inesmontani
PRO
0
380
A Guide to Academic Writing Using Generative AI - A Workshop
ks91
PRO
1
300
Chrome DevTools: State of the Union 2024 - Debugging React & Beyond
addyosmani
10
1.2k
What the history of the web can teach us about the future of AI
inesmontani
PRO
1
560
Typedesign – Prime Four
hannesfritz
42
3k
Data-driven link building: lessons from a $708K investment (BrightonSEO talk)
szymonslowik
1
1.1k
Skip the Path - Find Your Career Trail
mkilby
1
120
Primal Persuasion: How to Engage the Brain for Learning That Lasts
tmiket
0
340
Transcript
Local Models for Coding 1
Who am I • Local AI enthusiast running local models
on my MacBook • Building iowise.dev • ex-CTO of Termius SSH Client • 14 years working in software • Building backend, frontend, iOS, Android, and embedded
Running Models 3
OPENAI_API_BASE="https: / / chivalry - confront - untried.ngrok - free.dev:1234/v1"
OPENAI_API_KEY=lm - studio codex - - oss - - model qwopus3.6-35b - a3b - v1-mlx ANTHROPIC_BASE_URL="https: / / chivalry - confront - untried.ngrok - free.dev:1234/v1" ANTHROPIC_AUTH_TOKEN=lm - studio ANTHROPIC_API_KEY="" claude - - model qwopus3.6-35b - a3b - v1-mlx Connect Your Agent • Add model: qwopus3.6-35b-a3b-v1-mlx • Add OpenAI API key: lm-studio • Add OpenAI URL: https://chivalry-confront-untried.ngrok-free.dev:1234/v1 4
Terms • Model • Dense vs MoE • Agents •
Quantization • KV Cache • Inference engines • Prefill • Distillation • Speculations
-A3B -35B Model Names Qwen3.6 Model Family Total Number of
Parameters -Q4_K_S.gguf 6
Models • Qwen 3.x • Nemotron 3 • Gemma 4
• Fine-tuned
Where Models Live huggingface.co 🤗 8
Selecting a model • Add hardware to 🤗 • Browse
• Give it a go • Find eval • Iterate
How to run a model • LM Studio – https://lmstudio.ai/
• Ollama – https://ollama.com/ • Llama.cpp – the cutting edge of inference
Memory Math Total memory ≈ parameters × bytes/parameter + 2
× num_layers × num_kv_heads × head_dim × seq_len × batch_size × bytes/element + runtime overhead
Memory Math Total memory ≈ parameters × bytes/parameter + 2
× num_layers × num_kv_heads × head_dim × seq_len × batch_size × bytes/element + runtime overhead Start simple with LM Studio model loading Guardrails
-A3B -35B Qwen3.6 Model Family Total Number of Parameters -Q4_K_S.gguf
Active parameters 13
Dense VS Mixture of experts • Dense – “smarter” and
“slower” • MoE – “faster” but can be “dummier”
Faster Inference 15
-A3B -35B Qwen3.6 Model Family Total Number of Parameters -Q4_K_S.gguf
Active parameters Quantization File format 16
Quantization • F32 – original • F16 – 2x smaller
• Q8_0 – 4x smaller, similar quality • Q6_K • Q5_K_S, Q5_K_M, etc • Q4_K_S, Q4_K_M, etc – 8x smaller, the edge of quality • Q3_K_S, Q3_K_M, Q3_K_L, etc • Q2_K • 1-bit models
Quantization • ⬆ Inference • ⬇ RAM • ⬇ Disk
space • ⬇ KV Cache
Distillation • The original distillation requires full probability distribution of
output tokens • Modern distillation: training on synthetic data of a smart model
KV Cache It fixes a flaw in the LLM computation
model 20
KV Cache It fixes a flaw in the LLM computation
model 21
KV Cache It fixes a flaw in the LLM computation
model 22
KV Cache It fixes a flaw in the LLM computation
model 23
Inference stages 24
Inference stages 25
Speculations – Probabilistic speedup 26
Coding Agents • OpenCode • Pi • Goose • Hermes
Agent • OpenClaw • OpenAI Codex • Claude Code
Do you want to hear more? 28
Links • https://bbycroft.net/llm • https://hfviewer.com/Qwen/Qwen3.6-35B-A3B-FP8 • https://hfviewer.com/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning- BF16 • http://hfviewer.com/nvidia/Nemotron-Cascade-2-30B-A3B
• https://huggingface.co/spaces/mlx-community/mlx-my-repo