Local Models for Coding

Local Models for Coding 1

Who am I • Local AI enthusiast running local models
on my MacBook • Building iowise.dev • ex-CTO of Termius SSH Client • 14 years working in software • Building backend, frontend, iOS, Android, and embedded

Running Models 3

OPENAI_API_BASE="https: / / chivalry - confront - untried.ngrok - free.dev:1234/v1"
OPENAI_API_KEY=lm - studio codex - - oss - - model qwopus3.6-35b - a3b - v1-mlx ANTHROPIC_BASE_URL="https: / / chivalry - confront - untried.ngrok - free.dev:1234/v1" ANTHROPIC_AUTH_TOKEN=lm - studio ANTHROPIC_API_KEY="" claude - - model qwopus3.6-35b - a3b - v1-mlx Connect Your Agent • Add model: qwopus3.6-35b-a3b-v1-mlx • Add OpenAI API key: lm-studio • Add OpenAI URL: https://chivalry-confront-untried.ngrok-free.dev:1234/v1 4

Terms • Model • Dense vs MoE • Agents •
Quantization • KV Cache • Inference engines • Prefill • Distillation • Speculations

-A3B -35B Model Names Qwen3.6 Model Family Total Number of
Parameters -Q4_K_S.gguf 6

Models • Qwen 3.x • Nemotron 3 • Gemma 4
• Fine-tuned

Where Models Live huggingface.co 🤗 8

Selecting a model • Add hardware to 🤗 • Browse
• Give it a go • Find eval • Iterate

How to run a model • LM Studio – https://lmstudio.ai/
• Ollama – https://ollama.com/ • Llama.cpp – the cutting edge of inference

Memory Math Total memory ≈ parameters × bytes/parameter + 2
× num_layers × num_kv_heads × head_dim × seq_len × batch_size × bytes/element + runtime overhead

Memory Math Total memory ≈ parameters × bytes/parameter + 2
× num_layers × num_kv_heads × head_dim × seq_len × batch_size × bytes/element + runtime overhead Start simple with LM Studio model loading Guardrails

-A3B -35B Qwen3.6 Model Family Total Number of Parameters -Q4_K_S.gguf
Active parameters 13

Dense VS Mixture of experts • Dense – “smarter” and
“slower” • MoE – “faster” but can be “dummier”

Faster Inference 15

-A3B -35B Qwen3.6 Model Family Total Number of Parameters -Q4_K_S.gguf
Active parameters Quantization File format 16

Quantization • F32 – original • F16 – 2x smaller
• Q8_0 – 4x smaller, similar quality • Q6_K • Q5_K_S, Q5_K_M, etc • Q4_K_S, Q4_K_M, etc – 8x smaller, the edge of quality • Q3_K_S, Q3_K_M, Q3_K_L, etc • Q2_K • 1-bit models

Quantization • ⬆ Inference • ⬇ RAM • ⬇ Disk
space • ⬇ KV Cache

Distillation • The original distillation requires full probability distribution of
output tokens • Modern distillation: training on synthetic data of a smart model

KV Cache It fixes a flaw in the LLM computation
model 20

model 21

model 22

model 23

Inference stages 24

Inference stages 25

Speculations – Probabilistic speedup 26

Coding Agents • OpenCode • Pi • Goose • Hermes
Agent • OpenClaw • OpenAI Codex • Claude Code

Do you want to hear more? 28

Links • https://bbycroft.net/llm • https://hfviewer.com/Qwen/Qwen3.6-35B-A3B-FP8 • https://hfviewer.com/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning- BF16 • http://hfviewer.com/nvidia/Nemotron-Cascade-2-30B-A3B
• https://huggingface.co/spaces/mlx-community/mlx-my-repo

Local Models for Coding

Local Models for Coding

Eugene Oskin

More Decks by Eugene Oskin

Other Decks in Programming

Featured

Transcript