Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Local Models for Coding

Sponsored · SiteGround - Reliable hosting with speed, security, and support you can count on.

Local Models for Coding

Avatar for Eugene Oskin

Eugene Oskin

May 18, 2026

More Decks by Eugene Oskin

Other Decks in Programming

Transcript

  1. Who am I • Local AI enthusiast running local models

    on my MacBook • Building iowise.dev • ex-CTO of Termius SSH Client • 14 years working in software • Building backend, frontend, iOS, Android, and embedded
  2. OPENAI_API_BASE="https: / / chivalry - confront - untried.ngrok - free.dev:1234/v1"

    OPENAI_API_KEY=lm - studio codex - - oss - - model qwopus3.6-35b - a3b - v1-mlx ANTHROPIC_BASE_URL="https: / / chivalry - confront - untried.ngrok - free.dev:1234/v1" ANTHROPIC_AUTH_TOKEN=lm - studio ANTHROPIC_API_KEY="" claude - - model qwopus3.6-35b - a3b - v1-mlx Connect Your Agent • Add model: qwopus3.6-35b-a3b-v1-mlx • Add OpenAI API key: lm-studio • Add OpenAI URL: https://chivalry-confront-untried.ngrok-free.dev:1234/v1 4
  3. Terms • Model • Dense vs MoE • Agents •

    Quantization • KV Cache • Inference engines • Prefill • Distillation • Speculations
  4. Selecting a model • Add hardware to 🤗 • Browse

    • Give it a go • Find eval • Iterate
  5. How to run a model • LM Studio – https://lmstudio.ai/

    • Ollama – https://ollama.com/ • Llama.cpp – the cutting edge of inference
  6. Memory Math Total memory ≈ parameters × bytes/parameter + 2

    × num_layers × num_kv_heads × head_dim × seq_len × batch_size × bytes/element + runtime overhead
  7. Memory Math Total memory ≈ parameters × bytes/parameter + 2

    × num_layers × num_kv_heads × head_dim × seq_len × batch_size × bytes/element + runtime overhead Start simple with LM Studio model loading Guardrails
  8. Dense VS Mixture of experts • Dense – “smarter” and

    “slower” • MoE – “faster” but can be “dummier”
  9. -A3B -35B Qwen3.6 Model Family Total Number of Parameters -Q4_K_S.gguf

    Active parameters Quantization File format 16
  10. Quantization • F32 – original • F16 – 2x smaller

    • Q8_0 – 4x smaller, similar quality • Q6_K • Q5_K_S, Q5_K_M, etc • Q4_K_S, Q4_K_M, etc – 8x smaller, the edge of quality • Q3_K_S, Q3_K_M, Q3_K_L, etc • Q2_K • 1-bit models
  11. Distillation • The original distillation requires full probability distribution of

    output tokens • Modern distillation: training on synthetic data of a smart model
  12. Coding Agents • OpenCode • Pi • Goose • Hermes

    Agent • OpenClaw • OpenAI Codex • Claude Code