Upgrade to Pro — share decks privately, control downloads, hide ads and more …

JavaZone - Local Development in the AI Era

Avatar for Kevin Dubois Kevin Dubois
September 04, 2025

JavaZone - Local Development in the AI Era

Avatar for Kevin Dubois

Kevin Dubois

September 04, 2025
Tweet

More Decks by Kevin Dubois

Other Decks in Programming

Transcript

  1. kevindubois Local Development in the AI Era Kevin Dubois Sr

    Principal Developer Advocate IBM kevindubois
  2. kevindubois Kevin Dubois ★ Sr. Principal Developer Advocate at ★

    Java Champion ★ Technical Lead, CNCF DevEx TAG ★ From Belgium / Live in Switzerland ★ English, Dutch, French, Italian youtube.com/@thekevindubois linkedin.com/in/kevindubois github.com/kdubois @kevindubois.com
  3. kevindubois Wait, so you can run – and use -

    your own LLMs… completely local?
  4. kevindubois and there are plenty of open source tools to

    do so?! Wait, so you can run – and use - your own LLMs… completely local?
  5. kevindubois and there are plenty of open source tools to

    do so?! But there are over 2 million models, which to pick? Wait, so you can run – and use - your own LLMs… completely local?
  6. kevindubois and there are plenty of open source tools to

    do so?! But there’s over 2 million models, which to pick? How can I use them as personal assistant AND infuse local AI into my codebase? Wait, so you can run – and use - your own LLMs… completely local?
  7. kevindubois 7 ▸ Demo #1: Model serving ▸ Demo #2:

    Code assistance & agents ▸ Demo #3: Adding AI features to apps ▸ Running your own AI & LLMs ▸ How to choose the right model? ▸ Integrating your data & codebase! Today’s Schedule Session Slides link
  8. kevindubois Why run a model locally? For Developers Familiarity with

    the Development Environment and adherence of the developers to their “local developer experience” in particular for testing and debugging Convenience & Simplicity Direct Access to Hardware Ease of Integration Simplify the integration of the model with existing systems and applications that are already running locally. For Organizations Data Privacy and Security Data is the fuel for AI, and a differentiator factor (quality, quantity, qualification). Keeping data on -premises ensures sensitive information doesn’t leave the local environment → crucial for privacy -sensitive applications Cost Control While there is an initial investment in hardware and setup, running locally can potentially reduce ongoing costs of cloud computing services and alleviate the vendor-locking played by Amazon, MSFT, Google Regulatory Compliance Some industries have strict regulations about where and how data is processed Take advantage of total AI customization and control Customization & Control Easily train or fine -tune your own model, from the convenience of the developer’s local machine.
  9. kevindubois 11 But the stack can be a bit overwhelming!

    2024 MAD (Machine learning, Artificial Intelligence & Data) Landscape
  10. kevindubois Average developer trying to download, run, experiment & manage

    models, configure serving runtimes, ensure correct prompt templates, integrate it in their code… (Colorized, 2025)
  11. kevindubois ▸ Simple CLI: “Docker” style tool for running LLMs

    locally, offline, and privately ▸ Extensible: Basic model customization (Modelfile) and importing of fine-tuned LLMs ▸ Lightweight: Efficient and resource-friendly. ▸ Easy API: API for both inferencing and Ollama itself (ex. download models) Tool #1: Ollama
  12. kevindubois ▸ AI in Containers: Run models with Podman/Docker with

    no config needed. ▸ Registry Agnostic: Freedom to pull models from Hugging Face, Ollama, or OCI registries. ▸ GPU Optimized: Auto-detect & accelerate performance. ▸ Flexible: Supports llama.cpp, vLLM, whisper.cpp & more. Tool #2: Ramalama
  13. kevindubois ▸ For App Builders: Choose from various recipes like

    RAG, Agentic, Summarizers ▸ Curated Models: Easily access Apache 2.0 open-source options. ▸ Container Native: Easy app integration and movement from local to production. ▸ Interactive Playgrounds: Test & optimize models with your custom prompts and data. Tool #3: Podman AI Lab
  14. kevindubois • User Friendly • Easy way to find and

    serve models • Debug Mode: See what’s happening in the background • Ability to customize runtime for best performance • NOT Open Source  Tool #4: LM Studio
  15. kevindubois ▸ Research-Based: UC Berkeley project to improve model speeds

    and GPU consumption ▸ Standardized: Works with Hugging Face & OpenAI API. ▸ Versatile: Supports NVIDIA, AMD, Intel, TPUs & more. ▸ Scalable: Manages multiple requests efficiently, ex. with Kubernetes as an LLM runtime Tool #5: vLLM
  16. kevindubois ▸ It depends on the use case that you

    want to tackle & how ”Open Source” they should be. ▸ DeepSeek or the new gpt-oss models excel in reasoning tasks and complex problem-solving. ▸ Qwen or Granite have strong coding assistant models. ▸ Mixtral and LLaMA are particularly strong in summarization and sentiment analysis. So, which local model should you select?
  17. kevindubois Not all models are the same! Text Image Unimodal

    text -to-image text -to-text image -to-text image -to-image text -to-code Text Image Audio Video Multimodal any-to-any ✓ Single data input ✓ Less resources ✓ Single modality ✓ Limited depth and accuracy ✓ Multiple data inputs ✓ More resources ✓ Multiple modality ✓ Better understanding and accuracy OR
  18. kevindubois Kind of like how our apps are compiled for

    various architectures! Also! There’s a naming convention ibm- granite/granite -3.0 - 8b - base Family name Model architecture and version Number of parameters Model fine-tuned to be a baseline Mixtral - 8x7B - Instruct - v0.1 Family name Model version Number of parameters Model fine-tuned for instructive tasks Architecture type
  19. kevindubois How to deploy a larger model? Let’s say you

    want the best benchmarks with a frontier model
  20. kevindubois ▸ Quantization: A technique to compress LLMs by reducing

    numerical precision. ▸ Converts high-precision weights (FP32) into lower-bit formats (FP16, INT8, INT4). ▸ Reduces memory footprint, making models easier to deploy. It’s a way to compress models, think like a .zip or .tar Most models for local usage are quantized!
  21. kevindubois ▸ The Benefit? Run LLMs on “any” device, not

    just your local machine but IoT & Edge too ▸ Results in faster and lighter models that still maintain reasonable accuracy ・ Testing with Llama 3.1, for W4A16-INT resulted in 2.4x performance speedup and 3.5x model size compression ▸ Works on GPUs & CPUs! Source: https://neuralmagic.com/blog/we -ran-over-half-a-million-evaluations -on-quantized -llms-heres-what-we-found
  22. kevindubois Code Assistance Use a local model as a pair

    programmer, to generate and explain your codebase. Tools: Continue , Roo Code, Cline, … How to use local, disconnected (?) code assistants Fortunately, many tools exist for this too!
  23. kevindubois ▸ There are many options to serving and using

    models locally ▸ Pick the right model for the right use case ▸ Local code assistants work… ish. ▸ You might need to ask for hardware upgrades :D ▸ Developing local Agentic AI apps with Java is definitely possible (& kind of fun!). Recap
  24. kevindubois Thank you! slides podman -desktop.io docs.quarkiverse.io/quarkus -langchain4j github.com /kdubois

    /netatmo -java-mcp youtube.com/@thekevindubois linkedin.com/in/kevindubois github.com/kdubois @kevindubois.com @[email protected] tusen takk!