AI Community Day Bangkok 2025 - In-Browser ML/LLM Inference Ecosystem

In-Browser ML/LLM Inference Ecosystem AI Community Day Bangkok 2025-11-29

Karn Wong Loves optimization Has too much fun cranking out
benchmarks HashiCorp Ambassador & AWS Community Builder Website: karnwong.me Independent Consultant

Machine Learning is a Subtype of AI Classical Machine Learning
Neural Networks (Deep Learning, Reinforcement Learning, etc.) Large Language Models

LLM = Massive NLP Model

Natural Language Processing Primer Applied linguistics Linguistics: the scientific study
of languages Linguistics + math

Machine Learning Primer Input Model Output

Creating ML Models scikit-learn PyTorch TensorFlow etc.

Serving ML Models Flask Django FastAPI etc.

ML Inference Anatomy InferenceEndpoint App Transform Input Return Output Frontend
Backend Parse Input Inference

Do You Want to Maintain a Separate ML System? But
if you have a dedicated team to maintain the ML system, go ahead

WASM to the Rescue If it can be converted into
WASM, it can run on a web browser

ML Serving Runtimes with WASM Support ONNX LiteRT Candle wllama

🐘 Classical ML Classification example: # input { "feature1": 10.5,
"feature2": 250, "feature3": 0.75, "feature4": 1, "feature5": "category_A" } # output { "0": 0.15, "1": 0.85 }

🐘 Neural Networks Facial recognition (YOLO) example: # input {
"input_tensor": [[[ [0.23, 0.25, 0.26, ...], [0.22, 0.24, 0.26, ...], ...], ...]], "shape": [1, 3, 640, 640 ], "dtype": "float32", "normalized": true } # output { "image": "image_01.jpg", "detections": [ { "class_id": 0, "class_name": "person", "confidence": 0.87, "bbox_xywh": [233.0, 328.5, 242.0, 567.0] } ] }

🐘 LLM It’s complicated… You need the model’s tokenizer and
model config # python from ai_edge_torch.generative.examples.gemma3 import gemma3 // rust use candle_transformers::models::gemma::{Config as Config1, Model as Model1};

ONNX Runtime Web Most models can be converted into ONNX
format Works for classical ML and neural networks For LLM: need to construct input tensors + decode output tensors Not realistic for LLM inference in-browser due to model size 300M LLM model is 1.2 GB https://onnxruntime.ai/docs/tutorials/web/deploy.html

LiteRT Can be converted from PyTorch, TensorFlow, JAX Does not
support classical ML For LLM: see MediaPipe https://ai.google.dev/edge/litert

MediaPipe (via LiteRT) Plug-and-play solutions with default models that allow
for customization Face detection, image classification, etc. Can customize the models - only the input data HuggingFace provides pre-built LLM models Can also convert the models yourself LLM is packaged as a .task file Includes LiteRT files, components, metadata .task format already includes tokenizer + model config https://ai.google.dev/edge/mediapipe/solutions/guide

Candle Pure Rust implementation Can compile into WASM Have to
construct input tokenizer + tensors Needs to use both Rust + JS Focuses on NN + LLM https://github.com/huggingface/candle/tree/main/candle-wasm-examples

wllama WASM binding for llama.cpp WASM binary for wllama runtime
GGUF model file Focuses on LLM https://github.com/ngxson/wllama

Runtimes Comparison *GitHub Stars Language **Runtime is Automatically Provided Abstracted
LLM Input / Output Model Type Include Pre-Built Models ONNX 18.3K JS ❌ ❌ Classical / NN / LLM ❌ LiteRT 21.9K JS ✅ (MediaPipe) ✅ (MediaPipe) NN / LLM ✅ (MediaPipe + HuggingFace) Candle 18.5K Rust + JS ❌ ❌ NN / LLM ✅ (HuggingFace) wllama 993 JS ❌ ✅ LLM ✅ (HuggingFace) *GitHub stars as of 2025-11-17 **Have to explicitly reference WASM binary for model runtime

ONNX Runtime Web? Classicial ML + Custom neural networks model

LiteRT? Neural networks model via PyTorch, TensorFlow, JAX

MediaPipe? Generic use cases for image/text/audio processing Off-the-shelf models LLM

Candle? If you can find JS devs who can write
Rust 👀

wllama? Too early to tell, need to wait and see

Thank you 🙏 Download slides at: karnwong.me

AI Community Day Bangkok 2025 - In-Browser ML/L...

AI Community Day Bangkok 2025 - In-Browser ML/LLM Inference Ecosystem

Karn Wong

More Decks by Karn Wong

Other Decks in Technology

Featured

Transcript

In-Browser ML/LLM Inference Ecosystem AI Community Day Bangkok 2025-11-29

Karn Wong Loves optimization Has too much fun cranking out

Machine Learning is a Subtype of AI Classical Machine Learning

LLM = Massive NLP Model

Natural Language Processing Primer Applied linguistics Linguistics: the scientific study

Machine Learning Primer Input Model Output

Creating ML Models scikit-learn PyTorch TensorFlow etc.

Serving ML Models Flask Django FastAPI etc.

ML Inference Anatomy InferenceEndpoint App Transform Input Return Output Frontend

Do You Want to Maintain a Separate ML System? But

WASM to the Rescue If it can be converted into

ML Serving Runtimes with WASM Support ONNX LiteRT Candle wllama

🐘 Classical ML Classification example: # input { "feature1": 10.5,

🐘 Neural Networks Facial recognition (YOLO) example: # input {

🐘 LLM It’s complicated… You need the model’s tokenizer and

ONNX Runtime Web Most models can be converted into ONNX

LiteRT Can be converted from PyTorch, TensorFlow, JAX Does not

MediaPipe (via LiteRT) Plug-and-play solutions with default models that allow

Candle Pure Rust implementation Can compile into WASM Have to

wllama WASM binding for llama.cpp WASM binary for wllama runtime

Runtimes Comparison *GitHub Stars Language **Runtime is Automatically Provided Abstracted

ONNX Runtime Web? Classicial ML + Custom neural networks model

LiteRT? Neural networks model via PyTorch, TensorFlow, JAX

MediaPipe? Generic use cases for image/text/audio processing Off-the-shelf models LLM

Candle? If you can find JS devs who can write

wllama? Too early to tell, need to wait and see

Thank you 🙏 Download slides at: karnwong.me