Running LLM Inference on Android

Slide 1

Slide 1 text

Running LLM Inference on Android Wesley Kambale @weskambale kambale.dev

Slide 2

Slide 2 text

What’s an LLM Inference? The LLM Inference API lets you run large language models (LLMs) completely on-device for Android applications, which you can use to perform a wide range of tasks, such as generating text, retrieving information in natural language form, and summarizing documents. The API provides built-in support for multiple text-to-text large language models, so you can apply the latest on-device generative AI models to your Android apps.

Slide 3

Slide 3 text

What’s an LLM Inference? The API supports the following variants of Gemma: Gemma-3 1B, Gemma-2 2B, Gemma 2B, and Gemma 7B. Gemma is a family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models. It also supports the following external models: Phi-2, Falcon-RW-1B and StableLM-3B, and recently, DeepSeek R-1. In addition to the supported models, users can use Google's AI Edge Torch to export PyTorch models into multi-signature LiteRT (tflite) models, which are bundled with tokenizer parameters to create Task Bundles that are compatible with the LLM Inference API.

Slide 4

Slide 4 text

Code example Clone the git repository using the following command: git clone https://github.com/google-ai-edge/mediapipe-samples Optionally, configure your git instance to use sparse checkout, so you have only the files for the LLM Inference API example app: cd mediapipe-samples git sparse-checkout init --cone git sparse-checkout set examples/llm_inference/android

Slide 5

Slide 5 text

The LLM Inference API uses the com.google.mediapipe:tasks-genai library. Add this dependency to the build.gradle file of your Android app: dependencies { implementation 'com.google.mediapipe:tasks-genai:0.10.22' } Dependencies

Slide 6

Slide 6 text

Model The MediaPipe LLM Inference API requires a trained text-to-text language model that is compatible with this task. After downloading a model, install the required dependencies and push the model to the Android device. If you are using a model other than Gemma, you will have to convert the model to a format compatible with MediaPipe.

Slide 7

Slide 7 text

Push model to the device Push the content of the output_path folder to the Android device. $ adb shell rm -r /data/local/tmp/llm/ $ adb shell mkdir -p /data/local/tmp/llm/ $ adb push output_path /data/local/tmp/llm/model_version.task

Slide 8

Slide 8 text

Create the task The MediaPipe LLM Inference API uses the createFromOptions() function to set up the task. The createFromOptions() function accepts values for the configuration options. val options = LlmInferenceOptions.builder() .setModelPath("/data/local/.../") .setMaxTokens(1000) .setTopK(40) .setTemperature(0.8) .setRandomSeed(101) .build() llmInference = LlmInference.createFromOptions(context, options)

Slide 9

Slide 9 text

Configuration options

Slide 10

Slide 10 text

Prepare data The LLM Inference API accepts the following inputs: - prompt (string): A question or prompt. val inputPrompt = "Compose an email to MAK students of class plans at noon on Saturday."

Slide 11

Slide 11 text

Use the generateResponse() method to generate a text response to the input text provided in the previous section (inputPrompt). This produces a single generated response. val result = llmInference.generateResponse(inputPrompt) logger.atInfo().log("result: $result") Run the task

Slide 12

Slide 12 text

To stream the response, use the generateResponseAsync() method. val options = LlmInference.LlmInferenceOptions.builder() ... .setResultListener { partialResult, done -> logger.atInfo().log("partial result: $partialResult") } .build() llmInference.generateResponseAsync(inputPrompt) Run the task

Slide 13

Slide 13 text

Handle and display results The LLM Inference API returns a LlmInferenceResult, which includes the generated response text. Here's a draft you can use: Subject: Class on Saturday Hi MAK Students, Just a quick reminder about our class this Saturday at noon. Let me know if that still works for you. Looking forward to it! Best, [Your Name]

Slide 14

Slide 14 text

LoRA model customization Mediapipe LLM inference API can be configured to support Low-Rank Adaptation (LoRA) for large language models. Utilizing fine-tuned LoRA models, developers can customize the behavior of LLMs through a cost-effective training process. LoRA support of the LLM Inference API works for all Gemma variants and Phi-2 models for the GPU backend, with LoRA weights applicable to attention layers only. This initial implementation serves as an experimental API for future developments with plans to support more models and various types of layers in the coming updates.

Slide 15

Slide 15 text

Prepare LoRA models # For Gemma from peft import LoraConfig config = LoraConfig( r=LORA_RANK, target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], ) # For Phi-2 config = LoraConfig( r=LORA_RANK, target_modules=["q_proj", "v_proj", "k_proj", "dense"], )

Slide 16

Slide 16 text

LoRA model customization After training on the prepared dataset and saving the model, you obtain an adapter_model.safetensors file containing the fine-tuned LoRA model weights. The safetensors file is the LoRA checkpoint used in the model conversion. As the next step, you need convert the model weights into a TensorFlow Lite Flatbuffer using the MediaPipe Python Package. The ConversionConfig should specify the base model options as well as additional LoRA options. The API only supports LoRA inference with GPU, the backend must be set to 'gpu'.

Slide 17

Slide 17 text

import mediapipe as mp from mediapipe.tasks.python.genai import converter config = converter.ConversionConfig( # Other params related to base model ... # Must use gpu backend for LoRA conversion backend='gpu', # LoRA related params lora_ckpt=LORA_CKPT, lora_rank=LORA_RANK, lora_output_tflite_file=LORA_OUTPUT_TFLITE_FILE, ) converter.convert_checkpoint(config)

Slide 18

Slide 18 text

The Web, Android and iOS LLM Inference API are updated to support LoRA model inference. Android supports static LoRA during initialization. To load a LoRA model, users specify the LoRA model path as well as the base LLM. val options = LlmInferenceOptions.builder() ... .setRandomSeed(101) .setLoraPath('') .build() llmInference = LlmInference.createFromOptions(context, options) LoRA model inference