Putting The Genie in the Bottle - A Crash Course on running LLMs on Android

by Iury Souza

Slide 1

Slide 1 text

@iurysza iurysouza.dev Putting The Genie in the Bottle A Crash Course on running LLMs on Android

Slide 2

Slide 2 text

About me •GDE for Android •Senior Engineer @ Klarna @iurysza iurysouza.dev

Slide 3

Slide 3 text

But why tho? Do LLMs on mobile make any sense?

Slide 4

Slide 4 text

•Weren't they super hardware intensive? •We already have Chat-GPT/ Gemini at home on the cloud. But why tho?

Slide 5

Slide 5 text

•SOTA models get the most attention •Benchmarks are crushed every week What gets buried in the news

Slide 6

Slide 6 text

Gemini 2.5 Pro GPT-5 Claude Opus 4.1 Grok 4 Deepseek V3 Mistral Small 3.1 Lamma 3.1 Gemma 3 SmolLM Phi-3 The Model Iceberg

Slide 7

Slide 7 text

•There's another category of breakthroughs happening in parallel •Highly efficient and capable Open Source Models •These models are hitting benchmarks which were SOTA 6 months ago. Top Left Corner Models

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Why Run Locally? Even if they’re getting good, why should you still bother? •Performance & Latency •Privacy & Security •Cost & Accessibility •Offline functionality

Slide 15

Slide 15 text

Two options for running on-device Gen-AI Running on the Edge

Slide 16

Slide 16 text

A system wide LLM •Built in directly in the platform •Gemini Nano • via AI Edge SDK (Oct-24) • via ML Kit GenAI APIs (May-25)

Slide 17

Slide 17 text

A system wide LLM •Built in directly in the platform •Gemini Nano • via AI Edge SDK (Oct-24) • via ML Kit GenAI APIs (May-25)

Slide 18

Slide 18 text

AI Edge SDK •Easiest approach for developers •The model is shared by different apps •Support for LoRA Adapters •Model updates are managed by the platform •Built-in safety features •Available on High-end selected devices only

Slide 19

Slide 19 text

AI Edge SDK •Requirements

Slide 20

Slide 20 text

AI Edge SDK •Requirements • A compatible device •Google: Pixel 9 & 10 series •Samsung: Galaxy S25 series •OnePlus: 13 series •Top Models from: • Honor, Realme, Vivo, Xiaomi • iQOO, POCCO, OPPO, Motorola

Slide 21

Slide 21 text

AI Edge SDK •Requirements • A compatible device • AI Core APK Android AICore App

Slide 22

Slide 22 text

AI Edge SDK •Requirements • A compatible device • AI Core APK • Private Compute Services APK Private Compute Services App

Slide 23

Slide 23 text

AI Edge SDK Using it on your app implementation("com.google.ai.edge.aicore:aicore:0.0.1-exp01")

Slide 24

Slide 24 text

private val model = GenerativeModel( generationConfig = generationConfig { context = applicationContext maxOutputTokens = 600 temperature = 0.9f topK = 16 topP = 0.1f } ) val response = model.generateContentStream(prompt) AI Edge SDK Using it on your app

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

AI Edge SDK Using it on your app With it you can: •Check if the device is supported. •Tune safety settings. •Run inference at high performance. •Optionally, provide a LoRA adapter.

Slide 28

Slide 28 text

AI Edge SDK How does it work? Roughly AI Edge SDK Architecture

Slide 29

Slide 29 text

AI Edge SDK How does it work? AICore •OS-level integration •Model deployment management •Hardware abstraction layer AI Edge SDK Architecture

Slide 30

Slide 30 text

AI Edge SDK How does it work? Private Compute Services/Core •PCS + PCC • Privacy through Data Isolation • PCC: Process data safely. • PCS: Model updates. AI Edge SDK Architecture

Slide 31

Slide 31 text

AI Edge SDK How does it work? Private Compute Services/Core •PCS + PCC • Privacy through Data Isolation • PCC: Process data safely. • PCS: Model updates. Now Playing uses PCC

Slide 32

Slide 32 text

AI Edge SDK How does it work? Private Compute Services/Core •PCS + PCC • Privacy through Data Isolation • PCC: Process data safely. • PCS: Model updates. Smart Reply uses PCC

Slide 33

Slide 33 text

A system wide LLM •Built in directly in the platform •Gemini Nano • via AI Edge SDK (Oct-24) • via ML Kit GenAI APIs (May-25)

Slide 34

Slide 34 text

ML Kit GenAI APIs •High-level APIS for: • Summarization • Proofreading • Rewriting • Image Description

Slide 35

Slide 35 text

Adding it to your App ML Kit GenAI APIs val articleToSummarize = “We are excited to announce a set of on-device ... " val options = SummarizerOptions.builder(context) .setInputType(InputType.ARTICLE) .setOutputType(OutputType.ONE_BULLET) .setLanguage(Language.ENGLISH) .build() val summarizer = Summarization.getClient(options)

Slide 36

Slide 36 text

ML Kit GenAI APIs val articleToSummarize = “We are excited to announce a set of on-device ... " val options = SummarizerOptions.builder(context) .setInputType(InputType.ARTICLE) .setOutputType(OutputType.ONE_BULLET) .setLanguage(Language.ENGLISH) .build() val summarizer = Summarization.getClient(options) Adding it to your App

Slide 37

Slide 37 text

Slide 38

Slide 38 text

Slide 39

Slide 39 text

Two options for running on-device Gen-AI Running on the Edge

Slide 40

Slide 40 text

A system wide LLM •Built in directly in the platform •Gemini Nano • via AI Edge SDK (Oct-24) • via ML Kit GenAI APIs (May-25)

Slide 41

Slide 41 text

•You’re in control • Model management • Setup, initialization, and resource usage •You can run many different models •Bring your own fine tuned model •Wider device compatibility The DIY alternative Self managed Models

Slide 42

Slide 42 text

Adding to your App Media Pipe LLM Inference API What model do we use?

Slide 43

Slide 43 text

Wait, what? 🤨 Adding to your App Media Pipe LLM Inference API

Slide 44

Slide 44 text

What are these models? What's under the hood kaggle.com/models/google/gemma-3

Slide 45

Slide 45 text

❯ unzip -l gemma3-1b-it-int4.task Archive: gemma3-1b-it-int4.task Length Date Time Name 549971728 03-07-2025 07 : 36 TF_LITE_PREFILL_DECODE 4689074 03-07-2025 07 : 36 TOKENIZER_MODEL 90 03-07-2025 07 : 36 METADATA --------- ------- 554660892 3 files What are these models? What's under the hood

Slide 46

Slide 46 text

Slide 47

Slide 47 text

Slide 48

Slide 48 text

Slide 49

Slide 49 text

Slide 50

Slide 50 text

Slide 51

Slide 51 text

Slide 52

Slide 52 text

Slide 53

Slide 53 text

Slide 54

Slide 54 text

TOKENIZER_MODEL •Convert text to and from token IDs •Contains the Vocabulary What are these models? What's under the hood

Slide 55

Slide 55 text

TOKENIZER_MODEL •Convert text to and from token IDs •Contains the Vocabulary { "explain": 123, "this": 456, "model": 789 } What are these models? What's under the hood

Slide 56

Slide 56 text

TF_LITE_PREFILL_DECODE •Model weights & Architecture •Tensor Flow Lite •Prefill & Decode operations What are these models? What's under the hood

Slide 57

Slide 57 text

TF_LITE_PREFILL_DECODE •Model weights & Architecture •Tensor Flow Lite •Prefill & Decode operations What are these models? What's under the hood

Slide 58

Slide 58 text

What are they doing? What's under the hood What do we do with that?

Slide 59

Slide 59 text

We’re running Inference What are they doing? What's under the hood

Slide 60

Slide 60 text

What are they doing? What's under the hood Making next token predictions Running Inference?

Slide 61

Slide 61 text

https: // colab.research.google.com/github/google-ai-edge/ mediapipe-samples/blob/main/codelabs/litert_inference/ Gemma3_1b_fine_tune.ipynb#scrollTo=AM6rDABTXt2F What are they doing? What's under the hood

Slide 62

Slide 62 text

What are they doing? What's under the hood

Slide 63

Slide 63 text

What are they doing? What's under the hood

Slide 64

Slide 64 text

What are they doing? Media Pipe LLM Inference API

Slide 65

Slide 65 text

How to use this in my app? Media Pipe LLM Inference API MediaPipe LLM Inference Thin Java Layer

Slide 66

Slide 66 text

How to use this in my app? Media Pipe LLM Inference API MediaPipe LLM Inference CPP implementation uses LiteRT APIs

Slide 67

Slide 67 text

How to use this in my app? Media Pipe LLM Inference API MediaPipe LLM Inference CPP implementation uses LiteRT APIs

Slide 68

Slide 68 text

Media Pipe LLM Inference API Where were we again?

Slide 69

Slide 69 text

Adding it to your App Media Pipe LLM Inference API What model do we use?

Slide 70

Slide 70 text

Adding it to your App Media Pipe LLM Inference API

Slide 71

Slide 71 text

Adding it to your App Media Pipe LLM Inference API

Slide 72

Slide 72 text

Adding it to your App Media Pipe LLM Inference API What model do we use?

Slide 73

Slide 73 text

Adding it to your App Media Pipe LLM Inference API implementation("com.google.mediapipe:tasks-genai:0.10.24")

Slide 74

Slide 74 text

Adding to your App Media Pipe LLM Inference API # Push the model to the device MODEL_FILE="gemma3-1b-it-int4.task" TARGET_DIR="/storage/emulated/0/Android/data/com.myapp/files/" adb push "$MODEL_FILE" "$TARGET_DIR"

Slide 75

Slide 75 text

Adding to your App Media Pipe LLM Inference API fun mediaPipeBasicExample( context: Context, modelName: String, prompt: String ) { }

Slide 76

Slide 76 text

Slide 77

Slide 77 text

Slide 78

Slide 78 text

Adding to your App Media Pipe LLM Inference API fun mediaPipeBasicExample( context: Context, modelName: String, prompt: String ) { val modelPath = File( context.getExternalFilesDir(null), modelName ) / / Create LLM engine val llmInference = LlmInference.createFromOptions( context, LlmInference.LlmInferenceOptions.builder() .setModelPath(modelPath) .setMaxTokens(1024) .build() ) / / Create session val session = LlmInferenceSession.createFromOptions( llmInference, LlmInferenceSession.LlmInferenceSessionOptions.builder() .setTopK(50) .setTemperature(0.8f) .build() ) }

Slide 79

Slide 79 text

Slide 80

Slide 80 text

Adding to your App Media Pipe LLM Inference API fun mediaPipeBasicExample( context: Context, modelName: String, prompt: String ) { val modelPath = { ... } / / Create LLM engine val llmInference = LlmInference.createFromOptions( context, LlmInference.LlmInferenceOptions.builder() .setModelPath(modelPath) .setMaxTokens(1024) .build() ) / / Create session val session = LlmInferenceSession.createFromOptions( llmInference, LlmInferenceSession.LlmInferenceSessionOptions.builder() .setTopK(50) .setTemperature(0.8f) .build() ) / / Send prompt session.addQueryChunk(prompt) session.generateResponseAsync { partialResult, done -> // Handle output Log.d(partialResult) if (done) { // Cleanup llmInference.close() } } }

Slide 81

Slide 81 text

Adding to your App Media Pipe LLM Inference API fun mediaPipeBasicExample( context: Context, modelName: String, prompt: String ) { val modelPath = { .. } / / Create LLM engine val llmInference = LlmInference.createFromOptions( context, LlmInference.LlmInferenceOptions.builder() .setModelPath(modelPath) .setMaxTokens(1024) .build() ) / / Create session val session = LlmInferenceSession.createFromOptions( llmInference, LlmInferenceSession.LlmInferenceSessionOptions.builder() .setTopK(50) .setTemperature(0.8f) .build() ) / / Send prompt session.addQueryChunk(prompt) session.generateResponseAsync { partialResult, done -> // Handle output Log.d(partialResult) if (done) { // Cleanup llmInference.close() } } }

Slide 82

Slide 82 text

What to use? Bottom Line It depends 😌

Slide 83

Slide 83 text

What to use? Bottom Line Gemini Nano = macOS MediaPipe = Arch Linux

Slide 84

Slide 84 text

What to use? Bottom Line Gemini Nano •Via AIEdge & ML Kit GenAI •Limited to one model •Customization via LoRA Adapters •Guard Rails •Limited availability •OEM adoption is key for its future

Slide 85

Slide 85 text

What to use? Bottom Line MediaPipe •Completely self managed •Niche specialized models •Wider availability

Slide 86

Slide 86 text

CAVEATS Size still matters •Don’t expect SOTA performance •Focus on a narrow use case •Use LoRA Adapters or fine-tuned models. •Handling resource usage is challenging

Slide 87

Slide 87 text

Press X to Doubt "This is a fad.”

Slide 88

Slide 88 text

It's a movie, not a picture “Keep an on the trend lines” 3 ideas •Race to the bottom •History •Specialization + ubiquity

Slide 89

Slide 89 text

Race to the bottom Cloud Models are just too expensive

Slide 90

Slide 90 text

It's a movie, not a picture 1970s-1980s mainframe computer room Remember Mainframes?

Slide 91

Slide 91 text

It's a movie, not a picture Computer terminal Remember Mainframes?

Slide 92

Slide 92 text

It's a movie, not a picture Early personal computer Remember Mainframes?

Slide 93

Slide 93 text

SMOL & Agentic AI Specialization & Ubiquity Mini models everywhere

Slide 94

Slide 94 text

“For many agentic workloads, Small Language Models (SLMs) are a superior default to Large Language Models (LLMs) due to cost, latency, controllability, and fine-tuning ease.” "10–30× cheaper, lower energy, faster responses, easier local deployment” Specialization & Ubiquity Mini models everywhere

Slide 95

Slide 95 text

Specialization & Ubiquity Mini models everywhere Cost reduction enables different business models •Free apps •Premium app tiers •Single payment apps •No variable costs per user interaction with LLMs

Slide 96

Slide 96 text

GBoard Writing Tools Specialization & Ubiquity Now: LoRA adapters everywhere

Slide 97

Slide 97 text

Specialization & Ubiquity Now: LoRA adapters everywhere Pixel Screenshots App

Slide 98

Slide 98 text

Wrapping up Give it a go! Projects for you to get started:

Slide 99

Slide 99 text

Wrapping up Give it a go! AI Edge Gallery

Slide 100

Slide 100 text

Wrapping up Give it a go! Projects for you to get started:

Slide 101

Slide 101 text

Wrapping up Give it a go! flutter_gemma Sample App

Slide 102

Slide 102 text

@iurysza iurysouza.dev Putting The Genie in the Bottle A Crash Course on running LLMs on Android • Google AI Edge SDK Documentation • MediaPipe LLM Inference API • AI on Android Spotlight Week • Paper: Small Language Models are the Future of Agentic AI • Deep Dive into LLMs like ChatGPT • Google AI Edge