RAG with Mobile Application - DroidKaigi 2024

RAG (Retrieval-Augmented Generation) with Mobile Application Dreamwalker (@jaichangpark) 1

Dreamwalker Park Jai-Chang Editable Location @jaichangpark @JAICHANGPARK Dreamus Company GDG
Golang Korea / Flutter Seoul 2

3 A RAG App Specifically for DroidKaigi TL;DR

4 100% Native using Jetpack Compose TL;DR

5 TL;DR 100% Finally, Migrate RAG Pipeline to Mobile Device
with On-Device LLM (Offline)

AI 6 Artificial Intelligence

40% Would you like to be more productive? 7 @source:
https://news.mit.edu/2023/study-finds-chatgpt-boosts-worker-productivity-writing-0714 https://www.science.org/doi/epdf/10.1126/science.adh2586 https://survey.stackoverflow.co/2024/ai#sentiment-and-usage-ai-select

Large Language Model 9 Category NLP (Natural Language Processing) LLM
(Large Language Model) Definition Technology that enables AI to understand and process human language. A model trained on vast amounts of text data, capable of human-level language proficiency. Features - Utilizes various technologies - Applied in chatbots, machine translation, sentiment analysis, etc. - Trained on large text datasets - Performs various NLP tasks - Human-level language abilities Relationship LLMs play a crucial role in the advancement of NLP. A subset of NLP technology. Foundation Model

The History of NLP RNN (1985) Transformer (2017) GPT-4 Gemini
LLaMa Claude etc.. Word2Vec (2013) GPT-1 (2018) Seq2Seq (2014) LSTM (1997) Attention (2015) BERT (2018) 10 GPT-2 (2019) GPT3 (2020)

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse vehicula
nulla a leo placerat, in convallis justo molestie. Ut id maximus mauris, vitae pharetra justo. 11 @Srouce: Attention Is All You Need, 12 Jun 2017 (v1), last revised 2 Aug 2023 (this version, v7)] Transformer - Attention Is All You Need

12 @source: Harnessing the Power of LLMs in Practice: A
Survey on ChatGPT and Beyond

Hey! Can I … Train LLMs like LLaMA, GPT, or
Gemini myself? 13

14 Model Training Time (GPU hours) Model Training Time (GPU
hours) Llama 3 8B 1.3M Llama 3.1 8B 1.46M Llama 3 70B 6.4M Llama 3.1 70B 7.0M - - Llama 3.1 405B 30.84M Total 7.7M Total 39.3M Based on Nvidia H100-80G 1EA @source: meta AI

15 Model Time (GPU hours) Model Training Time (GPU hours)
Llama 3 8B 1.3M Llama 3.1 8B 1.46M Llama 3 70B 6.4M Llama 3.1 70B 7.0M Llama 3.1 405B 30.84M Total 7.7M Total 39.3M @source: meta AI 148~167 Years (Single H100 GPU)

16 We use this cluster design for Llama 3 training.
Today, we’re sharing details on two versions of our 24,576-GPU data center scale cluster at Meta. (NVIDIA H100 GPUs) H100(1ea) = about ¥5,000,000 H100 (24,576) = ¥ 122,880,000,000 Two cluster = ¥ 245,760,000,000 = about 2458億円 @source: https://engineering.fb.com/2024/03/12/data-center-engineering/building-metas-genai-infrastructure/

Fine-Tuning 17

18 Nvidia GPU Apple Silicon (Unified Memory)

19 https://arxiv.org/pdf/2106.09685

20 OpenAI Fine-tuning for DroidKaigi

Quantization 21

22 Origin 4 bit color @Source: https://en.wikipedia.org/wiki/Color_quantization

23 Model Memory Google Pixel 9 12GB RAM Pixel 9
Pro 16GB RAM Samsung Galaxy Z Fold 6 12GB RAM Galaxy Z Fold 5 12GB RAM Galaxy Z Flip 6 12GB RAM Galaxy S24 8GB RAM Galaxy S23 8GB RAM Galaxy A25 6GB RAM Galaxy A15 6GB RAM

24 @source: https://www.raspberrypi.com/products/raspberry-pi-5/ https://coral.ai/products/dev-board-micro/ 2GB, 4GB, and 8GB (LPDDR4X-4267 SDRAM)
64MB Raspberry Pi 5 Google Coral Micro

25 Weight @source: https://huggingface.co/meta-llama/Meta-Llama-3.1-8B Meta-Llama-3.1-8B About 16GB bias

26 FP32 FP16 BF16 (Brain Float 16) INT8 INT4 Quantization
*detail equation

What’s RAG ? Retrieval Augmented Generation 27

28 @source: https://arxiv.org/abs/2005.11401

29 Retrieval-augmented generation (RAG) is a software architecture and technique
that integrates large language models with external information sources, enhancing the accuracy and reliability of generative AI models by incorporating specific business data like documents, SQL databases, and internal applications. @source: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Why? 30 Motivation

Large Language Model 31 Limitations of LLMs in Learning from
the Latest Data

Large Language Model 32 Limitations of LLMs in Learning from
the Latest Data

Large Language Model 33 Example of LLM Hallucination

Large Language Model 35 Preventing LLM hallucinations 1. Enhancing data
through web searches for utilization in user queries. 2. This enables responses to questions regarding the latest data.

36 Security Private Data Privacy

37 Private Data https://notebooklm.google/ NotebookLM

38 https://notebooklm.google/ NotebookLM

39 Development Flow(Process) of RAG App

40 1 Hardware Selection

41 2 RAG Framework

42 2. RAG Framework

43 LangChain 🦜🔗 • LangChain is a framework for developing
applications powered by large language models (LLMs). • Open-source libraries: Build your applications using LangChain's modular building blocks and components. • Productionization: Inspect, monitor, and evaluate your apps with LangSmith so that you can constantly optimize and deploy with confidence. • Deployment: Turn any chain into a REST API with LangServe. @source: https://www.langchain.com/

44 2. RAG Framework langchain.dart langchain4j https://pub.dev/packages/langchain https://docs.langchain4j.dev/

45 3 Implement RAG

46 Implements RAG : Basic RAG Pipeline

47 Basic RAG Pipeline Retrieval, Generation

48 Embedding Model • Deep Learning Model • Sparse Embedding
a. One-Hot Encoding b. TF-IDF (Term Frequency-Inverse Document Frequency) • Dense Embedding a. Word2Vec b. BERT etc.. c. Text-embedding-3 (OpenAI)

49 Vector (Query) Embedding 0.88 0.76 0.34 0.23 0.64 0.01
0.44 0.66 What’s Flutter? n-dimension ex) text-embedding-3-large ⇒ 3072 dim What’s React Native? 0.15 0.22 0.89 0.12 0.34 0.09 0.55 0.77 Dense Vector Vector Store (DB) What’s Flutter What’s React Native? Dense Embedding

50 Vector Store 1. Embedding Storage The Vector Store stores
pre-built document embeddings (Embedding Vectors). Embeddings transform text into a high-dimensional vector space, where documents with similar meanings are located close to each other in the vector space. 2. Similarity Search When a user submits a query, it is also converted into a vector. The Vector Store then finds the embeddings most similar to the query vector, identifying relevant documents. This process involves mathematical calculations such as cosine similarity, Euclidean distance, and other similarity metrics. 3. Information Provision(Retriever) RAG (Relevant Augmented Generation) uses the retrieved similar documents to generate responses. The Vector Store provides highly relevant documents, allowing the model to generate more accurate and contextually relevant responses.

51 A (1, 4) B (4, 2) Vector Store Retriever
→ Similarity Search → Cosine Similarity

52 A (1, 4) B (4, 2) e.g. ) 2-dimension
vector Vector Store Retriever → Similarity Search → Cosine Similarity

53 A (1, 4) B (4, 2) e.g. ) 2-dimension
vector Cosine Similarity A - C ⇒ P0 Cosine Similarity A - B ⇒ P1 … Cosine Similarity A - N ⇒ Pn C (-3, -2) ex) user query Vector Store Retriever → Similarity Search → Cosine Similarity Top-K (n) n-chunks

54 Vector Store

57 LLM ( 1 - Basic Usage : OpenAI )
@Serializable data class ChatMessage( val role: String, val content: String ) @Serializable data class ChatRequest( val model: String, val messages: List<ChatMessage> ) @Serializable data class ChatResponse( val id: String, val choices: List<Choice>, val usage: Usage ) @Serializable data class Choice( val index: Int, val message: ChatMessage, val finish_reason: String? ) @Serializable data class Usage( val prompt_tokens: Int, val completion_tokens: Int, val total_tokens: Int ) Based on OpanAI API Docs

object KtorClient { private val client = HttpClient(Android) { install(ContentNegotiation) { json(Json { ignoreUnknownKeys = true isLenient = true prettyPrint = true }) } } suspend fun getChatResponse(apiKey: String, request: ChatRequest): ChatResponse { return client.post("https://api.openai.com/v1/chat/completions") { header(HttpHeaders.ContentType, ContentType.Application.Json) header(HttpHeaders.Authorization, "Bearer $apiKey") setBody(request) }.body() } }

object KtorClient { private val client = HttpClient(Android) { install(ContentNegotiation) { json(Json { ignoreUnknownKeys = true isLenient = true prettyPrint = true }) } } suspend fun getChatResponse(apiKey: String, request: ChatRequest): ChatResponse { return client.post("https://api.openai.com/v1/chat/completions") { header(HttpHeaders.ContentType, ContentType.Application.Json) header(HttpHeaders.Authorization, "Bearer $apiKey") setBody(request) }.body() } } To use the API, you need to obtain an API key.

class ChatViewModel : ViewModel() { private val _messages = MutableStateFlow<List<ChatMessage>>(emptyList()) val messages: StateFlow<List<ChatMessage>> get() = _messages private val apiKey = BuildConfig.openApiKey fun sendMessage(content: String) { /// code } }

fun sendMessage(content: String) { val systemMessage = ChatMessage(role = "system", content = "You are a helpful assistant.") val newMessage = ChatMessage(role = "user", content = content) /// code val request = ChatRequest( model = "gpt-4o-mini", messages = updatedMessages ) viewModelScope.launch { try { val response = getChatResponse(apiKey, request) val assistantMessage = response.choices.firstOrNull()?.message if (assistantMessage != null) { _messages.value += assistantMessage } } catch (e: Exception) { print(e) } } }

fun sendMessage(content: String) { val systemMessage = ChatMessage(role = "system", content = "You are a helpful assistant.") val newMessage = ChatMessage(role = "user", content = content) /// code val request = ChatRequest( model = "gpt-4o-mini", messages = updatedMessages ) viewModelScope.launch { try { val response = getChatResponse(apiKey, request) val assistantMessage = response.choices.firstOrNull()?.message if (assistantMessage != null) { _messages.value += assistantMessage } } catch (e: Exception) { print(e) } } } You can replace your own fine-tuning model. OpenAI o1-mini or o1-preview

fun sendMessage(content: String) { val systemMessage = ChatMessage(role = "system", content = "You are a helpful assistant.") val newMessage = ChatMessage(role = "user", content = content) /// code val request = ChatRequest( model = "gpt-4o-mini", messages = updatedMessages ) viewModelScope.launch { try { val response = getChatResponse(apiKey, request) val assistantMessage = response.choices.firstOrNull()?.message if (assistantMessage != null) { _messages.value += assistantMessage } } catch (e: Exception) { print(e) } } }

@Composable fun OpenAIChatScreen(viewModel: ChatViewModel = viewModel()) { val messages by viewModel.messages.collectAsState() var inputMessage by remember { mutableStateOf("") } val keyboardController = LocalSoftwareKeyboardController.current /// code .. }

Row( modifier = Modifier .fillMaxWidth() .padding(bottom = 16.dp, top = 24.dp), horizontalArrangement = Arrangement.spacedBy(8.dp) ) { TextField( value = inputMessage, label = { Text("Prompt") }, onValueChange = { inputMessage = it }, modifier = Modifier .weight(1f) .background(Color.White), keyboardActions = KeyboardActions( onDone = { keyboardController?.hide() }) ) // Button.. remember { mutableStateOf("") }

Row( modifier = Modifier .fillMaxWidth() .padding(bottom = 16.dp, top = 24.dp), horizontalArrangement = Arrangement.spacedBy(8.dp) ) { /// TextField ... Button(onClick = { if (inputMessage.isNotBlank()) { viewModel.sendMessage(inputMessage) inputMessage = "" keyboardController?.hide() } }) { Text("Send") } }

LazyColumn(modifier = Modifier.fillMaxSize()) { items(messages) { message -> if (message.role != "system") Row(modifier = Modifier.padding(bottom = 16.dp)) { CircleAvatar(message.role) Surface( modifier = Modifier .padding(horizontal = 8.dp), shape = RoundedCornerShape( bottomStart = 16.dp, topEnd = 16.dp ), color = Color.LightGray.copy(alpha = .2f), ) { MarkdownText( modifier = Modifier.padding(8.dp), markdown = "${message.content}" ) } } } }

@Composable fun CircleAvatar( role: String, modifier: Modifier = Modifier, size: Dp = 40.dp ) { val color = when (role) { "user" -> Color.Blue "assistant" -> Color.Green else -> Color.Gray } Canvas(modifier = modifier.size(size)) { val diameter = size.toPx() drawCircle( color = color, radius = diameter / 2 ) } }

70 LLM ( 1 - Basic Usage : Gemini )
dependencies { // add the dependency for the Google AI client SDK for Android implementation("com.google.ai.client.generativeai:generativeai:0.9.0") }

class GeminiViewModel : ViewModel() { private val _uiState: MutableStateFlow<UiState> = MutableStateFlow(UiState.Initial) val uiState: StateFlow<UiState> = _uiState.asStateFlow() private val generativeModel = GenerativeModel( modelName = "gemini-1.5-flash", apiKey = BuildConfig.apiKey ) } Replace it with the model you want to use.

viewModelScope.launch(Dispatchers.IO) { try { val response = generativeModel.generateContent( content { if (bitmap != null) { image(bitmap) } text(prompt) } ) response.text?.let { outputContent -> _uiState.value = UiState.Success(outputContent) } } catch (e: Exception) { _uiState.value = UiState.Error(e.localizedMessage ?: "") } }

viewModelScope.launch(Dispatchers.IO) { try { val response = generativeModel.generateContent( content { if (bitmap != null) { image(bitmap) } text(prompt) } ) response.text?.let { outputContent -> _uiState.value = UiState.Success(outputContent) } } catch (e: Exception) { _uiState.value = UiState.Error(e.localizedMessage ?: "") } } ( multimodal )

74 Basic RAG Pipeline Retrieval, Generation

75 4 Access & Evaluate

77 RAG Evaluation

78 Access (Evaluate)

79 Improvement (Optimization) Basic RAG Pipeline → Advanced RAG Pipeline
• Splitter, Chunk & Overlap Size • Multi Query Retriever • Ensemble Retriever ( Dense & Sparse ) • Long-Context Reorder & Re-Renaking • Context Compressor • Prompt Optimization @source https://arxiv.org/pdf/2307.03172

80 @source: Star Wars: Episode II - Attack of the
Clones - recreated by author using imgflip

81 And more… • Infrastructure ◦ MLOps ◦ LLMOps ◦
RAGOps • Monitoring ◦ Cost ◦ Usage (Log) • Agent ◦ But..

Implement RAG Mobile App For DroidKaigi 82 EXPERIMENT

But, How? 83

First, How can I collect DroidKaigi sessions data? 84

OMG Thanks for conference-app 85

86 Data Acquisition and Processing Strategy 1. Call the API
to obtain session timetable information for DroidKaigi 2024. 2. Join speaker and room information to create a single dataset. 3. Filter only the 'en' data from the title and i18n field. 4. For sessions with n speakers, consolidate them into a single cell. 5. Create a dataset that describes each session's information in one paragraph. 6. Since the data is insufficient, let's try to obtain data from before 2024. 7. Convert text file to PDF file.

87 Datasets import requests import pandas as pd from pprint
import pprint # API endpoint URL url = "https://xxxx-xxx.droidkaigi.jp/events/droidkaigi2024/timetable" # API call response = requests.get(url) data = '' # Check response status code if response.status_code == 200: # Convert JSON response to Python dictionary data = response.json() pprint(data) df = pd.read_json(data["sessions"]) ( 1 - dataset.py )

88 Datasets ( 2 ) ( speakers ) ( rooms
)

89 Datasets df_merged = pd.merge(df_sessions, df_room, left_on='roomId', right_on='id', how='left') merged_sessions_with_speakers
= speakers_exploded.merge( df_speaker, left_on='speakers', right_on='id', suffixes=('', '_speaker')) final_merged = merged_sessions_with_speakers.merge( df_room, left_on='roomId', right_on='id', suffixes=('', '_room')) ( 2 - dataset.py ) Merged each ids

90 Datasets ( pdf ) Split each Paragraph Split Chunk
Size ← overlap

91 Architecture

92 Feature On-Device Remote Text Splitter - ✅ Sentence Embedding
- ✅ Vector Store(DB) - ✅ LLM - ✅

93 LangChain4j

94 LangChain4j

95 RAG Pipeline using LangChain4j dependencies { implementation ("dev.langchain4j:langchain4j:0.33.0") implementation
("dev.langchain4j:langchain4j-open-ai:0.33.0") implementation ("dev.langchain4j:langchain4j-ollama:0.33.0") implementation ("dev.langchain4j:langchain4j-chroma:0.33.0") }

96 ( Document Splitter ) coroutineScope.launch(Dispatchers.IO) { val splitter =
DocumentSplitters.recursive( 128, 16, OpenAiTokenizer("gpt-4o") ) textSegments = splitter.split(document) } The chunk size needs to be calibrated according to the project or app. (+overlap) RAG Pipeline using LangChain4j

97 ( Embedding - Ollama ) coroutineScope.launch(Dispatchers.IO) { val embeddingModel:
EmbeddingModel = OllamaEmbeddingModel.builder() .baseUrl("http://10.0.2.2:11434") .modelName("mxbai-embed-large") .build() val responseData: Response<List<Embedding>> = withContext(Dispatchers.IO) { embeddingModel.embedAll(textSegments) } } RAG Pipeline using LangChain4j

coroutineScope.launch(Dispatchers.IO) { val embeddingModel: EmbeddingModel = OllamaEmbeddingModel.builder() .baseUrl("http://10.0.2.2:11434") .modelName("mxbai-embed-large") .build()
val responseData: Response<List<Embedding>> = withContext(Dispatchers.IO) { embeddingModel.embedAll(textSegments) } } 98 Download & Setup Your model ( Embedding - Ollama ) RAG Pipeline using LangChain4j

99 ( Vector Store - ChromaDB ) val embeddingStore =
ChromaEmbeddingStore .builder() .baseUrl("http://10.0.2.2:8000") .collectionName("droidkaigi") .logRequests(true) .logResponses(true) .build() embeddingStore?.addAll(it?.content(), textSegments) RAG Pipeline using LangChain4j

100 val embeddingStore = ChromaEmbeddingStore .builder() .baseUrl("http://10.0.2.2:8000") .collectionName("droidkaigi") .logRequests(true) .logResponses(true)
.build() embeddingStore?.addAll(it?.content(), textSegments) ( Vector Store - ChromaDB ) RAG Pipeline using LangChain4j

101 ( Retriever ) val questionEmbedding: Embedding = embeddingModel.embed(userMessage).content() val
maxResults = 5 val minScore = 0.6 val embeddingSearchRequest = EmbeddingSearchRequest.builder() .queryEmbedding(questionEmbedding) .maxResults(maxResults) .minScore(minScore) .build() val relevantEmbeddings = embeddingStore?.search(embeddingSearchRequest) RAG Pipeline using LangChain4j

102 val questionEmbedding: Embedding = embeddingModel.embed(userMessage).content() val maxResults = 5
val minScore = 0.6 val embeddingSearchRequest = EmbeddingSearchRequest.builder() .queryEmbedding(questionEmbedding) .maxResults(maxResults) .minScore(minScore) .build() val relevantEmbeddings = embeddingStore?.search(embeddingSearchRequest) ( Retriever ) RAG Pipeline using LangChain4j

103 val questionEmbedding: Embedding = embeddingModel.embed(userMessage).content() val maxResults = 5
val minScore = 0.6 val embeddingSearchRequest = EmbeddingSearchRequest.builder() .queryEmbedding(questionEmbedding) .maxResults(maxResults) .minScore(minScore) .build() val relevantEmbeddings = embeddingStore?.search(embeddingSearchRequest) ( Retriever ) RAG Pipeline using LangChain4j

104 ( Prompt ) var prompt = """ You are
an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise. Question: $question Context: $context Answer: """.trimIndent() val context: String = relevantEmbeddings?.matches() ?.joinToString(separator = "\\n\\n") { match -> match.embedded().text() } ?: "??" Prompt Engineering RAG Pipeline using LangChain4j

105 class OpenAIViewModel : ViewModel() { private val openAiModel =
OpenAiChatModel.builder() .apiKey(BuildConfig.openApiKey) .modelName(OpenAiChatModelName.GPT_4_O_MINI) .build() var conversation = mutableStateListOf<Pair<UserMessage, AiMessage>>() fun sendMessage(prompt: String) { /// code } } ( Generation : LLM - OpenAI ) RAG Pipeline using LangChain4j

106 class OpenAIViewModel : ViewModel() { private val openAiModel =
OpenAiChatModel.builder() .apiKey(BuildConfig.openApiKey) .modelName(OpenAiChatModelName.GPT_4_O_MINI) .build() var conversation = mutableStateListOf<Pair<UserMessage, AiMessage>>() fun sendMessage(prompt: String) { /// code } } You can replace your own fine-tuning model. ( Generation : LLM - OpenAI ) RAG Pipeline using LangChain4j

107 HumanMessage AIMessage SystemMessage class OpenAIViewModel : ViewModel() { private
val openAiModel = OpenAiChatModel.builder() .apiKey(BuildConfig.openApiKey) .modelName(OpenAiChatModelName.GPT_4_O_MINI) .build() var conversation = mutableStateListOf<Pair<UserMessage, AiMessage>>() fun sendMessage(prompt: String) { /// code } } ( Generation : LLM - OpenAI ) RAG Pipeline using LangChain4j

108 class OpenAIViewModel : ViewModel() { /// Code fun sendMessage(prompt:
String) { val userMsg = /// code... viewModelScope.launch(Dispatchers.IO) { val aiResponse: AiMessage = if (conversation.isEmpty()) { openAiModel.generate(userMsg).content() } else { val previousMessages = conversation.flatMap { listOf(it.first, it.second) } openAiModel.generate(*previousMessages.toTypedArray(), userMsg).content() } conversation.add(userMsg to aiResponse) } } } ( Generation : LLM - OpenAI ) RAG Pipeline using LangChain4j

109 val model: StreamingChatLanguageModel = OpenAiStreamingChatModel .builder() .temperature(.1) .apiKey(BuildConfig.openApiKey) .build()
model.generate(prompt, object : StreamingResponseHandler<AiMessage?> { override fun onNext(token: String) { println("onNext: $token") responseText += token } override fun onComplete(response: Response<AiMessage?>) { println("onComplete: $response") } override fun onError(error: Throwable) { error.printStackTrace() } }) ( Generation : LLM - OpenAI → Streaming) OllamaStreamingChatModel Update UI State RAG Pipeline using LangChain4j

48% The code was reduced by 110 The UI code
was excluded

111 Feature On-Device Remote Text Splitter ✅ - Sentence Embedding
- ✅ Vector Store(DB) - ✅ LLM - ✅ Result: RAG Pipeline using LangChain4j

1. Personally, I found the learning curve to be higher
than that of the Langchain Python library. 2. There is a need for improvements in the official documentation. 3. Not all existing Langchain features are compatible. Pros vs Cons - Langchain4j Pros 1. It can be implemented quickly without implementing a service client. 2. It is easy to implement the RAG pipeline. 112 Cons

113 Limitation - LangChain4j 1. There is currently an error
occurring when running the embedding model on mobile devices (Android). a. https://github.com/langchain4j/langchain4j/issues/776 b. https://github.com/langchain4j/langchain4j/issues/1093 c. https://github.com/langchain4j/langchain4j/issues/1202 2. On-device LLM cannot be run (Android) 3. There are limitations in implementing detailed features.

114 On-Device RAG Migration Strategy

115 Security Privacy

How to run Embedding model on a mobile device? How
to storing embedded vector data in a vector database. Implement Mission 116 How to run llm model on a mobile device?

117 On-Device RAG Architecture

How can I store vector data in a vector store
and implement a retriever? 118

119 ObjectBox https://docs.objectbox.io/on-device-vector-search

120 On-Device RAG ( Vector Database ) @Entity data class
Chunk( @Id var chunkId: Long = 0, @Index var docId: Long = 0, @HnswIndex(dimensions = 384) var chunkEmbedding: FloatArray = floatArrayOf(), var docFileName: String = "", var chunkData: String = "", var metadata: String = "", )

121 On-Device RAG ( Vector Database ) fun getSimilarChunks(queryEmbedding: FloatArray,
n: Int = 5): List<Pair<Float, Chunk>> { return chunksBox .query(Chunk_.chunkEmbedding.nearestNeighbors(queryEmbedding, 10)) .build() .findWithScores() .map { Pair(it.score.toFloat(), it.get()) } .subList(0, n) }

How to run Embedding & LLM model on a mobile
device? 122

124 Mediapipe Type Model Note Text embedding Universal Sentence Encode
100 dim Generative AI Gemma 1.1 2B CPU, GPU Gemma 1.1 7B Falcon 1B CPU, GPU StableLM 3B CPU, GPU Phi-2 CPU, GPU Access date: 2024.08

125 MediaPipe - Limitations • Text Embedding Dimension ◦ Universal
Sentence Encoder : Dimension → 100 • The models supported for conversion are currently limited. • Current Supported Model (2024.09) {"PHI_2", "FALCON_RW_1B", "STABLELM_4E1T_3B", "GEMMA_2B"}. The reasons for switching to ONNX/ONN Runtime.

126 On-Device Sentence Embedding Diagram

127 Source: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 sentence-transformers/all-MiniLM-L6-v2 384 dimensional

128 sentence-transformers/all-MiniLM-L6-v2 https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/tree/main

129 BAAI/bge-small-en-v1.5 @source: https://huggingface.co/BAAI/bge-small-en-v1.5

130 Local Embedding (on-device) suspend fun encode( sentence: String ):
FloatArray = withContext(Dispatchers.IO) { val result = tokenizer.tokenize(sentence) val inputTensorMap = mutableMapOf<String,OnnxTensor>() /// codes(setup inputTensorMap)... val outputs = ortSession.run(inputTensorMap) val embeddingTensor = outputs.get(outputTensorName).get() as OnnxTensor return@withContext embeddingTensor.floatBuffer.array() }

131 suspend fun encode( sentence: String ): FloatArray = withContext(Dispatchers.IO)
{ val result = tokenizer.tokenize(sentence) val inputTensorMap = mutableMapOf<String,OnnxTensor>() /// codes(setup inputTensorMap)... val outputs = ortSession.run(inputTensorMap) val embeddingTensor = outputs.get(outputTensorName).get() as OnnxTensor return@withContext embeddingTensor.floatBuffer.array() } Model Tokenizer Local Embedding (on-device)

132 suspend fun encode( sentence: String ): FloatArray = withContext(Dispatchers.IO)
{ val result = tokenizer.tokenize(sentence) val inputTensorMap = mutableMapOf<String,OnnxTensor>() /// codes(setup inputTensorMap)... val outputs = ortSession.run(inputTensorMap) val embeddingTensor = outputs.get(outputTensorName).get() as OnnxTensor return@withContext embeddingTensor.floatBuffer.array() } Local Embedding (on-device)

133 Local(Offline) LLM : on-device

134 https://mvnrepository.com/artifact/com.microsoft.onnxruntime/onnxruntime-android Onnxruntime onnx core: sessions C/C++ 1. Convert :
aar → zip 2. Copy jni your project 3. Copy headers Local(Offline) LLM : on-device

135 Local(Offline) LLM : on-device (xxx.cpp) #include <iostream> #include <fstream>
#include <jni.h> #include <android/asset_manager_jni.h> #include "tokenizer.hpp" #include "onnxruntime_cxx_api.h" #include "onnxruntime_float16.h" #include "nnapi_provider_factory.h"

136 (xxx.cpp) #include <iostream> #include <fstream> #include <jni.h> #include <android/asset_manager_jni.h>
#include "tokenizer.hpp" #include "onnxruntime_cxx_api.h" #include "onnxruntime_float16.h" #include "nnapi_provider_factory.h" MNN based Tokenizer (mnn-llm) Local(Offline) LLM : on-device

137 (xxx.cpp) #include <iostream> #include <fstream> #include <jni.h> #include <android/asset_manager_jni.h>
#include "tokenizer.hpp" #include "onnxruntime_cxx_api.h" #include "onnxruntime_float16.h" #include "nnapi_provider_factory.h" onnruntime headers Local(Offline) LLM : on-device

138 (JNI ) class LocalExternAPI { // Declare native methods
external fun preProcess(): Boolean external fun loadModels( assetManager: AssetManager, useGPU: Boolean, fp16: Boolean, useNNAPI: Boolean, useXNNPACK: Boolean, useQNN: Boolean, useDSPNPU: Boolean ): Boolean external fun runLLM( query: String, addPrompt: Boolean, clear: Boolean ): String companion object { init { System.loadLibrary("myapplication") } } } extern "C" JNIEXPORT jboolean JNICALL Java_com_example_myapplication_LocalExte rnAPI_loadModels( JNIEnv *env, jobject clazz, jobject asset_manager, jboolean use_gpu, jboolean use_fp16, jboolean use_nnapi, jboolean use_xnnpack, jboolean use_qnn, jboolean use_dsp_npu) Local(Offline) LLM : on-device

139 Local(Offline) LLM fun startLLM(input: String, prompt: String) { getSimilarChunks(input,
n = 3).forEach { retrievedContextList.add( RetrievedContext( it.second.docFileName, it.second.chunkData, it.first ) ) } val sortedList = retrievedContextList.sortedByDescending { it.score } sortedList.forEach { jointContext += " " + it.context } /// code.. } Retriever From Local Database with Similarity Search Sorted By Score For improve generation ( RAG pipeline : Retriever )

140 Local(Offline) LLM fun startLLM(input: String, prompt: String) { val
inputPrompt = prompt.replace("\$CONTEXT", jointContext).replace("\$QEUSTION", input) You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise. Question: {question} Context: {context} Answer: Prompt Engineering Write your own prompt. ( RAG pipeline : prompt setup )

141 Local(Offline) LLM viewModelScope.launch(Dispatchers.IO) { var chatting = true llmResp.value
= localExternAPI.runLLM(inputPrompt, true, clear_flag) val startTime = System.currentTimeMillis() while (chatting) { when (llmResp.value) { "END" -> { val tokensPerSecond = (1000f * responseCount / (System.currentTimeMillis() - startTime)) chatting = false } else -> { /// add message } } llmResp.value = localExternAPI.runLLM(inputPrompt, false, clear_flag) } } ( RAG pipeline : Generation )

142 Local(Offline) LLM viewModelScope.launch(Dispatchers.IO) { var chatting = true llmResp.value
= localExternAPI.runLLM(inputPrompt, true, clear_flag) val startTime = System.currentTimeMillis() while (chatting) { when (llmResp.value) { "END" -> { val tokensPerSecond = (1000f * responseCount / (System.currentTimeMillis() - startTime)) chatting = false } else -> { /// add message } } llmResp.value = localExternAPI.runLLM(inputPrompt, false, clear_flag) } } Count tokens/s Update UI State ( RAG pipeline : Generation )

143 Feature On-Device Remote Note Text Splitter ✅ - Custom
Splitter Sentence Embedding ✅ - all-minilm-l6-v2 bge-small-en-v1.5 Vector Store(DB) ✅ - ObjectBox LLM ✅ - Gemma2 Qwen2 - 1.5B Phi3.5-mini Final Result Migrate all features from remote to on-device.

• The llm model size is large. (about 1GB~3GB) •
User experience ◦ initial loading speed and download time • It is affected by hardware performance ◦ tokens/sec (Recommend over 10 tokens/s) ◦ Test device (Galaxy A15: avg 4.5 tokens/s) • Context window size 144 Limitations : On-device RAG

RAG Long Context PEFT 145 * PEFT(Parameter-Efficient Fine-Tuning)

146 RAG or Long-Context LLMs? @source: https://www.arxiv.org/pdf/2407.16833

147 Multilingual

1. An LLM (Large Language Model) is an artificial intelligence
model trained on large datasets to perform natural language processing tasks. 2. The basic RAG (Retrieval-Augmented Generation) pipeline. 3. The basic usage of LLMs in mobile apps and how to build a RAG pipeline using Langchain4j. 4. Implementing on-device RAG (Android). 148 Summary

No Limit! RAG 149

Thank You Park JaiChang Dreamwalker @jaichangpark

Appendix 151

AI 152 Artificial Intelligence See, Hear, and Speak 10101010101000101

153 @source: https://www.kaggle.com/datasets/samuelcortinhas/muffin-vs-chihuahua-image-classification

154 @source: https://mhanational.org/human-brain-101 https://en.wikipedia.org/wiki/File:Complete_neuron_cell_diagram_en.svg

AI 156 Artificial Intelligence See, Hear, and Speak 10101010101000101 Image
& Video DroidKaigi Text Sound Multimodal

AI 157 Artificial Intelligence AGI Artificial General Intelligence

LLM 158 Foundation Model Large Language Model

NLP (Natural language processing) 159

160 @source https://patents.google.com/patent/US10452978B2/en Transformer - Attention Is All You Need

161 https://arxiv.org/pdf/2310.11453 https://arxiv.org/pdf/2402.17764

163 1. Knowledge Expansion: Large language models rely on pre-trained
data. RAG can retrieve relevant information from external databases or documents, allowing the model to provide information it might not inherently know. 2. Improved Accuracy: RAG generates responses by leveraging external knowledge, providing more accurate and up-to-date information than a simple generative model. This is particularly useful when dealing with the latest information or detailed content in specific fields. 3. Efficiency: RAG retrieves and utilizes necessary information in real-time, eliminating the need for large-scale models with massive parameters. This saves memory and computational resources while maintaining high performance. 4. Versatility: RAG can be applied across various domains. It is particularly effective in tasks that require specialized knowledge, such as finance, customer support, legal advice, and technical document generation. 5. Enhanced Understanding: By utilizing retrieved documents to generate responses, RAG helps the model better understand the context and provide more relevant and accurate answers.

164 https://superlinked.com/vector-db-comparison Vector DB Comparison

165 LLM ( 1 - Basic Usage : ollama )
https://github.com/ollama/ollama/blob/main/docs/api.md data class GenerateRequest( val model: String, val prompt: String, val stream: Boolean ) data class GenerateResponse( val model: String, val created_at: String, val response: String, val done: Boolean, val context: List<Int>, val total_duration: Long, val load_duration: Long, val prompt_eval_count: Int, val prompt_eval_duration: Long, val eval_count: Int, val eval_duration: Long )

https://github.com/ollama/ollama/blob/main/docs/api.md // Define Retrofit API interface interface OllamaApiService { @POST("api/generate") suspend fun generate(@Body request: GenerateRequest): GenerateResponse }

fun createOllamaApiService(): OllamaApiService { val okHttpClient = OkHttpClient().newBuilder() .connectTimeout(30, TimeUnit.SECONDS) .readTimeout(30, TimeUnit.SECONDS) .writeTimeout(30, TimeUnit.SECONDS) .build() val retrofit = Retrofit.Builder() .baseUrl("http://10.0.2.2:11434/") .client(okHttpClient) .addConverterFactory(GsonConverterFactory.create()) .build() return retrofit.create(OllamaApiService::class.java) } Change to your ollama port Default 11434

class OllamaViewModel : ViewModel() { private val apiService = createOllamaApiService() var apiResponse by mutableStateOf<GenerateResponse?>(null) var errorMessage by mutableStateOf<String?>(null) fun fetchApiResponse(prompt: String) { viewModelScope.launch { try { val request = GenerateRequest( model = "llama3", prompt = prompt, stream = false ) val response = withContext(Dispatchers.IO) { apiService.generate(request) } apiResponse = response } catch (e: Exception) { errorMessage = "Error: ${e.message}" } } } Replace your model name

class OllamaViewModel : ViewModel() { private val apiService = createOllamaApiService() var apiResponse by mutableStateOf<GenerateResponse?>(null) var errorMessage by mutableStateOf<String?>(null) fun fetchApiResponse(prompt: String) { viewModelScope.launch { try { val request = GenerateRequest( model = "llama3", prompt = prompt, stream = false ) val response = withContext(Dispatchers.IO) { apiService.generate(request) } apiResponse = response } catch (e: Exception) { errorMessage = "Error: ${e.message}" } } }

170 Datasets ( 4 ) df_aggregated = final_merged.groupby( ['title', 'i18nDesc',
'startsAt', 'endsAt', 'roomName', 'lengthInMinutes', 'language', targetColumnName], as_index=False).agg( {'fullName': lambda x: combine_speakers(list(x))})

171 LLM - Generation

172 Datasets final_merged['title'] = final_merged['title'].apply( lambda x: x.get('en') if isinstance(x,
dict) else x) final_merged['i18nDesc'] = final_merged['i18nDesc'].apply( lambda x: x.get('en') if isinstance(x, dict) else x) final_merged['i18nTargetAudience'] = final_merged['i18nTargetAudience'].apply( lambda x: x.get('en') if isinstance(x, dict) else x) ( 3 - dataset.py ) Extract english data

173 Datasets ( txt ) Split each Paragraph Split Chunk
Size ← overlap

174 Data Acquisition and Processing Strategy 6. Since the data
is insufficient, let's try to obtain data from before 2024. DroidKaigi 2021 ~ 2024 The API does not provide data from before 2021. 7. Among the fields of the 2024 data, the key for targetAudience is 'i18nTargetAudience,' while the key for data from before 2024 is 'targetAudience.' 8. Convert text file to PDF file.

175 Datasets ( 4 - dataset.py ) for index, row
in df_aggregated.iterrows(): i18nDesc = row['i18nDesc'].replace("\r\n\r\n", "").replace("\r\n", "") i18nDesc = i18nDesc.replace("\n\n"," ") i18nDesc = i18nDesc.replace("\n"," ") i18nDesc = i18nDesc.strip() description = ( f"At droidkaigi2024, {row['fullName']} will be presenting the session titled '{row['title']}', " f"which is {i18nDesc}. The session will start at {row['startsAt']} and " f"end at {row['endsAt']}, taking place in {row['roomName']}. " f"It will last for {row['lengthInMinutes']} minutes and will be conducted in {row['language']}. " f"This session is aimed at {row[targetColumnName].replace("\r\n\r\n", "").replace("\r\n", "").replace("\n\n"," ").replace("\n"," ")}.\n\n\n" ) session_details.append(description)

176 Architecture

177 On-Device RAG ( Vector Database ) @Entity data class
Chunk( @Id var chunkId: Long = 0, @Index var docId: Long = 0, @HnswIndex(dimensions = 384) var chunkEmbedding: FloatArray = floatArrayOf(), var docFileName: String = "", var chunkData: String = "", var metadata: String = "", )

178 Datasets ( 6, 7 ) year = ['2021', '2022',
'2023', '2024'] for i in year: url = f"https://xxxx-xxx.droidkaigi.jp/events/droidkaigi{i}/timetable" # API Call response = requests.get(url) # code... targetColumnName = "targetAudience" if i == '2024': targetColumnName = "i18nTargetAudience" if i == '2024': final_merged[targetColumnName] = final_merged[targetColumnName].apply( lambda x: x.get('en') if isinstance(x, dict) else x)

179 MediaPipe (Generative AI Task - LLM) val partialResults: SharedFlow<Pair<String,
Boolean>> = _partialResults.asSharedFlow() val options = LlmInference.LlmInferenceOptions.builder() .setModelPath(MODEL_PATH) .setMaxTokens(1024) .setResultListener { partialResult, done -> _partialResults.tryEmit(partialResult to done) } .build() llmInference = LlmInference.createFromOptions(context, options) llmInference.generateResponseAsync(prompt)

Boolean>> = _partialResults.asSharedFlow() val options = LlmInference.LlmInferenceOptions.builder() .setModelPath(MODEL_PATH) .setMaxTokens(1024) .setResultListener { partialResult, done -> _partialResults.tryEmit(partialResult to done) } .build() llmInference = LlmInference.createFromOptions(context, options) llmInference.generateResponseAsync(prompt) eg. /data/local/tmp/llm/model.bin

Boolean>> = _partialResults.asSharedFlow() val options = LlmInference.LlmInferenceOptions.builder() .setModelPath(MODEL_PATH) .setMaxTokens(1024) .setResultListener { partialResult, done -> _partialResults.tryEmit(partialResult to done) } .build() llmInference = LlmInference.createFromOptions(context, options) llmInference.generateResponseAsync(prompt) set Options

Boolean>> = _partialResults.asSharedFlow() val options = LlmInference.LlmInferenceOptions.builder() .setModelPath(MODEL_PATH) .setMaxTokens(1024) .setResultListener { partialResult, done -> _partialResults.tryEmit(partialResult to done) } .build() llmInference = LlmInference.createFromOptions(context, options) llmInference.generateResponseAsync(prompt)

183 MediaPipe (Generative AI Task - LLM) inferenceModel.partialResults .collectIndexed {
index, (partialResult, done) -> currentMessageId?.let { if (index == 0) { /// append } else { /// append } if (done) { currentMessageId = null // Re-enable text input setInputEnabled(true) } } }

Discussion & Summary 184

👍 Recency 🤔 Relevancy

󰳙 Agent 🧙 186

RAG with Mobile Application - DroidKaigi 2024

RAG with Mobile Application - DroidKaigi 2024

More Decks by JaiChangPark

Other Decks in Programming

Featured

Transcript