RAG with Mobile Application - DroidKaigi 2024

Slide 1

Slide 1 text

RAG (Retrieval-Augmented Generation) with Mobile Application Dreamwalker (@jaichangpark) 1

Slide 2

Slide 2 text

Dreamwalker Park Jai-Chang Editable Location @jaichangpark @JAICHANGPARK Dreamus Company GDG Golang Korea / Flutter Seoul 2

Slide 3

Slide 3 text

3 A RAG App Specifically for DroidKaigi TL;DR

Slide 4

Slide 4 text

4 100% Native using Jetpack Compose TL;DR

Slide 5

Slide 5 text

5 TL;DR 100% Finally, Migrate RAG Pipeline to Mobile Device with On-Device LLM (Offline)

Slide 6

Slide 6 text

AI 6 Artificial Intelligence

Slide 7

Slide 7 text

40% Would you like to be more productive? 7 @source: https://news.mit.edu/2023/study-finds-chatgpt-boosts-worker-productivity-writing-0714 https://www.science.org/doi/epdf/10.1126/science.adh2586 https://survey.stackoverflow.co/2024/ai#sentiment-and-usage-ai-select

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Large Language Model 9 Category NLP (Natural Language Processing) LLM (Large Language Model) Definition Technology that enables AI to understand and process human language. A model trained on vast amounts of text data, capable of human-level language proficiency. Features - Utilizes various technologies - Applied in chatbots, machine translation, sentiment analysis, etc. - Trained on large text datasets - Performs various NLP tasks - Human-level language abilities Relationship LLMs play a crucial role in the advancement of NLP. A subset of NLP technology. Foundation Model

Slide 10

Slide 10 text

The History of NLP RNN (1985) Transformer (2017) GPT-4 Gemini LLaMa Claude etc.. Word2Vec (2013) GPT-1 (2018) Seq2Seq (2014) LSTM (1997) Attention (2015) BERT (2018) 10 GPT-2 (2019) GPT3 (2020)

Slide 11

Slide 11 text

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse vehicula nulla a leo placerat, in convallis justo molestie. Ut id maximus mauris, vitae pharetra justo. 11 @Srouce: Attention Is All You Need, 12 Jun 2017 (v1), last revised 2 Aug 2023 (this version, v7)] Transformer - Attention Is All You Need

Slide 12

Slide 12 text

12 @source: Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond

Slide 13

Slide 13 text

Hey! Can I … Train LLMs like LLaMA, GPT, or Gemini myself? 13

Slide 14

Slide 14 text

14 Model Training Time (GPU hours) Model Training Time (GPU hours) Llama 3 8B 1.3M Llama 3.1 8B 1.46M Llama 3 70B 6.4M Llama 3.1 70B 7.0M - - Llama 3.1 405B 30.84M Total 7.7M Total 39.3M Based on Nvidia H100-80G 1EA @source: meta AI

Slide 15

Slide 15 text

15 Model Time (GPU hours) Model Training Time (GPU hours) Llama 3 8B 1.3M Llama 3.1 8B 1.46M Llama 3 70B 6.4M Llama 3.1 70B 7.0M Llama 3.1 405B 30.84M Total 7.7M Total 39.3M @source: meta AI 148~167 Years (Single H100 GPU)

Slide 16

Slide 16 text

16 We use this cluster design for Llama 3 training. Today, we’re sharing details on two versions of our 24,576-GPU data center scale cluster at Meta. (NVIDIA H100 GPUs) H100(1ea) = about ¥5,000,000 H100 (24,576) = ¥ 122,880,000,000 Two cluster = ¥ 245,760,000,000 = about 2458億円 @source: https://engineering.fb.com/2024/03/12/data-center-engineering/building-metas-genai-infrastructure/

Slide 17

Slide 17 text

Fine-Tuning 17

Slide 18

Slide 18 text

18 Nvidia GPU Apple Silicon (Unified Memory)

Slide 19

Slide 19 text

19 https://arxiv.org/pdf/2106.09685

Slide 20

Slide 20 text

20 OpenAI Fine-tuning for DroidKaigi

Slide 21

Slide 21 text

Quantization 21

Slide 22

Slide 22 text

22 Origin 4 bit color @Source: https://en.wikipedia.org/wiki/Color_quantization

Slide 23

Slide 23 text

23 Model Memory Google Pixel 9 12GB RAM Pixel 9 Pro 16GB RAM Samsung Galaxy Z Fold 6 12GB RAM Galaxy Z Fold 5 12GB RAM Galaxy Z Flip 6 12GB RAM Galaxy S24 8GB RAM Galaxy S23 8GB RAM Galaxy A25 6GB RAM Galaxy A15 6GB RAM

Slide 24

Slide 24 text

24 @source: https://www.raspberrypi.com/products/raspberry-pi-5/ https://coral.ai/products/dev-board-micro/ 2GB, 4GB, and 8GB (LPDDR4X-4267 SDRAM) 64MB Raspberry Pi 5 Google Coral Micro

Slide 25

Slide 25 text

25 Weight @source: https://huggingface.co/meta-llama/Meta-Llama-3.1-8B Meta-Llama-3.1-8B About 16GB bias

Slide 26

Slide 26 text

26 FP32 FP16 BF16 (Brain Float 16) INT8 INT4 Quantization *detail equation

Slide 27

Slide 27 text

What’s RAG ? Retrieval Augmented Generation 27

Slide 28

Slide 28 text

28 @source: https://arxiv.org/abs/2005.11401

Slide 29

Slide 29 text

29 Retrieval-augmented generation (RAG) is a software architecture and technique that integrates large language models with external information sources, enhancing the accuracy and reliability of generative AI models by incorporating specific business data like documents, SQL databases, and internal applications. @source: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Slide 30

Slide 30 text

Why? 30 Motivation

Slide 31

Slide 31 text

Large Language Model 31 Limitations of LLMs in Learning from the Latest Data

Slide 32

Slide 32 text

Large Language Model 32 Limitations of LLMs in Learning from the Latest Data

Slide 33

Slide 33 text

Large Language Model 33 Example of LLM Hallucination

Slide 34

Slide 34 text

Large Language Model 34 Example of LLM Hallucination

Slide 35

Slide 35 text

Large Language Model 35 Preventing LLM hallucinations 1. Enhancing data through web searches for utilization in user queries. 2. This enables responses to questions regarding the latest data.

Slide 36

Slide 36 text

36 Security Private Data Privacy

Slide 37

Slide 37 text

37 Private Data https://notebooklm.google/ NotebookLM

Slide 38

Slide 38 text

38 https://notebooklm.google/ NotebookLM

Slide 39

Slide 39 text

39 Development Flow(Process) of RAG App

Slide 40

Slide 40 text

40 1 Hardware Selection

Slide 41

Slide 41 text

41 2 RAG Framework

Slide 42

Slide 42 text

42 2. RAG Framework

Slide 43

Slide 43 text

43 LangChain 🦜🔗 ● LangChain is a framework for developing applications powered by large language models (LLMs). ● Open-source libraries: Build your applications using LangChain's modular building blocks and components. ● Productionization: Inspect, monitor, and evaluate your apps with LangSmith so that you can constantly optimize and deploy with confidence. ● Deployment: Turn any chain into a REST API with LangServe. @source: https://www.langchain.com/

Slide 44

Slide 44 text

44 2. RAG Framework langchain.dart langchain4j https://pub.dev/packages/langchain https://docs.langchain4j.dev/

Slide 45

Slide 45 text

45 3 Implement RAG

Slide 46

Slide 46 text

46 Implements RAG : Basic RAG Pipeline

Slide 47

Slide 47 text

47 Basic RAG Pipeline Retrieval, Generation

Slide 48

Slide 48 text

48 Embedding Model ● Deep Learning Model ● Sparse Embedding a. One-Hot Encoding b. TF-IDF (Term Frequency-Inverse Document Frequency) ● Dense Embedding a. Word2Vec b. BERT etc.. c. Text-embedding-3 (OpenAI)

Slide 49

Slide 49 text

49 Vector (Query) Embedding 0.88 0.76 0.34 0.23 0.64 0.01 0.44 0.66 What’s Flutter? n-dimension ex) text-embedding-3-large ⇒ 3072 dim What’s React Native? 0.15 0.22 0.89 0.12 0.34 0.09 0.55 0.77 Dense Vector Vector Store (DB) What’s Flutter What’s React Native? Dense Embedding

Slide 50

Slide 50 text

50 Vector Store 1. Embedding Storage The Vector Store stores pre-built document embeddings (Embedding Vectors). Embeddings transform text into a high-dimensional vector space, where documents with similar meanings are located close to each other in the vector space. 2. Similarity Search When a user submits a query, it is also converted into a vector. The Vector Store then finds the embeddings most similar to the query vector, identifying relevant documents. This process involves mathematical calculations such as cosine similarity, Euclidean distance, and other similarity metrics. 3. Information Provision(Retriever) RAG (Relevant Augmented Generation) uses the retrieved similar documents to generate responses. The Vector Store provides highly relevant documents, allowing the model to generate more accurate and contextually relevant responses.

Slide 51

Slide 51 text

51 A (1, 4) B (4, 2) Vector Store Retriever → Similarity Search → Cosine Similarity

Slide 52

Slide 52 text

52 A (1, 4) B (4, 2) e.g. ) 2-dimension vector Vector Store Retriever → Similarity Search → Cosine Similarity

Slide 53

Slide 53 text

53 A (1, 4) B (4, 2) e.g. ) 2-dimension vector Cosine Similarity A - C ⇒ P0 Cosine Similarity A - B ⇒ P1 … Cosine Similarity A - N ⇒ Pn C (-3, -2) ex) user query Vector Store Retriever → Similarity Search → Cosine Similarity Top-K (n) n-chunks

Slide 54

Slide 54 text

54 Vector Store

Slide 55

Slide 55 text

Slide 56

Slide 56 text

Slide 57

Slide 57 text

57 LLM ( 1 - Basic Usage : OpenAI ) @Serializable data class ChatMessage( val role: String, val content: String ) @Serializable data class ChatRequest( val model: String, val messages: List ) @Serializable data class ChatResponse( val id: String, val choices: List, val usage: Usage ) @Serializable data class Choice( val index: Int, val message: ChatMessage, val finish_reason: String? ) @Serializable data class Usage( val prompt_tokens: Int, val completion_tokens: Int, val total_tokens: Int ) Based on OpanAI API Docs

Slide 58

Slide 58 text

58 LLM ( 1 - Basic Usage : OpenAI ) object KtorClient { private val client = HttpClient(Android) { install(ContentNegotiation) { json(Json { ignoreUnknownKeys = true isLenient = true prettyPrint = true }) } } suspend fun getChatResponse(apiKey: String, request: ChatRequest): ChatResponse { return client.post("https://api.openai.com/v1/chat/completions") { header(HttpHeaders.ContentType, ContentType.Application.Json) header(HttpHeaders.Authorization, "Bearer $apiKey") setBody(request) }.body() } }

Slide 59

Slide 59 text

59 LLM ( 1 - Basic Usage : OpenAI ) object KtorClient { private val client = HttpClient(Android) { install(ContentNegotiation) { json(Json { ignoreUnknownKeys = true isLenient = true prettyPrint = true }) } } suspend fun getChatResponse(apiKey: String, request: ChatRequest): ChatResponse { return client.post("https://api.openai.com/v1/chat/completions") { header(HttpHeaders.ContentType, ContentType.Application.Json) header(HttpHeaders.Authorization, "Bearer $apiKey") setBody(request) }.body() } }

Slide 60

Slide 60 text

60 LLM ( 1 - Basic Usage : OpenAI ) object KtorClient { private val client = HttpClient(Android) { install(ContentNegotiation) { json(Json { ignoreUnknownKeys = true isLenient = true prettyPrint = true }) } } suspend fun getChatResponse(apiKey: String, request: ChatRequest): ChatResponse { return client.post("https://api.openai.com/v1/chat/completions") { header(HttpHeaders.ContentType, ContentType.Application.Json) header(HttpHeaders.Authorization, "Bearer $apiKey") setBody(request) }.body() } } To use the API, you need to obtain an API key.

Slide 61

Slide 61 text

61 LLM ( 1 - Basic Usage : OpenAI ) class ChatViewModel : ViewModel() { private val _messages = MutableStateFlow>(emptyList()) val messages: StateFlow> get() = _messages private val apiKey = BuildConfig.openApiKey fun sendMessage(content: String) { /// code } }

Slide 62

Slide 62 text

62 LLM ( 1 - Basic Usage : OpenAI ) fun sendMessage(content: String) { val systemMessage = ChatMessage(role = "system", content = "You are a helpful assistant.") val newMessage = ChatMessage(role = "user", content = content) /// code val request = ChatRequest( model = "gpt-4o-mini", messages = updatedMessages ) viewModelScope.launch { try { val response = getChatResponse(apiKey, request) val assistantMessage = response.choices.firstOrNull()?.message if (assistantMessage != null) { _messages.value += assistantMessage } } catch (e: Exception) { print(e) } } }

Slide 63

Slide 63 text

63 LLM ( 1 - Basic Usage : OpenAI ) fun sendMessage(content: String) { val systemMessage = ChatMessage(role = "system", content = "You are a helpful assistant.") val newMessage = ChatMessage(role = "user", content = content) /// code val request = ChatRequest( model = "gpt-4o-mini", messages = updatedMessages ) viewModelScope.launch { try { val response = getChatResponse(apiKey, request) val assistantMessage = response.choices.firstOrNull()?.message if (assistantMessage != null) { _messages.value += assistantMessage } } catch (e: Exception) { print(e) } } } You can replace your own fine-tuning model. OpenAI o1-mini or o1-preview

Slide 64

Slide 64 text

64 LLM ( 1 - Basic Usage : OpenAI ) fun sendMessage(content: String) { val systemMessage = ChatMessage(role = "system", content = "You are a helpful assistant.") val newMessage = ChatMessage(role = "user", content = content) /// code val request = ChatRequest( model = "gpt-4o-mini", messages = updatedMessages ) viewModelScope.launch { try { val response = getChatResponse(apiKey, request) val assistantMessage = response.choices.firstOrNull()?.message if (assistantMessage != null) { _messages.value += assistantMessage } } catch (e: Exception) { print(e) } } }

Slide 65

Slide 65 text

65 LLM ( 1 - Basic Usage : OpenAI ) @Composable fun OpenAIChatScreen(viewModel: ChatViewModel = viewModel()) { val messages by viewModel.messages.collectAsState() var inputMessage by remember { mutableStateOf("") } val keyboardController = LocalSoftwareKeyboardController.current /// code .. }

Slide 66

Slide 66 text

66 LLM ( 1 - Basic Usage : OpenAI ) Row( modifier = Modifier .fillMaxWidth() .padding(bottom = 16.dp, top = 24.dp), horizontalArrangement = Arrangement.spacedBy(8.dp) ) { TextField( value = inputMessage, label = { Text("Prompt") }, onValueChange = { inputMessage = it }, modifier = Modifier .weight(1f) .background(Color.White), keyboardActions = KeyboardActions( onDone = { keyboardController?.hide() }) ) // Button.. remember { mutableStateOf("") }

Slide 67

Slide 67 text

67 LLM ( 1 - Basic Usage : OpenAI ) Row( modifier = Modifier .fillMaxWidth() .padding(bottom = 16.dp, top = 24.dp), horizontalArrangement = Arrangement.spacedBy(8.dp) ) { /// TextField ... Button(onClick = { if (inputMessage.isNotBlank()) { viewModel.sendMessage(inputMessage) inputMessage = "" keyboardController?.hide() } }) { Text("Send") } }

Slide 68

Slide 68 text

68 LLM ( 1 - Basic Usage : OpenAI ) LazyColumn(modifier = Modifier.fillMaxSize()) { items(messages) { message -> if (message.role != "system") Row(modifier = Modifier.padding(bottom = 16.dp)) { CircleAvatar(message.role) Surface( modifier = Modifier .padding(horizontal = 8.dp), shape = RoundedCornerShape( bottomStart = 16.dp, topEnd = 16.dp ), color = Color.LightGray.copy(alpha = .2f), ) { MarkdownText( modifier = Modifier.padding(8.dp), markdown = "${message.content}" ) } } } }

Slide 69

Slide 69 text

69 LLM ( 1 - Basic Usage : OpenAI ) @Composable fun CircleAvatar( role: String, modifier: Modifier = Modifier, size: Dp = 40.dp ) { val color = when (role) { "user" -> Color.Blue "assistant" -> Color.Green else -> Color.Gray } Canvas(modifier = modifier.size(size)) { val diameter = size.toPx() drawCircle( color = color, radius = diameter / 2 ) } }

Slide 70

Slide 70 text

70 LLM ( 1 - Basic Usage : Gemini ) dependencies { // add the dependency for the Google AI client SDK for Android implementation("com.google.ai.client.generativeai:generativeai:0.9.0") }

Slide 71

Slide 71 text

71 LLM ( 1 - Basic Usage : Gemini ) class GeminiViewModel : ViewModel() { private val _uiState: MutableStateFlow = MutableStateFlow(UiState.Initial) val uiState: StateFlow = _uiState.asStateFlow() private val generativeModel = GenerativeModel( modelName = "gemini-1.5-flash", apiKey = BuildConfig.apiKey ) } Replace it with the model you want to use.

Slide 72

Slide 72 text

72 LLM ( 1 - Basic Usage : Gemini ) viewModelScope.launch(Dispatchers.IO) { try { val response = generativeModel.generateContent( content { if (bitmap != null) { image(bitmap) } text(prompt) } ) response.text?.let { outputContent -> _uiState.value = UiState.Success(outputContent) } } catch (e: Exception) { _uiState.value = UiState.Error(e.localizedMessage ?: "") } }

Slide 73

Slide 73 text

73 LLM ( 1 - Basic Usage : Gemini ) viewModelScope.launch(Dispatchers.IO) { try { val response = generativeModel.generateContent( content { if (bitmap != null) { image(bitmap) } text(prompt) } ) response.text?.let { outputContent -> _uiState.value = UiState.Success(outputContent) } } catch (e: Exception) { _uiState.value = UiState.Error(e.localizedMessage ?: "") } } ( multimodal )

Slide 74

Slide 74 text

74 Basic RAG Pipeline Retrieval, Generation

Slide 75

Slide 75 text

75 4 Access & Evaluate

Slide 76

Slide 76 text

Slide 77

Slide 77 text

77 RAG Evaluation

Slide 78

Slide 78 text

78 Access (Evaluate)

Slide 79

Slide 79 text

79 Improvement (Optimization) Basic RAG Pipeline → Advanced RAG Pipeline ● Splitter, Chunk & Overlap Size ● Multi Query Retriever ● Ensemble Retriever ( Dense & Sparse ) ● Long-Context Reorder & Re-Renaking ● Context Compressor ● Prompt Optimization @source https://arxiv.org/pdf/2307.03172

Slide 80

Slide 80 text

80 @source: Star Wars: Episode II - Attack of the Clones - recreated by author using imgflip

Slide 81

Slide 81 text

81 And more… ● Infrastructure ○ MLOps ○ LLMOps ○ RAGOps ● Monitoring ○ Cost ○ Usage (Log) ● Agent ○ But..

Slide 82

Slide 82 text

Implement RAG Mobile App For DroidKaigi 82 EXPERIMENT

Slide 83

Slide 83 text

But, How? 83

Slide 84

Slide 84 text

First, How can I collect DroidKaigi sessions data? 84

Slide 85

Slide 85 text

OMG Thanks for conference-app 85

Slide 86

Slide 86 text

86 Data Acquisition and Processing Strategy 1. Call the API to obtain session timetable information for DroidKaigi 2024. 2. Join speaker and room information to create a single dataset. 3. Filter only the 'en' data from the title and i18n field. 4. For sessions with n speakers, consolidate them into a single cell. 5. Create a dataset that describes each session's information in one paragraph. 6. Since the data is insufficient, let's try to obtain data from before 2024. 7. Convert text file to PDF file.

Slide 87

Slide 87 text

87 Datasets import requests import pandas as pd from pprint import pprint # API endpoint URL url = "https://xxxx-xxx.droidkaigi.jp/events/droidkaigi2024/timetable" # API call response = requests.get(url) data = '' # Check response status code if response.status_code == 200: # Convert JSON response to Python dictionary data = response.json() pprint(data) df = pd.read_json(data["sessions"]) ( 1 - dataset.py )

Slide 88

Slide 88 text

88 Datasets ( 2 ) ( speakers ) ( rooms )

Slide 89

Slide 89 text

89 Datasets df_merged = pd.merge(df_sessions, df_room, left_on='roomId', right_on='id', how='left') merged_sessions_with_speakers = speakers_exploded.merge( df_speaker, left_on='speakers', right_on='id', suffixes=('', '_speaker')) final_merged = merged_sessions_with_speakers.merge( df_room, left_on='roomId', right_on='id', suffixes=('', '_room')) ( 2 - dataset.py ) Merged each ids

Slide 90

Slide 90 text

90 Datasets ( pdf ) Split each Paragraph Split Chunk Size ← overlap

Slide 91

Slide 91 text

91 Architecture

Slide 92

Slide 92 text

92 Feature On-Device Remote Text Splitter - ✅ Sentence Embedding - ✅ Vector Store(DB) - ✅ LLM - ✅

Slide 93

Slide 93 text

93 LangChain4j

Slide 94

Slide 94 text

94 LangChain4j

Slide 95

Slide 95 text

95 RAG Pipeline using LangChain4j dependencies { implementation ("dev.langchain4j:langchain4j:0.33.0") implementation ("dev.langchain4j:langchain4j-open-ai:0.33.0") implementation ("dev.langchain4j:langchain4j-ollama:0.33.0") implementation ("dev.langchain4j:langchain4j-chroma:0.33.0") }

Slide 96

Slide 96 text

96 ( Document Splitter ) coroutineScope.launch(Dispatchers.IO) { val splitter = DocumentSplitters.recursive( 128, 16, OpenAiTokenizer("gpt-4o") ) textSegments = splitter.split(document) } The chunk size needs to be calibrated according to the project or app. (+overlap) RAG Pipeline using LangChain4j

Slide 97

Slide 97 text

97 ( Embedding - Ollama ) coroutineScope.launch(Dispatchers.IO) { val embeddingModel: EmbeddingModel = OllamaEmbeddingModel.builder() .baseUrl("http://10.0.2.2:11434") .modelName("mxbai-embed-large") .build() val responseData: Response> = withContext(Dispatchers.IO) { embeddingModel.embedAll(textSegments) } } RAG Pipeline using LangChain4j

Slide 98

Slide 98 text

coroutineScope.launch(Dispatchers.IO) { val embeddingModel: EmbeddingModel = OllamaEmbeddingModel.builder() .baseUrl("http://10.0.2.2:11434") .modelName("mxbai-embed-large") .build() val responseData: Response> = withContext(Dispatchers.IO) { embeddingModel.embedAll(textSegments) } } 98 Download & Setup Your model ( Embedding - Ollama ) RAG Pipeline using LangChain4j

Slide 99

Slide 99 text

99 ( Vector Store - ChromaDB ) val embeddingStore = ChromaEmbeddingStore .builder() .baseUrl("http://10.0.2.2:8000") .collectionName("droidkaigi") .logRequests(true) .logResponses(true) .build() embeddingStore?.addAll(it?.content(), textSegments) RAG Pipeline using LangChain4j

Slide 100

Slide 100 text

100 val embeddingStore = ChromaEmbeddingStore .builder() .baseUrl("http://10.0.2.2:8000") .collectionName("droidkaigi") .logRequests(true) .logResponses(true) .build() embeddingStore?.addAll(it?.content(), textSegments) ( Vector Store - ChromaDB ) RAG Pipeline using LangChain4j

Slide 101

Slide 101 text

101 ( Retriever ) val questionEmbedding: Embedding = embeddingModel.embed(userMessage).content() val maxResults = 5 val minScore = 0.6 val embeddingSearchRequest = EmbeddingSearchRequest.builder() .queryEmbedding(questionEmbedding) .maxResults(maxResults) .minScore(minScore) .build() val relevantEmbeddings = embeddingStore?.search(embeddingSearchRequest) RAG Pipeline using LangChain4j

Slide 102

Slide 102 text

102 val questionEmbedding: Embedding = embeddingModel.embed(userMessage).content() val maxResults = 5 val minScore = 0.6 val embeddingSearchRequest = EmbeddingSearchRequest.builder() .queryEmbedding(questionEmbedding) .maxResults(maxResults) .minScore(minScore) .build() val relevantEmbeddings = embeddingStore?.search(embeddingSearchRequest) ( Retriever ) RAG Pipeline using LangChain4j

Slide 103

Slide 103 text

103 val questionEmbedding: Embedding = embeddingModel.embed(userMessage).content() val maxResults = 5 val minScore = 0.6 val embeddingSearchRequest = EmbeddingSearchRequest.builder() .queryEmbedding(questionEmbedding) .maxResults(maxResults) .minScore(minScore) .build() val relevantEmbeddings = embeddingStore?.search(embeddingSearchRequest) ( Retriever ) RAG Pipeline using LangChain4j

Slide 104

Slide 104 text

104 ( Prompt ) var prompt = """ You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise. Question: $question Context: $context Answer: """.trimIndent() val context: String = relevantEmbeddings?.matches() ?.joinToString(separator = "\\n\\n") { match -> match.embedded().text() } ?: "??" Prompt Engineering RAG Pipeline using LangChain4j

Slide 105

Slide 105 text

105 class OpenAIViewModel : ViewModel() { private val openAiModel = OpenAiChatModel.builder() .apiKey(BuildConfig.openApiKey) .modelName(OpenAiChatModelName.GPT_4_O_MINI) .build() var conversation = mutableStateListOf>() fun sendMessage(prompt: String) { /// code } } ( Generation : LLM - OpenAI ) RAG Pipeline using LangChain4j

Slide 106

Slide 106 text

106 class OpenAIViewModel : ViewModel() { private val openAiModel = OpenAiChatModel.builder() .apiKey(BuildConfig.openApiKey) .modelName(OpenAiChatModelName.GPT_4_O_MINI) .build() var conversation = mutableStateListOf>() fun sendMessage(prompt: String) { /// code } } You can replace your own fine-tuning model. ( Generation : LLM - OpenAI ) RAG Pipeline using LangChain4j

Slide 107

Slide 107 text

107 HumanMessage AIMessage SystemMessage class OpenAIViewModel : ViewModel() { private val openAiModel = OpenAiChatModel.builder() .apiKey(BuildConfig.openApiKey) .modelName(OpenAiChatModelName.GPT_4_O_MINI) .build() var conversation = mutableStateListOf>() fun sendMessage(prompt: String) { /// code } } ( Generation : LLM - OpenAI ) RAG Pipeline using LangChain4j

Slide 108

Slide 108 text

108 class OpenAIViewModel : ViewModel() { /// Code fun sendMessage(prompt: String) { val userMsg = /// code... viewModelScope.launch(Dispatchers.IO) { val aiResponse: AiMessage = if (conversation.isEmpty()) { openAiModel.generate(userMsg).content() } else { val previousMessages = conversation.flatMap { listOf(it.first, it.second) } openAiModel.generate(*previousMessages.toTypedArray(), userMsg).content() } conversation.add(userMsg to aiResponse) } } } ( Generation : LLM - OpenAI ) RAG Pipeline using LangChain4j

Slide 109

Slide 109 text

109 val model: StreamingChatLanguageModel = OpenAiStreamingChatModel .builder() .temperature(.1) .apiKey(BuildConfig.openApiKey) .build() model.generate(prompt, object : StreamingResponseHandler { override fun onNext(token: String) { println("onNext: $token") responseText += token } override fun onComplete(response: Response) { println("onComplete: $response") } override fun onError(error: Throwable) { error.printStackTrace() } }) ( Generation : LLM - OpenAI → Streaming) OllamaStreamingChatModel Update UI State RAG Pipeline using LangChain4j

Slide 110

Slide 110 text

48% The code was reduced by 110 The UI code was excluded

Slide 111

Slide 111 text

111 Feature On-Device Remote Text Splitter ✅ - Sentence Embedding - ✅ Vector Store(DB) - ✅ LLM - ✅ Result: RAG Pipeline using LangChain4j

Slide 112

Slide 112 text

1. Personally, I found the learning curve to be higher than that of the Langchain Python library. 2. There is a need for improvements in the official documentation. 3. Not all existing Langchain features are compatible. Pros vs Cons - Langchain4j Pros 1. It can be implemented quickly without implementing a service client. 2. It is easy to implement the RAG pipeline. 112 Cons

Slide 113

Slide 113 text

113 Limitation - LangChain4j 1. There is currently an error occurring when running the embedding model on mobile devices (Android). a. https://github.com/langchain4j/langchain4j/issues/776 b. https://github.com/langchain4j/langchain4j/issues/1093 c. https://github.com/langchain4j/langchain4j/issues/1202 2. On-device LLM cannot be run (Android) 3. There are limitations in implementing detailed features.

Slide 114

Slide 114 text

114 On-Device RAG Migration Strategy

Slide 115

Slide 115 text

115 Security Privacy

Slide 116

Slide 116 text

How to run Embedding model on a mobile device? How to storing embedded vector data in a vector database. Implement Mission 116 How to run llm model on a mobile device?

Slide 117

Slide 117 text

117 On-Device RAG Architecture

Slide 118

Slide 118 text

How can I store vector data in a vector store and implement a retriever? 118

Slide 119

Slide 119 text

119 ObjectBox https://docs.objectbox.io/on-device-vector-search

Slide 120

Slide 120 text

120 On-Device RAG ( Vector Database ) @Entity data class Chunk( @Id var chunkId: Long = 0, @Index var docId: Long = 0, @HnswIndex(dimensions = 384) var chunkEmbedding: FloatArray = floatArrayOf(), var docFileName: String = "", var chunkData: String = "", var metadata: String = "", )

Slide 121

Slide 121 text

121 On-Device RAG ( Vector Database ) fun getSimilarChunks(queryEmbedding: FloatArray, n: Int = 5): List> { return chunksBox .query(Chunk_.chunkEmbedding.nearestNeighbors(queryEmbedding, 10)) .build() .findWithScores() .map { Pair(it.score.toFloat(), it.get()) } .subList(0, n) }

Slide 122

Slide 122 text

How to run Embedding & LLM model on a mobile device? 122

Slide 123

Slide 123 text

123

Slide 124

Slide 124 text

124 Mediapipe Type Model Note Text embedding Universal Sentence Encode 100 dim Generative AI Gemma 1.1 2B CPU, GPU Gemma 1.1 7B Falcon 1B CPU, GPU StableLM 3B CPU, GPU Phi-2 CPU, GPU Access date: 2024.08

Slide 125

Slide 125 text

125 MediaPipe - Limitations ● Text Embedding Dimension ○ Universal Sentence Encoder : Dimension → 100 ● The models supported for conversion are currently limited. ● Current Supported Model (2024.09) {"PHI_2", "FALCON_RW_1B", "STABLELM_4E1T_3B", "GEMMA_2B"}. The reasons for switching to ONNX/ONN Runtime.

Slide 126

Slide 126 text

126 On-Device Sentence Embedding Diagram

Slide 127

Slide 127 text

127 Source: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 sentence-transformers/all-MiniLM-L6-v2 384 dimensional

Slide 128

Slide 128 text

128 sentence-transformers/all-MiniLM-L6-v2 https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/tree/main

Slide 129

Slide 129 text

129 BAAI/bge-small-en-v1.5 @source: https://huggingface.co/BAAI/bge-small-en-v1.5

Slide 130

Slide 130 text

130 Local Embedding (on-device) suspend fun encode( sentence: String ): FloatArray = withContext(Dispatchers.IO) { val result = tokenizer.tokenize(sentence) val inputTensorMap = mutableMapOf() /// codes(setup inputTensorMap)... val outputs = ortSession.run(inputTensorMap) val embeddingTensor = outputs.get(outputTensorName).get() as OnnxTensor return@withContext embeddingTensor.floatBuffer.array() }

Slide 131

Slide 131 text

131 suspend fun encode( sentence: String ): FloatArray = withContext(Dispatchers.IO) { val result = tokenizer.tokenize(sentence) val inputTensorMap = mutableMapOf() /// codes(setup inputTensorMap)... val outputs = ortSession.run(inputTensorMap) val embeddingTensor = outputs.get(outputTensorName).get() as OnnxTensor return@withContext embeddingTensor.floatBuffer.array() } Model Tokenizer Local Embedding (on-device)

Slide 132

Slide 132 text

132 suspend fun encode( sentence: String ): FloatArray = withContext(Dispatchers.IO) { val result = tokenizer.tokenize(sentence) val inputTensorMap = mutableMapOf() /// codes(setup inputTensorMap)... val outputs = ortSession.run(inputTensorMap) val embeddingTensor = outputs.get(outputTensorName).get() as OnnxTensor return@withContext embeddingTensor.floatBuffer.array() } Local Embedding (on-device)

Slide 133

Slide 133 text

133 Local(Offline) LLM : on-device

Slide 134

Slide 134 text

134 https://mvnrepository.com/artifact/com.microsoft.onnxruntime/onnxruntime-android Onnxruntime onnx core: sessions C/C++ 1. Convert : aar → zip 2. Copy jni your project 3. Copy headers Local(Offline) LLM : on-device

Slide 135

Slide 135 text

135 Local(Offline) LLM : on-device (xxx.cpp) #include #include #include #include #include "tokenizer.hpp" #include "onnxruntime_cxx_api.h" #include "onnxruntime_float16.h" #include "nnapi_provider_factory.h"

Slide 136

Slide 136 text

136 (xxx.cpp) #include #include #include #include #include "tokenizer.hpp" #include "onnxruntime_cxx_api.h" #include "onnxruntime_float16.h" #include "nnapi_provider_factory.h" MNN based Tokenizer (mnn-llm) Local(Offline) LLM : on-device

Slide 137

Slide 137 text

137 (xxx.cpp) #include #include #include #include #include "tokenizer.hpp" #include "onnxruntime_cxx_api.h" #include "onnxruntime_float16.h" #include "nnapi_provider_factory.h" onnruntime headers Local(Offline) LLM : on-device

Slide 138

Slide 138 text

138 (JNI ) class LocalExternAPI { // Declare native methods external fun preProcess(): Boolean external fun loadModels( assetManager: AssetManager, useGPU: Boolean, fp16: Boolean, useNNAPI: Boolean, useXNNPACK: Boolean, useQNN: Boolean, useDSPNPU: Boolean ): Boolean external fun runLLM( query: String, addPrompt: Boolean, clear: Boolean ): String companion object { init { System.loadLibrary("myapplication") } } } extern "C" JNIEXPORT jboolean JNICALL Java_com_example_myapplication_LocalExte rnAPI_loadModels( JNIEnv *env, jobject clazz, jobject asset_manager, jboolean use_gpu, jboolean use_fp16, jboolean use_nnapi, jboolean use_xnnpack, jboolean use_qnn, jboolean use_dsp_npu) Local(Offline) LLM : on-device

Slide 139

Slide 139 text

139 Local(Offline) LLM fun startLLM(input: String, prompt: String) { getSimilarChunks(input, n = 3).forEach { retrievedContextList.add( RetrievedContext( it.second.docFileName, it.second.chunkData, it.first ) ) } val sortedList = retrievedContextList.sortedByDescending { it.score } sortedList.forEach { jointContext += " " + it.context } /// code.. } Retriever From Local Database with Similarity Search Sorted By Score For improve generation ( RAG pipeline : Retriever )

Slide 140

Slide 140 text

140 Local(Offline) LLM fun startLLM(input: String, prompt: String) { val inputPrompt = prompt.replace("\$CONTEXT", jointContext).replace("\$QEUSTION", input) You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise. Question: {question} Context: {context} Answer: Prompt Engineering Write your own prompt. ( RAG pipeline : prompt setup )

Slide 141

Slide 141 text

141 Local(Offline) LLM viewModelScope.launch(Dispatchers.IO) { var chatting = true llmResp.value = localExternAPI.runLLM(inputPrompt, true, clear_flag) val startTime = System.currentTimeMillis() while (chatting) { when (llmResp.value) { "END" -> { val tokensPerSecond = (1000f * responseCount / (System.currentTimeMillis() - startTime)) chatting = false } else -> { /// add message } } llmResp.value = localExternAPI.runLLM(inputPrompt, false, clear_flag) } } ( RAG pipeline : Generation )

Slide 142

Slide 142 text

142 Local(Offline) LLM viewModelScope.launch(Dispatchers.IO) { var chatting = true llmResp.value = localExternAPI.runLLM(inputPrompt, true, clear_flag) val startTime = System.currentTimeMillis() while (chatting) { when (llmResp.value) { "END" -> { val tokensPerSecond = (1000f * responseCount / (System.currentTimeMillis() - startTime)) chatting = false } else -> { /// add message } } llmResp.value = localExternAPI.runLLM(inputPrompt, false, clear_flag) } } Count tokens/s Update UI State ( RAG pipeline : Generation )

Slide 143

Slide 143 text

143 Feature On-Device Remote Note Text Splitter ✅ - Custom Splitter Sentence Embedding ✅ - all-minilm-l6-v2 bge-small-en-v1.5 Vector Store(DB) ✅ - ObjectBox LLM ✅ - Gemma2 Qwen2 - 1.5B Phi3.5-mini Final Result Migrate all features from remote to on-device.

Slide 144

Slide 144 text

● The llm model size is large. (about 1GB~3GB) ● User experience ○ initial loading speed and download time ● It is affected by hardware performance ○ tokens/sec (Recommend over 10 tokens/s) ○ Test device (Galaxy A15: avg 4.5 tokens/s) ● Context window size 144 Limitations : On-device RAG

Slide 145

Slide 145 text

RAG Long Context PEFT 145 * PEFT(Parameter-Efficient Fine-Tuning)

Slide 146

Slide 146 text

146 RAG or Long-Context LLMs? @source: https://www.arxiv.org/pdf/2407.16833

Slide 147

Slide 147 text

147 Multilingual

Slide 148

Slide 148 text

1. An LLM (Large Language Model) is an artificial intelligence model trained on large datasets to perform natural language processing tasks. 2. The basic RAG (Retrieval-Augmented Generation) pipeline. 3. The basic usage of LLMs in mobile apps and how to build a RAG pipeline using Langchain4j. 4. Implementing on-device RAG (Android). 148 Summary

Slide 149

Slide 149 text

No Limit! RAG 149

Slide 150

Slide 150 text

Thank You Park JaiChang Dreamwalker @jaichangpark

Slide 151

Slide 151 text

Appendix 151

Slide 152

Slide 152 text

AI 152 Artificial Intelligence See, Hear, and Speak 10101010101000101

Slide 153

Slide 153 text

153 @source: https://www.kaggle.com/datasets/samuelcortinhas/muffin-vs-chihuahua-image-classification

Slide 154

Slide 154 text

154 @source: https://mhanational.org/human-brain-101 https://en.wikipedia.org/wiki/File:Complete_neuron_cell_diagram_en.svg

Slide 155

Slide 155 text

155

Slide 156

Slide 156 text

AI 156 Artificial Intelligence See, Hear, and Speak 10101010101000101 Image & Video DroidKaigi Text Sound Multimodal

Slide 157

Slide 157 text

AI 157 Artificial Intelligence AGI Artificial General Intelligence

Slide 158

Slide 158 text

LLM 158 Foundation Model Large Language Model

Slide 159

Slide 159 text

NLP (Natural language processing) 159

Slide 160

Slide 160 text

160 @source https://patents.google.com/patent/US10452978B2/en Transformer - Attention Is All You Need

Slide 161

Slide 161 text

161 https://arxiv.org/pdf/2310.11453 https://arxiv.org/pdf/2402.17764

Slide 162

Slide 162 text

Large Language Model 162 Example of LLM Hallucination

Slide 163

Slide 163 text

163 1. Knowledge Expansion: Large language models rely on pre-trained data. RAG can retrieve relevant information from external databases or documents, allowing the model to provide information it might not inherently know. 2. Improved Accuracy: RAG generates responses by leveraging external knowledge, providing more accurate and up-to-date information than a simple generative model. This is particularly useful when dealing with the latest information or detailed content in specific fields. 3. Efficiency: RAG retrieves and utilizes necessary information in real-time, eliminating the need for large-scale models with massive parameters. This saves memory and computational resources while maintaining high performance. 4. Versatility: RAG can be applied across various domains. It is particularly effective in tasks that require specialized knowledge, such as finance, customer support, legal advice, and technical document generation. 5. Enhanced Understanding: By utilizing retrieved documents to generate responses, RAG helps the model better understand the context and provide more relevant and accurate answers.

Slide 164

Slide 164 text

164 https://superlinked.com/vector-db-comparison Vector DB Comparison

Slide 165

Slide 165 text

165 LLM ( 1 - Basic Usage : ollama ) https://github.com/ollama/ollama/blob/main/docs/api.md data class GenerateRequest( val model: String, val prompt: String, val stream: Boolean ) data class GenerateResponse( val model: String, val created_at: String, val response: String, val done: Boolean, val context: List, val total_duration: Long, val load_duration: Long, val prompt_eval_count: Int, val prompt_eval_duration: Long, val eval_count: Int, val eval_duration: Long )

Slide 166

Slide 166 text

166 LLM ( 1 - Basic Usage : ollama ) https://github.com/ollama/ollama/blob/main/docs/api.md // Define Retrofit API interface interface OllamaApiService { @POST("api/generate") suspend fun generate(@Body request: GenerateRequest): GenerateResponse }

Slide 167

Slide 167 text

167 LLM ( 1 - Basic Usage : ollama ) fun createOllamaApiService(): OllamaApiService { val okHttpClient = OkHttpClient().newBuilder() .connectTimeout(30, TimeUnit.SECONDS) .readTimeout(30, TimeUnit.SECONDS) .writeTimeout(30, TimeUnit.SECONDS) .build() val retrofit = Retrofit.Builder() .baseUrl("http://10.0.2.2:11434/") .client(okHttpClient) .addConverterFactory(GsonConverterFactory.create()) .build() return retrofit.create(OllamaApiService::class.java) } Change to your ollama port Default 11434

Slide 168

Slide 168 text

168 LLM ( 1 - Basic Usage : ollama ) class OllamaViewModel : ViewModel() { private val apiService = createOllamaApiService() var apiResponse by mutableStateOf(null) var errorMessage by mutableStateOf(null) fun fetchApiResponse(prompt: String) { viewModelScope.launch { try { val request = GenerateRequest( model = "llama3", prompt = prompt, stream = false ) val response = withContext(Dispatchers.IO) { apiService.generate(request) } apiResponse = response } catch (e: Exception) { errorMessage = "Error: ${e.message}" } } } Replace your model name

Slide 169

Slide 169 text

169 LLM ( 1 - Basic Usage : ollama ) class OllamaViewModel : ViewModel() { private val apiService = createOllamaApiService() var apiResponse by mutableStateOf(null) var errorMessage by mutableStateOf(null) fun fetchApiResponse(prompt: String) { viewModelScope.launch { try { val request = GenerateRequest( model = "llama3", prompt = prompt, stream = false ) val response = withContext(Dispatchers.IO) { apiService.generate(request) } apiResponse = response } catch (e: Exception) { errorMessage = "Error: ${e.message}" } } }

Slide 170

Slide 170 text

170 Datasets ( 4 ) df_aggregated = final_merged.groupby( ['title', 'i18nDesc', 'startsAt', 'endsAt', 'roomName', 'lengthInMinutes', 'language', targetColumnName], as_index=False).agg( {'fullName': lambda x: combine_speakers(list(x))})

Slide 171

Slide 171 text

171 LLM - Generation

Slide 172

Slide 172 text

172 Datasets final_merged['title'] = final_merged['title'].apply( lambda x: x.get('en') if isinstance(x, dict) else x) final_merged['i18nDesc'] = final_merged['i18nDesc'].apply( lambda x: x.get('en') if isinstance(x, dict) else x) final_merged['i18nTargetAudience'] = final_merged['i18nTargetAudience'].apply( lambda x: x.get('en') if isinstance(x, dict) else x) ( 3 - dataset.py ) Extract english data

Slide 173

Slide 173 text

173 Datasets ( txt ) Split each Paragraph Split Chunk Size ← overlap

Slide 174

Slide 174 text

174 Data Acquisition and Processing Strategy 6. Since the data is insufficient, let's try to obtain data from before 2024. DroidKaigi 2021 ~ 2024 The API does not provide data from before 2021. 7. Among the fields of the 2024 data, the key for targetAudience is 'i18nTargetAudience,' while the key for data from before 2024 is 'targetAudience.' 8. Convert text file to PDF file.

Slide 175

Slide 175 text

175 Datasets ( 4 - dataset.py ) for index, row in df_aggregated.iterrows(): i18nDesc = row['i18nDesc'].replace("\r\n\r\n", "").replace("\r\n", "") i18nDesc = i18nDesc.replace("\n\n"," ") i18nDesc = i18nDesc.replace("\n"," ") i18nDesc = i18nDesc.strip() description = ( f"At droidkaigi2024, {row['fullName']} will be presenting the session titled '{row['title']}', " f"which is {i18nDesc}. The session will start at {row['startsAt']} and " f"end at {row['endsAt']}, taking place in {row['roomName']}. " f"It will last for {row['lengthInMinutes']} minutes and will be conducted in {row['language']}. " f"This session is aimed at {row[targetColumnName].replace("\r\n\r\n", "").replace("\r\n", "").replace("\n\n"," ").replace("\n"," ")}.\n\n\n" ) session_details.append(description)

Slide 176

Slide 176 text

176 Architecture

Slide 177

Slide 177 text

177 On-Device RAG ( Vector Database ) @Entity data class Chunk( @Id var chunkId: Long = 0, @Index var docId: Long = 0, @HnswIndex(dimensions = 384) var chunkEmbedding: FloatArray = floatArrayOf(), var docFileName: String = "", var chunkData: String = "", var metadata: String = "", )

Slide 178

Slide 178 text

178 Datasets ( 6, 7 ) year = ['2021', '2022', '2023', '2024'] for i in year: url = f"https://xxxx-xxx.droidkaigi.jp/events/droidkaigi{i}/timetable" # API Call response = requests.get(url) # code... targetColumnName = "targetAudience" if i == '2024': targetColumnName = "i18nTargetAudience" if i == '2024': final_merged[targetColumnName] = final_merged[targetColumnName].apply( lambda x: x.get('en') if isinstance(x, dict) else x)

Slide 179

Slide 179 text

179 MediaPipe (Generative AI Task - LLM) val partialResults: SharedFlow> = _partialResults.asSharedFlow() val options = LlmInference.LlmInferenceOptions.builder() .setModelPath(MODEL_PATH) .setMaxTokens(1024) .setResultListener { partialResult, done -> _partialResults.tryEmit(partialResult to done) } .build() llmInference = LlmInference.createFromOptions(context, options) llmInference.generateResponseAsync(prompt)

Slide 180

Slide 180 text

180 MediaPipe (Generative AI Task - LLM) val partialResults: SharedFlow> = _partialResults.asSharedFlow() val options = LlmInference.LlmInferenceOptions.builder() .setModelPath(MODEL_PATH) .setMaxTokens(1024) .setResultListener { partialResult, done -> _partialResults.tryEmit(partialResult to done) } .build() llmInference = LlmInference.createFromOptions(context, options) llmInference.generateResponseAsync(prompt) eg. /data/local/tmp/llm/model.bin

Slide 181

Slide 181 text

181 MediaPipe (Generative AI Task - LLM) val partialResults: SharedFlow> = _partialResults.asSharedFlow() val options = LlmInference.LlmInferenceOptions.builder() .setModelPath(MODEL_PATH) .setMaxTokens(1024) .setResultListener { partialResult, done -> _partialResults.tryEmit(partialResult to done) } .build() llmInference = LlmInference.createFromOptions(context, options) llmInference.generateResponseAsync(prompt) set Options

Slide 182

Slide 182 text

182 MediaPipe (Generative AI Task - LLM) val partialResults: SharedFlow> = _partialResults.asSharedFlow() val options = LlmInference.LlmInferenceOptions.builder() .setModelPath(MODEL_PATH) .setMaxTokens(1024) .setResultListener { partialResult, done -> _partialResults.tryEmit(partialResult to done) } .build() llmInference = LlmInference.createFromOptions(context, options) llmInference.generateResponseAsync(prompt)

Slide 183

Slide 183 text

183 MediaPipe (Generative AI Task - LLM) inferenceModel.partialResults .collectIndexed { index, (partialResult, done) -> currentMessageId?.let { if (index == 0) { /// append } else { /// append } if (done) { currentMessageId = null // Re-enable text input setInputEnabled(true) } } }