Upgrade to Pro — share decks privately, control downloads, hide ads and more …

RAG with Mobile Application - DroidKaigi 2024

JaiChangPark
September 15, 2024

RAG with Mobile Application - DroidKaigi 2024

DroidKaigi 2024
In the era of AI, Retrieval Augmented Generation (RAG) represents a cutting-edge approach that combines the strengths of retrieval-based and generation-based methods to improve information retrieval and content generation. This presentation will explore how RAG can be implemented in mobile applications to provide more accurate, context-aware, and user-specific responses. This presentation aims to provide Android mobile application developers with a comprehensive understanding of RAG and its practical applications. By the end of the session, developers will be equipped with the knowledge and tools necessary to implement RAG in their own projects, enhancing their apps with cutting-edge AI capabilities.

Jaichangpark (박제창)

JaiChangPark

September 15, 2024
Tweet

More Decks by JaiChangPark

Other Decks in Programming

Transcript

  1. 40% Would you like to be more productive? 7 @source:

    https://news.mit.edu/2023/study-finds-chatgpt-boosts-worker-productivity-writing-0714 https://www.science.org/doi/epdf/10.1126/science.adh2586 https://survey.stackoverflow.co/2024/ai#sentiment-and-usage-ai-select
  2. 8

  3. Large Language Model 9 Category NLP (Natural Language Processing) LLM

    (Large Language Model) Definition Technology that enables AI to understand and process human language. A model trained on vast amounts of text data, capable of human-level language proficiency. Features - Utilizes various technologies - Applied in chatbots, machine translation, sentiment analysis, etc. - Trained on large text datasets - Performs various NLP tasks - Human-level language abilities Relationship LLMs play a crucial role in the advancement of NLP. A subset of NLP technology. Foundation Model
  4. The History of NLP RNN (1985) Transformer (2017) GPT-4 Gemini

    LLaMa Claude etc.. Word2Vec (2013) GPT-1 (2018) Seq2Seq (2014) LSTM (1997) Attention (2015) BERT (2018) 10 GPT-2 (2019) GPT3 (2020)
  5. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse vehicula

    nulla a leo placerat, in convallis justo molestie. Ut id maximus mauris, vitae pharetra justo. 11 @Srouce: Attention Is All You Need, 12 Jun 2017 (v1), last revised 2 Aug 2023 (this version, v7)] Transformer - Attention Is All You Need
  6. 14 Model Training Time (GPU hours) Model Training Time (GPU

    hours) Llama 3 8B 1.3M Llama 3.1 8B 1.46M Llama 3 70B 6.4M Llama 3.1 70B 7.0M - - Llama 3.1 405B 30.84M Total 7.7M Total 39.3M Based on Nvidia H100-80G 1EA @source: meta AI
  7. 15 Model Time (GPU hours) Model Training Time (GPU hours)

    Llama 3 8B 1.3M Llama 3.1 8B 1.46M Llama 3 70B 6.4M Llama 3.1 70B 7.0M Llama 3.1 405B 30.84M Total 7.7M Total 39.3M @source: meta AI 148~167 Years (Single H100 GPU)
  8. 16 We use this cluster design for Llama 3 training.

    Today, we’re sharing details on two versions of our 24,576-GPU data center scale cluster at Meta. (NVIDIA H100 GPUs) H100(1ea) = about ¥5,000,000 H100 (24,576) = ¥ 122,880,000,000 Two cluster = ¥ 245,760,000,000 = about 2458億円 @source: https://engineering.fb.com/2024/03/12/data-center-engineering/building-metas-genai-infrastructure/
  9. 23 Model Memory Google Pixel 9 12GB RAM Pixel 9

    Pro 16GB RAM Samsung Galaxy Z Fold 6 12GB RAM Galaxy Z Fold 5 12GB RAM Galaxy Z Flip 6 12GB RAM Galaxy S24 8GB RAM Galaxy S23 8GB RAM Galaxy A25 6GB RAM Galaxy A15 6GB RAM
  10. 29 Retrieval-augmented generation (RAG) is a software architecture and technique

    that integrates large language models with external information sources, enhancing the accuracy and reliability of generative AI models by incorporating specific business data like documents, SQL databases, and internal applications. @source: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
  11. Large Language Model 35 Preventing LLM hallucinations 1. Enhancing data

    through web searches for utilization in user queries. 2. This enables responses to questions regarding the latest data.
  12. 43 LangChain 🦜🔗 • LangChain is a framework for developing

    applications powered by large language models (LLMs). • Open-source libraries: Build your applications using LangChain's modular building blocks and components. • Productionization: Inspect, monitor, and evaluate your apps with LangSmith so that you can constantly optimize and deploy with confidence. • Deployment: Turn any chain into a REST API with LangServe. @source: https://www.langchain.com/
  13. 48 Embedding Model • Deep Learning Model • Sparse Embedding

    a. One-Hot Encoding b. TF-IDF (Term Frequency-Inverse Document Frequency) • Dense Embedding a. Word2Vec b. BERT etc.. c. Text-embedding-3 (OpenAI)
  14. 49 Vector (Query) Embedding 0.88 0.76 0.34 0.23 0.64 0.01

    0.44 0.66 What’s Flutter? n-dimension ex) text-embedding-3-large ⇒ 3072 dim What’s React Native? 0.15 0.22 0.89 0.12 0.34 0.09 0.55 0.77 Dense Vector Vector Store (DB) What’s Flutter What’s React Native? Dense Embedding
  15. 50 Vector Store 1. Embedding Storage The Vector Store stores

    pre-built document embeddings (Embedding Vectors). Embeddings transform text into a high-dimensional vector space, where documents with similar meanings are located close to each other in the vector space. 2. Similarity Search When a user submits a query, it is also converted into a vector. The Vector Store then finds the embeddings most similar to the query vector, identifying relevant documents. This process involves mathematical calculations such as cosine similarity, Euclidean distance, and other similarity metrics. 3. Information Provision(Retriever) RAG (Relevant Augmented Generation) uses the retrieved similar documents to generate responses. The Vector Store provides highly relevant documents, allowing the model to generate more accurate and contextually relevant responses.
  16. 51 A (1, 4) B (4, 2) Vector Store Retriever

    → Similarity Search → Cosine Similarity
  17. 52 A (1, 4) B (4, 2) e.g. ) 2-dimension

    vector Vector Store Retriever → Similarity Search → Cosine Similarity
  18. 53 A (1, 4) B (4, 2) e.g. ) 2-dimension

    vector Cosine Similarity A - C ⇒ P0 Cosine Similarity A - B ⇒ P1 … Cosine Similarity A - N ⇒ Pn C (-3, -2) ex) user query Vector Store Retriever → Similarity Search → Cosine Similarity Top-K (n) n-chunks
  19. 55

  20. 56

  21. 57 LLM ( 1 - Basic Usage : OpenAI )

    @Serializable data class ChatMessage( val role: String, val content: String ) @Serializable data class ChatRequest( val model: String, val messages: List<ChatMessage> ) @Serializable data class ChatResponse( val id: String, val choices: List<Choice>, val usage: Usage ) @Serializable data class Choice( val index: Int, val message: ChatMessage, val finish_reason: String? ) @Serializable data class Usage( val prompt_tokens: Int, val completion_tokens: Int, val total_tokens: Int ) Based on OpanAI API Docs
  22. 58 LLM ( 1 - Basic Usage : OpenAI )

    object KtorClient { private val client = HttpClient(Android) { install(ContentNegotiation) { json(Json { ignoreUnknownKeys = true isLenient = true prettyPrint = true }) } } suspend fun getChatResponse(apiKey: String, request: ChatRequest): ChatResponse { return client.post("https://api.openai.com/v1/chat/completions") { header(HttpHeaders.ContentType, ContentType.Application.Json) header(HttpHeaders.Authorization, "Bearer $apiKey") setBody(request) }.body() } }
  23. 59 LLM ( 1 - Basic Usage : OpenAI )

    object KtorClient { private val client = HttpClient(Android) { install(ContentNegotiation) { json(Json { ignoreUnknownKeys = true isLenient = true prettyPrint = true }) } } suspend fun getChatResponse(apiKey: String, request: ChatRequest): ChatResponse { return client.post("https://api.openai.com/v1/chat/completions") { header(HttpHeaders.ContentType, ContentType.Application.Json) header(HttpHeaders.Authorization, "Bearer $apiKey") setBody(request) }.body() } }
  24. 60 LLM ( 1 - Basic Usage : OpenAI )

    object KtorClient { private val client = HttpClient(Android) { install(ContentNegotiation) { json(Json { ignoreUnknownKeys = true isLenient = true prettyPrint = true }) } } suspend fun getChatResponse(apiKey: String, request: ChatRequest): ChatResponse { return client.post("https://api.openai.com/v1/chat/completions") { header(HttpHeaders.ContentType, ContentType.Application.Json) header(HttpHeaders.Authorization, "Bearer $apiKey") setBody(request) }.body() } } To use the API, you need to obtain an API key.
  25. 61 LLM ( 1 - Basic Usage : OpenAI )

    class ChatViewModel : ViewModel() { private val _messages = MutableStateFlow<List<ChatMessage>>(emptyList()) val messages: StateFlow<List<ChatMessage>> get() = _messages private val apiKey = BuildConfig.openApiKey fun sendMessage(content: String) { /// code } }
  26. 62 LLM ( 1 - Basic Usage : OpenAI )

    fun sendMessage(content: String) { val systemMessage = ChatMessage(role = "system", content = "You are a helpful assistant.") val newMessage = ChatMessage(role = "user", content = content) /// code val request = ChatRequest( model = "gpt-4o-mini", messages = updatedMessages ) viewModelScope.launch { try { val response = getChatResponse(apiKey, request) val assistantMessage = response.choices.firstOrNull()?.message if (assistantMessage != null) { _messages.value += assistantMessage } } catch (e: Exception) { print(e) } } }
  27. 63 LLM ( 1 - Basic Usage : OpenAI )

    fun sendMessage(content: String) { val systemMessage = ChatMessage(role = "system", content = "You are a helpful assistant.") val newMessage = ChatMessage(role = "user", content = content) /// code val request = ChatRequest( model = "gpt-4o-mini", messages = updatedMessages ) viewModelScope.launch { try { val response = getChatResponse(apiKey, request) val assistantMessage = response.choices.firstOrNull()?.message if (assistantMessage != null) { _messages.value += assistantMessage } } catch (e: Exception) { print(e) } } } You can replace your own fine-tuning model. OpenAI o1-mini or o1-preview
  28. 64 LLM ( 1 - Basic Usage : OpenAI )

    fun sendMessage(content: String) { val systemMessage = ChatMessage(role = "system", content = "You are a helpful assistant.") val newMessage = ChatMessage(role = "user", content = content) /// code val request = ChatRequest( model = "gpt-4o-mini", messages = updatedMessages ) viewModelScope.launch { try { val response = getChatResponse(apiKey, request) val assistantMessage = response.choices.firstOrNull()?.message if (assistantMessage != null) { _messages.value += assistantMessage } } catch (e: Exception) { print(e) } } }
  29. 65 LLM ( 1 - Basic Usage : OpenAI )

    @Composable fun OpenAIChatScreen(viewModel: ChatViewModel = viewModel()) { val messages by viewModel.messages.collectAsState() var inputMessage by remember { mutableStateOf("") } val keyboardController = LocalSoftwareKeyboardController.current /// code .. }
  30. 66 LLM ( 1 - Basic Usage : OpenAI )

    Row( modifier = Modifier .fillMaxWidth() .padding(bottom = 16.dp, top = 24.dp), horizontalArrangement = Arrangement.spacedBy(8.dp) ) { TextField( value = inputMessage, label = { Text("Prompt") }, onValueChange = { inputMessage = it }, modifier = Modifier .weight(1f) .background(Color.White), keyboardActions = KeyboardActions( onDone = { keyboardController?.hide() }) ) // Button.. remember { mutableStateOf("") }
  31. 67 LLM ( 1 - Basic Usage : OpenAI )

    Row( modifier = Modifier .fillMaxWidth() .padding(bottom = 16.dp, top = 24.dp), horizontalArrangement = Arrangement.spacedBy(8.dp) ) { /// TextField ... Button(onClick = { if (inputMessage.isNotBlank()) { viewModel.sendMessage(inputMessage) inputMessage = "" keyboardController?.hide() } }) { Text("Send") } }
  32. 68 LLM ( 1 - Basic Usage : OpenAI )

    LazyColumn(modifier = Modifier.fillMaxSize()) { items(messages) { message -> if (message.role != "system") Row(modifier = Modifier.padding(bottom = 16.dp)) { CircleAvatar(message.role) Surface( modifier = Modifier .padding(horizontal = 8.dp), shape = RoundedCornerShape( bottomStart = 16.dp, topEnd = 16.dp ), color = Color.LightGray.copy(alpha = .2f), ) { MarkdownText( modifier = Modifier.padding(8.dp), markdown = "${message.content}" ) } } } }
  33. 69 LLM ( 1 - Basic Usage : OpenAI )

    @Composable fun CircleAvatar( role: String, modifier: Modifier = Modifier, size: Dp = 40.dp ) { val color = when (role) { "user" -> Color.Blue "assistant" -> Color.Green else -> Color.Gray } Canvas(modifier = modifier.size(size)) { val diameter = size.toPx() drawCircle( color = color, radius = diameter / 2 ) } }
  34. 70 LLM ( 1 - Basic Usage : Gemini )

    dependencies { // add the dependency for the Google AI client SDK for Android implementation("com.google.ai.client.generativeai:generativeai:0.9.0") }
  35. 71 LLM ( 1 - Basic Usage : Gemini )

    class GeminiViewModel : ViewModel() { private val _uiState: MutableStateFlow<UiState> = MutableStateFlow(UiState.Initial) val uiState: StateFlow<UiState> = _uiState.asStateFlow() private val generativeModel = GenerativeModel( modelName = "gemini-1.5-flash", apiKey = BuildConfig.apiKey ) } Replace it with the model you want to use.
  36. 72 LLM ( 1 - Basic Usage : Gemini )

    viewModelScope.launch(Dispatchers.IO) { try { val response = generativeModel.generateContent( content { if (bitmap != null) { image(bitmap) } text(prompt) } ) response.text?.let { outputContent -> _uiState.value = UiState.Success(outputContent) } } catch (e: Exception) { _uiState.value = UiState.Error(e.localizedMessage ?: "") } }
  37. 73 LLM ( 1 - Basic Usage : Gemini )

    viewModelScope.launch(Dispatchers.IO) { try { val response = generativeModel.generateContent( content { if (bitmap != null) { image(bitmap) } text(prompt) } ) response.text?.let { outputContent -> _uiState.value = UiState.Success(outputContent) } } catch (e: Exception) { _uiState.value = UiState.Error(e.localizedMessage ?: "") } } ( multimodal )
  38. 76

  39. 79 Improvement (Optimization) Basic RAG Pipeline → Advanced RAG Pipeline

    • Splitter, Chunk & Overlap Size • Multi Query Retriever • Ensemble Retriever ( Dense & Sparse ) • Long-Context Reorder & Re-Renaking • Context Compressor • Prompt Optimization @source https://arxiv.org/pdf/2307.03172
  40. 80 @source: Star Wars: Episode II - Attack of the

    Clones - recreated by author using imgflip
  41. 81 And more… • Infrastructure ◦ MLOps ◦ LLMOps ◦

    RAGOps • Monitoring ◦ Cost ◦ Usage (Log) • Agent ◦ But..
  42. 86 Data Acquisition and Processing Strategy 1. Call the API

    to obtain session timetable information for DroidKaigi 2024. 2. Join speaker and room information to create a single dataset. 3. Filter only the 'en' data from the title and i18n field. 4. For sessions with n speakers, consolidate them into a single cell. 5. Create a dataset that describes each session's information in one paragraph. 6. Since the data is insufficient, let's try to obtain data from before 2024. 7. Convert text file to PDF file.
  43. 87 Datasets import requests import pandas as pd from pprint

    import pprint # API endpoint URL url = "https://xxxx-xxx.droidkaigi.jp/events/droidkaigi2024/timetable" # API call response = requests.get(url) data = '' # Check response status code if response.status_code == 200: # Convert JSON response to Python dictionary data = response.json() pprint(data) df = pd.read_json(data["sessions"]) ( 1 - dataset.py )
  44. 89 Datasets df_merged = pd.merge(df_sessions, df_room, left_on='roomId', right_on='id', how='left') merged_sessions_with_speakers

    = speakers_exploded.merge( df_speaker, left_on='speakers', right_on='id', suffixes=('', '_speaker')) final_merged = merged_sessions_with_speakers.merge( df_room, left_on='roomId', right_on='id', suffixes=('', '_room')) ( 2 - dataset.py ) Merged each ids
  45. 95 RAG Pipeline using LangChain4j dependencies { implementation ("dev.langchain4j:langchain4j:0.33.0") implementation

    ("dev.langchain4j:langchain4j-open-ai:0.33.0") implementation ("dev.langchain4j:langchain4j-ollama:0.33.0") implementation ("dev.langchain4j:langchain4j-chroma:0.33.0") }
  46. 96 ( Document Splitter ) coroutineScope.launch(Dispatchers.IO) { val splitter =

    DocumentSplitters.recursive( 128, 16, OpenAiTokenizer("gpt-4o") ) textSegments = splitter.split(document) } The chunk size needs to be calibrated according to the project or app. (+overlap) RAG Pipeline using LangChain4j
  47. 97 ( Embedding - Ollama ) coroutineScope.launch(Dispatchers.IO) { val embeddingModel:

    EmbeddingModel = OllamaEmbeddingModel.builder() .baseUrl("http://10.0.2.2:11434") .modelName("mxbai-embed-large") .build() val responseData: Response<List<Embedding>> = withContext(Dispatchers.IO) { embeddingModel.embedAll(textSegments) } } RAG Pipeline using LangChain4j
  48. coroutineScope.launch(Dispatchers.IO) { val embeddingModel: EmbeddingModel = OllamaEmbeddingModel.builder() .baseUrl("http://10.0.2.2:11434") .modelName("mxbai-embed-large") .build()

    val responseData: Response<List<Embedding>> = withContext(Dispatchers.IO) { embeddingModel.embedAll(textSegments) } } 98 Download & Setup Your model ( Embedding - Ollama ) RAG Pipeline using LangChain4j
  49. 99 ( Vector Store - ChromaDB ) val embeddingStore =

    ChromaEmbeddingStore .builder() .baseUrl("http://10.0.2.2:8000") .collectionName("droidkaigi") .logRequests(true) .logResponses(true) .build() embeddingStore?.addAll(it?.content(), textSegments) RAG Pipeline using LangChain4j
  50. 100 val embeddingStore = ChromaEmbeddingStore .builder() .baseUrl("http://10.0.2.2:8000") .collectionName("droidkaigi") .logRequests(true) .logResponses(true)

    .build() embeddingStore?.addAll(it?.content(), textSegments) ( Vector Store - ChromaDB ) RAG Pipeline using LangChain4j
  51. 101 ( Retriever ) val questionEmbedding: Embedding = embeddingModel.embed(userMessage).content() val

    maxResults = 5 val minScore = 0.6 val embeddingSearchRequest = EmbeddingSearchRequest.builder() .queryEmbedding(questionEmbedding) .maxResults(maxResults) .minScore(minScore) .build() val relevantEmbeddings = embeddingStore?.search(embeddingSearchRequest) RAG Pipeline using LangChain4j
  52. 102 val questionEmbedding: Embedding = embeddingModel.embed(userMessage).content() val maxResults = 5

    val minScore = 0.6 val embeddingSearchRequest = EmbeddingSearchRequest.builder() .queryEmbedding(questionEmbedding) .maxResults(maxResults) .minScore(minScore) .build() val relevantEmbeddings = embeddingStore?.search(embeddingSearchRequest) ( Retriever ) RAG Pipeline using LangChain4j
  53. 103 val questionEmbedding: Embedding = embeddingModel.embed(userMessage).content() val maxResults = 5

    val minScore = 0.6 val embeddingSearchRequest = EmbeddingSearchRequest.builder() .queryEmbedding(questionEmbedding) .maxResults(maxResults) .minScore(minScore) .build() val relevantEmbeddings = embeddingStore?.search(embeddingSearchRequest) ( Retriever ) RAG Pipeline using LangChain4j
  54. 104 ( Prompt ) var prompt = """ You are

    an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise. Question: $question Context: $context Answer: """.trimIndent() val context: String = relevantEmbeddings?.matches() ?.joinToString(separator = "\\n\\n") { match -> match.embedded().text() } ?: "??" Prompt Engineering RAG Pipeline using LangChain4j
  55. 105 class OpenAIViewModel : ViewModel() { private val openAiModel =

    OpenAiChatModel.builder() .apiKey(BuildConfig.openApiKey) .modelName(OpenAiChatModelName.GPT_4_O_MINI) .build() var conversation = mutableStateListOf<Pair<UserMessage, AiMessage>>() fun sendMessage(prompt: String) { /// code } } ( Generation : LLM - OpenAI ) RAG Pipeline using LangChain4j
  56. 106 class OpenAIViewModel : ViewModel() { private val openAiModel =

    OpenAiChatModel.builder() .apiKey(BuildConfig.openApiKey) .modelName(OpenAiChatModelName.GPT_4_O_MINI) .build() var conversation = mutableStateListOf<Pair<UserMessage, AiMessage>>() fun sendMessage(prompt: String) { /// code } } You can replace your own fine-tuning model. ( Generation : LLM - OpenAI ) RAG Pipeline using LangChain4j
  57. 107 HumanMessage AIMessage SystemMessage class OpenAIViewModel : ViewModel() { private

    val openAiModel = OpenAiChatModel.builder() .apiKey(BuildConfig.openApiKey) .modelName(OpenAiChatModelName.GPT_4_O_MINI) .build() var conversation = mutableStateListOf<Pair<UserMessage, AiMessage>>() fun sendMessage(prompt: String) { /// code } } ( Generation : LLM - OpenAI ) RAG Pipeline using LangChain4j
  58. 108 class OpenAIViewModel : ViewModel() { /// Code fun sendMessage(prompt:

    String) { val userMsg = /// code... viewModelScope.launch(Dispatchers.IO) { val aiResponse: AiMessage = if (conversation.isEmpty()) { openAiModel.generate(userMsg).content() } else { val previousMessages = conversation.flatMap { listOf(it.first, it.second) } openAiModel.generate(*previousMessages.toTypedArray(), userMsg).content() } conversation.add(userMsg to aiResponse) } } } ( Generation : LLM - OpenAI ) RAG Pipeline using LangChain4j
  59. 109 val model: StreamingChatLanguageModel = OpenAiStreamingChatModel .builder() .temperature(.1) .apiKey(BuildConfig.openApiKey) .build()

    model.generate(prompt, object : StreamingResponseHandler<AiMessage?> { override fun onNext(token: String) { println("onNext: $token") responseText += token } override fun onComplete(response: Response<AiMessage?>) { println("onComplete: $response") } override fun onError(error: Throwable) { error.printStackTrace() } }) ( Generation : LLM - OpenAI → Streaming) OllamaStreamingChatModel Update UI State RAG Pipeline using LangChain4j
  60. 111 Feature On-Device Remote Text Splitter ✅ - Sentence Embedding

    - ✅ Vector Store(DB) - ✅ LLM - ✅ Result: RAG Pipeline using LangChain4j
  61. 1. Personally, I found the learning curve to be higher

    than that of the Langchain Python library. 2. There is a need for improvements in the official documentation. 3. Not all existing Langchain features are compatible. Pros vs Cons - Langchain4j Pros 1. It can be implemented quickly without implementing a service client. 2. It is easy to implement the RAG pipeline. 112 Cons
  62. 113 Limitation - LangChain4j 1. There is currently an error

    occurring when running the embedding model on mobile devices (Android). a. https://github.com/langchain4j/langchain4j/issues/776 b. https://github.com/langchain4j/langchain4j/issues/1093 c. https://github.com/langchain4j/langchain4j/issues/1202 2. On-device LLM cannot be run (Android) 3. There are limitations in implementing detailed features.
  63. How to run Embedding model on a mobile device? How

    to storing embedded vector data in a vector database. Implement Mission 116 How to run llm model on a mobile device?
  64. How can I store vector data in a vector store

    and implement a retriever? 118
  65. 120 On-Device RAG ( Vector Database ) @Entity data class

    Chunk( @Id var chunkId: Long = 0, @Index var docId: Long = 0, @HnswIndex(dimensions = 384) var chunkEmbedding: FloatArray = floatArrayOf(), var docFileName: String = "", var chunkData: String = "", var metadata: String = "", )
  66. 121 On-Device RAG ( Vector Database ) fun getSimilarChunks(queryEmbedding: FloatArray,

    n: Int = 5): List<Pair<Float, Chunk>> { return chunksBox .query(Chunk_.chunkEmbedding.nearestNeighbors(queryEmbedding, 10)) .build() .findWithScores() .map { Pair(it.score.toFloat(), it.get()) } .subList(0, n) }
  67. 123

  68. 124 Mediapipe Type Model Note Text embedding Universal Sentence Encode

    100 dim Generative AI Gemma 1.1 2B CPU, GPU Gemma 1.1 7B Falcon 1B CPU, GPU StableLM 3B CPU, GPU Phi-2 CPU, GPU Access date: 2024.08
  69. 125 MediaPipe - Limitations • Text Embedding Dimension ◦ Universal

    Sentence Encoder : Dimension → 100 • The models supported for conversion are currently limited. • Current Supported Model (2024.09) {"PHI_2", "FALCON_RW_1B", "STABLELM_4E1T_3B", "GEMMA_2B"}. The reasons for switching to ONNX/ONN Runtime.
  70. 130 Local Embedding (on-device) suspend fun encode( sentence: String ):

    FloatArray = withContext(Dispatchers.IO) { val result = tokenizer.tokenize(sentence) val inputTensorMap = mutableMapOf<String,OnnxTensor>() /// codes(setup inputTensorMap)... val outputs = ortSession.run(inputTensorMap) val embeddingTensor = outputs.get(outputTensorName).get() as OnnxTensor return@withContext embeddingTensor.floatBuffer.array() }
  71. 131 suspend fun encode( sentence: String ): FloatArray = withContext(Dispatchers.IO)

    { val result = tokenizer.tokenize(sentence) val inputTensorMap = mutableMapOf<String,OnnxTensor>() /// codes(setup inputTensorMap)... val outputs = ortSession.run(inputTensorMap) val embeddingTensor = outputs.get(outputTensorName).get() as OnnxTensor return@withContext embeddingTensor.floatBuffer.array() } Model Tokenizer Local Embedding (on-device)
  72. 132 suspend fun encode( sentence: String ): FloatArray = withContext(Dispatchers.IO)

    { val result = tokenizer.tokenize(sentence) val inputTensorMap = mutableMapOf<String,OnnxTensor>() /// codes(setup inputTensorMap)... val outputs = ortSession.run(inputTensorMap) val embeddingTensor = outputs.get(outputTensorName).get() as OnnxTensor return@withContext embeddingTensor.floatBuffer.array() } Local Embedding (on-device)
  73. 135 Local(Offline) LLM : on-device (xxx.cpp) #include <iostream> #include <fstream>

    #include <jni.h> #include <android/asset_manager_jni.h> #include "tokenizer.hpp" #include "onnxruntime_cxx_api.h" #include "onnxruntime_float16.h" #include "nnapi_provider_factory.h"
  74. 136 (xxx.cpp) #include <iostream> #include <fstream> #include <jni.h> #include <android/asset_manager_jni.h>

    #include "tokenizer.hpp" #include "onnxruntime_cxx_api.h" #include "onnxruntime_float16.h" #include "nnapi_provider_factory.h" MNN based Tokenizer (mnn-llm) Local(Offline) LLM : on-device
  75. 137 (xxx.cpp) #include <iostream> #include <fstream> #include <jni.h> #include <android/asset_manager_jni.h>

    #include "tokenizer.hpp" #include "onnxruntime_cxx_api.h" #include "onnxruntime_float16.h" #include "nnapi_provider_factory.h" onnruntime headers Local(Offline) LLM : on-device
  76. 138 (JNI ) class LocalExternAPI { // Declare native methods

    external fun preProcess(): Boolean external fun loadModels( assetManager: AssetManager, useGPU: Boolean, fp16: Boolean, useNNAPI: Boolean, useXNNPACK: Boolean, useQNN: Boolean, useDSPNPU: Boolean ): Boolean external fun runLLM( query: String, addPrompt: Boolean, clear: Boolean ): String companion object { init { System.loadLibrary("myapplication") } } } extern "C" JNIEXPORT jboolean JNICALL Java_com_example_myapplication_LocalExte rnAPI_loadModels( JNIEnv *env, jobject clazz, jobject asset_manager, jboolean use_gpu, jboolean use_fp16, jboolean use_nnapi, jboolean use_xnnpack, jboolean use_qnn, jboolean use_dsp_npu) Local(Offline) LLM : on-device
  77. 139 Local(Offline) LLM fun startLLM(input: String, prompt: String) { getSimilarChunks(input,

    n = 3).forEach { retrievedContextList.add( RetrievedContext( it.second.docFileName, it.second.chunkData, it.first ) ) } val sortedList = retrievedContextList.sortedByDescending { it.score } sortedList.forEach { jointContext += " " + it.context } /// code.. } Retriever From Local Database with Similarity Search Sorted By Score For improve generation ( RAG pipeline : Retriever )
  78. 140 Local(Offline) LLM fun startLLM(input: String, prompt: String) { val

    inputPrompt = prompt.replace("\$CONTEXT", jointContext).replace("\$QEUSTION", input) You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise. Question: {question} Context: {context} Answer: Prompt Engineering Write your own prompt. ( RAG pipeline : prompt setup )
  79. 141 Local(Offline) LLM viewModelScope.launch(Dispatchers.IO) { var chatting = true llmResp.value

    = localExternAPI.runLLM(inputPrompt, true, clear_flag) val startTime = System.currentTimeMillis() while (chatting) { when (llmResp.value) { "END" -> { val tokensPerSecond = (1000f * responseCount / (System.currentTimeMillis() - startTime)) chatting = false } else -> { /// add message } } llmResp.value = localExternAPI.runLLM(inputPrompt, false, clear_flag) } } ( RAG pipeline : Generation )
  80. 142 Local(Offline) LLM viewModelScope.launch(Dispatchers.IO) { var chatting = true llmResp.value

    = localExternAPI.runLLM(inputPrompt, true, clear_flag) val startTime = System.currentTimeMillis() while (chatting) { when (llmResp.value) { "END" -> { val tokensPerSecond = (1000f * responseCount / (System.currentTimeMillis() - startTime)) chatting = false } else -> { /// add message } } llmResp.value = localExternAPI.runLLM(inputPrompt, false, clear_flag) } } Count tokens/s Update UI State ( RAG pipeline : Generation )
  81. 143 Feature On-Device Remote Note Text Splitter ✅ - Custom

    Splitter Sentence Embedding ✅ - all-minilm-l6-v2 bge-small-en-v1.5 Vector Store(DB) ✅ - ObjectBox LLM ✅ - Gemma2 Qwen2 - 1.5B Phi3.5-mini Final Result Migrate all features from remote to on-device.
  82. • The llm model size is large. (about 1GB~3GB) •

    User experience ◦ initial loading speed and download time • It is affected by hardware performance ◦ tokens/sec (Recommend over 10 tokens/s) ◦ Test device (Galaxy A15: avg 4.5 tokens/s) • Context window size 144 Limitations : On-device RAG
  83. 1. An LLM (Large Language Model) is an artificial intelligence

    model trained on large datasets to perform natural language processing tasks. 2. The basic RAG (Retrieval-Augmented Generation) pipeline. 3. The basic usage of LLMs in mobile apps and how to build a RAG pipeline using Langchain4j. 4. Implementing on-device RAG (Android). 148 Summary
  84. 155

  85. 163 1. Knowledge Expansion: Large language models rely on pre-trained

    data. RAG can retrieve relevant information from external databases or documents, allowing the model to provide information it might not inherently know. 2. Improved Accuracy: RAG generates responses by leveraging external knowledge, providing more accurate and up-to-date information than a simple generative model. This is particularly useful when dealing with the latest information or detailed content in specific fields. 3. Efficiency: RAG retrieves and utilizes necessary information in real-time, eliminating the need for large-scale models with massive parameters. This saves memory and computational resources while maintaining high performance. 4. Versatility: RAG can be applied across various domains. It is particularly effective in tasks that require specialized knowledge, such as finance, customer support, legal advice, and technical document generation. 5. Enhanced Understanding: By utilizing retrieved documents to generate responses, RAG helps the model better understand the context and provide more relevant and accurate answers.
  86. 165 LLM ( 1 - Basic Usage : ollama )

    https://github.com/ollama/ollama/blob/main/docs/api.md data class GenerateRequest( val model: String, val prompt: String, val stream: Boolean ) data class GenerateResponse( val model: String, val created_at: String, val response: String, val done: Boolean, val context: List<Int>, val total_duration: Long, val load_duration: Long, val prompt_eval_count: Int, val prompt_eval_duration: Long, val eval_count: Int, val eval_duration: Long )
  87. 166 LLM ( 1 - Basic Usage : ollama )

    https://github.com/ollama/ollama/blob/main/docs/api.md // Define Retrofit API interface interface OllamaApiService { @POST("api/generate") suspend fun generate(@Body request: GenerateRequest): GenerateResponse }
  88. 167 LLM ( 1 - Basic Usage : ollama )

    fun createOllamaApiService(): OllamaApiService { val okHttpClient = OkHttpClient().newBuilder() .connectTimeout(30, TimeUnit.SECONDS) .readTimeout(30, TimeUnit.SECONDS) .writeTimeout(30, TimeUnit.SECONDS) .build() val retrofit = Retrofit.Builder() .baseUrl("http://10.0.2.2:11434/") .client(okHttpClient) .addConverterFactory(GsonConverterFactory.create()) .build() return retrofit.create(OllamaApiService::class.java) } Change to your ollama port Default 11434
  89. 168 LLM ( 1 - Basic Usage : ollama )

    class OllamaViewModel : ViewModel() { private val apiService = createOllamaApiService() var apiResponse by mutableStateOf<GenerateResponse?>(null) var errorMessage by mutableStateOf<String?>(null) fun fetchApiResponse(prompt: String) { viewModelScope.launch { try { val request = GenerateRequest( model = "llama3", prompt = prompt, stream = false ) val response = withContext(Dispatchers.IO) { apiService.generate(request) } apiResponse = response } catch (e: Exception) { errorMessage = "Error: ${e.message}" } } } Replace your model name
  90. 169 LLM ( 1 - Basic Usage : ollama )

    class OllamaViewModel : ViewModel() { private val apiService = createOllamaApiService() var apiResponse by mutableStateOf<GenerateResponse?>(null) var errorMessage by mutableStateOf<String?>(null) fun fetchApiResponse(prompt: String) { viewModelScope.launch { try { val request = GenerateRequest( model = "llama3", prompt = prompt, stream = false ) val response = withContext(Dispatchers.IO) { apiService.generate(request) } apiResponse = response } catch (e: Exception) { errorMessage = "Error: ${e.message}" } } }
  91. 170 Datasets ( 4 ) df_aggregated = final_merged.groupby( ['title', 'i18nDesc',

    'startsAt', 'endsAt', 'roomName', 'lengthInMinutes', 'language', targetColumnName], as_index=False).agg( {'fullName': lambda x: combine_speakers(list(x))})
  92. 172 Datasets final_merged['title'] = final_merged['title'].apply( lambda x: x.get('en') if isinstance(x,

    dict) else x) final_merged['i18nDesc'] = final_merged['i18nDesc'].apply( lambda x: x.get('en') if isinstance(x, dict) else x) final_merged['i18nTargetAudience'] = final_merged['i18nTargetAudience'].apply( lambda x: x.get('en') if isinstance(x, dict) else x) ( 3 - dataset.py ) Extract english data
  93. 174 Data Acquisition and Processing Strategy 6. Since the data

    is insufficient, let's try to obtain data from before 2024. DroidKaigi 2021 ~ 2024 The API does not provide data from before 2021. 7. Among the fields of the 2024 data, the key for targetAudience is 'i18nTargetAudience,' while the key for data from before 2024 is 'targetAudience.' 8. Convert text file to PDF file.
  94. 175 Datasets ( 4 - dataset.py ) for index, row

    in df_aggregated.iterrows(): i18nDesc = row['i18nDesc'].replace("\r\n\r\n", "").replace("\r\n", "") i18nDesc = i18nDesc.replace("\n\n"," ") i18nDesc = i18nDesc.replace("\n"," ") i18nDesc = i18nDesc.strip() description = ( f"At droidkaigi2024, {row['fullName']} will be presenting the session titled '{row['title']}', " f"which is {i18nDesc}. The session will start at {row['startsAt']} and " f"end at {row['endsAt']}, taking place in {row['roomName']}. " f"It will last for {row['lengthInMinutes']} minutes and will be conducted in {row['language']}. " f"This session is aimed at {row[targetColumnName].replace("\r\n\r\n", "").replace("\r\n", "").replace("\n\n"," ").replace("\n"," ")}.\n\n\n" ) session_details.append(description)
  95. 177 On-Device RAG ( Vector Database ) @Entity data class

    Chunk( @Id var chunkId: Long = 0, @Index var docId: Long = 0, @HnswIndex(dimensions = 384) var chunkEmbedding: FloatArray = floatArrayOf(), var docFileName: String = "", var chunkData: String = "", var metadata: String = "", )
  96. 178 Datasets ( 6, 7 ) year = ['2021', '2022',

    '2023', '2024'] for i in year: url = f"https://xxxx-xxx.droidkaigi.jp/events/droidkaigi{i}/timetable" # API Call response = requests.get(url) # code... targetColumnName = "targetAudience" if i == '2024': targetColumnName = "i18nTargetAudience" if i == '2024': final_merged[targetColumnName] = final_merged[targetColumnName].apply( lambda x: x.get('en') if isinstance(x, dict) else x)
  97. 179 MediaPipe (Generative AI Task - LLM) val partialResults: SharedFlow<Pair<String,

    Boolean>> = _partialResults.asSharedFlow() val options = LlmInference.LlmInferenceOptions.builder() .setModelPath(MODEL_PATH) .setMaxTokens(1024) .setResultListener { partialResult, done -> _partialResults.tryEmit(partialResult to done) } .build() llmInference = LlmInference.createFromOptions(context, options) llmInference.generateResponseAsync(prompt)
  98. 180 MediaPipe (Generative AI Task - LLM) val partialResults: SharedFlow<Pair<String,

    Boolean>> = _partialResults.asSharedFlow() val options = LlmInference.LlmInferenceOptions.builder() .setModelPath(MODEL_PATH) .setMaxTokens(1024) .setResultListener { partialResult, done -> _partialResults.tryEmit(partialResult to done) } .build() llmInference = LlmInference.createFromOptions(context, options) llmInference.generateResponseAsync(prompt) eg. /data/local/tmp/llm/model.bin
  99. 181 MediaPipe (Generative AI Task - LLM) val partialResults: SharedFlow<Pair<String,

    Boolean>> = _partialResults.asSharedFlow() val options = LlmInference.LlmInferenceOptions.builder() .setModelPath(MODEL_PATH) .setMaxTokens(1024) .setResultListener { partialResult, done -> _partialResults.tryEmit(partialResult to done) } .build() llmInference = LlmInference.createFromOptions(context, options) llmInference.generateResponseAsync(prompt) set Options
  100. 182 MediaPipe (Generative AI Task - LLM) val partialResults: SharedFlow<Pair<String,

    Boolean>> = _partialResults.asSharedFlow() val options = LlmInference.LlmInferenceOptions.builder() .setModelPath(MODEL_PATH) .setMaxTokens(1024) .setResultListener { partialResult, done -> _partialResults.tryEmit(partialResult to done) } .build() llmInference = LlmInference.createFromOptions(context, options) llmInference.generateResponseAsync(prompt)
  101. 183 MediaPipe (Generative AI Task - LLM) inferenceModel.partialResults .collectIndexed {

    index, (partialResult, done) -> currentMessageId?.let { if (index == 0) { /// append } else { /// append } if (done) { currentMessageId = null // Re-enable text input setInputEnabled(true) } } }