Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Gemma and RAG Architecture

Introduction to Gemma and RAG Architecture

Aye Hninn Khine

July 28, 2024
Tweet

More Decks by Aye Hninn Khine

Other Decks in Technology

Transcript

  1. Introduction to Gemma and RAG Architecture Aye Hninn Khine Machine

    Learning Researcher ML GDE Google IO Extended Yangon 2024 GDG Yangon
  2. Intro to Gemma The official Train the Trainer deck! Gus

    Martins he/him AI Developer Relations Engineer Ju-yeong Ji he/him AI Developer Relations Engineer
  3. What is an LLM? [...] [...] [...] [...] 0.02 0.03

    0.9 0.01 0.0 … Dogs Rain Drops Fish Wind … and cats raining It’s
  4. Explain this joke: “I used to be addicted to soap,

    but I'm clean now.” Model: The joke is a pun on the word "clean". It has two meanings: Model: User: The joke is funny because it is unexpected. The listener expects the speaker to be talking about the first meaning of "clean", but they are actually talking about the second meaning. This creates a surprise and a laugh. To be free of dirt or other unwanted substances. To be free of addiction.
  5. Model: 1. **Build a volcano.** This is a classic science

    fair project that is both fun and educational. You can learn about the science of volcanoes, earthquakes, and plate tectonics. 2. **Make a solar oven.** This is a great project to learn about renewable energy and how to harness the power of the sun. You can use your solar oven to cook food, dry fruit, or even melt chocolate. 3. **Create a 3D printer.** This is a more challenging project, but it is also very rewarding. You can learn about 3D printing technology and design your own custom objects. 4. **Build a robot.** This is a great project for those who are interested in engineering and robotics. You can learn about basic electronics, mechanics, and programming. 5. **Design a water filtration system.** This is a project that can help you learn about the importance of clean water and how to filter it. You can use your water filtration system to clean water from a stream or river. Give me five neat ideas for science fair projects. User:
  6. Search Conversation Content generation LLMs are characterized by emergent abilities,

    or the ability to perform tasks that were not present in smaller models. LLMs contextual understanding of human language changes how we interact with data and intelligent systems. LLMs can find patterns and connections in massive, disparate data corpora. Why are large language models different?
  7. Google is the industry pioneer in AI Responsible AI 3,000

    Researchers 7,000 Publications Built & Tested for Safety Privacy in design Upholds high scientific standards Accountable to People Socially Beneficial Avoid creating unfair bias 2015 Google DeepMind AlphaGo defeats Go champion 2016 Google’s DeepMind helps detect eye disease 2017 Google invents Transformer kickstarting LLM revolution 2018 Google’s groundbreaking large language model, BERT 2019 Text-to-Text Transfer Transformer LLM 10B P Model Open Sourced 2020 Google LaMDA Model Trained to converse 2022 AlphaFold predicts structures of all known proteins 2023 A conversational AI Service powered by PaLM2 2024 Family of multimodal LLMs & products
  8. Background A history of disruptive innovations in AI grounded in

    open development Distbelief: First distributed system for training neural nets First DL model in production (Deep Neural Nets for acoustic modeling) “Cats paper”: First network trained in a distributed way Word2Vec: Proof of viability of word embeddings to understand content Seq2Seq: Foundational framework leading to Transformers and BERT GoogleNet “Going Deeper with Convolutions” Brain’s first live DL experiments in Ads Distillation: compression model reducing size of complex ML models RankBrain first launches TensorFlow released TPUs announced JAX AutoML: Success in learning to learn Transformers: Rethinking Seq-to-Seq BERT in Search AlphaFold LaMDA 2, PaLM, AI Test Kitchen Imagen, Parti, Imagen-Video, Phenaki Bard , Gemini, Search Labs, and more 2012 2013 2014 2015 2016 2017 2018 2019 2022 2024
  9. Gemma open models A family of lightweight, state-of-the art open

    models built from the same research and technology used to create the Gemini models
  10. Key Gemma goals Democratizing innovation Build together, grow together Accelerating

    model development Already 500+ variants Fostering trust and responsible AI Safety by design + tools Enabling continuous learning Events, community contributions Shaping the AI future Partner with NVIDIA, Hugging Face, and you!
  11. How do I get access to Gemma? Gemma models are

    available across Kaggle, Vertex AI and Hugging Face - choose the model hub that makes the most sense for you. Just request access on the model hub of your choice, agree to the Gemma Terms of Use and you’ll be ready to get started!
  12. How do I get ready to use Gemma? First, install

    KerasNLP with Keras 3 into your environment, then import the keras_nlp module. !pip install --upgrade keras-nlp !pip install --upgrade keras import keras_nlp # https://keras.io/api/keras_nlp/models/gemma/gemma_causal_lm/ g_lm = keras_nlp.models.GemmaCausalLM.from_preset("gemma_2b_en") Then, load Gemma from a preset configuration! Presets are available for the 2B and 7B parameter versions of Gemma.
  13. Using Gemma The generate() function is how we interact with

    the model. For instance, if we wanted to ask Gemma what comes after “it was a dark and stormy night”... Gemma will give us: txt = g_lm.generate("It was a dark and stormy night.", max_length=64) print(txt) It was a dark and stormy night. The rain was pouring down, and the wind was howling. But that didn’t stop a group of friends from going out for a night of fun. They were all dressed up in their best clothes, and they were ready to have a good time.
  14. Choosing The Right Gemma Gemma offers 2B and 7B parameter

    sizes (larger = more powerful)! Option for instruction-tuned models for specific tasks. Your use cases and target hardware will help shape your choice of Gemma model to use. For more information, see https://ai.google.dev/gemma
  15. CodeGemma An open model that can perform a variety of

    coding tasks • Intelligent code completion and generation • Enhanced accuracy • Multi-language proficiency ◦ Python, JavaScript, Java, Kotlin, C++, C#, Rust, Go, and other languages • Streamlined workflows
  16. RecurrentGemma Throughput when sampling from a 2k prompt An open

    model based on Griffin • Reduced memory usage • Higher throughput • High performance
  17. • Always have the latest Gemma resources and information. •

    Easily launch into working with Gemma in notebooks. • Explore how the community uses Gemma and connect with fellow data enthusiasts. • Compete for cash prizes, dive into past competitions for cutting-edge techniques. • Be part of shaping the future of AI alongside Google and the vibrant Kaggle community. https://www.kaggle.com/models/google/gemma Kaggle: Your Gateway to AI Innovation
  18. Why Keras? simplifies development, empowering focused problem-solving. offers scalable complexity

    for evolving needs. Simple Flexible industry-strength performance and scalability Powerful
  19. Fine-tune Gemma models in Keras using LoRA # Enable LoRA

    for the model and set the LoRA rank to 4. gemma_lm.backbone.enable_lora(rank=4) # Use AdamW (a common optimizer for transformer models). optimizer = keras.optimizers.AdamW( learning_rate=5e-5, weight_decay=0.01, ) # Exclude layernorm and bias terms from decay. optimizer.exclude_from_weight_decay(var_names=["bias", "scale"]) gemma_lm.compile( loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True), optimizer=optimizer, weighted_metrics=[keras.metrics.SparseCategoricalAccuracy()], ) gemma_lm.fit(data, epochs=1, batch_size=1)
  20. gemma = keras_nlp.models.GemmaCausalLM.from_preset("gemma_instruct_2b_en") # Use the same LoRA rank that

    you trained gemma_lm.backbone.enable_lora(rank=4) # Load pre-trained LoRA weights gemma_lm.backbone.load_lora_weights("my_model.lora.h5") gemma_lm.compile(sampler=keras_nlp.samplers.TopKSampler(k=3, temperature=0.7)) Save / Load pre-trained LoRA # Save LoRA weights after gemma_lm.fit(), filename must end in ".lora.h5" gemma_lm.backbone.save_lora_weights("my_model.lora.h5")
  21. Alternatively: Distributed Tuning Training a single model across multiple devices

    (GPUs or TPUs) for faster results. Why? • Large Models: Complex models with many parameters require significant time to train on a single device. • Massive Datasets: Distributed training accelerates the process of working through extremely large datasets. • Time Constraints: Distributing training can drastically reduce the time needed to reach optimal model performance.
  22. # Define a 2x4 device mesh with data and model

    parallel axes device_mesh = keras.distribution.DeviceMesh( shape=(2, 4), axis_names=["data", "model"], devices=devices ) Defining the structure of our distributed hardware Distributed Tuning
  23. model_dim = "model" layout_map = keras.distribution.LayoutMap(device_mesh) # Weights that match

    'token_embedding/embeddings' will be sharded on 8 TPUs layout_map["token_embedding/embeddings"] = (model_dim, None) # Regex to match against the query, key and value matrices in the decoder # attention layers layout_map["decoder_block.*attention.*(query|key|value).*kernel"] = ( model_dim, None, None) layout_map["decoder_block.*attention_output.*kernel"] = ( model_dim, None, None) layout_map["decoder_block.*ffw_gating.*kernel"] = (None, model_dim) layout_map["decoder_block.*ffw_linear.*kernel"] = (model_dim, None) Specifying how the weights and tensors should be distributed Distributed Tuning (cont’ed)
  24. model_parallel = keras.distribution.ModelParallel( device_mesh, layout_map, batch_dim_name="batch") keras.distribution.set_distribution(model_parallel) Setting up the

    distribution Distributed Tuning (cont’ed) After this, the generate() and fit() methods will work just like non-distributed code.
  25. TensorFlow Lite Framework for deploying ML on web, mobile, and

    embedded systems Host (Conversion) TensorFlow Graph TensorFlow APIs Keras TensorFlow Lite Converter Optimization & Conversion Passes TensorFlow Lite Model (flatbuffer) Device (Inference) TensorFlow Lite Model (flatbuffer) TensorFlow Lite C/C++ APIs Java TensorFlow Lite Runtime CPU Kernels Python Swift Obj-C XNNPACK TPU CoreML HW Accelerator Delegates DSP GPU
  26. • Text generation (smart replies, email compose, etc) • Text

    rewrite • Summarization • Document / Corpus Q&A • Text classification (toxicity, sentiment, etc) What can you do with Gemma on-device? Without cloud bills Without privacy concerns Without connectivity
  27. Code Samples • Android: https://bit.ly/odml-android • iOS: https://bit.ly/odml-ios • Web:

    https://bit.ly/odml-web git clone https://github.com/googlesamples/mediapipe Clone the git repository
  28. Gemma 2B running on Android & iOS - Video sped

    up for demo purposes - May display inaccurate or offensive information that doesn't represent Google's views
  29. What is gemma.cpp? gemma.cpp is a lightweight, standalone C++ inference

    engine for Google’s Gemma series of foundation models. It prioritizes simplicity and portability, and being built with C++, can be easily integrated into other native projects.
  30. • AI researchers Because gemma.cpp is a C++ implementation instead

    of a black box, researchers can easily experiment with direct code modifications • C++ developers with LLM needs Developers building or maintaining existing native apps that want to add LLM capabilities • Anyone who thinks it’s cool! (Maybe that’s you.) Who is this project for?
  31. How to use gemma.cpp? System requirements • CMake • Clang

    C++ compiler, supporting at least C++17
  32. How to use gemma.cpp? Clone the git repo or Download

    $ git clone https://github.com/google/gemma.cpp.git
  33. How to use gemma.cpp? Extract Files $ tar -xf archive.tar.gz

    This can take a few minutes, should produce a file containing model weights such as “2b-it-sfp.sbs” and a tokenizer file “tokenizer.model”. You may want to move these files to a convenient directory (e.g. gemma.cpp/build/ directory)
  34. How to use gemma.cpp? Build Generate the build files using

    cmake $ cd gemma.cpp/build && cmake .. Then run make to build the gemma executable $ make gemma If this is successful, you should now have a gemma executable in the gemma.cpp/build/ directory.
  35. Argument Description Example value --model model type 2b-it | 2b-pt

    | 7b-it | 7b-pt --compressed_weights compressed weights file 2b-it-cpp-sfp.sbs --tokenizer tokenizer file tokenizer.spm How to use gemma.cpp? Run gemma has the following required arguments:
  36. Best of Predictive and Generative AI in one platform Foundation

    Models Discover Task Specific Models Open Models Data & Algorithms Prototype Develop Deploy Colab & Workbench Notebooks AI Studio Developer SDKs Customize Training on GPUs & TPUs Tuning: SFT, RLHF, & Distillation Augment Grounding & Extensions Embeddings & Vector Search Evaluate Experiments & Tensboard AutoMetrics & AutoSxS Orchestrate Pipelines & Metadata Govern FeatureStore & Model Registry Predict Endpoints on GPUs & TPUs Monitoring & Safety
  37. Making it easy for any developer to also be an

    AI developer: • Extensive quick start library with code samples • Free developer labs and training resources at https://cloudskillsboost.google/ • Robust integrations with popular third party developer tools • Packages and extensions to natively support Google Cloud foundation models Vertex AI is built for developers Interfaces for all developers Firebase Colab Vertex AI Flutter
  38. Migrating from local Gemma to Vertex AI • Assumptions ◦

    You have a Google Cloud Project (GCP) with Vertex AI enabled. ◦ You have fine-tuned Gemma Model in local • Steps 1. Export Gemma Model: 2. Prepare Model for Vertex AI: 3. Upload Model to Google Cloud Storage (GCS): 4. Create a Vertex AI Model Resource: 5. Import Model to Vertex AI: 6. Deploy as an Endpoint (Optional): 7. Test and Integrate: https://cloud.google.com/vertex-ai/docs/model-registry/import-model
  39. Kubernetes Kubernetes as the foundation for Open LLMs Choice of

    frameworks and ecosystem tools that are container portable Orchestrate AI models at massive scale across specialized compute Flexibility Performance Optimize valuable compute resources while reducing operational complexity Efficiency
  40. Harness the full potential of AI/ML with GKE Train, fine-tune

    and serve AI models, with portability, scalability, and easy administration GKE is a robust, production ready AI and ML platform with all the benefits of managed Kubernetes Efficient resource management, consistent ops environments, and autoscaling Easy orchestration of Google Cloud AI accelerators, including GPUs and TPUs, for faster training and inference
  41. Streaming ML - Gemma with Dataflow ML • Low Code

    pipelines • Single line of code for multiple sources and destinations • Pre and post processing steps for real world use cases • Simple to complex ensambles with multiple models / frameworks model_handler = TFModelHandlerNumpy(model_path, inference_fn=gemma_inference_function) with beam.Pipeline() as p: _ = (p | beam.io.ReadFromPubSub(topic=topic) # Read from Source | beam.Map(lambda x:x.decode("utf-8")) # Decode the source data | RunInference(model_handler, inference_args={'max_length': 32}) # Send prompt to Gemma. | beam.ParDo(FormatOutput()) # Format the output. | beam.Map(print) # Print the formatted output. )
  42. Useful Resources • Serve Gemma using GPUs ◦ on GKE

    with vLLM ◦ on GKE with Hugging Face TGI ◦ on GKE with Triton and TensorRT-LLM • Serve Gemma using TPUs ◦ on GKE with Saxml • Gemma deployment to GKE using TGI on GPU • Use Gemma with Dataflow
  43. Proprietary + Confidential Google’s AI Principles Be built and tested

    for safety Be accountable to people Incorporate privacy design principles Uphold high standards of scientific excellence Be made available for uses that accord with these principles Avoid creating or reinforcing unfair bias Be socially beneficial
  44. Proprietary + Confidential Proprietary + Confidential 87 Safety by Design

    Gemma’s approach to responsible AI Transparent and Robust Evaluations Responsible Generative AI Toolkit
  45. Proprietary + Confidential Filter out certain personal information and other

    sensitive data Filtered pre-training data Safety by design Supervised Fine Tuning & Reinforcement Learning with Human Feedback for safety Safety tuning
  46. Proprietary + Confidential Transparent & Robust Evaluations Human rater have

    compared Gemma and Mistral over 400 prompts testing basic for safety protocols. Human SxS Gemma results on 9 academic authoritative safety and fairness benchmarks. Academic benchmarks Advanced red teaming. Manual testing for advanced capabilities (chemical, biological risks). Internal safety evaluations Tested for safety, privacy, societal risks, data memorization and dangerous capabilities.
  47. Proprietary + Confidential Responsible Generative AI Toolkit The first LLM

    prompt-debugger, based on saliency methods. Model debugging Guidance on developing Responsible models RAI guidance A hate speech classifier. Methodology to build any classifier with limited data points. safety classifiers
  48. Proprietary + Confidential Build your safety classifier 1. Collect 80-500

    training data examples 2. Parameter efficient tune using LoRA 3. Get model scores or predictions and evaluate Custom classifier 200 data points SOTA on ETHOS leaderboard F1 : 0.8 Hate speech classifier
  49. Proprietary + Confidential Safety as a Shared Responsibility • Gemma

    is an open model – accessible for further development. • Those adapting Gemma must uphold safety and quality standards. • Responsibility extends to creators and end-users.
  50. Google Cloud Proprietary & Confidential 102 LLMs are phenomenal for

    knowledge generation and reasoning. They are pre-trained on large amounts of publicly available data. Typical usage of LLMs LLM • Text generation • Summarization • Q&A
  51. Google Cloud Proprietary & Confidential 103 LLMs can only understand

    the information • That they were trained on • That they are explicitly given in the prompt They will often assume that the premise of a prompt is true. LLMs don’t have the capability to ask for more information and potentially need some outside input. Prompt What is langchain? Response LangChain is a blockchain-based platform that allows users to create, share, and learn languages. It is a decentralized platform that is not controlled by any single entity. This means that users have complete control over their data and content. LangChain is also a secure platform, as all data is encrypted and stored on the blockchain. This makes it very difficult for hackers to access or steal data. But…. The Grounding Problem (aka Hallucinations)
  52. Google Cloud Proprietary & Confidential 104 (Full) Fine-Tuning Make Humans

    Check Prompt Engineering Run fine-tuning to teach the LLM by baking knowledge into the weights of the network. - Data preparation effort - Expensive to do - Online learning/ updating new data - Still may not work Put a human in-between the output of the LLM and the user. - Very latent and expensive - Humans untrustworthy or inattentive Induce the LLM by adding any known information to the prompt. - Based on LLMs existing trained knowledge - Token limit for LLMs - Error prone: Retrieving the right context - Trade off: performance, latency, cost Some naive solutions
  53. Google Cloud Proprietary & Confidential 105 The Problem: • LLMs

    do not know your business’s proprietary or domain specific data • LLMs do not have real-time information • LLMs find it challenging to provide accurate citations from their parametric knowledge The Solution: Feed the LLM *relevant* context in real-time, by using an information retrieval system Text Generation Information Retrieval (IR) System External Knowledge Retriever (BM25, encoder, vector search) Generator (LLM: PaLM) question related context Input Prompt ranked results Retrieval Augmented Generator Answer Generative AI applications with Vertex AI PaLM 2 Models and LangChain Retrieval Augmented Generation (RAG) ‘Grounding’ on user data
  54. Google Cloud Proprietary & Confidential 106 Question & Answering Chatbots

    Agents Semantic search and/or summarization over unstructured documents or structured data sources. Can involve breaking down complex question, combining heterogeneous data sources or multiple documents. Instead of a single question and answer, a chatbot can handle multiple back-and-forth queries and answers, getting clarification or answering follow-up questions. An “agent” is an automated reasoning and decision engine that takes in a user input/query and make internal decisions for executing that query to return results. Involves breaking down complex question, choosing external tools, planning tasks and caching completed tasks. Common use cases / applications
  55. Google Cloud Proprietary & Confidential 107 When do you fine–tune

    vs RAG? RAG vs Finetuning — Which Is the Best Tool to Boost Your LLM Application? Fine-Tuning RAG External knowledge required? ❌ ✅ Model adaptation required? ✅ ❌ Minimize hallucinations? ❌ ✅ Is training data available? ✅ ❌ How dynamic is the data? ❌ ✅ Interpretability required? ❌ ✅
  56. Google Cloud Proprietary & Confidential 108 "An embedding is a

    relatively low-dimensional vector into which you can translate high-dimensional vectors. Ideally, an embedding captures some of the semantics of the input by placing semantically similar inputs close together in the embedding space." From: Google Machine Learning Crash Course Data (104~106 dims) DL models Embs (102~104 dims) Meet AI’s multitool: Vector embeddings Embeddings
  57. Google Cloud Proprietary & Confidential 109 Data Ingestion / Parsing

    • Split up document(s) into even chunks. • Each chunk is a piece of raw text. • Generate embedding for each chunk • Store each chunk into a vector database Chunk Chunk Chunk Chunk Vector Database Vector Database Chunk LLM Querying • Generate embedding for query • Find top-k most similar chunks from vector database • Plug into LLM response synthesis Retrieval Synthesis RAG workflow for building a QA System
  58. Google Cloud Proprietary & Confidential 110 The updated Vertex AI

    Feature Store now natively supports embeddings storage and retrieval. “.... lorem ipsum lorem ipsum….” User Prompt Vertex AI: Text Embeddings API Document Corpus or Knowledge Base (1) Index the relevant content Embeddings + Content Index Vertex AI: Vector Search Your favourite low latency key-value store (2) Fetch relevant info and augment prompt Vertex AI: Text Embeddings API Fetch top-k semantic matches and append to prompt “.... lorem ipsum lorem ipsum…. Context: ….doc 1…. ….doc 2…. … ….doc k…. ” Augmented Prompt Vertex AI: PaLM2 LLM Response (long docs to be split into smaller chunks) Retrieval Augmented Generation: Example architecture powered by Vertex AI Text Embeddings and Vector Search
  59. Google Cloud Proprietary & Confidential 111 Improving performance Better retrieval

    == better results Loading Chunking Query < answer > Retrieval Ranked docs Answer Generation Prompt LLM Embedding Storage Vector Store Evaluation