GenAI for the rest of us

Generative AI from Zero to Hero in less than four
hours. GenAI for the rest of us

15/01/24 Who am I? NOME CLIENTE 15/01/24 Luca Bianchi, PhD
Chief Technology Officer @ Neosperience and Neosperience Health, proud AWS Serverless Hero, passionate about software architectures, serverless, and machine learning. Serverless Italy and [Gen]AI Milano Meetup co-founder. ServerlessDays Milano co-organizer. github.com/aletheia https://it.linkedin.com/in/lucabianchipavia https://speakerdeck.com/aletheia bianchiluca.com @bianchiluca Big Daddy Little Elisa

15/01/24 Who am I? NOME CLIENTE 15/01/24 Janos Tolgyesi Machine
Learning Team Leader @ Neosperience, proud AWS Community Builder, passionate about machine learning. [Gen]AI Milano Meetup co-founder. github.com/mrtj linkedin.com/in/janostolgyesi @ jtolgyesi

25/07/23 Summary 1. The AI Landscape 2. Transformers 3. Foundation
Models 4. Advancements on LLMs 5. Improving LLM behaviour 6. Project Lifecycle 7. Tools 8. The road ahead

01. The AI Landscape

The landscape of AI is vast and continually evolving, with
various subfields offering specialized applications. Awareness of this landscape is essential for professionals across different disciplines who wish to harness AI's capabilities effectively. Machine Learning: The backbone of modern AI, machine learning algorithms allow computers to learn from data. Techniques range from supervised to unsupervised learning, with deep learning becoming increasingly popular. Natural Language Processing (NLP): A subfield focused on the interaction between computers and human languages. Examples include machine translation, chatbots, and sentiment analysis. Computer Vision: Enables machines to interpret visual information from the world. Key applications include facial recognition, medical image analysis, and autonomous vehicles. Reinforcement Learning: A subfield of machine learning where agents learn to make decisions by interacting with an environment to achieve a goal. Used in areas like game theory, robotics, and recommendation systems. 01. AI LANDSCAPE SOTTOTITOLO SLIDE The AI Landscape 6

7 AI is more than LLM There are five categories
of artificial intelligence (tribes) modeling every different representation of human rational reasoning aspect. 01. AI LANDSCAPE

01. AI LANDSCAPE

SOTTOTITOLO SLIDE Neural Networks (the short version) 9 01. AI
LANDSCAPE

Understanding the semantic relationships between terms such as synonyms, antonyms,
and hyponyms—is crucial for precise communication. Language abstractions involves using general or specific terms to convey complex ideas, sometimes creatively through metaphors. Idioms, like "break a leg," carry culturally specific meanings that aren't literal. Paradoxes are statements that seemingly contradict themselves but might still be true, challenging our understanding of logic. Lastly, meta-language describes another language, offering a secondary layer of understanding, like code comments in programming. In specialized fields like marketing, medicine, and software development, these linguistic elements are vital. Marketers need to understand cultural idioms, medical professionals require semantic precision for accurate diagnosis and treatment, and developers use abstraction and meta-language for code efficiency and collaboration. Understanding these nuances aids in tailored communication and problem-solving across these disciplines. SOTTOTITOLO SLIDE Language is difficult (for machines) 10 01. AI LANDSCAPE

The key algorithm has been, for more than a decade,
a Recursive Neural Network (RNN), which is capable of generating text based on a sequence of characters. This presents a lot of challenges: SOTTOTITOLO SLIDE Language generation (before 2017) 11 I took my money to the bank The teacher taught the student with the book The movie was great, but the theatre was awful river bank? who used the book? how is the sentiment? 01. AI LANDSCAPE

02. Transformers

13 What is self-attention? 02. TRANSFORMERS Self-attention is a mechanism
that allows the model to learn the semantic solid correlation between words in a given sentence. Attention is “focus on a most important part of the input data.” Technically speaking, attention measures the similarity between two vectors and returns the weighted similarity scores. A standard attention function takes three main inputs: query, key, and value vectors. A Holistic Guide to the Transformer Neural Network Architecture https://deeprevision.github.io/posts/001- transformer/

A breakthrough paper from Google, presenting the Transformers architecture: •
Replaces Recurrence: Traditional sequence-to-sequence models like RNNs and LSTMs rely on recurrent mechanisms that process each token in sequence. The attention mechanism replaces this by calculating the relationships between all words in parallel, thereby eliminating the need for recurrent layers. • Parallelization: Because attention computes relationships simultaneously, the model can process multiple parts of the input at the same time. This allows for faster computation and significantly reduces training time. • Computes Relationships: Attention weighs the importance of different parts of the input when producing each element in the output. This is especially effective in capturing long-range dependencies within sequences that RNNs and LSTMs often struggle with. 14 Attention is all you need (2017) 02. TRANSFORMERS

It transforms the input sequence into a compressed representation. Architecture:
Originally, the transformer architecture featured six encoder blocks, though this number can vary depending on the size of the architecture. Encoder Block Structure: Each encoder block consists of three main layers: • Multi-Head Attention (MHA): Enables the model to simultaneously focus on different parts of the input sequence. • Layer Normalization: Standardizes the outputs of each sub-layer before they are passed to the next. • MLPs (Multi-Layer Perceptrons): As feedforward layers, they process the sequence further. Sub-Layers and Additional Components: MHA and MLPs are considered sub- layers. They are interconnected with layer normalization, dropout, and residual connections, which are crucial for the flow and efficiency of the architecture. Importance of Encoder Layer Count: The number of encoder layers (initially six) correlates with the model's size and ability to capture the global context of input sequences. More layers generally lead to better task generalization due to a more comprehensive context understanding. 15 Transformers (Encoder) 02. TRANSFORMERS

The decoder closely resembles the encoder but includes an additional
multi-head attention layer that operates over the encoder's output. This extra layer is essential for integrating the encoder's output with the target sequence. The decoder's primary function is to merge the encoder's output with the target sequence to make predictions or determine the next token. Masked Attention in Decoder: To ensure the prediction process's integrity, the decoder's attention mechanism is masked. This masking prevents the current token being processed from attending to subsequent tokens in the target sequence. With this, the decoder might have easy access to future sequence information, potentially leading to overfitting and poor generalization outside the training data. Similar to the encoder, the decoder is repeated multiple times for effectiveness. The original transformer model had six decoder blocks, mirroring the number of encoder blocks. 16 Transformers (Decoder) A Holistic Guide to the Transformer Neural Network Architecture https://deeprevision.github.io/posts/001- transformer/ 02. TRANSFORMERS

• Starting with images in pixel space and considering noising
and de-noising processes. • Training a neural network to learn how to gradually de-noise data from pure noise is possible, teaching a neural network model to predict the noise added. • It is called the noise predictor in Stable Diffusion. • Noise is then subtracted from the original image. • This cycle of noise addition, prediction, and subtraction is repeated multiple times to refine the image quality and detail. 17 Diffusion models 02. TRANSFORMERS

• Diffusion models working in pixel space are prolonged because
the image space is enormous. • A more efficient approach is to compress the image with a Variational Auto Encoder (VAE), thus reducing dimensions. • Details are preserved due to redundancy in real-life images. • VAE projects the image into the latent space. • Reverse diffusion is applied to the latent space, thus resulting in latent diffusion models. • Conditioning the latent space with a tensor coming from a text prompt produces the described output from denoising. • Stable Diffusion is a latent diffusion model. 18 Latent Diffusion models The Annotated Diffusion Model https://huggingface.co/blog/annotated-diffusion 02. TRANSFORMERS

Large Language Models present a promising technology with extensive capabilities
to enhance our interaction with data and natural language across various sectors. However, their deployment must be considered carefully due to challenges like hallucination and the need for content moderation. • Trained on Massive Text Datasets: Models like GPT-3, BERT, and Transformers are trained on extensive collections of text data, including books, news articles, and online conversations. • Automatic Learning of Word & Phrase Relationships: These models utilize embeddings to autonomously understand the relationships between words and phrases, enabling them to generate text. • Diverse Applications: Large Language Models have a wide range of uses, from content creation and virtual assistance to machine translation and Natural Language Processing. • Challenges: These models can sometimes produce hallucinated information that isn't accurate, and their outputs often require moderation to ensure reliability and appropriateness. 19 Large Language Models (LLM) 02. TRANSFORMERS

20 LLM Landscape 02. TRANSFORMERS

03. Foundation Models

• Based on massive datasets, foundation models (FMs) are large
deep-learning neural networks and a starting point to develop ML models that power new applications more quickly and cost-effectively. • Trained on a broad spectrum of generalized and unlabeled data capable of performing various general tasks such as understanding language, generating text and images, and conversing in natural language. • Focus on adaptability. Perform a wide range of disparate tasks with high accuracy based on input prompts. • Can be used as base models for developing more specialized downstream applications. • The computational power required for foundation models has doubled every 3.4 months since 2012 22 Foundation Models (FM) 03. FOUNDATION MODELS

• Generative Pre-trained Transformer 3 (GPT-3) is a large language
model released by OpenAI in 2020. • Decoder-only transformer model of deep neural network • Uses a 2048-tokens-long context • Involves 175 billion parameters, • Requires 800GB of storage space • strong "zero-shot" and "few-shot" learning abilities on many tasks 23 GPT3 03. FOUNDATION MODELS

• A fine-tuned version of GPT3 • Provides better human-like
conversation capabilities • Uses a 2K, 4K, and 16K tokens-long context • Empowers first release of ChatGPT • Generates answers in 10-15 seconds 24 GPT3.5-turbo 03. FOUNDATION MODELS

• A new model with improved reasoning capabilities • Multi-modal
capabilities • Uses a 32K tokens-long context • Can solve difficult problems with greater accuracy, thanks to its broader general knowledge and problem-solving abilities • Improved alignment and safety • Supports function calling • It’s the basis of OpenAI GPTs 25 GPT4 03. FOUNDATION MODELS

• Released in 2018, Bidirectional Encoder Representations from Transformers (BERT)
was one of the first foundation models. • BERT is a bidirectional model that analyzes the context of a complete sequence and then makes a prediction. • It was trained on a plain text corpus and Wikipedia using 3.3 billion tokens (words) and 340 million parameters. • BERT can answer questions, predict sentences, and translate texts. 26 BERT 03. FOUNDATION MODELS

• Open source LLM from Meta • Different models ranging
from 7B to 70B • Fine-tuned models for chat • Fine-tuned models for code generation • Free for research and commercial use 27 LLAMA2 03. FOUNDATION MODELS

• Falcon 180B is an open-source super-powerful language model with
180 billion parameters, trained on 3.5 trillion tokens. • It ranked Hugging Face Leaderboard for pre-trained Open Large Language Models. • License for both research and commercial use • Performs exceptionally well in various tasks like reasoning, coding, proficiency, and knowledge tests • Improved accuracy over Meta's LLaMA 2. • Ranks just behind OpenAI’s GPT4, and on par with Google's PaLM 2 Large • Free for research and commercial use 28 Falcon 180B 03. FOUNDATION MODELS

• Specifically trained to reduce model hallucinations. • Supports 200K
context window • Can orchestrate across developer-defined functions or APIs • Can search over web sources • Can retrieve information from private knowledge bases • The most advanced alternative to OpenAI 29 Claude 2.1 03. FOUNDATION MODELS

• Multi-modal built-in model • Outperforms human experts on MMLU
(Massive Multitask Language Understanding) • Generate code based on different inputs. • Visual reasoning capabilities. • The ultra version not yet been released • Pro version exploits capabilities similar to GPT3 30 Gemini (Ultra) 03. FOUNDATION MODELS

• Advanced Image Synthesis: Generates detailed, creative images from textual
descriptions using deep learning techniques. • Versatile Styles and Techniques: Can create images in various styles, including artistic and realistic interpretations. • Enhanced Text-to-Image Capabilities: Excels in translating complex, abstract text descriptions into coherent and relevant visuals. • Improved Quality and Resolution: Offers higher image resolution and a better understanding of text inputs than previous versions. • Wide Range of Applications: Useful in graphic design, advertising, entertainment, education, and more. 31 DALL-E 3 03. FOUNDATION MODELS

• SOTA open architecture for image generation with 3.5B parameter
base model stage and 6.6B parameter ensemble pipeline. • Native 1024x1024 image generation with cinematic photorealism and fine detail. • Complex compositions • Fine-tuned to create complex compositions with basic natural language prompting. 32 SDXL Turbo 03. FOUNDATION MODELS

• Uses advanced and proprietary AI algorithms to create images
from textual descriptions. • Produces images that have a unique, sometimes surreal, artistic quality. • Can interpret many prompts, from straightforward descriptions to more abstract concepts. • User feedback and interactions can influence its evolution. • Useful for artists, designers, and creatives for inspiration, mock-up generation, and exploring visual concepts. 33 Midjourney 03. FOUNDATION MODELS

04. Advancements on LLMs

CTRL (Conditional Transformer Language Model) is a breakthrough in natural
language processing, boasting 1.6 billion parameters. It enhances human-AI interaction by allowing controlled generation of content and style using over 50 control codes. This model is unique in its ability to trace back the influence of its training data on generated text, making it a versatile tool for a range of NLP applications. • Control Codes: Allow explicit influence over style, genre, entities, etc. • Predictable Variation: Enables variation in generated text based on control codes. • Source Attribution: Identifies data sources influencing text generation. 35 Salesforce CTRL Model 03. FOUNDATION MODELS Introduction to Salesforce CTRL Model https://blog.salesforceairesearch.com/introducing-a- conditional-transformer-language-model-for- controllable-generation/

Mixtral 8x7B, a high-quality sparse mixture of expert model (SMoE)
with open weights, Handles a context of 32k tokens. • Sparse MoE layers are used instead of dense feed-forward network (FFN) layers. MoE layers have a certain number of “experts” (e.g. 8), where each expert is an FFN, or even a MoE itself • A gate network or router determines which tokens are sent to which expert 36 Mixtral of Experts (MoE) 03. FOUNDATION MODELS Mixture of Experts Explained https://huggingface.co/blog/moe#what-is-a-mixture- of-experts-moe

05. Improving LLM behaviour

Building a Large Language Models is a quite difficult task
due to a number of reasons: Training: • High demand in computing power: often massive GPU clusters are needed to crunch data for days to train a single LLM —> Often training a LLM from scratch requires billions of $ of investment. • Training requires deep knowledge of both data science and MLOps to properly optimize network architecture and the underlying infrastructure. Inference: • Model context length is bounded to a very small size (8K, 32K, 100K) which means model contextual knowledge is limited and must be optimized. • Model knowledge is freezed at the time of training, no new elements, tailored knowledge or real-time data to be used by the model. • Model hallucination can offer biased answers or discuss about topics not aligned with customer guardrails • Performances are difficult to be evaluated 38 Plain LLM — challenges 05. IMPROVING LLM BEHAVIOUR

Model inference can be deeply improved using innovative techniques such
as: • Parameters Efficient Fine-Tuning (PEFT) to fine-tune models in an efficient fashion with a reduced number of dataset items. • Model Quantization and Low Range Optimization (QLoRa) to efficiently reduce model size and computing memory needed for Inference • Embeddings to encode vast knowledge and empower retrieval capabilities. • Retrieval Augmented Generation (RAG) to complement model with external knowledge • Guidelines to set boundaries to model behaviour. • Reinforcement Learning with Human Feedback (RLHF) to align model with specific tone-of-voice, company values, and avoid misbehaviour. 39 Improving LLMs — techniques 05. IMPROVING LLM BEHAVIOUR

05.1. Prompt Engineering

• Prompt Engineering is the art and science of crafting
inputs (prompts) to elicit desired responses from AI models, particularly in language processing and generative tasks. • Optimizes the interaction between humans and AI, enhancing the quality, relevance, and accuracy of the AI's output • Key Components • Precision: Carefully choosing words and phrases to guide the AI towards the intended interpretation. • Context: Providing sufficient background information for the AI to understand the query. • Clarity: Avoiding ambiguity to minimize misinterpretation. 41 What is Prompt Engineering? 05. IMPROVING LLM BEHAVIOUR

42 Principles 05. IMPROVING LLM BEHAVIOUR

43 Principles 05. IMPROVING LLM BEHAVIOUR

05.2. Parameter Efficient Fine Tuning (PEFT)

Parameter-efficient fine-tuning (PEFT) methods enable efficient pre-trained language models (PLMs)
adaptation to various downstream applications without fine-tuning all the model's parameters. Fine-tuning large-scale PLMs is often prohibitively costly. PEFT methods only fine-tune a small number of (extra) model parameters, thereby significantly decreasing the computational and storage costs. Recent State-of-the-Art PEFT techniques achieve performance comparable to that of full fine-tuning. • Need to train 15%-20% of the original LLM weights, making the training process less expensive • Updates only a small subset of parameters. This helps prevent catastrophic forgetting. • Works out sparse data environments • Easy Portability and Deployment • Economic efficient approach 45 What is PEFT? 05. IMPROVING LLM BEHAVIOUR

Performing full-finetuning can lead to catastrophic forgetting because it changes
all parameters on the model. Since PEFT only updates a small subset of parameters, it’s more robust against this catastrophic forgetting effect. Training LLMs is computationally intensive. Full fine-tuning requires memory to store the model and various other parameters are required during the training process optimizer states, gradients, forward activations, and temporary memory throughout the training process 46 Full fine-tune LLM 05. IMPROVING LLM BEHAVIOUR

Several approaches to PEFT can balance different kind of trade-offs
on costs, memory, training speed, and performance: • Selective method identify which parameters you want to update, train only certain components of the model or specific layers, even individual parameter types • Reparameterization methods also work like LoRA Low Rank Adaptation • Additive methods carry out fine-tuning by keeping all of the original LLM weights frozen and introducing new trainable components • Adapter methods add new trainable layers to the model's architecture, typically inside the encoder or decoder components after the attention or feed-forward layers. • Soft prompt methods keep the model architecture fixed and frozen and focus on manipulating the input to achieve better performance. This can be done by adding trainable parameters to the prompt embeddings, keeping the input fixed, and retraining the embedding weights 47 PEFT Trade-offs 05. IMPROVING LLM BEHAVIOUR

The goal is to find an efficient way to update
the weights of the model. Transformers' basic structure is built with encoder-decoder, self-attention, and feedforward networks without having to train every single parameter again. There is a slight modification in the self-attention network. • perform this method of parameter-efficient fine-tuning with a single GPU and avoid the need for a distributed cluster of GPUs. • fine-tune a different set for each task and switch them out at inference time by updating the weights. • LoRA is broadly used in practice because of the comparable performance to full fine-tuning for many tasks and data sets, 48 LoRA 05. IMPROVING LLM BEHAVIOUR Parameter Efficient Fine Tuning https://medium.com/@kanikaadik07/peft- parameter-efficient-fine-tuning-55e32c60c799

05.3. Retrieval Augmented Generation (RAG)

RAG is ingeniously designed to amplify the capabilities of generative
models, such as those used for text generation, by uniquely integrating them with a retrieval mechanism. • Retrieval Mechanism: The retrieval part of RAG involves searching a large database or corpus of documents to find relevant information. • Generative Model: relevant information is fed into an LLM to generate a response or continuation of the text. • Advantages: The primary advantage of RAG is that it allows the generative model to produce highly relevant responses informed by up-to-date or specialized information. • Applications: RAG can be used in various applications such as chatbots, question-answering systems, content creation tools, and more. • Continuous Learning: the retrieval database can be periodically updated, allowing the system to stay current with new information. 50 What is RAG? 05. IMPROVING LLM BEHAVIOUR

51 What is RAG? 05. IMPROVING LLM BEHAVIOUR

• Acquire source documentation ◦ web crawling, ◦ data lake
extraction, ◦ connecting to proprietary databases, etc. • Convert documentation from source formats (pdf, doc, html, etc.) to plain text / lightweight structured format (simple html, markdown) • Create text chunks 52 LOAD: Document preprocessing 05. IMPROVING LLM BEHAVIOUR

• Divide text into syntactically correlated parts like phrases, paragraphs,
sections • Make fit multiple relevant part of the source document in the context of the LLM • Ideally one chunk also discusses a single topic • Divide source text by phrases (punctuation marks), paragraphs (new line characters), if it is markdown, by different level of headers. • Engineering decisions: ◦ what is the maximum size of the chunk, in the function of the context length len(context) of the LLM and the number of chunks (k) we want to stuff in the context, and the length of the instructive part of the prompt (len(prompt))? ◦ Whether to use, and if yes, how much overlap between chunks? • max(len(chunk)) * k + len(prompt) < len(context) 53 SPLIT: Text splitting 05. IMPROVING LLM BEHAVIOUR

• Text embeddings are a representation of text in the
form of numeric vectors • They capture the semantic meanings of words, phrases, sentences, or chunks of documents • Properties: ◦ Map documents of different lengths into a fixed-length, smaller dimension vector-space (typical number of dimensions: 512, 768, 1536, etc.) ◦ Semantic similarity: similar words, phrases or concepts are mapped to vectors that are “close” to each other. ◦ Modern embedding models are neural network based and represent the model’s “understanding” of the text ◦ Vector algebra works: emb(“king”) - emb(“man”) + emb(“woman”) ≅ emb(“queen”) • Examples: Ada-002 (OpenAI), Titan Embeddings (Amazon), Gecko (Vertex AI / Google) • Use cases: semantic search, classification, clustering, outlier detection. 54 EMBED: Embeddings 05. IMPROVING LLM BEHAVIOUR

• A non-parametric supervised learning method that identifies the k
“nearest” data points in a dataset to a given input ◦ “nearest” is interpreted by an appropriate metrics that defines a distance function between data points ◦ k is a user-defined parameter • Example metrics: Euclidean distance or cosine similarity for real vectors, manhattan distance for coordinates, Hamming distance for binary data, Jaccard distance for sets, etc. • Algorithms with their computational complexity in the function of dimension (d) and dataset size (N): ◦ brute force, exact k-NN: O(d * N) ◦ Hierarchical Navigable Small Words (HNSW): O(log N) ◦ Inverted File System (IVF): depending on the dataset and parameters, often < O(N) • Other factors to consider: does the algorithm support pre-filtering? 55 k-nearest neighbors (KNN) 05. IMPROVING LLM BEHAVIOUR

• A vector store is a specialized type of database
designed to handle vector data efficiently. ◦ Store high-dimensional vector data ◦ Efficient similarity search (knn) implementations ◦ Might scale up to billions of data points ◦ Indexing for fast retrieval (for approximate k-nn algorithms) • Might support storing structured metadata along the vectors • Might support pre-filtering or post-filtering of the data points based on predicates on metadata. • Examples: a lot of open source (OpenSearch, pgvector, Chroma, Milvius, LlamaIndex, Apache Cassandra etc.) and commercial (Pinecone, CouchBase, MongoDB Atlas, etc.) • Factors to consider when choosing vector store: total cost of ownership, scalability, features (updates, filtering, etc.) 56 STORE: Vector Store 05. IMPROVING LLM BEHAVIOUR

Hands-on RAG: question answering over custom knowledge base Session Jupyter
Notebooks https://github.com/mrtj/genai-rest-of-us

05.4. Multimodal Search

Hands-on CLIP: multi-modal embedding models, image embeddings, vector store, query
by text and image Session Jupyter Notebooks https://github.com/mrtj/genai-rest-of-us

Hands-on Stable Diffusion Session Jupyter Notebooks https://github.com/mrtj/genai-rest-of-us

06. Project Lifecycle

“The failure rate on AI projects has been between 83%
and 92%” — Fortune.com https://fortune.com/2022/07/26/a-i- success-business-sense-aible-sengupta

“Successful AI initiatives require a good understanding of AI projects
lifecycle” — Forbes.com https://www.forbes.com/sites/cognitiveworld/2022/08/14/ the-one-practice-that-is-separating-the-ai-successes-from- the-failures/?sh=6df5b30e17cb

Generative AI Projects Lifecycle The AI Value Flywheel Managing projects
lifecycle 06. PROJECT LIFECYCLE

Generative AI Projects Lifecycle PROJECT SCOPE DEFINITION Shape the use
case defining the task the project is expected to resolve. Select the interaction interface to be exposed to users. Define KPIs and constraints for the solution to be acceptable. Define overall project running budget. 06. PROJECT LIFECYCLE

Generative AI Projects Lifecycle PROJECT SCOPE DEFINITION MODEL SELECTION Shape
the use case defining the task the project is expected to resolve. Select the interaction interface to be exposed to users. Define KPIs and constraints for the solution to be acceptable. Define overall project running budget. Select the optimal Foundation Model (FM) to be used, based on available data, supported languages and regulatory constraints. 06. PROJECT LIFECYCLE

Generative AI Projects Lifecycle PROJECT SCOPE DEFINITION MODEL SELECTION ADAPTATION
& ALIGNMENT Shape the use case defining the task the project is expected to resolve. Select the interaction interface to be exposed to users. Define KPIs and constraints for the solution to be acceptable. Define overall project running budget. Select the optimal Foundation Model (FM) to be used, based on available data, supported languages and regulatory constraints. Adopt techniques to make models adapt to solve specific task. Evaluate model fine- tuning opportunity to increase model specificity to languages and tasks. Evaluate model alignment to further customize tone of voice, enforce guardrails, and prevent hallucinations. 06. PROJECT LIFECYCLE

& ALIGNMENT APPLICATION INTEGRATION Shape the use case defining the task the project is expected to resolve. Select the interaction interface to be exposed to users. Define KPIs and constraints for the solution to be acceptable. Define overall project running budget. Select the optimal Foundation Model (FM) to be used, based on available data, supported languages and regulatory constraints. Adopt techniques to make models adapt to solve specific task. Evaluate model fine- tuning opportunity to increase model specificity to languages and tasks. Evaluate model alignment to further customize tone of voice, enforce guardrails, and prevent hallucinations. Integrate models with external data sources to provide up-to-date or real-time responses, overcome context constraints, and call APIs. Implement reasoning and acting accordingly to improve autonomous interactions. 06. PROJECT LIFECYCLE

& ALIGNMENT APPLICATION INTEGRATION DEPLOY Shape the use case defining the task the project is expected to resolve. Select the interaction interface to be exposed to users. Define KPIs and constraints for the solution to be acceptable. Define overall project running budget. Select the optimal Foundation Model (FM) to be used, based on available data, supported languages and regulatory constraints. Adopt techniques to make models adapt to solve specific task. Evaluate model fine- tuning opportunity to increase model specificity to languages and tasks. Evaluate model alignment to further customize tone of voice, enforce guardrails, and prevent hallucinations. Integrate models with external data sources to provide up-to-date or real-time responses, overcome context constraints, and call APIs. Implement reasoning and acting accordingly to improve autonomous interactions. Define deployment targets and hardware constraints. Perform model optimization to balance precision and required computing power. Exploit SaaS/Cloud/on- premise alternatives to address company constraints and budget. 06. PROJECT LIFECYCLE

Define project scope to properly identify the right model, based
on the task to accomplish: • essay writing • summarization • translation • information retrieval • Reasoning / Agents • Entity / Sentiment recognition A project can leverage one or more task and its corresponding models to be involved. It’s quite common more than one model is adapted to accomplish a set of tasks. 70 Project Scope Definition — Task PROJECT SCOPE DEFINITION MODEL SELECTION ADAPTATION & ALIGNMENT APPLICATION INTEGRATION DEPLOY 06. PROJECT LIFECYCLE

Interaction is a pivotal feature of LLM applications. Defining the
proper user interface could result in an excellent customer engagement opportunity • chatbot / conversational • form with response • API one-shot • API with context memory For each one of these aspects, a number of sub cases such as the kind of information to be presented, support for rich text (i.e. Markdown) and the proper format, have to be defined as well. 71 Project Scope Definition — Interface PROJECT SCOPE DEFINITION MODEL SELECTION ADAPTATION & ALIGNMENT APPLICATION INTEGRATION DEPLOY 06. PROJECT LIFECYCLE

Depending on the kind of task selected, the project could
have: • No additional data to the model knowledge • Some examples to tune prompts (tens of items) • A dataset to fine tune the model (thousands of samples) • A wide dataset to align the model (hundreds of thousands of samples) • Documents or Knowledge Base to be searched into • Rules / constraints • APIs providing data • Labelled domain entities • Languages to be supported 72 Project Scope Definition — Data PROJECT SCOPE DEFINITION MODEL SELECTION ADAPTATION & ALIGNMENT APPLICATION INTEGRATION DEPLOY 06. PROJECT LIFECYCLE

73 Model Selection MODEL SELECTION PROJECT SCOPE DEFINITION ADAPTATION &
ALIGNMENT APPLICATION INTEGRATION DEPLOY Based on the kind of informations handled by the model, the available dataset, and regulatory constraints of the company it is possible to select a proper model within these metrics: • model generation: medium-size models (BERT, RoBERTa, GPT) or LLMs (GPT4, etc.) • decoder only, encoder only, encoder/decoder models • open source (LLaMa2, Falcon, Bloom, MBT) or commercial models (GPT3.5-turbo, GPT4, Coral) Commercial models are provided with API-only access by players such as OpenAI, Cohere, Google, and Amazon. Open Source models are a set of models trained, then released on hubs such as HuggingFace to be downloaded or launched into the cloud. Usually the budget required to train a new LLM from scratch makes it unfeasible for most companies, which prefer to leverage on adaptation and alignment techniques. 06. PROJECT LIFECYCLE

74 Adaptation & Alignment ADAPTATION & ALIGNMENT PROJECT SCOPE DEFINITION
MODEL SELECTION APPLICATION INTEGRATION DEPLOY Optimize machine learning models for task-specific performance by employing specialized techniques. Assess the potential for fine-tuning models to enhance their adaptability to specific languages and application domains. Conduct a rigorous evaluation of model alignment to customize tone of voice, implement safety measures, and mitigate the risk of generating false or misleading information. These techniques can be used independently or jointly to obtain the best effort. Despite being extremely useful, they require different amount of data and should be selected wisely according to the available human effort. 06. PROJECT LIFECYCLE

75 Adaptation & Alignment — Techniques ADAPTATION & ALIGNMENT PROJECT
SCOPE DEFINITION MODEL SELECTION APPLICATION INTEGRATION DEPLOY The main techniques available for model adaptation and alignment are: • Prompt Engineering: is the main technique to customize a model behavior. It consists into crafting the message provided to a LLM using a few samples and/or guided instructions. Requires just a few examples and a strong model knowledge. • Parameter Efficient Fine-Tuning (PEFT): a new set of adapted parameters is trained to further specialize the model into understanding a language subset of specific topics (such as domain specific wording). These layers can be archived and attached to the model when needed. It requires a few hundred samples and computing power to train the model properly. • Reinforcement Learning with Human Feedback (RLHF): it’s the most powerful techniques consisting into training a reinforcement learning model into being able to assign a score to model’s generated sentences based on their alignment to specific guidelines (i.e. adapting tone of voice, avoiding profanity, increasing friendliness). It requires thousands of samples, often tens of thousands, human crafted to enlist feedback. 06. PROJECT LIFECYCLE

76 Application Integration APPLICATION INTEGRATION PROJECT SCOPE DEFINITION MODEL SELECTION
ADAPTATION & ALIGNMENT DEPLOY Large Language Models knowledge is freezed at the specific training point of time. Moreover, their understanding is bounded to either training dataset or context length (which is often less than 64K). To overcome these limits and improve model responsiveness, a number of techniques have been developed to enable Retrieval Augmented Generation (RAG): • Data Efficiency: Allows the model to selectively focus on relevant pieces of the knowledge base, thus making the best use of available data. • Scalability: RAG enables the model to leverage external databases, which is essential for scaling up without retraining the entire model. • Improved Accuracy: Combining the advantages of retrieval-based and generative models, RAG enhances question-answering performance. • Context Relevance: Better at providing contextually relevant answers compared to traditional LLMs, as it pulls in documents that relate to the question being asked. One of the fundamental key points of this techniques is the knowledge encoding process obtained through Embeddings. 06. PROJECT LIFECYCLE

77 Deploy DEPLOY PROJECT SCOPE DEFINITION MODEL SELECTION ADAPTATION &
ALIGNMENT APPLICATION INTEGRATION It is one of the most important aspects of LLM management, directly bounded to running costs and performances. In many cases the size of the model (comprising eventually also its PEFT layers) il too big to be handled on a single GPU, thus requiring strong investment or preventing some use cases such as embedded deployments. A few techniques such as quantization and Low Range Adaptation (LoRA) offer a good tradeoff between model precision and size. Deployment considerations also involve the evaluation of the release strategy: • SaaS service such as OpenAI being a cost effective solution, but with some regulatory and performance constraints • Dedicated cloud deployments such as Google VertexAI, Amazon Bedrock or Microsoft OpenAI offer a GDPR compliant and data safe environment while preserving a good cost balance. • On premise deployment offers data locality with a provisioning cost that need to be carefully considered. 06. PROJECT LIFECYCLE

A set of questions to shape a LLM Application project:
◦ Which task should the application accomplish? ◦ Which is the size of the available dataset? ◦ Is the model handling GDPR sensitive data or data subject to other privacy constraints? ◦ How is live data available (API, export, database)? ◦ Which languages have to be supported? ◦ Do we need a chatbot or any other type of UI? ◦ SaaS, cloud or on-prem deployment? 78 Project Checklist 06. PROJECT LIFECYCLE

79 Incremental Projects Lifecycle (IPL) Sometimes requirements are unclear and
project scope cannot be defined before a working prototype is built. In such cases, an incremental approach is preferable because offers the customer a understanding over the direction the solution is heading, while keeping budget in check. The Generative AI Project Lifecycle can be grouped into three phases, aimed to showcase the feasibility and match business requirements. A Proof-of-Concept (PoC) is the initial phase, where requirements and project scope need to be properly defined. In this phase also model capabilities are matched against customer requirement and a baseline showcasing the expected result is shown. Sometimes, due to the uncertainty of the environment and the continuous development of the technology, the PoC phase is switched to a Research and Development (R&D) phase which allow for better management of of uncertainty within a constrained effort. In the Minimum Viable Product (MVP) phase the model performances are tailored to production requirements and the main features of the solution are developed. The release phase accounts all the integration features, GUIs and deployments needed to support scalability and reliability. MVP RELEASE POC / R&D PROJECT SCOPE DEFINITION MODEL SELECTION ADAPTATION & ALIGNMENT APPLICATION INTEGRATION DEPLOY 06. PROJECT LIFECYCLE

80 IPL — Phases Phase Description Outcome Target Users Research
and Development (R&D) Starts with project kick-off and covers all the solution design process, requirements mapping, models evaluation, selection and initial prompt engineering. Usually alternative to PoC phase. • R&D Report • Specific tests / PoCs • Internal users • Stakeholders Proof-of-Concept (PoC) Starts with project kick-off and covers all the solution design process, requirements mapping, models evaluation, selection and initial prompt engineering. • Solution project • Critical path definition • Budget estimation • Working prototype in sandbox environment or on exported data • Internal users • Stakeholders • Project team Minimum Viable Product (MVP) Starts when PoC is approved. Has the goal to fine-tune models and prompts. Eventual model alignment using RLHF. Iterates multiple times through Evaluation and engineering sub-phases. Then integrations with systems providing data are built. • Viable product implementing requested features working on customer data • Critical path implementation • Integrations with customer systems • Production ready • Stakeholders • End users Release Aims to scale the MVP towards the customer base, accounting for reliability and high availability. • Released full-feature solution • Customer Training (optional) • End users • General audience 06. PROJECT LIFECYCLE

07. Tools

07.1. APIs

• Managed LLMs platform offers a variety of models to
be leveraged with just an API parameter • Different models range from GPT3.5-Turbo, GPT4, CLIP, Ada, and DALL-E • SDK to invoke APIs • Pay-as-you-go pricing model 83 OpenAI API 07. TOOLS

• Managed LLMs platform offers a variety of models to
be leveraged with just an API parameter • Different models ranging from Anthropic Claude to Meta LLAMA2 to Amazon Titan proprietary models • Amazon SDK to invoke APIs • Pay-as-you-go pricing model 84 Amazon Bedrock 07. TOOLS

• The managed LLMs platform offers a variety of models
to be leveraged with just an API parameter directly into the Google Cloud platform • Supports Google proprietary models such as PaLM, Codey, Imagen, and MedLM • Google SDK to invoke APIs • Pay-as-you-go pricing model 85 Google Vertex.AI 07. TOOLS

07.2. Run your own LLM

SageMaker JumpStart provides one-click, end-to-end solutions for many common machine
learning use cases, such as demand forecasting, credit rate prediction, fraud detection, and computer vision. • Manage model lifecycle: deploy, fine-tune, and evaluate pre-trained models from popular model hubs through the JumpStart landing page in the updated Studio experience. • Run inference: access pretrained models, solution templates, and examples through the JumpStart landing page in Amazon SageMaker Studio Classic. 87 Amazon Sagemaker JumpStart 07. TOOLS

A Hub platform that allows to upload, share, and deploy
models with ease. Saves developers the time and computational resources required to train models from scratch. • Portability: HuggingFace supports various deployment strategies and providers, from Amazon to their hosted model version to on-prem or other cloud providers. • Dataset: supports storing and retrieving freely available datasets to train or fine-tune models. • Models: supports many Open Source LLMs. 88 Hugging Face 07. TOOLS

07.3. Langchain

An open-source framework for developing applications powered by language models:
• Context-aware: connect a language model to context sources (prompt instructions, few shot examples, content to ground its response in, etc.) • Reason: rely on a language model to reason (about how to answer based on provided context, what actions to take, etc.) Supports model I/O, data retrieval, and agents to abstract from underlying modules, knowledge base understanding, and interface to use LLMs for reasoning and actions. 90 Langchain v0.1.0 Langchain Python https://github.com/langchain-ai/langchain Langchain JS https://github.com/langchain-ai/langchainjs 07. TOOLS

Input and outputs can be parsed, and an LLM can
be instantiated to run locally (suitable for small models) or remote invocation through APIs. Langchain offers utilities to build a chat template and a string output parser and then chain the parts together. With Langchain, model-specific characteristics, and prompt templates are abstracted away from the project and encapsulated into reusable models. 91 Langchain (example) 07. TOOLS

08. The road ahead

Model inference can be deeply improved using innovative techniques such
as: • LLMs excel at understanding and generating human-like text, transforming how we interact with information. • Improved conversational AI, making interactions with chatbots and virtual assistants more natural and efficient • Aiding in writing, coding, and artistic endeavors by suggesting ideas and content. • Automating routine writing tasks, enabling focus on more complex creative or analytical work. • Facilitating language translation and content accessibility for diverse populations. 93 The Impact of LLMs 08. THE ROAD AHEAD

LLMs' unprecedented capabilities are constrained by costs, environmental impact, sustainability,
and ownership. These reasons are shifting interest toward smaller models • Small models can be equally or more effective, especially for specific tasks or domains • Smaller models reduce computational costs and latency, offering a more economical AI solution without compromising performance. • Small models excel in targeted tasks, providing depth in specific domains rather than generalizing across multiple areas. • Small models encourage curated datasets, enhancing training effectiveness and data security. • Combining small models, each with specific strengths, leads to powerful orchestrated solutions akin to a team of specialists. 94 Issues with LLMs 08. THE ROAD AHEAD The Ever-Growing Power of Small Models https://blog.salesforceairesearch.com/the-ever- growing-power-of-small-models/

Large Action Models represent a significant shift in AI, promising
to automate processes and augment human abilities, potentially transforming personal assistance and organizational efficiency. • Agents capable of performing tasks autonomously, moving beyond mere response generation to active task execution. • Act as advanced personal assistants, automating tasks across both professional and personal domains. • Designed to adapt to changing circumstances and update their actions accordingly. • Utilize human feedback and data analysis to refine behaviors and decision-making. • Use cases: • Marketing Automation: Streamlining marketing campaigns by integrating data, tools, and domain- specific agents. • Organizational Transformation: Enhancing business operations, customer interactions, and decision- making processes. 95 Large Action Models (LAM) 08. THE ROAD AHEAD The Ever-Growing Power of Small Models https://blog.salesforceairesearch.com/the-ever- growing-power-of-small-models/

Amazon PartyRock https://partyrock.aws/ D&D Adventure Generator https://partyrock.aws/u/aletheia/PAzHuQ1EN/ DandD-Adventure-Generator

Thank You. 25125 BRESCIA, VIA ORZINUOVI, 20 20137 MILANO, VIA
PRIVATA DECEMVIRI, 20 WWW.NEOSPERIENCE.COM Download the slides https://bit.ly/41ZQa5L Provide your feedback https://bit.ly/3vvjUeE

GenAI for the rest of us

GenAI for the rest of us

More Decks by Aletheia

Other Decks in Technology

Featured

Transcript