Upgrade to Pro — share decks privately, control downloads, hide ads and more …

GenAI for the rest of us

Aletheia
January 16, 2024

GenAI for the rest of us

An introductory talk designed to unveil the mysteries and potentials of generative artificial intelligence. This session will provide a concise overview of how generative AI models like GPT and DALL-E work, their applications in various fields, and the ethical considerations surrounding their use. Attendees will gain insights into the transformative impact of these technologies and explore how they're shaping the future of creativity, automation, and human-AI interaction.

Aletheia

January 16, 2024
Tweet

More Decks by Aletheia

Other Decks in Technology

Transcript

  1. Generative AI from Zero to Hero in less than four hours.
    GenAI for the rest of us

    View full-size slide

  2. 15/01/24
    Who am I? NOME CLIENTE
    15/01/24
    Luca Bianchi, PhD
    Chief Technology Officer @ Neosperience and Neosperience Health,
    proud AWS Serverless Hero, passionate about software architectures,
    serverless, and machine learning.
    Serverless Italy and [Gen]AI Milano Meetup co-founder.
    ServerlessDays Milano co-organizer.
    github.com/aletheia
    https://it.linkedin.com/in/lucabianchipavia
    https://speakerdeck.com/aletheia
    bianchiluca.com
    @bianchiluca
    Big
    Daddy
    Little
    Elisa

    View full-size slide

  3. 15/01/24
    Who am I? NOME CLIENTE
    15/01/24
    Janos Tolgyesi
    Machine Learning Team Leader @ Neosperience, proud AWS
    Community Builder, passionate about machine learning.
    [Gen]AI Milano Meetup co-founder.
    github.com/mrtj
    linkedin.com/in/janostolgyesi
    @ jtolgyesi

    View full-size slide

  4. 25/07/23
    Summary
    1. The AI Landscape
    2. Transformers
    3. Foundation Models
    4. Advancements on LLMs
    5. Improving LLM behaviour
    6. Project Lifecycle
    7. Tools
    8. The road ahead

    View full-size slide

  5. 01.
    The AI Landscape

    View full-size slide

  6. The landscape of AI is vast and continually evolving, with various subfields offering
    specialized applications. Awareness of this landscape is essential for professionals
    across different disciplines who wish to harness AI's capabilities effectively.
    Machine Learning: The backbone of modern AI, machine learning algorithms
    allow computers to learn from data. Techniques range from supervised to
    unsupervised learning, with deep learning becoming increasingly popular.
    Natural Language Processing (NLP): A subfield focused on the
    interaction between computers and human languages. Examples include machine
    translation, chatbots, and sentiment analysis.
    Computer Vision: Enables machines to interpret visual information from the
    world. Key applications include facial recognition, medical image analysis, and
    autonomous vehicles.
    Reinforcement Learning: A subfield of machine learning where agents learn
    to make decisions by interacting with an environment to achieve a goal. Used in
    areas like game theory, robotics, and recommendation systems.
    01. AI LANDSCAPE SOTTOTITOLO SLIDE
    The AI Landscape
    6

    View full-size slide

  7. 7
    AI is more than LLM
    There are five categories of
    artificial intelligence (tribes)
    modeling every different
    representation of human
    rational reasoning aspect.
    01. AI LANDSCAPE

    View full-size slide

  8. 01. AI LANDSCAPE

    View full-size slide

  9. SOTTOTITOLO SLIDE
    Neural Networks (the short version)
    9
    01. AI LANDSCAPE

    View full-size slide

  10. Understanding the semantic relationships between terms such as
    synonyms, antonyms, and hyponyms—is crucial for precise communication.
    Language abstractions involves using general or specific terms to convey
    complex ideas, sometimes creatively through metaphors. Idioms, like "break a leg,"
    carry culturally specific meanings that aren't literal.
    Paradoxes are statements that seemingly contradict themselves but might still
    be true, challenging our understanding of logic.
    Lastly, meta-language describes another language, offering a secondary layer
    of understanding, like code comments in programming.
    In specialized fields like marketing, medicine, and software development, these
    linguistic elements are vital. Marketers need to understand cultural idioms,
    medical professionals require semantic precision for accurate diagnosis and
    treatment, and developers use abstraction and meta-language for code
    efficiency and collaboration.
    Understanding these nuances aids in tailored communication and problem-solving
    across these disciplines.
    SOTTOTITOLO SLIDE
    Language is difficult (for machines)
    10
    01. AI LANDSCAPE

    View full-size slide

  11. The key algorithm has been, for more than
    a decade, a Recursive Neural Network
    (RNN), which is capable of generating text
    based on a sequence of characters.
    This presents a lot of challenges:
    SOTTOTITOLO SLIDE
    Language generation (before 2017)
    11
    I took my money to the bank
    The teacher taught the student with the book
    The movie was great, but the theatre was awful
    river bank?
    who used the book?
    how is the sentiment?
    01. AI LANDSCAPE

    View full-size slide

  12. 02.
    Transformers

    View full-size slide

  13. 13
    What is self-attention?
    02. TRANSFORMERS
    Self-attention is a mechanism that allows the model to learn the
    semantic solid correlation between words in a given sentence.
    Attention is “focus on a most important part of the input data.”
    Technically speaking, attention measures the similarity between
    two vectors and returns the weighted similarity scores.
    A standard attention function takes three main inputs: query, key,
    and value vectors. A Holistic Guide to the Transformer
    Neural Network Architecture
    https://deeprevision.github.io/posts/001-
    transformer/

    View full-size slide

  14. A breakthrough paper from Google, presenting the Transformers architecture:
    ● Replaces Recurrence: Traditional sequence-to-sequence models like RNNs and LSTMs rely on
    recurrent mechanisms that process each token in sequence. The attention mechanism replaces this
    by calculating the relationships between all words in parallel, thereby eliminating the need for
    recurrent layers.
    ● Parallelization: Because attention computes relationships simultaneously, the model can process
    multiple parts of the input at the same time. This allows for faster computation and significantly
    reduces training time.
    ● Computes Relationships: Attention weighs the importance of different parts of the input when
    producing each element in the output. This is especially effective in capturing long-range
    dependencies within sequences that RNNs and LSTMs often struggle with.
    14
    Attention is all you need (2017)
    02. TRANSFORMERS

    View full-size slide

  15. It transforms the input sequence into a compressed representation.
    Architecture: Originally, the transformer architecture featured six encoder
    blocks, though this number can vary depending on the size of the
    architecture.
    Encoder Block Structure: Each encoder block consists of three main layers:
    ● Multi-Head Attention (MHA): Enables the model to simultaneously focus on
    different parts of the input sequence.
    ● Layer Normalization: Standardizes the outputs of each sub-layer before
    they are passed to the next.
    ● MLPs (Multi-Layer Perceptrons): As feedforward layers, they process the
    sequence further.
    Sub-Layers and Additional Components: MHA and MLPs are considered sub-
    layers. They are interconnected with layer normalization, dropout, and residual
    connections, which are crucial for the flow and efficiency of the architecture.
    Importance of Encoder Layer Count: The number of encoder layers (initially
    six) correlates with the model's size and ability to capture the global context
    of input sequences. More layers generally lead to better task generalization
    due to a more comprehensive context understanding.
    15
    Transformers (Encoder)
    02. TRANSFORMERS

    View full-size slide

  16. The decoder closely resembles the encoder but includes an
    additional multi-head attention layer that operates over the
    encoder's output.
    This extra layer is essential for integrating the encoder's output with
    the target sequence.
    The decoder's primary function is to merge the encoder's output with
    the target sequence to make predictions or determine the next
    token.
    Masked Attention in Decoder: To ensure the prediction process's
    integrity, the decoder's attention mechanism is masked. This masking
    prevents the current token being processed from attending to
    subsequent tokens in the target sequence. With this, the decoder
    might have easy access to future sequence information, potentially
    leading to overfitting and poor generalization outside the training
    data.
    Similar to the encoder, the decoder is repeated multiple times for
    effectiveness. The original transformer model had six decoder
    blocks, mirroring the number of encoder blocks.
    16
    Transformers (Decoder)
    A Holistic Guide to the Transformer
    Neural Network Architecture
    https://deeprevision.github.io/posts/001-
    transformer/
    02. TRANSFORMERS

    View full-size slide

  17. ● Starting with images in pixel space and considering noising and de-noising processes.
    ● Training a neural network to learn how to gradually de-noise data from pure noise is possible, teaching a neural network model
    to predict the noise added.
    ● It is called the noise predictor in Stable Diffusion.
    ● Noise is then subtracted from the original image.
    ● This cycle of noise addition, prediction, and subtraction is repeated multiple times to refine the image quality and detail.
    17
    Diffusion models
    02. TRANSFORMERS

    View full-size slide

  18. ● Diffusion models working in pixel space are prolonged
    because the image space is enormous.
    ● A more efficient approach is to compress the image with
    a Variational Auto Encoder (VAE), thus reducing
    dimensions.
    ● Details are preserved due to redundancy in real-life
    images.
    ● VAE projects the image into the latent space.
    ● Reverse diffusion is applied to the latent space, thus
    resulting in latent diffusion models.
    ● Conditioning the latent space with a tensor coming from
    a text prompt produces the described output from
    denoising.
    ● Stable Diffusion is a latent diffusion model.
    18
    Latent Diffusion models
    The Annotated Diffusion Model
    https://huggingface.co/blog/annotated-diffusion
    02. TRANSFORMERS

    View full-size slide

  19. Large Language Models present a promising technology with extensive capabilities to enhance our
    interaction with data and natural language across various sectors. However, their deployment must be
    considered carefully due to challenges like hallucination and the need for content moderation.
    ● Trained on Massive Text Datasets: Models like GPT-3, BERT, and Transformers are trained on extensive
    collections of text data, including books, news articles, and online conversations.
    ● Automatic Learning of Word & Phrase Relationships: These models utilize embeddings to
    autonomously understand the relationships between words and phrases, enabling them to generate
    text.
    ● Diverse Applications: Large Language Models have a wide range of uses, from content creation and
    virtual assistance to machine translation and Natural Language Processing.
    ● Challenges: These models can sometimes produce hallucinated information that isn't accurate, and
    their outputs often require moderation to ensure reliability and appropriateness.
    19
    Large Language Models (LLM)
    02. TRANSFORMERS

    View full-size slide

  20. 20
    LLM Landscape
    02. TRANSFORMERS

    View full-size slide

  21. 03.
    Foundation Models

    View full-size slide

  22. ● Based on massive datasets, foundation models (FMs) are large deep-learning neural networks and a
    starting point to develop ML models that power new applications more quickly and cost-effectively.
    ● Trained on a broad spectrum of generalized and unlabeled data capable of performing various general
    tasks such as understanding language, generating text and images, and conversing in natural language.
    ● Focus on adaptability. Perform a wide range of disparate tasks with high accuracy based on input
    prompts.
    ● Can be used as base models for developing more specialized downstream applications.
    ● The computational power required for foundation models has doubled every 3.4 months since 2012
    22
    Foundation Models (FM)
    03. FOUNDATION MODELS

    View full-size slide

  23. ● Generative Pre-trained Transformer 3 (GPT-3) is
    a large language model released by OpenAI in 2020.
    ● Decoder-only transformer model of deep neural
    network
    ● Uses a 2048-tokens-long context
    ● Involves 175 billion parameters,
    ● Requires 800GB of storage space
    ● strong "zero-shot" and "few-shot" learning abilities
    on many tasks
    23
    GPT3
    03. FOUNDATION MODELS

    View full-size slide

  24. ● A fine-tuned version of GPT3
    ● Provides better human-like conversation
    capabilities
    ● Uses a 2K, 4K, and 16K tokens-long context
    ● Empowers first release of ChatGPT
    ● Generates answers in 10-15 seconds
    24
    GPT3.5-turbo
    03. FOUNDATION MODELS

    View full-size slide

  25. ● A new model with improved reasoning capabilities
    ● Multi-modal capabilities
    ● Uses a 32K tokens-long context
    ● Can solve difficult problems with greater accuracy,
    thanks to its broader general knowledge and
    problem-solving abilities
    ● Improved alignment and safety
    ● Supports function calling
    ● It’s the basis of OpenAI GPTs
    25
    GPT4
    03. FOUNDATION MODELS

    View full-size slide

  26. ● Released in 2018, Bidirectional Encoder
    Representations from Transformers (BERT)
    was one of the first foundation models.
    ● BERT is a bidirectional model that analyzes
    the context of a complete sequence and
    then makes a prediction.
    ● It was trained on a plain text corpus and
    Wikipedia using 3.3 billion tokens (words)
    and 340 million parameters.
    ● BERT can answer questions, predict
    sentences, and translate texts.
    26
    BERT
    03. FOUNDATION MODELS

    View full-size slide

  27. ● Open source LLM from Meta
    ● Different models ranging from 7B to 70B
    ● Fine-tuned models for chat
    ● Fine-tuned models for code generation
    ● Free for research and commercial use
    27
    LLAMA2
    03. FOUNDATION MODELS

    View full-size slide

  28. ● Falcon 180B is an open-source super-powerful
    language model with 180 billion parameters, trained
    on 3.5 trillion tokens.
    ● It ranked Hugging Face Leaderboard for pre-trained
    Open Large Language Models.
    ● License for both research and commercial use
    ● Performs exceptionally well in various tasks like
    reasoning, coding, proficiency, and knowledge tests
    ● Improved accuracy over Meta's LLaMA 2.
    ● Ranks just behind OpenAI’s GPT4, and on par with
    Google's PaLM 2 Large
    ● Free for research and commercial use
    28
    Falcon 180B
    03. FOUNDATION MODELS

    View full-size slide

  29. ● Specifically trained to reduce model hallucinations.
    ● Supports 200K context window
    ● Can orchestrate across developer-defined
    functions or APIs
    ● Can search over web sources
    ● Can retrieve information from private knowledge
    bases
    ● The most advanced alternative to OpenAI
    29
    Claude 2.1
    03. FOUNDATION MODELS

    View full-size slide

  30. ● Multi-modal built-in model
    ● Outperforms human experts on MMLU (Massive
    Multitask Language Understanding)
    ● Generate code based on different inputs.
    ● Visual reasoning capabilities.
    ● The ultra version not yet been released
    ● Pro version exploits capabilities similar to GPT3
    30
    Gemini (Ultra)
    03. FOUNDATION MODELS

    View full-size slide

  31. ● Advanced Image Synthesis: Generates detailed,
    creative images from textual descriptions using
    deep learning techniques.
    ● Versatile Styles and Techniques: Can create images
    in various styles, including artistic and realistic
    interpretations.
    ● Enhanced Text-to-Image Capabilities: Excels in
    translating complex, abstract text descriptions into
    coherent and relevant visuals.
    ● Improved Quality and Resolution: Offers higher image
    resolution and a better understanding of text inputs
    than previous versions.
    ● Wide Range of Applications: Useful in graphic design,
    advertising, entertainment, education, and more.
    31
    DALL-E 3
    03. FOUNDATION MODELS

    View full-size slide

  32. ● SOTA open architecture for image generation with
    3.5B parameter base model stage and 6.6B
    parameter ensemble pipeline.
    ● Native 1024x1024 image generation with cinematic
    photorealism and fine detail.
    ● Complex compositions
    ● Fine-tuned to create complex compositions with
    basic natural language prompting.
    32
    SDXL Turbo
    03. FOUNDATION MODELS

    View full-size slide

  33. ● Uses advanced and proprietary AI algorithms to
    create images from textual descriptions.
    ● Produces images that have a unique, sometimes
    surreal, artistic quality.
    ● Can interpret many prompts, from straightforward
    descriptions to more abstract concepts.
    ● User feedback and interactions can influence its
    evolution.
    ● Useful for artists, designers, and creatives for
    inspiration, mock-up generation, and exploring
    visual concepts.
    33
    Midjourney
    03. FOUNDATION MODELS

    View full-size slide

  34. 04.
    Advancements on
    LLMs

    View full-size slide

  35. CTRL (Conditional Transformer Language Model) is a breakthrough in natural language processing,
    boasting 1.6 billion parameters.
    It enhances human-AI interaction by allowing controlled generation of content and style using over 50
    control codes.
    This model is unique in its ability to trace back the influence of its training data on generated text, making
    it a versatile tool for a range of NLP applications.
    ● Control Codes: Allow explicit influence over style, genre, entities, etc.
    ● Predictable Variation: Enables variation in generated text based on control codes.
    ● Source Attribution: Identifies data sources influencing text generation.
    35
    Salesforce CTRL Model
    03. FOUNDATION MODELS
    Introduction to Salesforce CTRL Model
    https://blog.salesforceairesearch.com/introducing-a-
    conditional-transformer-language-model-for-
    controllable-generation/

    View full-size slide

  36. Mixtral 8x7B, a high-quality sparse mixture
    of expert model (SMoE) with open
    weights, Handles a context of 32k tokens.
    ● Sparse MoE layers are used instead of
    dense feed-forward network (FFN)
    layers. MoE layers have a certain number
    of “experts” (e.g. 8), where each expert
    is an FFN, or even a MoE itself
    ● A gate network or router determines
    which tokens are sent to which expert
    36
    Mixtral of Experts (MoE)
    03. FOUNDATION MODELS
    Mixture of Experts Explained
    https://huggingface.co/blog/moe#what-is-a-mixture-
    of-experts-moe

    View full-size slide

  37. 05.
    Improving LLM
    behaviour

    View full-size slide

  38. Building a Large Language Models is a quite difficult task due to a number of reasons:
    Training:
    ● High demand in computing power: often massive GPU clusters are needed to crunch data for days to train a
    single LLM —> Often training a LLM from scratch requires billions of $ of investment.
    ● Training requires deep knowledge of both data science and MLOps to properly optimize network architecture
    and the underlying infrastructure.
    Inference:
    ● Model context length is bounded to a very small size (8K, 32K, 100K) which means model contextual
    knowledge is limited and must be optimized.
    ● Model knowledge is freezed at the time of training, no new elements, tailored knowledge or real-time data to
    be used by the model.
    ● Model hallucination can offer biased answers or discuss about topics not aligned with customer guardrails
    ● Performances are difficult to be evaluated
    38
    Plain LLM — challenges
    05. IMPROVING LLM BEHAVIOUR

    View full-size slide

  39. Model inference can be deeply improved using innovative techniques such as:
    ● Parameters Efficient Fine-Tuning (PEFT) to fine-tune models in an efficient fashion with a reduced
    number of dataset items.
    ● Model Quantization and Low Range Optimization (QLoRa) to efficiently reduce model size and
    computing memory needed for Inference
    ● Embeddings to encode vast knowledge and empower retrieval capabilities.
    ● Retrieval Augmented Generation (RAG) to complement model with external knowledge
    ● Guidelines to set boundaries to model behaviour.
    ● Reinforcement Learning with Human Feedback (RLHF) to align model with specific tone-of-voice,
    company values, and avoid misbehaviour.
    39
    Improving LLMs — techniques
    05. IMPROVING LLM BEHAVIOUR

    View full-size slide

  40. 05.1.
    Prompt Engineering

    View full-size slide

  41. ● Prompt Engineering is the art and science of crafting inputs (prompts) to elicit desired responses from
    AI models, particularly in language processing and generative tasks.
    ● Optimizes the interaction between humans and AI, enhancing the quality, relevance, and accuracy of
    the AI's output
    ● Key Components
    ● Precision: Carefully choosing words and phrases to guide the AI towards the intended interpretation.
    ● Context: Providing sufficient background information for the AI to understand the query.
    ● Clarity: Avoiding ambiguity to minimize misinterpretation.
    41
    What is Prompt Engineering?
    05. IMPROVING LLM BEHAVIOUR

    View full-size slide

  42. 42
    Principles
    05. IMPROVING LLM BEHAVIOUR

    View full-size slide

  43. 43
    Principles
    05. IMPROVING LLM BEHAVIOUR

    View full-size slide

  44. 05.2.
    Parameter Efficient
    Fine Tuning (PEFT)

    View full-size slide

  45. Parameter-efficient fine-tuning (PEFT) methods enable efficient pre-trained language models (PLMs)
    adaptation to various downstream applications without fine-tuning all the model's parameters.
    Fine-tuning large-scale PLMs is often prohibitively costly. PEFT methods only fine-tune a small number of
    (extra) model parameters, thereby significantly decreasing the computational and storage costs.
    Recent State-of-the-Art PEFT techniques achieve performance comparable to that of full fine-tuning.
    ● Need to train 15%-20% of the original LLM weights, making the training process less expensive
    ● Updates only a small subset of parameters. This helps prevent catastrophic forgetting.
    ● Works out sparse data environments
    ● Easy Portability and Deployment
    ● Economic efficient approach
    45
    What is PEFT?
    05. IMPROVING LLM BEHAVIOUR

    View full-size slide

  46. Performing full-finetuning can lead to catastrophic
    forgetting because it changes all parameters on the
    model.
    Since PEFT only updates a small subset of
    parameters, it’s more robust against this
    catastrophic forgetting effect.
    Training LLMs is computationally intensive.
    Full fine-tuning requires memory to store the model
    and various other parameters are required during the
    training process optimizer states, gradients, forward
    activations, and temporary memory throughout the
    training process
    46
    Full fine-tune LLM
    05. IMPROVING LLM BEHAVIOUR

    View full-size slide

  47. Several approaches to PEFT can balance different kind of trade-offs on
    costs, memory, training speed, and performance:
    ● Selective method identify which parameters you want to update,
    train only certain components of the model or specific layers, even
    individual parameter types
    ● Reparameterization methods also work like LoRA Low Rank
    Adaptation
    ● Additive methods carry out fine-tuning by keeping all of the original
    LLM weights frozen and introducing new trainable components
    ● Adapter methods add new trainable layers to the model's
    architecture, typically inside the encoder or decoder components
    after the attention or feed-forward layers.
    ● Soft prompt methods keep the model architecture fixed and frozen
    and focus on manipulating the input to achieve better performance.
    This can be done by adding trainable parameters to the prompt
    embeddings, keeping the input fixed, and retraining the embedding
    weights
    47
    PEFT Trade-offs
    05. IMPROVING LLM BEHAVIOUR

    View full-size slide

  48. The goal is to find an efficient way to update the weights of
    the model. Transformers' basic structure is built with
    encoder-decoder, self-attention, and feedforward
    networks without having to train every single parameter
    again. There is a slight modification in the self-attention
    network.
    ● perform this method of parameter-efficient fine-tuning
    with a single GPU and avoid the need for a distributed
    cluster of GPUs.
    ● fine-tune a different set for each task and switch them
    out at inference time by updating the weights.
    ● LoRA is broadly used in practice because of the
    comparable performance to full fine-tuning for many
    tasks and data sets,
    48
    LoRA
    05. IMPROVING LLM BEHAVIOUR
    Parameter Efficient Fine Tuning
    https://medium.com/@kanikaadik07/peft-
    parameter-efficient-fine-tuning-55e32c60c799

    View full-size slide

  49. 05.3.
    Retrieval Augmented
    Generation (RAG)

    View full-size slide

  50. RAG is ingeniously designed to amplify the capabilities of generative models, such as those used for text
    generation, by uniquely integrating them with a retrieval mechanism.
    ● Retrieval Mechanism: The retrieval part of RAG involves searching a large database or corpus of
    documents to find relevant information.
    ● Generative Model: relevant information is fed into an LLM to generate a response or continuation of
    the text.
    ● Advantages: The primary advantage of RAG is that it allows the generative model to produce highly
    relevant responses informed by up-to-date or specialized information.
    ● Applications: RAG can be used in various applications such as chatbots, question-answering systems,
    content creation tools, and more.
    ● Continuous Learning: the retrieval database can be periodically updated, allowing the system to stay
    current with new information.
    50
    What is RAG?
    05. IMPROVING LLM BEHAVIOUR

    View full-size slide

  51. 51
    What is RAG?
    05. IMPROVING LLM BEHAVIOUR

    View full-size slide

  52. ● Acquire source documentation
    ○ web crawling,
    ○ data lake extraction,
    ○ connecting to proprietary databases, etc.
    ● Convert documentation from source formats (pdf, doc, html, etc.) to plain text / lightweight structured
    format (simple html, markdown)
    ● Create text chunks
    52
    LOAD: Document preprocessing
    05. IMPROVING LLM BEHAVIOUR

    View full-size slide

  53. ● Divide text into syntactically correlated parts like phrases, paragraphs, sections
    ● Make fit multiple relevant part of the source document in the context of the LLM
    ● Ideally one chunk also discusses a single topic
    ● Divide source text by phrases (punctuation marks), paragraphs (new line characters), if it is
    markdown, by different level of headers.
    ● Engineering decisions:
    ○ what is the maximum size of the chunk, in the function of the context length len(context) of the
    LLM and the number of chunks (k) we want to stuff in the context, and the length of the instructive
    part of the prompt (len(prompt))?
    ○ Whether to use, and if yes, how much overlap between chunks?
    ● max(len(chunk)) * k + len(prompt) < len(context)
    53
    SPLIT: Text splitting
    05. IMPROVING LLM BEHAVIOUR

    View full-size slide

  54. ● Text embeddings are a representation of text in the form of numeric vectors
    ● They capture the semantic meanings of words, phrases, sentences, or chunks of documents
    ● Properties:
    ○ Map documents of different lengths into a fixed-length, smaller dimension vector-space (typical
    number of dimensions: 512, 768, 1536, etc.)
    ○ Semantic similarity: similar words, phrases or concepts are mapped to vectors that are “close” to
    each other.
    ○ Modern embedding models are neural network based and represent the model’s “understanding”
    of the text
    ○ Vector algebra works: emb(“king”) - emb(“man”) + emb(“woman”) ≅ emb(“queen”)
    ● Examples: Ada-002 (OpenAI), Titan Embeddings (Amazon), Gecko (Vertex AI / Google)
    ● Use cases: semantic search, classification, clustering, outlier detection.
    54
    EMBED: Embeddings
    05. IMPROVING LLM BEHAVIOUR

    View full-size slide

  55. ● A non-parametric supervised learning method that identifies the k “nearest” data points in a dataset to a
    given input
    ○ “nearest” is interpreted by an appropriate metrics that defines a distance function between data
    points
    ○ k is a user-defined parameter
    ● Example metrics: Euclidean distance or cosine similarity for real vectors, manhattan distance for
    coordinates, Hamming distance for binary data, Jaccard distance for sets, etc.
    ● Algorithms with their computational complexity in the function of dimension (d) and dataset size (N):
    ○ brute force, exact k-NN: O(d * N)
    ○ Hierarchical Navigable Small Words (HNSW): O(log N)
    ○ Inverted File System (IVF): depending on the dataset and parameters, often < O(N)
    ● Other factors to consider: does the algorithm support pre-filtering?
    55
    k-nearest neighbors (KNN)
    05. IMPROVING LLM BEHAVIOUR

    View full-size slide

  56. ● A vector store is a specialized type of database designed to handle vector data efficiently.
    ○ Store high-dimensional vector data
    ○ Efficient similarity search (knn) implementations
    ○ Might scale up to billions of data points
    ○ Indexing for fast retrieval (for approximate k-nn algorithms)
    ● Might support storing structured metadata along the vectors
    ● Might support pre-filtering or post-filtering of the data points based on predicates on metadata.
    ● Examples: a lot of open source (OpenSearch, pgvector, Chroma, Milvius, LlamaIndex, Apache Cassandra
    etc.) and commercial (Pinecone, CouchBase, MongoDB Atlas, etc.)
    ● Factors to consider when choosing vector store: total cost of ownership, scalability, features
    (updates, filtering, etc.)
    56
    STORE: Vector Store
    05. IMPROVING LLM BEHAVIOUR

    View full-size slide

  57. Hands-on
    RAG: question answering over
    custom knowledge base
    Session Jupyter Notebooks
    https://github.com/mrtj/genai-rest-of-us

    View full-size slide

  58. 05.4.
    Multimodal Search

    View full-size slide

  59. Hands-on
    CLIP: multi-modal embedding
    models, image embeddings, vector
    store, query by text and image
    Session Jupyter Notebooks
    https://github.com/mrtj/genai-rest-of-us

    View full-size slide

  60. Hands-on
    Stable Diffusion
    Session Jupyter Notebooks
    https://github.com/mrtj/genai-rest-of-us

    View full-size slide

  61. 06.
    Project Lifecycle

    View full-size slide

  62. “The failure rate on AI projects
    has been between 83% and 92%”
    — Fortune.com
    https://fortune.com/2022/07/26/a-i-
    success-business-sense-aible-sengupta

    View full-size slide

  63. “Successful AI initiatives
    require a good understanding of
    AI projects lifecycle”
    — Forbes.com
    https://www.forbes.com/sites/cognitiveworld/2022/08/14/
    the-one-practice-that-is-separating-the-ai-successes-from-
    the-failures/?sh=6df5b30e17cb

    View full-size slide

  64. Generative AI Projects Lifecycle
    The AI Value Flywheel
    Managing projects lifecycle
    06. PROJECT LIFECYCLE

    View full-size slide

  65. Generative AI Projects Lifecycle
    PROJECT SCOPE
    DEFINITION
    Shape the use case
    defining the task the
    project is expected to
    resolve.
    Select the interaction
    interface to be exposed
    to users.
    Define KPIs and
    constraints for the
    solution to be
    acceptable.
    Define overall project
    running budget.
    06. PROJECT LIFECYCLE

    View full-size slide

  66. Generative AI Projects Lifecycle
    PROJECT SCOPE
    DEFINITION
    MODEL
    SELECTION
    Shape the use case
    defining the task the
    project is expected to
    resolve.
    Select the interaction
    interface to be exposed
    to users.
    Define KPIs and
    constraints for the
    solution to be
    acceptable.
    Define overall project
    running budget.
    Select the optimal
    Foundation Model (FM) to
    be used, based on
    available data, supported
    languages and regulatory
    constraints.
    06. PROJECT LIFECYCLE

    View full-size slide

  67. Generative AI Projects Lifecycle
    PROJECT SCOPE
    DEFINITION
    MODEL
    SELECTION
    ADAPTATION &
    ALIGNMENT
    Shape the use case
    defining the task the
    project is expected to
    resolve.
    Select the interaction
    interface to be exposed
    to users.
    Define KPIs and
    constraints for the
    solution to be
    acceptable.
    Define overall project
    running budget.
    Select the optimal
    Foundation Model (FM) to
    be used, based on
    available data, supported
    languages and regulatory
    constraints.
    Adopt techniques to
    make models adapt to
    solve specific task.
    Evaluate model fine-
    tuning opportunity to
    increase model
    specificity to languages
    and tasks.
    Evaluate model alignment
    to further customize
    tone of voice, enforce
    guardrails, and prevent
    hallucinations.
    06. PROJECT LIFECYCLE

    View full-size slide

  68. Generative AI Projects Lifecycle
    PROJECT SCOPE
    DEFINITION
    MODEL
    SELECTION
    ADAPTATION &
    ALIGNMENT
    APPLICATION
    INTEGRATION
    Shape the use case
    defining the task the
    project is expected to
    resolve.
    Select the interaction
    interface to be exposed
    to users.
    Define KPIs and
    constraints for the
    solution to be
    acceptable.
    Define overall project
    running budget.
    Select the optimal
    Foundation Model (FM) to
    be used, based on
    available data, supported
    languages and regulatory
    constraints.
    Adopt techniques to
    make models adapt to
    solve specific task.
    Evaluate model fine-
    tuning opportunity to
    increase model
    specificity to languages
    and tasks.
    Evaluate model alignment
    to further customize
    tone of voice, enforce
    guardrails, and prevent
    hallucinations.
    Integrate models with
    external data sources to
    provide up-to-date or
    real-time responses,
    overcome context
    constraints, and call APIs.
    Implement reasoning and
    acting accordingly to
    improve autonomous
    interactions.
    06. PROJECT LIFECYCLE

    View full-size slide

  69. Generative AI Projects Lifecycle
    PROJECT SCOPE
    DEFINITION
    MODEL
    SELECTION
    ADAPTATION &
    ALIGNMENT
    APPLICATION
    INTEGRATION
    DEPLOY
    Shape the use case
    defining the task the
    project is expected to
    resolve.
    Select the interaction
    interface to be exposed
    to users.
    Define KPIs and
    constraints for the
    solution to be
    acceptable.
    Define overall project
    running budget.
    Select the optimal
    Foundation Model (FM) to
    be used, based on
    available data, supported
    languages and regulatory
    constraints.
    Adopt techniques to
    make models adapt to
    solve specific task.
    Evaluate model fine-
    tuning opportunity to
    increase model
    specificity to languages
    and tasks.
    Evaluate model alignment
    to further customize
    tone of voice, enforce
    guardrails, and prevent
    hallucinations.
    Integrate models with
    external data sources to
    provide up-to-date or
    real-time responses,
    overcome context
    constraints, and call APIs.
    Implement reasoning and
    acting accordingly to
    improve autonomous
    interactions.
    Define deployment
    targets and hardware
    constraints.
    Perform model
    optimization to balance
    precision and required
    computing power.
    Exploit SaaS/Cloud/on-
    premise alternatives to
    address company
    constraints and budget.
    06. PROJECT LIFECYCLE

    View full-size slide

  70. Define project scope to properly identify the right model, based on the task to
    accomplish:
    ● essay writing
    ● summarization
    ● translation
    ● information retrieval
    ● Reasoning / Agents
    ● Entity / Sentiment recognition
    A project can leverage one or more task and its corresponding models to be involved. It’s
    quite common more than one model is adapted to accomplish a set of tasks.
    70
    Project Scope Definition — Task
    PROJECT SCOPE
    DEFINITION
    MODEL
    SELECTION
    ADAPTATION &
    ALIGNMENT
    APPLICATION
    INTEGRATION
    DEPLOY
    06. PROJECT LIFECYCLE

    View full-size slide

  71. Interaction is a pivotal feature of LLM applications. Defining the proper user interface
    could result in an excellent customer engagement opportunity
    ● chatbot / conversational
    ● form with response
    ● API one-shot
    ● API with context memory
    For each one of these aspects, a number of sub cases such as the kind of information to
    be presented, support for rich text (i.e. Markdown) and the proper format, have to be
    defined as well.
    71
    Project Scope Definition — Interface
    PROJECT SCOPE
    DEFINITION
    MODEL
    SELECTION
    ADAPTATION &
    ALIGNMENT
    APPLICATION
    INTEGRATION
    DEPLOY
    06. PROJECT LIFECYCLE

    View full-size slide

  72. Depending on the kind of task selected, the project could have:
    ● No additional data to the model knowledge
    ● Some examples to tune prompts (tens of items)
    ● A dataset to fine tune the model (thousands of samples)
    ● A wide dataset to align the model (hundreds of thousands of samples)
    ● Documents or Knowledge Base to be searched into
    ● Rules / constraints
    ● APIs providing data
    ● Labelled domain entities
    ● Languages to be supported
    72
    Project Scope Definition — Data
    PROJECT SCOPE
    DEFINITION
    MODEL
    SELECTION
    ADAPTATION &
    ALIGNMENT
    APPLICATION
    INTEGRATION
    DEPLOY
    06. PROJECT LIFECYCLE

    View full-size slide

  73. 73
    Model Selection
    MODEL
    SELECTION
    PROJECT SCOPE
    DEFINITION
    ADAPTATION &
    ALIGNMENT
    APPLICATION
    INTEGRATION
    DEPLOY
    Based on the kind of informations handled by the model, the available dataset, and
    regulatory constraints of the company it is possible to select a proper model within these
    metrics:
    ● model generation: medium-size models (BERT, RoBERTa, GPT) or LLMs (GPT4, etc.)
    ● decoder only, encoder only, encoder/decoder models
    ● open source (LLaMa2, Falcon, Bloom, MBT) or commercial models (GPT3.5-turbo, GPT4,
    Coral)
    Commercial models are provided with API-only access by players such as OpenAI, Cohere,
    Google, and Amazon.
    Open Source models are a set of models trained, then released on hubs such as
    HuggingFace to be downloaded or launched into the cloud.
    Usually the budget required to train a new LLM from scratch makes it unfeasible for most
    companies, which prefer to leverage on adaptation and alignment techniques.
    06. PROJECT LIFECYCLE

    View full-size slide

  74. 74
    Adaptation & Alignment
    ADAPTATION &
    ALIGNMENT
    PROJECT SCOPE
    DEFINITION
    MODEL
    SELECTION
    APPLICATION
    INTEGRATION
    DEPLOY
    Optimize machine learning models for task-specific performance by employing specialized
    techniques.
    Assess the potential for fine-tuning models to enhance their adaptability to specific
    languages and application domains.
    Conduct a rigorous evaluation of model alignment to customize tone of voice, implement
    safety measures, and mitigate the risk of generating false or misleading information.
    These techniques can be used independently or jointly to obtain the best effort. Despite
    being extremely useful, they require different amount of data and should be selected
    wisely according to the available human effort.
    06. PROJECT LIFECYCLE

    View full-size slide

  75. 75
    Adaptation & Alignment — Techniques
    ADAPTATION &
    ALIGNMENT
    PROJECT SCOPE
    DEFINITION
    MODEL
    SELECTION
    APPLICATION
    INTEGRATION
    DEPLOY
    The main techniques available for model adaptation and alignment are:
    ● Prompt Engineering: is the main technique to customize a model behavior. It consists
    into crafting the message provided to a LLM using a few samples and/or guided
    instructions. Requires just a few examples and a strong model knowledge.
    ● Parameter Efficient Fine-Tuning (PEFT): a new set of adapted parameters is trained to
    further specialize the model into understanding a language subset of specific topics
    (such as domain specific wording). These layers can be archived and attached to the
    model when needed. It requires a few hundred samples and computing power to train the
    model properly.
    ● Reinforcement Learning with Human Feedback (RLHF): it’s the most powerful
    techniques consisting into training a reinforcement learning model into being able to
    assign a score to model’s generated sentences based on their alignment to specific
    guidelines (i.e. adapting tone of voice, avoiding profanity, increasing friendliness). It
    requires thousands of samples, often tens of thousands, human crafted to enlist
    feedback.
    06. PROJECT LIFECYCLE

    View full-size slide

  76. 76
    Application Integration
    APPLICATION
    INTEGRATION
    PROJECT SCOPE
    DEFINITION
    MODEL
    SELECTION
    ADAPTATION &
    ALIGNMENT
    DEPLOY
    Large Language Models knowledge is freezed at the specific training point of time. Moreover,
    their understanding is bounded to either training dataset or context length (which is often less
    than 64K).
    To overcome these limits and improve model responsiveness, a number of techniques have been
    developed to enable Retrieval Augmented Generation (RAG):
    ● Data Efficiency: Allows the model to selectively focus on relevant pieces of the knowledge
    base, thus making the best use of available data.
    ● Scalability: RAG enables the model to leverage external databases, which is essential for
    scaling up without retraining the entire model.
    ● Improved Accuracy: Combining the advantages of retrieval-based and generative models, RAG
    enhances question-answering performance.
    ● Context Relevance: Better at providing contextually relevant answers compared to traditional
    LLMs, as it pulls in documents that relate to the question being asked.
    One of the fundamental key points of this techniques is the knowledge encoding process
    obtained through Embeddings.
    06. PROJECT LIFECYCLE

    View full-size slide

  77. 77
    Deploy
    DEPLOY
    PROJECT SCOPE
    DEFINITION
    MODEL
    SELECTION
    ADAPTATION &
    ALIGNMENT
    APPLICATION
    INTEGRATION
    It is one of the most important aspects of LLM management, directly bounded to running costs
    and performances.
    In many cases the size of the model (comprising eventually also its PEFT layers) il too big to be
    handled on a single GPU, thus requiring strong investment or preventing some use cases such as
    embedded deployments.
    A few techniques such as quantization and Low Range Adaptation (LoRA) offer a good tradeoff
    between model precision and size.
    Deployment considerations also involve the evaluation of the release strategy:
    ● SaaS service such as OpenAI being a cost effective solution, but with some regulatory and
    performance constraints
    ● Dedicated cloud deployments such as Google VertexAI, Amazon Bedrock or Microsoft OpenAI
    offer a GDPR compliant and data safe environment while preserving a good cost balance.
    ● On premise deployment offers data locality with a provisioning cost that need to be carefully
    considered.
    06. PROJECT LIFECYCLE

    View full-size slide

  78. A set of questions to shape a LLM Application project:
    ○ Which task should the application accomplish?
    ○ Which is the size of the available dataset?
    ○ Is the model handling GDPR sensitive data or data subject to other privacy constraints?
    ○ How is live data available (API, export, database)?
    ○ Which languages have to be supported?
    ○ Do we need a chatbot or any other type of UI?
    ○ SaaS, cloud or on-prem deployment?
    78
    Project Checklist
    06. PROJECT LIFECYCLE

    View full-size slide

  79. 79
    Incremental Projects Lifecycle (IPL)
    Sometimes requirements are unclear and project scope cannot be defined before a working prototype is built. In such cases,
    an incremental approach is preferable because offers the customer a understanding over the direction the solution is heading,
    while keeping budget in check. The Generative AI Project Lifecycle can be grouped into three phases, aimed to showcase the
    feasibility and match business requirements.
    A Proof-of-Concept (PoC) is the initial phase, where requirements and project scope need to be properly defined. In this phase also model
    capabilities are matched against customer requirement and a baseline showcasing the expected result is shown.
    Sometimes, due to the uncertainty of the environment and the continuous development of the technology, the PoC phase is switched to a
    Research and Development (R&D) phase which allow for better management of of uncertainty within a constrained effort.
    In the Minimum Viable Product (MVP) phase the model performances are tailored to production requirements and the main features of the
    solution are developed.
    The release phase accounts all the integration features, GUIs and deployments needed to support scalability and reliability.
    MVP RELEASE
    POC / R&D
    PROJECT SCOPE
    DEFINITION
    MODEL
    SELECTION
    ADAPTATION &
    ALIGNMENT
    APPLICATION
    INTEGRATION
    DEPLOY
    06. PROJECT LIFECYCLE

    View full-size slide

  80. 80
    IPL — Phases
    Phase Description Outcome Target Users
    Research and
    Development
    (R&D)
    Starts with project kick-off and covers all the solution
    design process, requirements mapping, models
    evaluation, selection and initial prompt engineering.
    Usually alternative to PoC phase.
    • R&D Report
    • Specific tests / PoCs
    • Internal users
    • Stakeholders
    Proof-of-Concept (PoC)
    Starts with project kick-off and covers all the solution
    design process, requirements mapping, models
    evaluation, selection and initial prompt engineering.
    • Solution project
    • Critical path definition
    • Budget estimation
    • Working prototype in sandbox
    environment or on exported data
    • Internal users
    • Stakeholders
    • Project team
    Minimum Viable Product
    (MVP)
    Starts when PoC is approved. Has the goal to fine-tune
    models and prompts. Eventual model alignment using
    RLHF. Iterates multiple times through Evaluation and
    engineering sub-phases. Then integrations with systems
    providing data are built.
    • Viable product implementing
    requested features working on
    customer data
    • Critical path implementation
    • Integrations with customer systems
    • Production ready
    • Stakeholders
    • End users
    Release
    Aims to scale the MVP towards the customer base,
    accounting for reliability and high availability.
    • Released full-feature solution
    • Customer Training (optional)
    • End users
    • General audience
    06. PROJECT LIFECYCLE

    View full-size slide

  81. ● Managed LLMs platform offers a variety of models
    to be leveraged with just an API parameter
    ● Different models range from GPT3.5-Turbo, GPT4,
    CLIP, Ada, and DALL-E
    ● SDK to invoke APIs
    ● Pay-as-you-go pricing model
    83
    OpenAI API
    07. TOOLS

    View full-size slide

  82. ● Managed LLMs platform offers a variety of models
    to be leveraged with just an API parameter
    ● Different models ranging from Anthropic Claude to
    Meta LLAMA2 to Amazon Titan proprietary models
    ● Amazon SDK to invoke APIs
    ● Pay-as-you-go pricing model
    84
    Amazon Bedrock
    07. TOOLS

    View full-size slide

  83. ● The managed LLMs platform offers a variety of
    models to be leveraged with just an API parameter
    directly into the Google Cloud platform
    ● Supports Google proprietary models such as PaLM,
    Codey, Imagen, and MedLM
    ● Google SDK to invoke APIs
    ● Pay-as-you-go pricing model
    85
    Google Vertex.AI
    07. TOOLS

    View full-size slide

  84. 07.2.
    Run your own LLM

    View full-size slide

  85. SageMaker JumpStart provides one-click, end-to-end
    solutions for many common machine learning use cases, such
    as demand forecasting, credit rate prediction, fraud
    detection, and computer vision.
    ● Manage model lifecycle: deploy, fine-tune, and evaluate
    pre-trained models from popular model hubs through the
    JumpStart landing page in the updated Studio experience.
    ● Run inference: access pretrained models, solution
    templates, and examples through the JumpStart landing
    page in Amazon SageMaker Studio Classic.
    87
    Amazon Sagemaker JumpStart
    07. TOOLS

    View full-size slide

  86. A Hub platform that allows to upload, share, and deploy
    models with ease.
    Saves developers the time and computational
    resources required to train models from scratch.
    ● Portability: HuggingFace supports various
    deployment strategies and providers, from Amazon
    to their hosted model version to on-prem or other
    cloud providers.
    ● Dataset: supports storing and retrieving freely
    available datasets to train or fine-tune models.
    ● Models: supports many Open Source LLMs.
    88
    Hugging Face
    07. TOOLS

    View full-size slide

  87. 07.3.
    Langchain

    View full-size slide

  88. An open-source framework for developing
    applications powered by language models:
    ● Context-aware: connect a language model to
    context sources (prompt instructions, few shot
    examples, content to ground its response in, etc.)
    ● Reason: rely on a language model to reason (about
    how to answer based on provided context, what
    actions to take, etc.)
    Supports model I/O, data retrieval, and agents to
    abstract from underlying modules, knowledge base
    understanding, and interface to use LLMs for reasoning
    and actions.
    90
    Langchain v0.1.0
    Langchain Python
    https://github.com/langchain-ai/langchain
    Langchain JS
    https://github.com/langchain-ai/langchainjs
    07. TOOLS

    View full-size slide

  89. Input and outputs can be parsed, and an LLM can
    be instantiated to run locally (suitable for small
    models) or remote invocation through APIs.
    Langchain offers utilities to build a chat template
    and a string output parser and then chain the
    parts together.
    With Langchain, model-specific characteristics,
    and prompt templates are abstracted away from
    the project and encapsulated into reusable
    models.
    91
    Langchain (example)
    07. TOOLS

    View full-size slide

  90. 08.
    The road ahead

    View full-size slide

  91. Model inference can be deeply improved using innovative techniques such as:
    ● LLMs excel at understanding and generating human-like text, transforming how we interact with
    information.
    ● Improved conversational AI, making interactions with chatbots and virtual assistants more natural and
    efficient
    ● Aiding in writing, coding, and artistic endeavors by suggesting ideas and content.
    ● Automating routine writing tasks, enabling focus on more complex creative or analytical work.
    ● Facilitating language translation and content accessibility for diverse populations.
    93
    The Impact of LLMs
    08. THE ROAD AHEAD

    View full-size slide

  92. LLMs' unprecedented capabilities are constrained by costs, environmental impact, sustainability, and
    ownership. These reasons are shifting interest toward smaller models
    ● Small models can be equally or more effective, especially for specific tasks or domains
    ● Smaller models reduce computational costs and latency, offering a more economical AI solution
    without compromising performance.
    ● Small models excel in targeted tasks, providing depth in specific domains rather than generalizing
    across multiple areas.
    ● Small models encourage curated datasets, enhancing training effectiveness and data security.
    ● Combining small models, each with specific strengths, leads to powerful orchestrated solutions akin
    to a team of specialists.
    94
    Issues with LLMs
    08. THE ROAD AHEAD
    The Ever-Growing Power of Small Models
    https://blog.salesforceairesearch.com/the-ever-
    growing-power-of-small-models/

    View full-size slide

  93. Large Action Models represent a significant shift in AI, promising to automate processes and augment
    human abilities, potentially transforming personal assistance and organizational efficiency.
    ● Agents capable of performing tasks autonomously, moving beyond mere response generation to active
    task execution.
    ● Act as advanced personal assistants, automating tasks across both professional and personal domains.
    ● Designed to adapt to changing circumstances and update their actions accordingly.
    ● Utilize human feedback and data analysis to refine behaviors and decision-making.
    ● Use cases:
    ● Marketing Automation: Streamlining marketing campaigns by integrating data, tools, and domain-
    specific agents.
    ● Organizational Transformation: Enhancing business operations, customer interactions, and decision-
    making processes.
    95
    Large Action Models (LAM)
    08. THE ROAD AHEAD
    The Ever-Growing Power of Small Models
    https://blog.salesforceairesearch.com/the-ever-
    growing-power-of-small-models/

    View full-size slide

  94. Amazon PartyRock
    https://partyrock.aws/
    D&D Adventure Generator
    https://partyrock.aws/u/aletheia/PAzHuQ1EN/
    DandD-Adventure-Generator

    View full-size slide

  95. Thank You.
    25125 BRESCIA, VIA ORZINUOVI, 20
    20137 MILANO, VIA PRIVATA DECEMVIRI, 20
    WWW.NEOSPERIENCE.COM
    Download the slides
    https://bit.ly/41ZQa5L
    Provide your feedback
    https://bit.ly/3vvjUeE

    View full-size slide