Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SingaDEV - Zuo Yue - LLMs demystified

Michael Isvy
February 28, 2025
9

SingaDEV - Zuo Yue - LLMs demystified

We had this talk in Singapore on February 27th 2025.

Zuo Yue is a Mathematician and Machine Learning Engineer. She will show you LLMs from the inside, and will dissect the Transformer decoder architecture—from tokenization and embedding to attention, feed-forward layers, and sampling strategies.

Michael Isvy

February 28, 2025
Tweet

Transcript

  1. This presentation is… To: • Give some intuition on how

    transformer works • Set up a framework to keep up with the fast-evolving technologies Not for: • Prompt engineering • Inference optimization • Fine tuning, • etc. Although you will get a glimpse of some of them
  2. Survey background • I have machine learning knowledge • I

    have knowledge on neural network • I have used ChatGPT (or equivalent)
  3. 1. Recall: Neural network • Neural network • Different neural

    network architecture • Learning • Brain and neural network analogy
  4. y = 𝒂(ax+b), where a,b,x,y ∈R What is Neural network?

    x y z z = ax+b Linear function y = 𝝈(z) Non-linear activation function y = 𝝈(ax+b) 1 2 n 1 … 3 1 2 n 2 Layer 1 Layer 2 Neuron Layer m … • Activation function a: 𝜎, 𝑅𝑒𝑙𝑢, tanh • Quiz: Why need a non-linear activation function? X 2 = a(WX 1 +b) X 1 X 2 n 1 Neurons n 2 Neurons n i Neurons W: weight matrix, of dimension n 2 xn 1 X i+1 = 𝒂i (W i X i +b i ), where X i ∈Rni , W i ∈Rni+1xni, b i ∈Rni+1 • Computation/ inference: • Matrix multiplication, activation -> • Matrix multiplication, activation • … • Quiz: How many parameters in this m layer neural network? • Weights: σ 𝑖=1 𝑚−1 𝑛𝑖+1 ∗ 𝑛𝑖 • Bias: σ 𝑖=1 𝑚−1 𝑛𝑖+1 x 1 x 2 x 3 1 2 n ni 3 n m Neurons y X i
  5. Different popular neural network architectures Standard NN Convolutional NN (CNN)

    Recurrent NN (RNN) Example NN Example architecture Input data use case examples tabular data, etc. images Sequence data, text
  6. What is learning? • What to learn: statistical pattern •

    importance of good training data : • Bias • Representativity • etc. • Given training data and architecture, learned results are model weights. • More weights – More capable
  7. Why is it called neural network? Brain vs. neural network

    analogy • Structure: • Neurons • Connections • Learning process: • Connection strength between neurons vs. learned weights • Information flow: • Action potential (neurontransmission) vs. activation function • … Neuron connection Neurontransmission Fun facts In the human brain1: • 86 billion neurons, • 100 trillion connections 1. source: research article Caruso 2023
  8. 2. Transformer architecture • Transformer architecture • Tokenization, embedding •

    Self-attention, Multi-head attention • Next token prediction
  9. Noting that there are other popular architectures, e.g. • Mamba:

    to infer with O(input length) computation cost • Mixture-of-Expert (MoE): sparse model, to reduce infer computation vs. model size (replacing dense FF layer) Layer step: objective Transformer architecture - Let’s build Legos • Tokenization: convert text to digital numbers (tokens) • Embedding (+positional encoding): convert tokens into vectors • Encoder blocks x N times: produce 1 contextual understanding vector per input token • Multi-head attention: enrich each input token with contextual meaning • Feedforward (FF): extract feature • Decoder blocks x N times: accept input vectors and generate vector containing next token information • Output: convert vector to token probability 1. Vaswani et al. 2017 Attention Is All You Need Transformer architecture1 2 3 4 5 1 5 4 3 2 1 2 1
  10. Many transformer architecture variations GPT, llama, etc. Encoder- decoder Model

    type Model architecture Use case examples Translation Example Text generation BERT BART Encoder only Classification Decoder only (autoregressive, causal) Focus from now on inputs output inputs output inputs output 1. source: Building Llama 3 from scratch Legend Encoder Decoder Embedding Output softmax 3 main transformer architectures Vanilla transformer, Llama and Mistral architecture comparison1
  11. Work with text: tokenization GPT-4o tokenizer1 examples 1 OpenAI tokenizer

    https://platform.openai.com/tokenizer words Token IDs Token is a word/ subword/ punctuation mark. Common English text : • 1 token ~ 4 characters of text • 1 token ~ ¾ word (100 tokens ~ 75 words) Noting that: • Same word with different semantic meaning (“bank”, “long”) has the same token • Each model has its own tokenization method 1 2
  12. Semantic meaning: embedding Add GPS: positional encoding Embedding Token embedding

    uses a matrix multiplication (thus linear) to convert token to vector Embeddings capture semantic meaning of words, and is the foundation of NLP. Similarity is usually measured with cosine between vectors. Motivating question: How to build a representation system so that similar words can have similar vectors? How to measure this similarity? Embeddings1 projected2 to 2d Embedding matrix3 1. Embedding with Glove; 2. projected with PCA; 3. source: The Illustrated GPT-2 (Visualizing Transformer Language Models) Till now, “bank” and “bank” are still represented the same way! Positional encoding Motivating question: As the tokens are processed in parallel, how do we give the information on the order of the tokens? Bill ‘s bank is on the east bank What the model sees: What you see:
  13. Contextual understanding: Self-attention 1. source: The Illustrated GPT-2 (Visualizing Transformer

    Language Models) Analogy: • Query (Q): Summary • Key (K): File index • Softmax(Q x K): % • Value (V): File content • Output: sum(% x Value) Each token has a triple vector representation (Q, K, V). Intuition about the attention formula Query all previous tokens’ key 𝑄 = 𝑋𝑊𝑄 𝐾 = 𝑋𝑊𝐾 V = 𝑋𝑊𝑉 Where 𝑊𝑄, 𝑊𝐾, 𝑊𝑉 are trainable weight matrices, and X the “input”. The contextual meaning of the token is calculated as: • The current position queries all previous tokens (prompt in the context window + generated tokens) and calculate the contextual meaning (flexible embedding understands “bank” <> “bank”). • In this example, 𝑄9 𝐾1 , … , 𝑄9 𝐾9 are calculated in parallel. • The (K,V) of previous tokens are stored so that we don’t need recalculate X 𝑊𝐾, 𝑋𝑊𝑉. (a.k.a. KV cache) • Different decoding block has its own 𝑊𝑄, 𝑊𝐾, 𝑊𝑉. 1 1
  14. “Multi-angle” contextual understanding: multi-head self attention • Different attention head

    i has its own 𝑊 𝑖 𝑄, 𝑊𝑖 𝐾, 𝑊𝑖 𝑉 and extracts different information (not easy to interpret though) • Compared to self-attention of the same dimension, it o attends to “different information” of previous tokens o has quicker calculation thanks to the smaller vector length/ matrix size and calculation in parallel • Transformer and CNN work in similar fashion: o Multiple attention heads & filters: extract “little” information but in “large” scale in parallel o Repeated blocks to repeatedly refine information Q 3 : Where Q 2 : what Q 1 : Who K 3 K 2 K 1 V 3 V 2 V 1 K 3 K 2 K 1 V 3 V 2 V 1 K 3 K 2 K 1 V 3 V 2 V 1 K 3 K 2 K 1 V 3 V 2 V 1 K 3 K 2 K 1 V 3 V 2 V 1 K 3 K 2 K 1 V 3 V 2 V 1 K 3 K 2 K 1 V 3 V 2 V 1 K 3 K 2 K 1 V 3 V 2 V 1 We take the 2nd “bank”’s multi-head attention calculation in one decoder block as example O 1 O 2 O 3 Step3: Concatenate output vector of different attention heads Step1: Calculate Q, K, V vectors for each attention head, by multiplying the input vector with different weight matrix Step2: Calculate output vector for each attention head similar to previous slide Multi-head attention illustration1 (only for illustration purpose) 1. Inspired by Andrew Ng’s deep learning course Width represents Q x K value
  15. Output generation: Next token prediction As the objective of learning

    is to be able to reproduce the learned statistical patterns at inference, the objective of inference can be formulated as: Given the previous tokens (ො 𝑦1 , ො 𝑦2 , … ො 𝑦𝑛−1 ), what is the probability of the next tokens P(𝑦𝑛 |(ො 𝑦1 , ො 𝑦2 , … ො 𝑦𝑛−1 ))? Inference objective Prompt: What is molecular data? Decoder #1 Embedding Output softmax Decoder #2 Decoder #N Generated output: Molecular data refers to output input Prompt: What is molecular data? Molecular data refers to .. … In the example on the right-hand side, we want to predict the distribution P(next token|(prompt, generated output) Example Several generation configurative parameters: • Temperature: o Change the logits before sending them to softmax, to control generation creativity. o Smaller T leads to strongly peaked probability distribution (less creative) • Sampling strategy: o Top k: top k tokens with the highest probability with be sampled o Top p: tokens with the top probability whose probability sums to < p token proba a 0.1% … biological 15% … information 40% … zulu 0.1%
  16. Q&A

  17. Resources recommendation • Deep learning courses by Andrew Ng •

    GenAI with LLMs by AWS • How Transformer LLMs work on deeplearning.ai • Illustrated GPT2 by Jay Alammar • The illustrated transformer by Jay Alammar • Transformer inference arithmetic by kipp.ly