Things you never dared to ask about LLMs

Slide 1

Slide 1 text

Things you never dared to ask about LLMs – Guillaume Laforge Developer Advocate Two R’s or not two R’s, that is the question @[email protected]

Slide 2

Slide 2 text

1. I don’t have a PhD in Machine Learning. 2. This talk may contain hallucinations! 3. Please correct me if I’m wrong 🙏 I’m eager to learn. 🚧 Be warned 🚧

Slide 3

Slide 3 text

How many R’s in “strawberry”? Two or Three?

Slide 4

Slide 4 text

How many R’s in “strawberry”? They process language in chunks, not individual letters

Slide 5

Slide 5 text

LLMs process language in chunks… or actually tokens. GPT-4 is rumored to have been trained on ~2 trillion tokens. Roughly: 4 tokens == 3 words Lots & lots of tokens… A human reading 8-hours a day needs 44,000 years to read 2 trillion tokens

Slide 6

Slide 6 text

LLMs reason on tokens, not letters x.com/karpathy/status/1816637781659254908

Slide 7

Slide 7 text

What you see… What LLM sees…

Slide 8

Slide 8 text

For efficiency! Better for memory Keeps attention high Tokens are the sweet spot between letters and words Why tokens and not letters? or even words? reddit.com/r/learnmachinelearning/comments/1d0sopa/why_not_use_words_instead_of_tokens

Slide 9

Slide 9 text

Most common algorithms: ● BPE (Byte-Pair Encoding) used by GPTs ● WordPiece, used by BERT ● Unigram, often used in SentencePiece ● SentencePiece, used by Gemini & Gemma Some require pre-tokenization, or don’t offer reversible tokenization. How does tokenization work? huggingface.co/learn/nlp-course/chapter6/4#algorithm-overview

Slide 10

Slide 10 text

Zooming on Byte-Pair Encoding u-n-r-e-l-a-t-e-d

Slide 11

Slide 11 text

Zooming on Byte-Pair Encoding Merge rules r + e → re u-n-r-e-l-a-t-e-d

Slide 12

Slide 12 text

Zooming on Byte-Pair Encoding Merge rules r + e → re u-n-re-l-a-t-e-d

Slide 13

Slide 13 text

Zooming on Byte-Pair Encoding Merge rules r + e → re a + t → at u-n-re-l-a-t-e-d

Slide 14

Slide 14 text

Zooming on Byte-Pair Encoding Merge rules r + e → re a + t → at u-n-re-l-at-e-d

Slide 15

Slide 15 text

Zooming on Byte-Pair Encoding Merge rules r + e → re a + t → at e + d → ed u-n-re-l-at-e-d

Slide 16

Slide 16 text

Zooming on Byte-Pair Encoding Merge rules r + e → re a + t → at e + d → ed u-n-re-l-at-ed

Slide 17

Slide 17 text

Zooming on Byte-Pair Encoding Merge rules r + e → re a + t → at e + d → ed u + n → un u-n-re-l-at-ed

Slide 18

Slide 18 text

Zooming on Byte-Pair Encoding Merge rules r + e → re a + t → at e + d → ed u + n → un un-re-l-at-ed

Slide 19

Slide 19 text

Zooming on Byte-Pair Encoding Merge rules r + e → re a + t → at e + d → ed u + n → un at + ed → ated un-re-l-at-ed

Slide 20

Slide 20 text

Zooming on Byte-Pair Encoding Merge rules r + e → re a + t → at e + d → ed u + n → un at + ed → ated un-re-l-ated

Slide 21

Slide 21 text

Zooming on Byte-Pair Encoding Merge rules r + e → re a + t → at e + d → ed u + n → un at + ed → ated re + l → rel un-re-l-ated

Slide 22

Slide 22 text

Zooming on Byte-Pair Encoding Merge rules r + e → re a + t → at e + d → ed u + n → un at + ed → ated re + l → rel un-rel-ated

Slide 23

Slide 23 text

Zooming on Byte-Pair Encoding Merge rules r + e → re a + t → at e + d → ed u + n → un at + ed → ated re + l → rel rel + ated → related un-rel-ated

Slide 24

Slide 24 text

Zooming on Byte-Pair Encoding Merge rules r + e → re a + t → at e + d → ed u + n → un at + ed → ated re + l → rel rel + ated → related un-related

Slide 25

Slide 25 text

‘▁’ is not your regular ‘_’ SentencePiece uses this character to denote white space. Weird ▁ character in front of tokens? github.com/google/sentencepiece/tree/master#whitespace-is-treated-as-a-basic-symbol (U+2581) Lower One Eighth Block Sentence, Piece, ▁uses, ▁this, ▁character, ▁to, ▁denote, ▁white, ▁space, .

Slide 26

Slide 26 text

A very talkative and imaginative parrot! Speaking endlessly… Stochastic parrots Blah, blah, blah…

Slide 27

Slide 27 text

Foundation or instruction-tuned models?

Slide 28

Slide 28 text

Because we told them so? LLM text generation reached ● the max output tokens count ● its <|endoftext|> or model tokens from its instruction tuning set How do LLMs know when to stop generating tokens? www.louisbouchard.ai/how-llms-know-when-to-stop/

Slide 29

Slide 29 text

How do LLMs choose the next token? I’m just playing dice!

Slide 30

Slide 30 text

The cat is a… Choosing the next token 56% chance of being picked 2% chance of being picked

Slide 31

Slide 31 text

Flying through hyperspace… Err… Hyperparameters To max output tokens, and beyond!

Slide 32

Slide 32 text

Top-K — Just pick from k tokens The cat is a… Pick the top 2

Slide 33

Slide 33 text

Top-P — Cumulative probability The cat is a… Also called “nucleus sampling” Pick top 0.9 0.56 + 0.28 + 0.11 > 0.9 0.95

Slide 34

Slide 34 text

What is temperature exactly? It’s hot, that’s all I know!

Slide 35

Slide 35 text

Temperature — Flattening / sharpening the curve Temp = 0.2

Slide 36

Slide 36 text

Temperature — Flattening / sharpening the curve Temp = 0.6

Slide 37

Slide 37 text

Temperature — Flattening / sharpening the curve Temp = 1.0

Slide 38

Slide 38 text

Temperature — Flattening / sharpening the curve Temp = 1.4

Slide 39

Slide 39 text

Temperature — Flattening / sharpening the curve Temp = 1.8

Slide 40

Slide 40 text

Some more space LLM oddities

Slide 41

Slide 41 text

If I set temperature at 0, top-K at 1, do I have a deterministic output? What about if there’s a seed parameter? Well, no, because of: ● Floating point numbers (non-)associativity (a + b) + c ≠ a + (b + c) due to rounding errors ● Parallel execution order GPUs reorder reductions (like sums) in threads and cores Deterministic… to a certain extent

Slide 42

Slide 42 text

Hallucinations or confabulation? news.ycombinator.com/item?id=33841672 www.linkedin.com/pulse/hallucinating-confabulating-peter-mcelwaine-johnn/ Anthropo- morphism Fit the gaps in one’s memory High token probability, or more deterministic output, doesn’t mean being correct.

Slide 43

Slide 43 text

Even when it was actually right in the first place! You can convince an LLM it’s wrong! Damn, I was right this time!

Slide 44

Slide 44 text

Context matters, LLMs are easily influenced… Mbappé, of course!

Slide 45

Slide 45 text

The reversal curse arxiv.org/pdf/2309.12288 the-decoder.com/language-models-know-tom-cruises-mother-but-not-her-son/

Slide 46

Slide 46 text

Base64 is just like any other (human) language! Do you speak Base64? RW5qb3kgZ G90QUkh