Decoder-only architecture or How the f*ck does ChatGPT work?

The bird’s eye view

Overview IOKI loves AI

Overview IOKI loves AI Tokenizer

Overview IOKI loves AI Tokenizer Embeddings

Overview IOKI loves AI Tokenizer Embeddings Attention

Overview IOKI loves AI Tokenizer Embeddings Attention FFNN

Overview IOKI loves AI Tokenizer Embeddings Attention FFNN Sampling

Overview IOKI loves AI Tokenizer Embeddings Attention FFNN Sampling Tokenizer

Overview IOKI loves AI Tokenizer Embeddings Attention FFNN Sampling Tokenizer
.

Overview IOKI loves AI Tokenizer text2id Embeddings id2vector Attention QKV
in multi heads FFNN MLP with 3 layers Sampling temperature top_p Tokenizer id2text .

Why? Tokenzier IOKI loves AI Tokenizer text2id Computers can’t work
with text :)

Why? Embeddings IOKI loves AI Tokenizer text2id Embeddings id2vector To
get multiple dimensions of a token. It returns a vector. How close is one token to another.

get multiple dimensions of a token. It returns a vector. How close is one token to another. King Queen Cat Dog Example: 3 dimensions

get multiple dimensions of a token. It returns a vector. How close is one token to another. Fun fact: text-embedding-3-small: 1536d text-embedding-3-large: 3072d King Queen Cat Dog Example: 3 dimensions

Why? Attention IOKI loves AI Tokenizer text2id Embeddings id2vector Attention
QKV in multi heads How much each token vector attend to another

Why? Attention IOKI loves AI Tokenizer text2id Embeddings id2vector Attention
QKV in multi heads IOKI IOKI(0.20) loves(0.30) AI(0.50) loves IOKI(0.46) loves(0.10) AI(0.53) AI IOKI(0.76) loves(0.14) AI(0.10) How much each token vector attend to another

Why? FFNN IOKI loves AI Tokenizer text2id Embeddings id2vector Attention
QKV in multi heads FFNN MLP with 3 layers Refine each vector individually. “What is this token about?”

Why? Sampling IOKI loves AI Tokenizer text2id Embeddings id2vector Attention
QKV in multi heads FFNN MLP with 3 layers Sampling temperature top_p Find the next token, based on the last vector

Why? Tokenzier IOKI loves AI Tokenizer text2id Embeddings id2vector Attention
QKV in multi heads FFNN MLP with 3 layers Sampling temperature top_p Tokenizer id2text To transform the token back to text.

DOT IOKI loves AI Tokenizer text2id Embeddings id2vector Attention QKV

The secret IOKI loves AI Tokenizer text2id Embeddings id2vector Attention
QKV in multi heads FFNN MLP with 3 layers Sampling temperature top_p Tokenizer id2text . Repeat?!

The secret IOKI loves AI. Tokenizer text2id Embeddings id2vector Attention
QKV in multi heads FFNN MLP with 3 layers Sampling temperature top_p Tokenizer id2text EOS Repeat?! Repeat?!

Let’s go technical

Tokenizer (text2id) IOKI loves AI Tokenizer text2id IOKI loves AI
IO: 3982 KI: 66495 loves: 19620 AI: 20837

Embeddings IOKI loves AI Tokenizer text2id Vectorizer/Embeddings id2vector IO: 3982
→ [0.12, -0.45, 0.88] KI: 66495 → [-0,67, 0.33, -0.22] loves: 19620 → [0.04, 0.91, -0.77] AI: 20837 → [0.55, -0.12, 0.34]

Attention - Remember Why? IOKI loves AI Tokenizer text2id Embeddings
id2vector Attention QKV in multi heads How much each word/token/vector “attends” to all other word/token/vector?

Attention - Remember Why? IOKI loves AI Tokenizer text2id Embeddings
id2vector Attention QKV in multi heads How much “IO” relates to other tokens KI loves AI

Attention IOKI loves AI Tokenizer text2id Embeddings id2vector Attention QKV
in multi heads … → [0.12, -0.45, 0.88] Q[-0.3, 0.20, -0.92] K[[0.14, -0.22, 0.54], [...], [...], [...]] V[0.64, 0.18, -0.45]

Attention IOKI loves AI … → [0.12, -0.45, 0.88] Q[-0.3,
0.20, -0.92] K[[0.14, -0.22, 0.54], [...], [...], [...]] V[0.64, 0.18, -0.45] Head 1 Head 2 Head 3 Q[-0.3] Q[0.20] Q[0.88] K[[0.14], [.], [.], [.]] K[[-0.22], [.], [.], [.]] K[[0.54], [.], [.], [.]] V[0.64] V[-0.18] V[-0.45]

0.20, -0.92] K[[0.14, -0.22, 0.54], [...], [...], [...]] V[0.64, 0.18, -0.45] Head 1 Head 2 Head 3 [Q[-0.3], Q[.], Q[.], Q[.]] [Q[0.20], Q[.], Q[.], Q[.]] [Q[0.88], Q[.], Q[.], Q[.]] K[[0.14], [.], [.], [.]] K[[-0.22], [.], [.], [.]] K[[0.54], [.], [.], [.]] [V[0.64], V[.], V[.], V[.]] [V[-0.18], V[.], V[.], V[.]] [V[-0.45], V[.], V[.], V[.]]

0.20, -0.92] K[[0.14, -0.22, 0.54], [...], [...], [...]] V[0.64, 0.18, -0.45] Head 1 Head 2 Head 3 Q[-0.3] Q[0.20] Q[0.88] K[[0.14], [.], [.], [.]] K[[-0.22], [.], [.], [.]] K[[0.54], [.], [.], [.]] V[0.64] V[-0.18] V[-0.45] softmax(Q * Kt) * V

0.20, -0.92] K[[0.14, -0.22, 0.54], [...], [...], [...]] V[0.64, 0.18, -0.45] Head 1 Head 2 Head 3 Q[-0.3] Q[0.20] Q[0.88] K[[0.14], [.], [.], [.]] K[[-0.22], [.], [.], [.]] K[[0.54], [.], [.], [.]] V[0.64] V[-0.18] V[-0.45] softmax(Q * Kt) * V concat(Vhead1 , Vhead2 , Vhead3 )

New_IO[0.22, -0.22, 0.99]

FFNN IOKI loves AI Tokenizer text2id Embeddings id2vector Attention QKV
in multi heads FFNN MLP with 3 layers … → New_IO[0.22, -0.22, 0.99] → Refined_New_IO[...] … → New_KI[...] → Refined_New_KI[...] … → New_loves[...] → Refined_New_loves[...] … → New_AI[...] → Refined_New_AI[...]

Sampling IOKI loves AI Tokenizer text2id Embeddings id2vector Attention QKV
in multi heads FFNN MLP with 3 layers Sampling temperature top_p Projection before “sampling”!

Projection IOKI loves AI Tokenizer text2id Embeddings id2vector Attention QKV
in multi heads FFNN MLP with 3 layers Sampling temperature top_p Refined_New_AI[0.33, 0.11, 0.97] → P[...,N] P.size = number of tokens in the dictionary P.values = probability of the token in the dict

Projection

Sampling IOKI loves AI Tokenizer text2id Embeddings id2vector Attention QKV
in multi heads FFNN MLP with 3 layers Sampling temperature top_p [...,N]... Which to choose?

Sampling - temperature [...,N] Tokens Probability t=1

Sampling - temperature [...,N] Tokens Probability Tokens Probability t=1 t=0

Sampling - temperature [...,N] Tokens Probability Tokens Probability t=1 t=2
Tokens Probability t=0

Sampling - top_p [...,N] Tokens Probability top_p=1

Sampling - top_p [...,N] Tokens Probability top_p=1 Tokens Probability top_p=0.5

Sampling - top_p [...,N] Tokens Probability top_p=1 Tokens Probability top_p=0.5
Tokens Probability top_p=0.1

Tokenzier (id2text) IOKI loves AI Tokenizer text2id Embeddings id2vector Attention
QKV in multi heads FFNN MLP with 3 layers Sampling temperature top_p Tokenizer id2text 13: .

Overview IOKI loves AI Tokenizer text2id Embeddings id2vector Attention QKV

Overview IOKI loves AI. Tokenizer text2id Embeddings id2vector Attention QKV
in multi heads FFNN MLP with 3 layers Sampling temperature top_p Tokenizer id2text EOS Repeat?!

Weights? Parameter? Training data?

IOKI loves AI Tokenizer Embeddings Attention FFNN Sampling Tokenizer .

IOKI loves AI Tokenizer Embeddings Attention FFNN Sampling Tokenizer .
Training data Parameter Parameter

Training data in Embeddings IOKI loves AI Tokenizer Embeddings Training
data IO: 3982 → [0.12, -0.45, 0.88] KI: 66495 → [-0,67, 0.33, -0.22] loves: 19620 → [0.04, 0.91, -0.77] AI: 20837 → [0.55, -0.12, 0.34]

Parameter in Attention IOKI loves AI Tokenizer Embeddings Attention Training
data Parameter … → [0.12, -0.45, 0.88] Q[-0.3, 0.20, -0.92] K[[0.14, -0.22, 0.54], [...], [...], [...]] V[0.64, 0.18, -0.45]

Parameter in Attention IOKI loves AI Tokenizer Embeddings Attention Training
data Parameter [0.12, -0.45, 0.88] * WeightsQ = Q [...] * WeightsK = K [...] * WeightsV = V

IOKI loves AI Tokenizer Embeddings Attention FFNN Training data Parameter
Parameter

p 1 p 2 p 7 p7 = p1 *
w1 + p2 * w2 + …. w1 w2

IOKI loves AI Tokenizer Embeddings Attention FFNN Training data Parameter
Parameter

IOKI loves AI Tokenizer Embeddings Attention FFNN Sampling Training data
Parameter Parameter Projection before “sampling”! Parameter

Thanks for listening! Questions?

Decoder-only architecture or How the f*ck does ...

Decoder-only architecture or How the f*ck does ChatGPT work?

More Decks by Stefan M.

Other Decks in Programming

Featured

Transcript