Decoder-only architecture or How the f*ck does ChatGPT work?
As part of my learning journey in generative AI—specifically the attention mechanism—I gave this presentation at work.
It was mainly a way for me to reinforce my own understanding.
get multiple dimensions of a token. It returns a vector. How close is one token to another. Fun fact: text-embedding-3-small: 1536d text-embedding-3-large: 3072d King Queen Cat Dog Example: 3 dimensions
QKV in multi heads IOKI IOKI(0.20) loves(0.30) AI(0.50) loves IOKI(0.46) loves(0.10) AI(0.53) AI IOKI(0.76) loves(0.14) AI(0.10) How much each token vector attend to another
in multi heads FFNN MLP with 3 layers Sampling temperature top_p Refined_New_AI[0.33, 0.11, 0.97] → P[...,N] P.size = number of tokens in the dictionary P.values = probability of the token in the dict