Slide 9
Slide 9 text
Tokens and Tokenization
~50K vocab size
[464, 5044, 1422, 470, 3272, 262, 4675, 780, 340, 373, 1165, 10032, 13]
60 chars
(76 chars, 17 tokens)
(55 chars, 24 tokens)
[0.653249, -0.211342, 0.000436 … -0.532995, 0.900358, 0.345422]
13 tokens
N-dimensional
embedding
vector
per token
…a continuous space
representation we can use
as model input
Embeddings for similar concepts will be close to each other in N-dimensional space
(e.g., vectors for “dog” and “hound” will have a cosine similarity closer to 1 than “dog” and “chair”)
Less common words will tend to split into multiple tokens:
There’s a bias towards English in the BPE corpus:
dog
chair
hound