Slide 18
Slide 18 text
Overview: Tokenization
1: original "This is a great example"
2: word segmentation [This, is, a, great, example]
3: subwording [This, is, a, gre, ##at, ex, ##ample]
4: encoding [42, 24, 6, 8, 4, 16, 110]
Tokenization splits sentence into tokens. Tokens are either a word or a part of a word.
Encoding assigns unique number to each token.