Slide 24
Slide 24 text
Byte Pair Encoding (BPE)
⚫Training algorithm
1. Create an initial small subword vocabulary of all characters in the training corpus.
2. Merge the two consecutive subwords that occur most frequently in the corpus.
– The tokenizer learns the merge rule.
3. Repeat step 2. until the vocabulary reaches the desired size.
24
("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)
["b", "g", "h", "n", "p", "s", "u"]
("h" "u" "g", 10), ("p" "u" "g", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "u" "g" "s", 5)
("h" "ug", 10), ("p" "ug", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "ug" "s", 5)
["b", "g", "h", "n", "p", "s", "u“, “ug”]
Corpus:
Corpus (subword-based):
Vocabulary:
Corpus (subword-based):
Vocabulary:
Indicates that
“hugs” occurs five times, and so on.
Merge “u” and “g” (frequency: 20)
New subword
Merge “u” and “n” (frequency: 16)
…
Example is from Hugging Face