small subword vocabulary of all characters in the training corpus. 2. Merge the two consecutive subwords that occur most frequently in the corpus. – The tokenizer learns the merge rule. 3. Repeat step 2. until the vocabulary reaches the desired size. 24 ("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5) ["b", "g", "h", "n", "p", "s", "u"] ("h" "u" "g", 10), ("p" "u" "g", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "u" "g" "s", 5) ("h" "ug", 10), ("p" "ug", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "ug" "s", 5) ["b", "g", "h", "n", "p", "s", "u“, “ug”] Corpus: Corpus (subword-based): Vocabulary: Corpus (subword-based): Vocabulary: Indicates that “hugs” occurs five times, and so on. Merge “u” and “g” (frequency: 20) New subword Merge “u” and “n” (frequency: 16) … Example is from Hugging Face