Upgrade to Pro — share decks privately, control downloads, hide ads and more …

論文紹介: Fast WordPiece Tokenization

tatHi
December 11, 2021

論文紹介: Fast WordPiece Tokenization

tatHi

December 11, 2021
Tweet

More Decks by tatHi

Other Decks in Research

Transcript

  1. Fast WordPiece Tokenization Xinying Song, Alex Salcianu, Yang Song, Dave

    Dopson, Denny Zhou EMNLP 2021 (Main Conference) Presenter: Tatsuya Hiraoka (D3) 2021/11/12 Paper Reading (Hiraoka) 1
  2. Overview • Target: • Tokenization algorithm used in WordPiece (tokenizer

    for BERT) • Longest-match-first (MaxMatch) • Problem: • Much more fast tokenization is required in NLP • Conventional implementation requires 𝑂(𝑛!), where 𝑛 is word length • Solution: • Propose fast algorithm for Longest-match-first (MaxMatch) • Realize 𝑂(𝑛) algorithm by adding pre-computing for the trie of vocabulary 2021/11/12 Paper Reading (Hiraoka) 2
  3. Motivation: We Always Do Tokenization 2021/11/12 Paper Reading (Hiraoka) 3

    THE BERT Billions of Queries Tokenizer Even in Inference, e.g., Google Web Search Much faster tokenizer is required.
  4. WordPiece Tokenization 2021/11/12 Paper Reading (Hiraoka) 5 … the superman

    is … super ##man Word Tokenization Input Sentence
  5. WordPiece Tokenization 2021/11/12 Paper Reading (Hiraoka) 6 … the superman

    is … super ##man Word Tokenization Left-to-right Longest-match-first algorithm for vocabulary Input Sentence
  6. WordPiece Tokenization 2021/11/12 Paper Reading (Hiraoka) 7 … the superman

    is … super ##man Word Tokenization Left-to-right Longest-match-first algorithm for vocabulary Special prefix indicating the token starts in middle of word Input Sentence
  7. WordPiece Tokenization (Building Trie) • Example of trie construction for

    the vocabulary 2021/11/12 Paper Reading (Hiraoka) 8 Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz
  8. WordPiece Tokenization (Building Trie) • Example of trie construction for

    the vocabulary 2021/11/12 Paper Reading (Hiraoka) 9 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13
  9. WordPiece Tokenization (Building Trie) • Example of trie construction for

    the vocabulary 2021/11/12 Paper Reading (Hiraoka) 10 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 Initial State
  10. WordPiece Tokenization (Building Trie) • Example of trie construction for

    the vocabulary 2021/11/12 Paper Reading (Hiraoka) 11 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 Accept “a” Accept “abcdx” Accept “##c” Accept “##dz” Accept “##cdy” Accept “##b” Initial State
  11. WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 12 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence
  12. WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 13 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence 〇 STEP
  13. WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 14 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ① STEP
  14. WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 15 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ② STEP
  15. WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 16 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ③ STEP
  16. WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 17 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ④ STEP
  17. WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 18 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence No transition from d to z. Yield latest accepted token. Transit to state 2. a ⑤ STEP
  18. WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 19 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence a ⑥ STEP
  19. WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 20 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence a ##b No transition from b to c. Yield latest accepted token. Transit to state 2. ⑦ STEP
  20. WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 21 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence a ##b ⑧ STEP
  21. WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 22 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence a ##b ⑨ STEP
  22. WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 23 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence a ##b ##c No transition from d to z. Yield latest accepted token. Transit to state 2. ⑩ STEP
  23. WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 24 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence a ##b ##c ⑪ STEP
  24. WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 25 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence a ##b ##c ⑫ STEP
  25. WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 26 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz No transition from d to _. Yield latest accepted token. ⑬ STEP
  26. Problem • Reading the same sequence multiple times. • Causes

    𝑂(𝑛!) computation. • Waste long time when reading long word • such as “interesting” (11chars) 2021/11/12 Paper Reading (Hiraoka) 27 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑥ ⑦ → # # b ⑧ ⑨ ⑩ → # # c ⑪ ⑫ ⑬ → # # d z Reading step to yield token
  27. Problem • Reading the same sequence multiple times. • Causes

    𝑂(𝑛!) computation. • Waste long time when reading long word • such as “interesting” (11chars) 2021/11/12 Paper Reading (Hiraoka) 28 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑥ ⑦ → # # b ⑧ ⑨ ⑩ → # # c ⑪ ⑫ ⑬ → # # d z Reading step to yield token
  28. Problem • Reading the same sequence multiple times. • Causes

    𝑂(𝑛!) computation. • Waste long time when reading long word • such as “interesting” (11chars) 2021/11/12 Paper Reading (Hiraoka) 29 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑥ ⑦ → # # b ⑧ ⑨ ⑩ → # # c ⑪ ⑫ ⑬ → # # d z Reading step to yield token
  29. Problem • Reading the same sequence multiple times. • Causes

    𝑂(𝑛!) computation. • Waste long time when reading long word • such as “interesting” (11chars) 2021/11/12 Paper Reading (Hiraoka) 30 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑥ ⑦ → # # b ⑧ ⑨ ⑩ → # # c ⑪ ⑫ ⑬ → # # d z Reading step to yield token
  30. Problem • Reading the same sequence multiple times. • Causes

    𝑂(𝑛!) computation. • Waste long time when reading long word • such as “interesting” (11chars) 2021/11/12 Paper Reading (Hiraoka) 31 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑥ ⑦ → # # b ⑧ ⑨ ⑩ → # # c ⑪ ⑫ ⑬ → # # d z Reading step to yield token
  31. Proposed: LinMaxMatch 2021/11/12 Paper Reading (Hiraoka) 33 a b c

    d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ③ STEP We already know the first tokenization should be [a, ##b] and the next state should be 9 if the search stops here. →We donʼt need to read “bc” again.
  32. Proposed: LinMaxMatch 2021/11/12 Paper Reading (Hiraoka) 34 a b c

    d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ③ STEP We already know the first tokenization should be [a, ##b] and the next state should be 9 if the search stops here. →We donʼt need to read “bc” again. 1 Yield a and Move to 2 if transition at 5 fails.
  33. Proposed: LinMaxMatch 2021/11/12 Paper Reading (Hiraoka) 35 a b c

    d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ③ STEP We already know the first tokenization should be [a, ##b] and the next state should be 9 if the search stops here. →We donʼt need to read “bc” again. 1 2 Read b
  34. Proposed: LinMaxMatch 2021/11/12 Paper Reading (Hiraoka) 36 a b c

    d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ③ STEP We already know the first tokenization should be [a, ##b] and the next state should be 9 if the search stops here. →We donʼt need to read “bc” again. 1 2 3 Read c and yield ##b
  35. Proposed: LinMaxMatch 2021/11/12 Paper Reading (Hiraoka) 37 a b c

    d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ③ STEP We already know the first tokenization should be [a, ##b] and the next state should be 9 if the search stops here. →We donʼt need to read “bc” again. 1 2 3 4 Move 2 to 4 by reading c
  36. Proposed: LinMaxMatch 2021/11/12 Paper Reading (Hiraoka) 38 a b c

    d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ③ STEP [a, ##b] Next: 9 Add the information to trie for the failure transition as precomputation. We already know the first tokenization should be [a, ##b] and the next state should be 9 if the search stops here. →We donʼt need to read “bc” again.
  37. Proposed: LinMaxMatch 2021/11/12 Paper Reading (Hiraoka) 39 a b c

    d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 • Fill information for all states as precomputation. • i.e., Caching 𝑂(𝑛!) computation in advance.
  38. Example: Tokenization (LinMaxMatch) 2021/11/12 Paper Reading (Hiraoka) 40 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 a b c d z Input Word Tokenized Sequence 〇 STEP
  39. Example: Tokenization (LinMaxMatch) 2021/11/12 Paper Reading (Hiraoka) 41 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 a b c d z Input Word Tokenized Sequence ① STEP
  40. Example: Tokenization (LinMaxMatch) 2021/11/12 Paper Reading (Hiraoka) 42 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 a b c d z Input Word Tokenized Sequence ② STEP
  41. Example: Tokenization (LinMaxMatch) 2021/11/12 Paper Reading (Hiraoka) 43 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 a b c d z Input Word Tokenized Sequence ③ STEP
  42. Example: Tokenization (LinMaxMatch) 2021/11/12 Paper Reading (Hiraoka) 44 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 a b c d z Input Word Tokenized Sequence ④ STEP
  43. Example: Tokenization (LinMaxMatch) 2021/11/12 Paper Reading (Hiraoka) 45 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 a b c d z Input Word Tokenized Sequence a ##b No transition from d to z. Yield [a, ##b]. Transit to state 10. ⑤ STEP
  44. Example: Tokenization (LinMaxMatch) 2021/11/12 Paper Reading (Hiraoka) 46 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 a b c d z Input Word Tokenized Sequence a ##b ##c No transition from d to z. Yield [##c]. Transit to state 12. ⑤ STEP ʼ
  45. Example: Tokenization (LinMaxMatch) 2021/11/12 Paper Reading (Hiraoka) 47 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 a b c d z Input Word Tokenized Sequence a ##b ##c ⑤ STEP ”
  46. Example: Tokenization (LinMaxMatch) 2021/11/12 Paper Reading (Hiraoka) 48 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 a b c d z Input Word Tokenized Sequence a ##b ##c ⑤ STEP ” We can pass through from 6 to 13 only by reading z
  47. Example: Tokenization (LinMaxMatch) 2021/11/12 Paper Reading (Hiraoka) 49 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz No transition from z to _. Yield [##dz]. ⑥ STEP
  48. Example: Tokenization • The proposed method reads the input only

    one time • ⑤ , ⑤ have no-cost because just following the precomputed transition. • Reducing the computation cost to linear, 𝑂(𝑛). • This paper is originally titled as “Linear WordPiece Tokenization”. 2021/11/12 Paper Reading (Hiraoka) 50 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑤ → # # b ⑤ → # # c ⑤ ⑥ → # # d z Reading step to yield token ʼ ” ʼ ”
  49. Example: Tokenization • The proposed method reads the input only

    one time • ⑤ , ⑤ have no-cost because just following the precomputed transition. • Reducing the computation cost to linear, 𝑂(𝑛). • This paper is originally titled as “Linear WordPiece Tokenization”. 2021/11/12 Paper Reading (Hiraoka) 51 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑤ → # # b ⑤ → # # c ⑤ ⑥ → # # d z Reading step to yield token ʼ ” ʼ ”
  50. Example: Tokenization • The proposed method reads the input only

    one time • ⑤ , ⑤ have no-cost because just following the precomputed transition. • Reducing the computation cost to linear, 𝑂(𝑛). • This paper is originally titled as “Linear WordPiece Tokenization”. 2021/11/12 Paper Reading (Hiraoka) 52 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑤ → # # b ⑤ → # # c ⑤ ⑥ → # # d z Reading step to yield token ʼ ” ʼ ”
  51. Example: Tokenization • The proposed method reads the input only

    one time • ⑤ , ⑤ have no-cost because just following the precomputed transition. • Reducing the computation cost to linear, 𝑂(𝑛). • This paper is originally titled as “Linear WordPiece Tokenization”. 2021/11/12 Paper Reading (Hiraoka) 53 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑤ → # # b ⑤ → # # c ⑤ ⑥ → # # d z Reading step to yield token ʼ ” ʼ ”
  52. Example: Tokenization • The proposed method reads the input only

    one time • ⑤ , ⑤ have no-cost because just following the precomputed transition. • Reducing the computation cost to linear, 𝑂(𝑛). • This paper is originally titled as “Linear WordPiece Tokenization”. 2021/11/12 Paper Reading (Hiraoka) 54 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑤ → # # b ⑤ → # # c ⑤ ⑥ → # # d z Reading step to yield token ʼ ” ʼ ”
  53. Example: Tokenization • The proposed method reads the input only

    one time • ⑤ , ⑤ have no-cost because just following the precomputed transition. • Reducing the computation cost to linear, 𝑂(𝑛). • This paper is originally titled as “Linear WordPiece Tokenization”. 2021/11/12 Paper Reading (Hiraoka) 55 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑤ → # # b ⑤ → # # c ⑤ ⑥ → # # d z Reading step to yield token ʼ ” ʼ ”
  54. Experiments: Total Tokenization Speed • Vocabulary: BERT-base (Multilingual Cased model)

    • Data: 1000 sentences including 82 languages (Wikipedia) 2021/11/12 Paper Reading (Hiraoka) 56 WordPiece in Rust WordPiece in C++ Proposed WordPiece in C++ Note that all methods output the same tokenization.
  55. Experiments: Total Tokenization Speed • Vocabulary: BERT-base (Multilingual Cased model)

    • Data: 1000 sentences including 82 languages (Wikipedia) 2021/11/12 Paper Reading (Hiraoka) 57 WordPiece in Rust WordPiece in C++ Proposed WordPiece in C++ Note that all methods output the same tokenization. Tokenization speed for single word Tokenization speed for each sentence
  56. Experiments: Total Tokenization Speed • Vocabulary: BERT-base (Multilingual Cased model)

    • Data: 1000 sentences including 82 languages (Wikipedia) 2021/11/12 Paper Reading (Hiraoka) 58 WordPiece in Rust WordPiece in C++ Proposed WordPiece in C++ Note that all methods output the same tokenization. Tokenization speed for single word Tokenization speed for each sentence The proposed method is the fastest!
  57. Experiments: Speeds against Word Length 2021/11/12 Paper Reading (Hiraoka) 59

    Longer Word Slower Much faster than the original implementation, especially for longer word! ←Rust Wordpiece ←C++ WordPiece ←Proposed WordPiece
  58. Overview and Comments • Target: • Tokenization algorithm used in

    WordPiece (tokenizer for BERT) • Longest-match-first (MaxMatch) • Problem: • Much more fast tokenization is required in NLP • Conventional implementation requires 𝑂(𝑛!), where 𝑛 is word length • Solution: • Propose fast algorithm for Longest-match-first (MaxMatch) • Realize 𝑂(𝑛) algorithm by adding pre-computing for the trie of vocabulary • Comment • Iʼm impressed that xACL accept such an algorithmic paper on tokenization nowadays. 2021/11/12 Paper Reading (Hiraoka) 60