Upgrade to Pro — share decks privately, control downloads, hide ads and more …

論文紹介: Fast WordPiece Tokenization

6f881fac818f465f2b375ed7e335cf2a?s=47 tatHi
December 11, 2021

論文紹介: Fast WordPiece Tokenization

6f881fac818f465f2b375ed7e335cf2a?s=128

tatHi

December 11, 2021
Tweet

More Decks by tatHi

Other Decks in Research

Transcript

  1. Fast WordPiece Tokenization Xinying Song, Alex Salcianu, Yang Song, Dave

    Dopson, Denny Zhou EMNLP 2021 (Main Conference) Presenter: Tatsuya Hiraoka (D3) 2021/11/12 Paper Reading (Hiraoka) 1
  2. Overview • Target: • Tokenization algorithm used in WordPiece (tokenizer

    for BERT) • Longest-match-first (MaxMatch) • Problem: • Much more fast tokenization is required in NLP • Conventional implementation requires 𝑂(𝑛!), where 𝑛 is word length • Solution: • Propose fast algorithm for Longest-match-first (MaxMatch) • Realize 𝑂(𝑛) algorithm by adding pre-computing for the trie of vocabulary 2021/11/12 Paper Reading (Hiraoka) 2
  3. Motivation: We Always Do Tokenization 2021/11/12 Paper Reading (Hiraoka) 3

    THE BERT Billions of Queries Tokenizer Even in Inference, e.g., Google Web Search Much faster tokenizer is required.
  4. WordPiece Tokenization 2021/11/12 Paper Reading (Hiraoka) 4 … the superman

    is … Input Sentence
  5. WordPiece Tokenization 2021/11/12 Paper Reading (Hiraoka) 5 … the superman

    is … super ##man Word Tokenization Input Sentence
  6. WordPiece Tokenization 2021/11/12 Paper Reading (Hiraoka) 6 … the superman

    is … super ##man Word Tokenization Left-to-right Longest-match-first algorithm for vocabulary Input Sentence
  7. WordPiece Tokenization 2021/11/12 Paper Reading (Hiraoka) 7 … the superman

    is … super ##man Word Tokenization Left-to-right Longest-match-first algorithm for vocabulary Special prefix indicating the token starts in middle of word Input Sentence
  8. WordPiece Tokenization (Building Trie) • Example of trie construction for

    the vocabulary 2021/11/12 Paper Reading (Hiraoka) 8 Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz
  9. WordPiece Tokenization (Building Trie) • Example of trie construction for

    the vocabulary 2021/11/12 Paper Reading (Hiraoka) 9 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13
  10. WordPiece Tokenization (Building Trie) • Example of trie construction for

    the vocabulary 2021/11/12 Paper Reading (Hiraoka) 10 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 Initial State
  11. WordPiece Tokenization (Building Trie) • Example of trie construction for

    the vocabulary 2021/11/12 Paper Reading (Hiraoka) 11 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 Accept “a” Accept “abcdx” Accept “##c” Accept “##dz” Accept “##cdy” Accept “##b” Initial State
  12. WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 12 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence
  13. WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 13 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence 〇 STEP
  14. WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 14 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ① STEP
  15. WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 15 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ② STEP
  16. WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 16 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ③ STEP
  17. WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 17 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ④ STEP
  18. WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 18 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence No transition from d to z. Yield latest accepted token. Transit to state 2. a ⑤ STEP
  19. WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 19 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence a ⑥ STEP
  20. WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 20 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence a ##b No transition from b to c. Yield latest accepted token. Transit to state 2. ⑦ STEP
  21. WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 21 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence a ##b ⑧ STEP
  22. WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 22 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence a ##b ⑨ STEP
  23. WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 23 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence a ##b ##c No transition from d to z. Yield latest accepted token. Transit to state 2. ⑩ STEP
  24. WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 24 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence a ##b ##c ⑪ STEP
  25. WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 25 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence a ##b ##c ⑫ STEP
  26. WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 26 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz No transition from d to _. Yield latest accepted token. ⑬ STEP
  27. Problem • Reading the same sequence multiple times. • Causes

    𝑂(𝑛!) computation. • Waste long time when reading long word • such as “interesting” (11chars) 2021/11/12 Paper Reading (Hiraoka) 27 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑥ ⑦ → # # b ⑧ ⑨ ⑩ → # # c ⑪ ⑫ ⑬ → # # d z Reading step to yield token
  28. Problem • Reading the same sequence multiple times. • Causes

    𝑂(𝑛!) computation. • Waste long time when reading long word • such as “interesting” (11chars) 2021/11/12 Paper Reading (Hiraoka) 28 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑥ ⑦ → # # b ⑧ ⑨ ⑩ → # # c ⑪ ⑫ ⑬ → # # d z Reading step to yield token
  29. Problem • Reading the same sequence multiple times. • Causes

    𝑂(𝑛!) computation. • Waste long time when reading long word • such as “interesting” (11chars) 2021/11/12 Paper Reading (Hiraoka) 29 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑥ ⑦ → # # b ⑧ ⑨ ⑩ → # # c ⑪ ⑫ ⑬ → # # d z Reading step to yield token
  30. Problem • Reading the same sequence multiple times. • Causes

    𝑂(𝑛!) computation. • Waste long time when reading long word • such as “interesting” (11chars) 2021/11/12 Paper Reading (Hiraoka) 30 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑥ ⑦ → # # b ⑧ ⑨ ⑩ → # # c ⑪ ⑫ ⑬ → # # d z Reading step to yield token
  31. Problem • Reading the same sequence multiple times. • Causes

    𝑂(𝑛!) computation. • Waste long time when reading long word • such as “interesting” (11chars) 2021/11/12 Paper Reading (Hiraoka) 31 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑥ ⑦ → # # b ⑧ ⑨ ⑩ → # # c ⑪ ⑫ ⑬ → # # d z Reading step to yield token
  32. 2021/11/12 Paper Reading (Hiraoka) 32 ▷Go on to Proposed Method…

  33. Proposed: LinMaxMatch 2021/11/12 Paper Reading (Hiraoka) 33 a b c

    d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ③ STEP We already know the first tokenization should be [a, ##b] and the next state should be 9 if the search stops here. →We donʼt need to read “bc” again.
  34. Proposed: LinMaxMatch 2021/11/12 Paper Reading (Hiraoka) 34 a b c

    d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ③ STEP We already know the first tokenization should be [a, ##b] and the next state should be 9 if the search stops here. →We donʼt need to read “bc” again. 1 Yield a and Move to 2 if transition at 5 fails.
  35. Proposed: LinMaxMatch 2021/11/12 Paper Reading (Hiraoka) 35 a b c

    d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ③ STEP We already know the first tokenization should be [a, ##b] and the next state should be 9 if the search stops here. →We donʼt need to read “bc” again. 1 2 Read b
  36. Proposed: LinMaxMatch 2021/11/12 Paper Reading (Hiraoka) 36 a b c

    d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ③ STEP We already know the first tokenization should be [a, ##b] and the next state should be 9 if the search stops here. →We donʼt need to read “bc” again. 1 2 3 Read c and yield ##b
  37. Proposed: LinMaxMatch 2021/11/12 Paper Reading (Hiraoka) 37 a b c

    d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ③ STEP We already know the first tokenization should be [a, ##b] and the next state should be 9 if the search stops here. →We donʼt need to read “bc” again. 1 2 3 4 Move 2 to 4 by reading c
  38. Proposed: LinMaxMatch 2021/11/12 Paper Reading (Hiraoka) 38 a b c

    d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ③ STEP [a, ##b] Next: 9 Add the information to trie for the failure transition as precomputation. We already know the first tokenization should be [a, ##b] and the next state should be 9 if the search stops here. →We donʼt need to read “bc” again.
  39. Proposed: LinMaxMatch 2021/11/12 Paper Reading (Hiraoka) 39 a b c

    d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 • Fill information for all states as precomputation. • i.e., Caching 𝑂(𝑛!) computation in advance.
  40. Example: Tokenization (LinMaxMatch) 2021/11/12 Paper Reading (Hiraoka) 40 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 a b c d z Input Word Tokenized Sequence 〇 STEP
  41. Example: Tokenization (LinMaxMatch) 2021/11/12 Paper Reading (Hiraoka) 41 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 a b c d z Input Word Tokenized Sequence ① STEP
  42. Example: Tokenization (LinMaxMatch) 2021/11/12 Paper Reading (Hiraoka) 42 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 a b c d z Input Word Tokenized Sequence ② STEP
  43. Example: Tokenization (LinMaxMatch) 2021/11/12 Paper Reading (Hiraoka) 43 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 a b c d z Input Word Tokenized Sequence ③ STEP
  44. Example: Tokenization (LinMaxMatch) 2021/11/12 Paper Reading (Hiraoka) 44 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 a b c d z Input Word Tokenized Sequence ④ STEP
  45. Example: Tokenization (LinMaxMatch) 2021/11/12 Paper Reading (Hiraoka) 45 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 a b c d z Input Word Tokenized Sequence a ##b No transition from d to z. Yield [a, ##b]. Transit to state 10. ⑤ STEP
  46. Example: Tokenization (LinMaxMatch) 2021/11/12 Paper Reading (Hiraoka) 46 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 a b c d z Input Word Tokenized Sequence a ##b ##c No transition from d to z. Yield [##c]. Transit to state 12. ⑤ STEP ʼ
  47. Example: Tokenization (LinMaxMatch) 2021/11/12 Paper Reading (Hiraoka) 47 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 a b c d z Input Word Tokenized Sequence a ##b ##c ⑤ STEP ”
  48. Example: Tokenization (LinMaxMatch) 2021/11/12 Paper Reading (Hiraoka) 48 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 a b c d z Input Word Tokenized Sequence a ##b ##c ⑤ STEP ” We can pass through from 6 to 13 only by reading z
  49. Example: Tokenization (LinMaxMatch) 2021/11/12 Paper Reading (Hiraoka) 49 a b

    c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz No transition from z to _. Yield [##dz]. ⑥ STEP
  50. Example: Tokenization • The proposed method reads the input only

    one time • ⑤ , ⑤ have no-cost because just following the precomputed transition. • Reducing the computation cost to linear, 𝑂(𝑛). • This paper is originally titled as “Linear WordPiece Tokenization”. 2021/11/12 Paper Reading (Hiraoka) 50 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑤ → # # b ⑤ → # # c ⑤ ⑥ → # # d z Reading step to yield token ʼ ” ʼ ”
  51. Example: Tokenization • The proposed method reads the input only

    one time • ⑤ , ⑤ have no-cost because just following the precomputed transition. • Reducing the computation cost to linear, 𝑂(𝑛). • This paper is originally titled as “Linear WordPiece Tokenization”. 2021/11/12 Paper Reading (Hiraoka) 51 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑤ → # # b ⑤ → # # c ⑤ ⑥ → # # d z Reading step to yield token ʼ ” ʼ ”
  52. Example: Tokenization • The proposed method reads the input only

    one time • ⑤ , ⑤ have no-cost because just following the precomputed transition. • Reducing the computation cost to linear, 𝑂(𝑛). • This paper is originally titled as “Linear WordPiece Tokenization”. 2021/11/12 Paper Reading (Hiraoka) 52 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑤ → # # b ⑤ → # # c ⑤ ⑥ → # # d z Reading step to yield token ʼ ” ʼ ”
  53. Example: Tokenization • The proposed method reads the input only

    one time • ⑤ , ⑤ have no-cost because just following the precomputed transition. • Reducing the computation cost to linear, 𝑂(𝑛). • This paper is originally titled as “Linear WordPiece Tokenization”. 2021/11/12 Paper Reading (Hiraoka) 53 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑤ → # # b ⑤ → # # c ⑤ ⑥ → # # d z Reading step to yield token ʼ ” ʼ ”
  54. Example: Tokenization • The proposed method reads the input only

    one time • ⑤ , ⑤ have no-cost because just following the precomputed transition. • Reducing the computation cost to linear, 𝑂(𝑛). • This paper is originally titled as “Linear WordPiece Tokenization”. 2021/11/12 Paper Reading (Hiraoka) 54 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑤ → # # b ⑤ → # # c ⑤ ⑥ → # # d z Reading step to yield token ʼ ” ʼ ”
  55. Example: Tokenization • The proposed method reads the input only

    one time • ⑤ , ⑤ have no-cost because just following the precomputed transition. • Reducing the computation cost to linear, 𝑂(𝑛). • This paper is originally titled as “Linear WordPiece Tokenization”. 2021/11/12 Paper Reading (Hiraoka) 55 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑤ → # # b ⑤ → # # c ⑤ ⑥ → # # d z Reading step to yield token ʼ ” ʼ ”
  56. Experiments: Total Tokenization Speed • Vocabulary: BERT-base (Multilingual Cased model)

    • Data: 1000 sentences including 82 languages (Wikipedia) 2021/11/12 Paper Reading (Hiraoka) 56 WordPiece in Rust WordPiece in C++ Proposed WordPiece in C++ Note that all methods output the same tokenization.
  57. Experiments: Total Tokenization Speed • Vocabulary: BERT-base (Multilingual Cased model)

    • Data: 1000 sentences including 82 languages (Wikipedia) 2021/11/12 Paper Reading (Hiraoka) 57 WordPiece in Rust WordPiece in C++ Proposed WordPiece in C++ Note that all methods output the same tokenization. Tokenization speed for single word Tokenization speed for each sentence
  58. Experiments: Total Tokenization Speed • Vocabulary: BERT-base (Multilingual Cased model)

    • Data: 1000 sentences including 82 languages (Wikipedia) 2021/11/12 Paper Reading (Hiraoka) 58 WordPiece in Rust WordPiece in C++ Proposed WordPiece in C++ Note that all methods output the same tokenization. Tokenization speed for single word Tokenization speed for each sentence The proposed method is the fastest!
  59. Experiments: Speeds against Word Length 2021/11/12 Paper Reading (Hiraoka) 59

    Longer Word Slower Much faster than the original implementation, especially for longer word! ←Rust Wordpiece ←C++ WordPiece ←Proposed WordPiece
  60. Overview and Comments • Target: • Tokenization algorithm used in

    WordPiece (tokenizer for BERT) • Longest-match-first (MaxMatch) • Problem: • Much more fast tokenization is required in NLP • Conventional implementation requires 𝑂(𝑛!), where 𝑛 is word length • Solution: • Propose fast algorithm for Longest-match-first (MaxMatch) • Realize 𝑂(𝑛) algorithm by adding pre-computing for the trie of vocabulary • Comment • Iʼm impressed that xACL accept such an algorithmic paper on tokenization nowadays. 2021/11/12 Paper Reading (Hiraoka) 60