for BERT) • Longest-match-first (MaxMatch) • Problem: • Much more fast tokenization is required in NLP • Conventional implementation requires 𝑂(𝑛!), where 𝑛 is word length • Solution: • Propose fast algorithm for Longest-match-first (MaxMatch) • Realize 𝑂(𝑛) algorithm by adding pre-computing for the trie of vocabulary 2021/11/12 Paper Reading (Hiraoka) 2
is … super ##man Word Tokenization Left-to-right Longest-match-first algorithm for vocabulary Special prefix indicating the token starts in middle of word Input Sentence
the vocabulary 2021/11/12 Paper Reading (Hiraoka) 9 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13
the vocabulary 2021/11/12 Paper Reading (Hiraoka) 10 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 Initial State
the vocabulary 2021/11/12 Paper Reading (Hiraoka) 11 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 Accept “a” Accept “abcdx” Accept “##c” Accept “##dz” Accept “##cdy” Accept “##b” Initial State
c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence
c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence 〇 STEP
c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ① STEP
c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ② STEP
c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ③ STEP
c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ④ STEP
c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence No transition from d to z. Yield latest accepted token. Transit to state 2. a ⑤ STEP
c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence a ⑥ STEP
c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence a ##b No transition from b to c. Yield latest accepted token. Transit to state 2. ⑦ STEP
c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence a ##b ⑧ STEP
c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence a ##b ⑨ STEP
c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence a ##b ##c No transition from d to z. Yield latest accepted token. Transit to state 2. ⑩ STEP
c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence a ##b ##c ⑪ STEP
c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence a ##b ##c ⑫ STEP
c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz No transition from d to _. Yield latest accepted token. ⑬ STEP
𝑂(𝑛!) computation. • Waste long time when reading long word • such as “interesting” (11chars) 2021/11/12 Paper Reading (Hiraoka) 27 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑥ ⑦ → # # b ⑧ ⑨ ⑩ → # # c ⑪ ⑫ ⑬ → # # d z Reading step to yield token
𝑂(𝑛!) computation. • Waste long time when reading long word • such as “interesting” (11chars) 2021/11/12 Paper Reading (Hiraoka) 28 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑥ ⑦ → # # b ⑧ ⑨ ⑩ → # # c ⑪ ⑫ ⑬ → # # d z Reading step to yield token
𝑂(𝑛!) computation. • Waste long time when reading long word • such as “interesting” (11chars) 2021/11/12 Paper Reading (Hiraoka) 29 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑥ ⑦ → # # b ⑧ ⑨ ⑩ → # # c ⑪ ⑫ ⑬ → # # d z Reading step to yield token
𝑂(𝑛!) computation. • Waste long time when reading long word • such as “interesting” (11chars) 2021/11/12 Paper Reading (Hiraoka) 30 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑥ ⑦ → # # b ⑧ ⑨ ⑩ → # # c ⑪ ⑫ ⑬ → # # d z Reading step to yield token
𝑂(𝑛!) computation. • Waste long time when reading long word • such as “interesting” (11chars) 2021/11/12 Paper Reading (Hiraoka) 31 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑥ ⑦ → # # b ⑧ ⑨ ⑩ → # # c ⑪ ⑫ ⑬ → # # d z Reading step to yield token
d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ③ STEP We already know the first tokenization should be [a, ##b] and the next state should be 9 if the search stops here. →We donʼt need to read “bc” again.
d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ③ STEP We already know the first tokenization should be [a, ##b] and the next state should be 9 if the search stops here. →We donʼt need to read “bc” again. 1 Yield a and Move to 2 if transition at 5 fails.
d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ③ STEP We already know the first tokenization should be [a, ##b] and the next state should be 9 if the search stops here. →We donʼt need to read “bc” again. 1 2 Read b
d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ③ STEP We already know the first tokenization should be [a, ##b] and the next state should be 9 if the search stops here. →We donʼt need to read “bc” again. 1 2 3 Read c and yield ##b
d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ③ STEP We already know the first tokenization should be [a, ##b] and the next state should be 9 if the search stops here. →We donʼt need to read “bc” again. 1 2 3 4 Move 2 to 4 by reading c
d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ③ STEP [a, ##b] Next: 9 Add the information to trie for the failure transition as precomputation. We already know the first tokenization should be [a, ##b] and the next state should be 9 if the search stops here. →We donʼt need to read “bc” again.
c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 a b c d z Input Word Tokenized Sequence a ##b No transition from d to z. Yield [a, ##b]. Transit to state 10. ⑤ STEP
c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 a b c d z Input Word Tokenized Sequence a ##b ##c No transition from d to z. Yield [##c]. Transit to state 12. ⑤ STEP ʼ
c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 a b c d z Input Word Tokenized Sequence a ##b ##c ⑤ STEP ” We can pass through from 6 to 13 only by reading z
c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz No transition from z to _. Yield [##dz]. ⑥ STEP
one time • ⑤ , ⑤ have no-cost because just following the precomputed transition. • Reducing the computation cost to linear, 𝑂(𝑛). • This paper is originally titled as “Linear WordPiece Tokenization”. 2021/11/12 Paper Reading (Hiraoka) 50 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑤ → # # b ⑤ → # # c ⑤ ⑥ → # # d z Reading step to yield token ʼ ” ʼ ”
one time • ⑤ , ⑤ have no-cost because just following the precomputed transition. • Reducing the computation cost to linear, 𝑂(𝑛). • This paper is originally titled as “Linear WordPiece Tokenization”. 2021/11/12 Paper Reading (Hiraoka) 51 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑤ → # # b ⑤ → # # c ⑤ ⑥ → # # d z Reading step to yield token ʼ ” ʼ ”
one time • ⑤ , ⑤ have no-cost because just following the precomputed transition. • Reducing the computation cost to linear, 𝑂(𝑛). • This paper is originally titled as “Linear WordPiece Tokenization”. 2021/11/12 Paper Reading (Hiraoka) 52 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑤ → # # b ⑤ → # # c ⑤ ⑥ → # # d z Reading step to yield token ʼ ” ʼ ”
one time • ⑤ , ⑤ have no-cost because just following the precomputed transition. • Reducing the computation cost to linear, 𝑂(𝑛). • This paper is originally titled as “Linear WordPiece Tokenization”. 2021/11/12 Paper Reading (Hiraoka) 53 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑤ → # # b ⑤ → # # c ⑤ ⑥ → # # d z Reading step to yield token ʼ ” ʼ ”
one time • ⑤ , ⑤ have no-cost because just following the precomputed transition. • Reducing the computation cost to linear, 𝑂(𝑛). • This paper is originally titled as “Linear WordPiece Tokenization”. 2021/11/12 Paper Reading (Hiraoka) 54 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑤ → # # b ⑤ → # # c ⑤ ⑥ → # # d z Reading step to yield token ʼ ” ʼ ”
one time • ⑤ , ⑤ have no-cost because just following the precomputed transition. • Reducing the computation cost to linear, 𝑂(𝑛). • This paper is originally titled as “Linear WordPiece Tokenization”. 2021/11/12 Paper Reading (Hiraoka) 55 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑤ → # # b ⑤ → # # c ⑤ ⑥ → # # d z Reading step to yield token ʼ ” ʼ ”
• Data: 1000 sentences including 82 languages (Wikipedia) 2021/11/12 Paper Reading (Hiraoka) 56 WordPiece in Rust WordPiece in C++ Proposed WordPiece in C++ Note that all methods output the same tokenization.
• Data: 1000 sentences including 82 languages (Wikipedia) 2021/11/12 Paper Reading (Hiraoka) 57 WordPiece in Rust WordPiece in C++ Proposed WordPiece in C++ Note that all methods output the same tokenization. Tokenization speed for single word Tokenization speed for each sentence
• Data: 1000 sentences including 82 languages (Wikipedia) 2021/11/12 Paper Reading (Hiraoka) 58 WordPiece in Rust WordPiece in C++ Proposed WordPiece in C++ Note that all methods output the same tokenization. Tokenization speed for single word Tokenization speed for each sentence The proposed method is the fastest!
WordPiece (tokenizer for BERT) • Longest-match-first (MaxMatch) • Problem: • Much more fast tokenization is required in NLP • Conventional implementation requires 𝑂(𝑛!), where 𝑛 is word length • Solution: • Propose fast algorithm for Longest-match-first (MaxMatch) • Realize 𝑂(𝑛) algorithm by adding pre-computing for the trie of vocabulary • Comment • Iʼm impressed that xACL accept such an algorithmic paper on tokenization nowadays. 2021/11/12 Paper Reading (Hiraoka) 60