論文紹介: Fast WordPiece Tokenization

Fast WordPiece Tokenization Xinying Song, Alex Salcianu, Yang Song, Dave
Dopson, Denny Zhou EMNLP 2021 (Main Conference) Presenter: Tatsuya Hiraoka (D3) 2021/11/12 Paper Reading (Hiraoka) 1

Overview • Target: • Tokenization algorithm used in WordPiece (tokenizer
for BERT) • Longest-match-first (MaxMatch) • Problem: • Much more fast tokenization is required in NLP • Conventional implementation requires 𝑂(𝑛!), where 𝑛 is word length • Solution: • Propose fast algorithm for Longest-match-first (MaxMatch) • Realize 𝑂(𝑛) algorithm by adding pre-computing for the trie of vocabulary 2021/11/12 Paper Reading (Hiraoka) 2

Motivation: We Always Do Tokenization 2021/11/12 Paper Reading (Hiraoka) 3
THE BERT Billions of Queries Tokenizer Even in Inference, e.g., Google Web Search Much faster tokenizer is required.

WordPiece Tokenization 2021/11/12 Paper Reading (Hiraoka) 4 … the superman
is … Input Sentence

is … super ##man Word Tokenization Input Sentence

is … super ##man Word Tokenization Left-to-right Longest-match-first algorithm for vocabulary Input Sentence

is … super ##man Word Tokenization Left-to-right Longest-match-first algorithm for vocabulary Special prefix indicating the token starts in middle of word Input Sentence

WordPiece Tokenization (Building Trie) • Example of trie construction for
the vocabulary 2021/11/12 Paper Reading (Hiraoka) 8 Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz

the vocabulary 2021/11/12 Paper Reading (Hiraoka) 9 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13

the vocabulary 2021/11/12 Paper Reading (Hiraoka) 10 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 Initial State

the vocabulary 2021/11/12 Paper Reading (Hiraoka) 11 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 Accept “a” Accept “abcdx” Accept “##c” Accept “##dz” Accept “##cdy” Accept “##b” Initial State

WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 12 a b
c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence

c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence 〇 STEP

c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ① STEP

c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ② STEP

c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ③ STEP

c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ④ STEP

c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence No transition from d to z. Yield latest accepted token. Transit to state 2. a ⑤ STEP

c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence a ⑥ STEP

c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence a ##b No transition from b to c. Yield latest accepted token. Transit to state 2. ⑦ STEP

c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence a ##b ⑧ STEP

c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence a ##b ⑨ STEP

c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence a ##b ##c No transition from d to z. Yield latest accepted token. Transit to state 2. ⑩ STEP

c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence a ##b ##c ⑪ STEP

c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence a ##b ##c ⑫ STEP

c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz No transition from d to _. Yield latest accepted token. ⑬ STEP

Problem • Reading the same sequence multiple times. • Causes
𝑂(𝑛!) computation. • Waste long time when reading long word • such as “interesting” (11chars) 2021/11/12 Paper Reading (Hiraoka) 27 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑥ ⑦ → # # b ⑧ ⑨ ⑩ → # # c ⑪ ⑫ ⑬ → # # d z Reading step to yield token

2021/11/12 Paper Reading (Hiraoka) 32 ▷Go on to Proposed Method…

Proposed: LinMaxMatch 2021/11/12 Paper Reading (Hiraoka) 33 a b c
d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ③ STEP We already know the first tokenization should be [a, ##b] and the next state should be 9 if the search stops here. →We donʼt need to read “bc” again.

d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ③ STEP We already know the first tokenization should be [a, ##b] and the next state should be 9 if the search stops here. →We donʼt need to read “bc” again. 1 Yield a and Move to 2 if transition at 5 fails.

d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ③ STEP We already know the first tokenization should be [a, ##b] and the next state should be 9 if the search stops here. →We donʼt need to read “bc” again. 1 2 Read b

d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ③ STEP We already know the first tokenization should be [a, ##b] and the next state should be 9 if the search stops here. →We donʼt need to read “bc” again. 1 2 3 Read c and yield ##b

d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ③ STEP We already know the first tokenization should be [a, ##b] and the next state should be 9 if the search stops here. →We donʼt need to read “bc” again. 1 2 3 4 Move 2 to 4 by reading c

d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ③ STEP [a, ##b] Next: 9 Add the information to trie for the failure transition as precomputation. We already know the first tokenization should be [a, ##b] and the next state should be 9 if the search stops here. →We donʼt need to read “bc” again.

d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 • Fill information for all states as precomputation. • i.e., Caching 𝑂(𝑛!) computation in advance.

Example: Tokenization (LinMaxMatch) 2021/11/12 Paper Reading (Hiraoka) 40 a b
c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 a b c d z Input Word Tokenized Sequence 〇 STEP

c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 a b c d z Input Word Tokenized Sequence ① STEP

c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 a b c d z Input Word Tokenized Sequence ② STEP

c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 a b c d z Input Word Tokenized Sequence ③ STEP

c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 a b c d z Input Word Tokenized Sequence ④ STEP

c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 a b c d z Input Word Tokenized Sequence a ##b No transition from d to z. Yield [a, ##b]. Transit to state 10. ⑤ STEP

c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 a b c d z Input Word Tokenized Sequence a ##b ##c No transition from d to z. Yield [##c]. Transit to state 12. ⑤ STEP ʼ

c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 a b c d z Input Word Tokenized Sequence a ##b ##c ⑤ STEP ”

c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 a b c d z Input Word Tokenized Sequence a ##b ##c ⑤ STEP ” We can pass through from 6 to 13 only by reading z

c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz No transition from z to _. Yield [##dz]. ⑥ STEP

Example: Tokenization • The proposed method reads the input only
one time • ⑤ , ⑤ have no-cost because just following the precomputed transition. • Reducing the computation cost to linear, 𝑂(𝑛). • This paper is originally titled as “Linear WordPiece Tokenization”. 2021/11/12 Paper Reading (Hiraoka) 50 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑤ → # # b ⑤ → # # c ⑤ ⑥ → # # d z Reading step to yield token ʼ ” ʼ ”

Experiments: Total Tokenization Speed • Vocabulary: BERT-base (Multilingual Cased model)
• Data: 1000 sentences including 82 languages (Wikipedia) 2021/11/12 Paper Reading (Hiraoka) 56 WordPiece in Rust WordPiece in C++ Proposed WordPiece in C++ Note that all methods output the same tokenization.

• Data: 1000 sentences including 82 languages (Wikipedia) 2021/11/12 Paper Reading (Hiraoka) 57 WordPiece in Rust WordPiece in C++ Proposed WordPiece in C++ Note that all methods output the same tokenization. Tokenization speed for single word Tokenization speed for each sentence

• Data: 1000 sentences including 82 languages (Wikipedia) 2021/11/12 Paper Reading (Hiraoka) 58 WordPiece in Rust WordPiece in C++ Proposed WordPiece in C++ Note that all methods output the same tokenization. Tokenization speed for single word Tokenization speed for each sentence The proposed method is the fastest!

Experiments: Speeds against Word Length 2021/11/12 Paper Reading (Hiraoka) 59
Longer Word Slower Much faster than the original implementation, especially for longer word! ←Rust Wordpiece ←C++ WordPiece ←Proposed WordPiece

Overview and Comments • Target: • Tokenization algorithm used in
WordPiece (tokenizer for BERT) • Longest-match-first (MaxMatch) • Problem: • Much more fast tokenization is required in NLP • Conventional implementation requires 𝑂(𝑛!), where 𝑛 is word length • Solution: • Propose fast algorithm for Longest-match-first (MaxMatch) • Realize 𝑂(𝑛) algorithm by adding pre-computing for the trie of vocabulary • Comment • Iʼm impressed that xACL accept such an algorithmic paper on tokenization nowadays. 2021/11/12 Paper Reading (Hiraoka) 60

論文紹介: Fast WordPiece Tokenization

論文紹介: Fast WordPiece Tokenization

More Decks by tatHi

Other Decks in Research

Featured

Transcript