論文紹介: Fast WordPiece Tokenization

Slide 1

Slide 1 text

Fast WordPiece Tokenization Xinying Song, Alex Salcianu, Yang Song, Dave Dopson, Denny Zhou EMNLP 2021 (Main Conference) Presenter: Tatsuya Hiraoka (D3) 2021/11/12 Paper Reading (Hiraoka) 1

Slide 2

Slide 2 text

Overview • Target: • Tokenization algorithm used in WordPiece (tokenizer for BERT) • Longest-match-first (MaxMatch) • Problem: • Much more fast tokenization is required in NLP • Conventional implementation requires 𝑂(𝑛!), where 𝑛 is word length • Solution: • Propose fast algorithm for Longest-match-first (MaxMatch) • Realize 𝑂(𝑛) algorithm by adding pre-computing for the trie of vocabulary 2021/11/12 Paper Reading (Hiraoka) 2

Slide 3

Slide 3 text

Motivation: We Always Do Tokenization 2021/11/12 Paper Reading (Hiraoka) 3 THE BERT Billions of Queries Tokenizer Even in Inference, e.g., Google Web Search Much faster tokenizer is required.

Slide 4

Slide 4 text

WordPiece Tokenization 2021/11/12 Paper Reading (Hiraoka) 4 … the superman is … Input Sentence

Slide 5

Slide 5 text

WordPiece Tokenization 2021/11/12 Paper Reading (Hiraoka) 5 … the superman is … super ##man Word Tokenization Input Sentence

Slide 6

Slide 6 text

WordPiece Tokenization 2021/11/12 Paper Reading (Hiraoka) 6 … the superman is … super ##man Word Tokenization Left-to-right Longest-match-first algorithm for vocabulary Input Sentence

Slide 7

Slide 7 text

WordPiece Tokenization 2021/11/12 Paper Reading (Hiraoka) 7 … the superman is … super ##man Word Tokenization Left-to-right Longest-match-first algorithm for vocabulary Special prefix indicating the token starts in middle of word Input Sentence

Slide 8

Slide 8 text

WordPiece Tokenization (Building Trie) • Example of trie construction for the vocabulary 2021/11/12 Paper Reading (Hiraoka) 8 Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz

Slide 9

Slide 9 text

WordPiece Tokenization (Building Trie) • Example of trie construction for the vocabulary 2021/11/12 Paper Reading (Hiraoka) 9 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13

Slide 10

Slide 10 text

WordPiece Tokenization (Building Trie) • Example of trie construction for the vocabulary 2021/11/12 Paper Reading (Hiraoka) 10 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 Initial State

Slide 11

Slide 11 text

WordPiece Tokenization (Building Trie) • Example of trie construction for the vocabulary 2021/11/12 Paper Reading (Hiraoka) 11 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 Accept “a” Accept “abcdx” Accept “##c” Accept “##dz” Accept “##cdy” Accept “##b” Initial State

Slide 12

Slide 12 text

WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 12 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence

Slide 13

Slide 13 text

WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 13 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence 〇 STEP

Slide 14

Slide 14 text

WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 14 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ① STEP

Slide 15

Slide 15 text

WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 15 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ② STEP

Slide 16

Slide 16 text

WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 16 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ③ STEP

Slide 17

Slide 17 text

WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 17 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ④ STEP

Slide 18

Slide 18 text

WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 18 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence No transition from d to z. Yield latest accepted token. Transit to state 2. a ⑤ STEP

Slide 19

Slide 19 text

WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 19 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence a ⑥ STEP

Slide 20

Slide 20 text

WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 20 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence a ##b No transition from b to c. Yield latest accepted token. Transit to state 2. ⑦ STEP

Slide 21

Slide 21 text

WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 21 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence a ##b ⑧ STEP

Slide 22

Slide 22 text

WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 22 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence a ##b ⑨ STEP

Slide 23

Slide 23 text

WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 23 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence a ##b ##c No transition from d to z. Yield latest accepted token. Transit to state 2. ⑩ STEP

Slide 24

Slide 24 text

WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 24 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence a ##b ##c ⑪ STEP

Slide 25

Slide 25 text

WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 25 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence a ##b ##c ⑫ STEP

Slide 26

Slide 26 text

WordPiece Tokenization (MaxMatching) 2021/11/12 Paper Reading (Hiraoka) 26 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz No transition from d to _. Yield latest accepted token. ⑬ STEP

Slide 27

Slide 27 text

Problem • Reading the same sequence multiple times. • Causes 𝑂(𝑛!) computation. • Waste long time when reading long word • such as “interesting” (11chars) 2021/11/12 Paper Reading (Hiraoka) 27 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑥ ⑦ → # # b ⑧ ⑨ ⑩ → # # c ⑪ ⑫ ⑬ → # # d z Reading step to yield token

Slide 28

Slide 28 text

Problem • Reading the same sequence multiple times. • Causes 𝑂(𝑛!) computation. • Waste long time when reading long word • such as “interesting” (11chars) 2021/11/12 Paper Reading (Hiraoka) 28 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑥ ⑦ → # # b ⑧ ⑨ ⑩ → # # c ⑪ ⑫ ⑬ → # # d z Reading step to yield token

Slide 29

Slide 29 text

Problem • Reading the same sequence multiple times. • Causes 𝑂(𝑛!) computation. • Waste long time when reading long word • such as “interesting” (11chars) 2021/11/12 Paper Reading (Hiraoka) 29 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑥ ⑦ → # # b ⑧ ⑨ ⑩ → # # c ⑪ ⑫ ⑬ → # # d z Reading step to yield token

Slide 30

Slide 30 text

Problem • Reading the same sequence multiple times. • Causes 𝑂(𝑛!) computation. • Waste long time when reading long word • such as “interesting” (11chars) 2021/11/12 Paper Reading (Hiraoka) 30 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑥ ⑦ → # # b ⑧ ⑨ ⑩ → # # c ⑪ ⑫ ⑬ → # # d z Reading step to yield token

Slide 31

Slide 31 text

Problem • Reading the same sequence multiple times. • Causes 𝑂(𝑛!) computation. • Waste long time when reading long word • such as “interesting” (11chars) 2021/11/12 Paper Reading (Hiraoka) 31 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑥ ⑦ → # # b ⑧ ⑨ ⑩ → # # c ⑪ ⑫ ⑬ → # # d z Reading step to yield token

Slide 32

Slide 32 text

2021/11/12 Paper Reading (Hiraoka) 32 ▷Go on to Proposed Method…

Slide 33

Slide 33 text

Proposed: LinMaxMatch 2021/11/12 Paper Reading (Hiraoka) 33 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ③ STEP We already know the first tokenization should be [a, ##b] and the next state should be 9 if the search stops here. →We donʼt need to read “bc” again.

Slide 34

Slide 34 text

Proposed: LinMaxMatch 2021/11/12 Paper Reading (Hiraoka) 34 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ③ STEP We already know the first tokenization should be [a, ##b] and the next state should be 9 if the search stops here. →We donʼt need to read “bc” again. 1 Yield a and Move to 2 if transition at 5 fails.

Slide 35

Slide 35 text

Proposed: LinMaxMatch 2021/11/12 Paper Reading (Hiraoka) 35 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ③ STEP We already know the first tokenization should be [a, ##b] and the next state should be 9 if the search stops here. →We donʼt need to read “bc” again. 1 2 Read b

Slide 36

Slide 36 text

Proposed: LinMaxMatch 2021/11/12 Paper Reading (Hiraoka) 36 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ③ STEP We already know the first tokenization should be [a, ##b] and the next state should be 9 if the search stops here. →We donʼt need to read “bc” again. 1 2 3 Read c and yield ##b

Slide 37

Slide 37 text

Proposed: LinMaxMatch 2021/11/12 Paper Reading (Hiraoka) 37 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ③ STEP We already know the first tokenization should be [a, ##b] and the next state should be 9 if the search stops here. →We donʼt need to read “bc” again. 1 2 3 4 Move 2 to 4 by reading c

Slide 38

Slide 38 text

Proposed: LinMaxMatch 2021/11/12 Paper Reading (Hiraoka) 38 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 a b c d z Input Word Tokenized Sequence ③ STEP [a, ##b] Next: 9 Add the information to trie for the failure transition as precomputation. We already know the first tokenization should be [a, ##b] and the next state should be 9 if the search stops here. →We donʼt need to read “bc” again.

Slide 39

Slide 39 text

Proposed: LinMaxMatch 2021/11/12 Paper Reading (Hiraoka) 39 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 • Fill information for all states as precomputation. • i.e., Caching 𝑂(𝑛!) computation in advance.

Slide 40

Slide 40 text

Example: Tokenization (LinMaxMatch) 2021/11/12 Paper Reading (Hiraoka) 40 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 a b c d z Input Word Tokenized Sequence 〇 STEP

Slide 41

Slide 41 text

Example: Tokenization (LinMaxMatch) 2021/11/12 Paper Reading (Hiraoka) 41 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 a b c d z Input Word Tokenized Sequence ① STEP

Slide 42

Slide 42 text

Example: Tokenization (LinMaxMatch) 2021/11/12 Paper Reading (Hiraoka) 42 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 a b c d z Input Word Tokenized Sequence ② STEP

Slide 43

Slide 43 text

Example: Tokenization (LinMaxMatch) 2021/11/12 Paper Reading (Hiraoka) 43 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 a b c d z Input Word Tokenized Sequence ③ STEP

Slide 44

Slide 44 text

Example: Tokenization (LinMaxMatch) 2021/11/12 Paper Reading (Hiraoka) 44 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 a b c d z Input Word Tokenized Sequence ④ STEP

Slide 45

Slide 45 text

Example: Tokenization (LinMaxMatch) 2021/11/12 Paper Reading (Hiraoka) 45 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 a b c d z Input Word Tokenized Sequence a ##b No transition from d to z. Yield [a, ##b]. Transit to state 10. ⑤ STEP

Slide 46

Slide 46 text

Example: Tokenization (LinMaxMatch) 2021/11/12 Paper Reading (Hiraoka) 46 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 a b c d z Input Word Tokenized Sequence a ##b ##c No transition from d to z. Yield [##c]. Transit to state 12. ⑤ STEP ʼ

Slide 47

Slide 47 text

Example: Tokenization (LinMaxMatch) 2021/11/12 Paper Reading (Hiraoka) 47 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 a b c d z Input Word Tokenized Sequence a ##b ##c ⑤ STEP ”

Slide 48

Slide 48 text

Example: Tokenization (LinMaxMatch) 2021/11/12 Paper Reading (Hiraoka) 48 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 a b c d z Input Word Tokenized Sequence a ##b ##c ⑤ STEP ” We can pass through from 6 to 13 only by reading z

Slide 49

Slide 49 text

Example: Tokenization (LinMaxMatch) 2021/11/12 Paper Reading (Hiraoka) 49 a b c d x # # b d c d y z Vocabulary a, abcdx, ##b, ##c, ##cdy, ##dz Accepting State Non-accepting State Trie edge Legend 0 3 4 1 2 5 6 7 8 9 10 11 12 13 [a, ##b] Next: 9 [ ] Next: ∅ [ ] Next: ∅ [ ] Next: ∅ [a] Next: 2 [a] Next: 8 [a, ##b] Next: 10 [abcdx] Next: 2 [##b] Next: 2 [##c] Next: 2 [##c] Next: 12 [##cdy] Next: 2 [ ] Next: ∅ [##dz] Next: 2 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz No transition from z to _. Yield [##dz]. ⑥ STEP

Slide 50

Slide 50 text

Example: Tokenization • The proposed method reads the input only one time • ⑤ , ⑤ have no-cost because just following the precomputed transition. • Reducing the computation cost to linear, 𝑂(𝑛). • This paper is originally titled as “Linear WordPiece Tokenization”. 2021/11/12 Paper Reading (Hiraoka) 50 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑤ → # # b ⑤ → # # c ⑤ ⑥ → # # d z Reading step to yield token ʼ ” ʼ ”

Slide 51

Slide 51 text

Example: Tokenization • The proposed method reads the input only one time • ⑤ , ⑤ have no-cost because just following the precomputed transition. • Reducing the computation cost to linear, 𝑂(𝑛). • This paper is originally titled as “Linear WordPiece Tokenization”. 2021/11/12 Paper Reading (Hiraoka) 51 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑤ → # # b ⑤ → # # c ⑤ ⑥ → # # d z Reading step to yield token ʼ ” ʼ ”

Slide 52

Slide 52 text

Example: Tokenization • The proposed method reads the input only one time • ⑤ , ⑤ have no-cost because just following the precomputed transition. • Reducing the computation cost to linear, 𝑂(𝑛). • This paper is originally titled as “Linear WordPiece Tokenization”. 2021/11/12 Paper Reading (Hiraoka) 52 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑤ → # # b ⑤ → # # c ⑤ ⑥ → # # d z Reading step to yield token ʼ ” ʼ ”

Slide 53

Slide 53 text

Example: Tokenization • The proposed method reads the input only one time • ⑤ , ⑤ have no-cost because just following the precomputed transition. • Reducing the computation cost to linear, 𝑂(𝑛). • This paper is originally titled as “Linear WordPiece Tokenization”. 2021/11/12 Paper Reading (Hiraoka) 53 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑤ → # # b ⑤ → # # c ⑤ ⑥ → # # d z Reading step to yield token ʼ ” ʼ ”

Slide 54

Slide 54 text

Example: Tokenization • The proposed method reads the input only one time • ⑤ , ⑤ have no-cost because just following the precomputed transition. • Reducing the computation cost to linear, 𝑂(𝑛). • This paper is originally titled as “Linear WordPiece Tokenization”. 2021/11/12 Paper Reading (Hiraoka) 54 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑤ → # # b ⑤ → # # c ⑤ ⑥ → # # d z Reading step to yield token ʼ ” ʼ ”

Slide 55

Slide 55 text

Example: Tokenization • The proposed method reads the input only one time • ⑤ , ⑤ have no-cost because just following the precomputed transition. • Reducing the computation cost to linear, 𝑂(𝑛). • This paper is originally titled as “Linear WordPiece Tokenization”. 2021/11/12 Paper Reading (Hiraoka) 55 a b c d z Input Word Tokenized Sequence a ##b ##c ##dz 〇 ① ② ③ ④ ⑤ → a ⑤ → # # b ⑤ → # # c ⑤ ⑥ → # # d z Reading step to yield token ʼ ” ʼ ”

Slide 56

Slide 56 text

Experiments: Total Tokenization Speed • Vocabulary: BERT-base (Multilingual Cased model) • Data: 1000 sentences including 82 languages (Wikipedia) 2021/11/12 Paper Reading (Hiraoka) 56 WordPiece in Rust WordPiece in C++ Proposed WordPiece in C++ Note that all methods output the same tokenization.

Slide 57

Slide 57 text

Experiments: Total Tokenization Speed • Vocabulary: BERT-base (Multilingual Cased model) • Data: 1000 sentences including 82 languages (Wikipedia) 2021/11/12 Paper Reading (Hiraoka) 57 WordPiece in Rust WordPiece in C++ Proposed WordPiece in C++ Note that all methods output the same tokenization. Tokenization speed for single word Tokenization speed for each sentence

Slide 58

Slide 58 text

Experiments: Total Tokenization Speed • Vocabulary: BERT-base (Multilingual Cased model) • Data: 1000 sentences including 82 languages (Wikipedia) 2021/11/12 Paper Reading (Hiraoka) 58 WordPiece in Rust WordPiece in C++ Proposed WordPiece in C++ Note that all methods output the same tokenization. Tokenization speed for single word Tokenization speed for each sentence The proposed method is the fastest!

Slide 59

Slide 59 text

Experiments: Speeds against Word Length 2021/11/12 Paper Reading (Hiraoka) 59 Longer Word Slower Much faster than the original implementation, especially for longer word! ←Rust Wordpiece ←C++ WordPiece ←Proposed WordPiece

Slide 60

Slide 60 text

Overview and Comments • Target: • Tokenization algorithm used in WordPiece (tokenizer for BERT) • Longest-match-first (MaxMatch) • Problem: • Much more fast tokenization is required in NLP • Conventional implementation requires 𝑂(𝑛!), where 𝑛 is word length • Solution: • Propose fast algorithm for Longest-match-first (MaxMatch) • Realize 𝑂(𝑛) algorithm by adding pre-computing for the trie of vocabulary • Comment • Iʼm impressed that xACL accept such an algorithmic paper on tokenization nowadays. 2021/11/12 Paper Reading (Hiraoka) 60