Upgrade to Pro — share decks privately, control downloads, hide ads and more …

論文紹介: Fast WordPiece Tokenization

tatHi
December 11, 2021

論文紹介: Fast WordPiece Tokenization

tatHi

December 11, 2021
Tweet

More Decks by tatHi

Other Decks in Research

Transcript

  1. Fast WordPiece Tokenization
    Xinying Song, Alex Salcianu, Yang Song, Dave Dopson, Denny Zhou
    EMNLP 2021 (Main Conference)
    Presenter: Tatsuya Hiraoka (D3)
    2021/11/12 Paper Reading (Hiraoka) 1

    View Slide

  2. Overview
    • Target:
    • Tokenization algorithm used in WordPiece (tokenizer for BERT)
    • Longest-match-first (MaxMatch)
    • Problem:
    • Much more fast tokenization is required in NLP
    • Conventional implementation requires 𝑂(𝑛!), where 𝑛 is word length
    • Solution:
    • Propose fast algorithm for Longest-match-first (MaxMatch)
    • Realize 𝑂(𝑛) algorithm by adding pre-computing for the trie of vocabulary
    2021/11/12 Paper Reading (Hiraoka) 2

    View Slide

  3. Motivation: We Always Do Tokenization
    2021/11/12 Paper Reading (Hiraoka) 3
    THE BERT
    Billions of Queries Tokenizer
    Even in Inference, e.g., Google Web Search
    Much faster tokenizer is required.

    View Slide

  4. WordPiece Tokenization
    2021/11/12 Paper Reading (Hiraoka) 4

    the
    superman
    is

    Input Sentence

    View Slide

  5. WordPiece Tokenization
    2021/11/12 Paper Reading (Hiraoka) 5

    the
    superman
    is

    super ##man
    Word
    Tokenization
    Input Sentence

    View Slide

  6. WordPiece Tokenization
    2021/11/12 Paper Reading (Hiraoka) 6

    the
    superman
    is

    super ##man
    Word
    Tokenization
    Left-to-right
    Longest-match-first
    algorithm for vocabulary
    Input Sentence

    View Slide

  7. WordPiece Tokenization
    2021/11/12 Paper Reading (Hiraoka) 7

    the
    superman
    is

    super ##man
    Word
    Tokenization
    Left-to-right
    Longest-match-first
    algorithm for vocabulary
    Special prefix indicating
    the token starts
    in middle of word
    Input Sentence

    View Slide

  8. WordPiece Tokenization (Building Trie)
    • Example of trie construction for the vocabulary
    2021/11/12 Paper Reading (Hiraoka) 8
    Vocabulary
    a,
    abcdx,
    ##b,
    ##c,
    ##cdy,
    ##dz

    View Slide

  9. WordPiece Tokenization (Building Trie)
    • Example of trie construction for the vocabulary
    2021/11/12 Paper Reading (Hiraoka) 9
    a b c d x
    #
    # b
    d
    c d y
    z
    Vocabulary
    a,
    abcdx,
    ##b,
    ##c,
    ##cdy,
    ##dz
    Accepting
    State
    Non-accepting
    State
    Trie edge
    Legend
    0 3 4
    1
    2
    5 6 7
    8
    9 10 11
    12 13

    View Slide

  10. WordPiece Tokenization (Building Trie)
    • Example of trie construction for the vocabulary
    2021/11/12 Paper Reading (Hiraoka) 10
    a b c d x
    #
    # b
    d
    c d y
    z
    Vocabulary
    a,
    abcdx,
    ##b,
    ##c,
    ##cdy,
    ##dz
    Accepting
    State
    Non-accepting
    State
    Trie edge
    Legend
    0 3 4
    1
    2
    5 6 7
    8
    9 10 11
    12 13
    Initial State

    View Slide

  11. WordPiece Tokenization (Building Trie)
    • Example of trie construction for the vocabulary
    2021/11/12 Paper Reading (Hiraoka) 11
    a b c d x
    #
    # b
    d
    c d y
    z
    Vocabulary
    a,
    abcdx,
    ##b,
    ##c,
    ##cdy,
    ##dz
    Accepting
    State
    Non-accepting
    State
    Trie edge
    Legend
    0 3 4
    1
    2
    5 6 7
    8
    9 10 11
    12 13
    Accept “a” Accept “abcdx”
    Accept “##c”
    Accept “##dz”
    Accept “##cdy”
    Accept “##b”
    Initial State

    View Slide

  12. WordPiece Tokenization (MaxMatching)
    2021/11/12 Paper Reading (Hiraoka) 12
    a b c d x
    #
    # b
    d
    c d y
    z
    Vocabulary
    a,
    abcdx,
    ##b,
    ##c,
    ##cdy,
    ##dz
    Accepting
    State
    Non-accepting
    State
    Trie edge
    Legend
    0 3 4
    1
    2
    5 6 7
    8
    9 10 11
    12 13
    a b c d z
    Input Word Tokenized Sequence

    View Slide

  13. WordPiece Tokenization (MaxMatching)
    2021/11/12 Paper Reading (Hiraoka) 13
    a b c d x
    #
    # b
    d
    c d y
    z
    Vocabulary
    a,
    abcdx,
    ##b,
    ##c,
    ##cdy,
    ##dz
    Accepting
    State
    Non-accepting
    State
    Trie edge
    Legend
    0 3 4
    1
    2
    5 6 7
    8
    9 10 11
    12 13
    a b c d z
    Input Word Tokenized Sequence

    STEP

    View Slide

  14. WordPiece Tokenization (MaxMatching)
    2021/11/12 Paper Reading (Hiraoka) 14
    a b c d x
    #
    # b
    d
    c d y
    z
    Vocabulary
    a,
    abcdx,
    ##b,
    ##c,
    ##cdy,
    ##dz
    Accepting
    State
    Non-accepting
    State
    Trie edge
    Legend
    0 3 4
    1
    2
    5 6 7
    8
    9 10 11
    12 13
    a b c d z
    Input Word Tokenized Sequence

    STEP

    View Slide

  15. WordPiece Tokenization (MaxMatching)
    2021/11/12 Paper Reading (Hiraoka) 15
    a b c d x
    #
    # b
    d
    c d y
    z
    Vocabulary
    a,
    abcdx,
    ##b,
    ##c,
    ##cdy,
    ##dz
    Accepting
    State
    Non-accepting
    State
    Trie edge
    Legend
    0 3 4
    1
    2
    5 6 7
    8
    9 10 11
    12 13
    a b c d z
    Input Word Tokenized Sequence

    STEP

    View Slide

  16. WordPiece Tokenization (MaxMatching)
    2021/11/12 Paper Reading (Hiraoka) 16
    a b c d x
    #
    # b
    d
    c d y
    z
    Vocabulary
    a,
    abcdx,
    ##b,
    ##c,
    ##cdy,
    ##dz
    Accepting
    State
    Non-accepting
    State
    Trie edge
    Legend
    0 3 4
    1
    2
    5 6 7
    8
    9 10 11
    12 13
    a b c d z
    Input Word Tokenized Sequence

    STEP

    View Slide

  17. WordPiece Tokenization (MaxMatching)
    2021/11/12 Paper Reading (Hiraoka) 17
    a b c d x
    #
    # b
    d
    c d y
    z
    Vocabulary
    a,
    abcdx,
    ##b,
    ##c,
    ##cdy,
    ##dz
    Accepting
    State
    Non-accepting
    State
    Trie edge
    Legend
    0 3 4
    1
    2
    5 6 7
    8
    9 10 11
    12 13
    a b c d z
    Input Word Tokenized Sequence

    STEP

    View Slide

  18. WordPiece Tokenization (MaxMatching)
    2021/11/12 Paper Reading (Hiraoka) 18
    a b c d x
    #
    # b
    d
    c d y
    z
    Vocabulary
    a,
    abcdx,
    ##b,
    ##c,
    ##cdy,
    ##dz
    Accepting
    State
    Non-accepting
    State
    Trie edge
    Legend
    0 3 4
    1
    2
    5 6 7
    8
    9 10 11
    12 13
    a b c d z
    Input Word Tokenized Sequence
    No transition from d to z.
    Yield latest accepted token.
    Transit to state 2.
    a

    STEP

    View Slide

  19. WordPiece Tokenization (MaxMatching)
    2021/11/12 Paper Reading (Hiraoka) 19
    a b c d x
    #
    # b
    d
    c d y
    z
    Vocabulary
    a,
    abcdx,
    ##b,
    ##c,
    ##cdy,
    ##dz
    Accepting
    State
    Non-accepting
    State
    Trie edge
    Legend
    0 3 4
    1
    2
    5 6 7
    8
    9 10 11
    12 13
    a b c d z
    Input Word Tokenized Sequence
    a

    STEP

    View Slide

  20. WordPiece Tokenization (MaxMatching)
    2021/11/12 Paper Reading (Hiraoka) 20
    a b c d x
    #
    # b
    d
    c d y
    z
    Vocabulary
    a,
    abcdx,
    ##b,
    ##c,
    ##cdy,
    ##dz
    Accepting
    State
    Non-accepting
    State
    Trie edge
    Legend
    0 3 4
    1
    2
    5 6 7
    8
    9 10 11
    12 13
    a b c d z
    Input Word Tokenized Sequence
    a ##b
    No transition from b to c.
    Yield latest accepted token.
    Transit to state 2.

    STEP

    View Slide

  21. WordPiece Tokenization (MaxMatching)
    2021/11/12 Paper Reading (Hiraoka) 21
    a b c d x
    #
    # b
    d
    c d y
    z
    Vocabulary
    a,
    abcdx,
    ##b,
    ##c,
    ##cdy,
    ##dz
    Accepting
    State
    Non-accepting
    State
    Trie edge
    Legend
    0 3 4
    1
    2
    5 6 7
    8
    9 10 11
    12 13
    a b c d z
    Input Word Tokenized Sequence
    a ##b

    STEP

    View Slide

  22. WordPiece Tokenization (MaxMatching)
    2021/11/12 Paper Reading (Hiraoka) 22
    a b c d x
    #
    # b
    d
    c d y
    z
    Vocabulary
    a,
    abcdx,
    ##b,
    ##c,
    ##cdy,
    ##dz
    Accepting
    State
    Non-accepting
    State
    Trie edge
    Legend
    0 3 4
    1
    2
    5 6 7
    8
    9 10 11
    12 13
    a b c d z
    Input Word Tokenized Sequence
    a ##b

    STEP

    View Slide

  23. WordPiece Tokenization (MaxMatching)
    2021/11/12 Paper Reading (Hiraoka) 23
    a b c d x
    #
    # b
    d
    c d y
    z
    Vocabulary
    a,
    abcdx,
    ##b,
    ##c,
    ##cdy,
    ##dz
    Accepting
    State
    Non-accepting
    State
    Trie edge
    Legend
    0 3 4
    1
    2
    5 6 7
    8
    9 10 11
    12 13
    a b c d z
    Input Word Tokenized Sequence
    a ##b ##c
    No transition from d to z.
    Yield latest accepted token.
    Transit to state 2.

    STEP

    View Slide

  24. WordPiece Tokenization (MaxMatching)
    2021/11/12 Paper Reading (Hiraoka) 24
    a b c d x
    #
    # b
    d
    c d y
    z
    Vocabulary
    a,
    abcdx,
    ##b,
    ##c,
    ##cdy,
    ##dz
    Accepting
    State
    Non-accepting
    State
    Trie edge
    Legend
    0 3 4
    1
    2
    5 6 7
    8
    9 10 11
    12 13
    a b c d z
    Input Word Tokenized Sequence
    a ##b ##c

    STEP

    View Slide

  25. WordPiece Tokenization (MaxMatching)
    2021/11/12 Paper Reading (Hiraoka) 25
    a b c d x
    #
    # b
    d
    c d y
    z
    Vocabulary
    a,
    abcdx,
    ##b,
    ##c,
    ##cdy,
    ##dz
    Accepting
    State
    Non-accepting
    State
    Trie edge
    Legend
    0 3 4
    1
    2
    5 6 7
    8
    9 10 11
    12 13
    a b c d z
    Input Word Tokenized Sequence
    a ##b ##c

    STEP

    View Slide

  26. WordPiece Tokenization (MaxMatching)
    2021/11/12 Paper Reading (Hiraoka) 26
    a b c d x
    #
    # b
    d
    c d y
    z
    Vocabulary
    a,
    abcdx,
    ##b,
    ##c,
    ##cdy,
    ##dz
    Accepting
    State
    Non-accepting
    State
    Trie edge
    Legend
    0 3 4
    1
    2
    5 6 7
    8
    9 10 11
    12 13
    a b c d z
    Input Word Tokenized Sequence
    a ##b ##c ##dz
    No transition from d to _.
    Yield latest accepted token.

    STEP

    View Slide

  27. Problem
    • Reading the same sequence multiple times.
    • Causes 𝑂(𝑛!) computation.
    • Waste long time when reading long word
    • such as “interesting” (11chars)
    2021/11/12 Paper Reading (Hiraoka) 27
    a b c d z
    Input Word Tokenized Sequence
    a ##b ##c ##dz
    〇 ① ② ③ ④ ⑤ → a
    ⑥ ⑦ → # # b
    ⑧ ⑨ ⑩ → # # c
    ⑪ ⑫ ⑬ → # # d z
    Reading step to yield token

    View Slide

  28. Problem
    • Reading the same sequence multiple times.
    • Causes 𝑂(𝑛!) computation.
    • Waste long time when reading long word
    • such as “interesting” (11chars)
    2021/11/12 Paper Reading (Hiraoka) 28
    a b c d z
    Input Word Tokenized Sequence
    a ##b ##c ##dz
    〇 ① ② ③ ④ ⑤ → a
    ⑥ ⑦ → # # b
    ⑧ ⑨ ⑩ → # # c
    ⑪ ⑫ ⑬ → # # d z
    Reading step to yield token

    View Slide

  29. Problem
    • Reading the same sequence multiple times.
    • Causes 𝑂(𝑛!) computation.
    • Waste long time when reading long word
    • such as “interesting” (11chars)
    2021/11/12 Paper Reading (Hiraoka) 29
    a b c d z
    Input Word Tokenized Sequence
    a ##b ##c ##dz
    〇 ① ② ③ ④ ⑤ → a
    ⑥ ⑦ → # # b
    ⑧ ⑨ ⑩ → # # c
    ⑪ ⑫ ⑬ → # # d z
    Reading step to yield token

    View Slide

  30. Problem
    • Reading the same sequence multiple times.
    • Causes 𝑂(𝑛!) computation.
    • Waste long time when reading long word
    • such as “interesting” (11chars)
    2021/11/12 Paper Reading (Hiraoka) 30
    a b c d z
    Input Word Tokenized Sequence
    a ##b ##c ##dz
    〇 ① ② ③ ④ ⑤ → a
    ⑥ ⑦ → # # b
    ⑧ ⑨ ⑩ → # # c
    ⑪ ⑫ ⑬ → # # d z
    Reading step to yield token

    View Slide

  31. Problem
    • Reading the same sequence multiple times.
    • Causes 𝑂(𝑛!) computation.
    • Waste long time when reading long word
    • such as “interesting” (11chars)
    2021/11/12 Paper Reading (Hiraoka) 31
    a b c d z
    Input Word Tokenized Sequence
    a ##b ##c ##dz
    〇 ① ② ③ ④ ⑤ → a
    ⑥ ⑦ → # # b
    ⑧ ⑨ ⑩ → # # c
    ⑪ ⑫ ⑬ → # # d z
    Reading step to yield token

    View Slide

  32. 2021/11/12 Paper Reading (Hiraoka) 32
    ▷Go on to Proposed Method…

    View Slide

  33. Proposed: LinMaxMatch
    2021/11/12 Paper Reading (Hiraoka) 33
    a b c d x
    #
    # b
    d
    c d y
    z
    Vocabulary
    a,
    abcdx,
    ##b,
    ##c,
    ##cdy,
    ##dz
    Accepting
    State
    Non-accepting
    State
    Trie edge
    Legend
    0 3 4
    1
    2
    5 6 7
    8
    9 10 11
    12 13
    a b c d z
    Input Word Tokenized Sequence

    STEP
    We already know
    the first tokenization should be [a, ##b]
    and the next state should be 9
    if the search stops here.
    →We donʼt need to read “bc” again.

    View Slide

  34. Proposed: LinMaxMatch
    2021/11/12 Paper Reading (Hiraoka) 34
    a b c d x
    #
    # b
    d
    c d y
    z
    Vocabulary
    a,
    abcdx,
    ##b,
    ##c,
    ##cdy,
    ##dz
    Accepting
    State
    Non-accepting
    State
    Trie edge
    Legend
    0 3 4
    1
    2
    5 6 7
    8
    9 10 11
    12 13
    a b c d z
    Input Word Tokenized Sequence

    STEP
    We already know
    the first tokenization should be [a, ##b]
    and the next state should be 9
    if the search stops here.
    →We donʼt need to read “bc” again.
    1
    Yield a and Move to 2
    if transition at 5 fails.

    View Slide

  35. Proposed: LinMaxMatch
    2021/11/12 Paper Reading (Hiraoka) 35
    a b c d x
    #
    # b
    d
    c d y
    z
    Vocabulary
    a,
    abcdx,
    ##b,
    ##c,
    ##cdy,
    ##dz
    Accepting
    State
    Non-accepting
    State
    Trie edge
    Legend
    0 3 4
    1
    2
    5 6 7
    8
    9 10 11
    12 13
    a b c d z
    Input Word Tokenized Sequence

    STEP
    We already know
    the first tokenization should be [a, ##b]
    and the next state should be 9
    if the search stops here.
    →We donʼt need to read “bc” again.
    1
    2
    Read b

    View Slide

  36. Proposed: LinMaxMatch
    2021/11/12 Paper Reading (Hiraoka) 36
    a b c d x
    #
    # b
    d
    c d y
    z
    Vocabulary
    a,
    abcdx,
    ##b,
    ##c,
    ##cdy,
    ##dz
    Accepting
    State
    Non-accepting
    State
    Trie edge
    Legend
    0 3 4
    1
    2
    5 6 7
    8
    9 10 11
    12 13
    a b c d z
    Input Word Tokenized Sequence

    STEP
    We already know
    the first tokenization should be [a, ##b]
    and the next state should be 9
    if the search stops here.
    →We donʼt need to read “bc” again.
    1
    2
    3
    Read c and
    yield ##b

    View Slide

  37. Proposed: LinMaxMatch
    2021/11/12 Paper Reading (Hiraoka) 37
    a b c d x
    #
    # b
    d
    c d y
    z
    Vocabulary
    a,
    abcdx,
    ##b,
    ##c,
    ##cdy,
    ##dz
    Accepting
    State
    Non-accepting
    State
    Trie edge
    Legend
    0 3 4
    1
    2
    5 6 7
    8
    9 10 11
    12 13
    a b c d z
    Input Word Tokenized Sequence

    STEP
    We already know
    the first tokenization should be [a, ##b]
    and the next state should be 9
    if the search stops here.
    →We donʼt need to read “bc” again.
    1
    2
    3 4
    Move 2 to 4
    by reading c

    View Slide

  38. Proposed: LinMaxMatch
    2021/11/12 Paper Reading (Hiraoka) 38
    a b c d x
    #
    # b
    d
    c d y
    z
    Vocabulary
    a,
    abcdx,
    ##b,
    ##c,
    ##cdy,
    ##dz
    Accepting
    State
    Non-accepting
    State
    Trie edge
    Legend
    0 3 4
    1
    2
    5 6 7
    8
    9 10 11
    12 13
    a b c d z
    Input Word Tokenized Sequence

    STEP
    [a, ##b]
    Next: 9
    Add the information to trie
    for the failure transition
    as precomputation.
    We already know
    the first tokenization should be [a, ##b]
    and the next state should be 9
    if the search stops here.
    →We donʼt need to read “bc” again.

    View Slide

  39. Proposed: LinMaxMatch
    2021/11/12 Paper Reading (Hiraoka) 39
    a b c d x
    #
    # b
    d
    c d y
    z
    Vocabulary
    a,
    abcdx,
    ##b,
    ##c,
    ##cdy,
    ##dz
    Accepting
    State
    Non-accepting
    State
    Trie edge
    Legend
    0 3 4
    1
    2
    5 6 7
    8
    9 10 11
    12 13
    [a, ##b]
    Next: 9
    [ ]
    Next: ∅
    [ ]
    Next: ∅
    [ ]
    Next: ∅
    [a]
    Next: 2
    [a]
    Next: 8
    [a, ##b]
    Next: 10
    [abcdx]
    Next: 2
    [##b]
    Next: 2
    [##c]
    Next: 2
    [##c]
    Next: 12
    [##cdy]
    Next: 2
    [ ]
    Next: ∅
    [##dz]
    Next: 2
    • Fill information for all states as precomputation.
    • i.e., Caching 𝑂(𝑛!) computation in advance.

    View Slide

  40. Example: Tokenization (LinMaxMatch)
    2021/11/12 Paper Reading (Hiraoka) 40
    a b c d x
    #
    # b
    d
    c d y
    z
    Vocabulary
    a,
    abcdx,
    ##b,
    ##c,
    ##cdy,
    ##dz
    Accepting
    State
    Non-accepting
    State
    Trie edge
    Legend
    0 3 4
    1
    2
    5 6 7
    8
    9 10 11
    12 13
    [a, ##b]
    Next: 9
    [ ]
    Next: ∅
    [ ]
    Next: ∅
    [ ]
    Next: ∅
    [a]
    Next: 2
    [a]
    Next: 8
    [a, ##b]
    Next: 10
    [abcdx]
    Next: 2
    [##b]
    Next: 2
    [##c]
    Next: 2
    [##c]
    Next: 12
    [##cdy]
    Next: 2
    [ ]
    Next: ∅
    [##dz]
    Next: 2
    a b c d z
    Input Word Tokenized Sequence

    STEP

    View Slide

  41. Example: Tokenization (LinMaxMatch)
    2021/11/12 Paper Reading (Hiraoka) 41
    a b c d x
    #
    # b
    d
    c d y
    z
    Vocabulary
    a,
    abcdx,
    ##b,
    ##c,
    ##cdy,
    ##dz
    Accepting
    State
    Non-accepting
    State
    Trie edge
    Legend
    0 3 4
    1
    2
    5 6 7
    8
    9 10 11
    12 13
    [a, ##b]
    Next: 9
    [ ]
    Next: ∅
    [ ]
    Next: ∅
    [ ]
    Next: ∅
    [a]
    Next: 2
    [a]
    Next: 8
    [a, ##b]
    Next: 10
    [abcdx]
    Next: 2
    [##b]
    Next: 2
    [##c]
    Next: 2
    [##c]
    Next: 12
    [##cdy]
    Next: 2
    [ ]
    Next: ∅
    [##dz]
    Next: 2
    a b c d z
    Input Word Tokenized Sequence

    STEP

    View Slide

  42. Example: Tokenization (LinMaxMatch)
    2021/11/12 Paper Reading (Hiraoka) 42
    a b c d x
    #
    # b
    d
    c d y
    z
    Vocabulary
    a,
    abcdx,
    ##b,
    ##c,
    ##cdy,
    ##dz
    Accepting
    State
    Non-accepting
    State
    Trie edge
    Legend
    0 3 4
    1
    2
    5 6 7
    8
    9 10 11
    12 13
    [a, ##b]
    Next: 9
    [ ]
    Next: ∅
    [ ]
    Next: ∅
    [ ]
    Next: ∅
    [a]
    Next: 2
    [a]
    Next: 8
    [a, ##b]
    Next: 10
    [abcdx]
    Next: 2
    [##b]
    Next: 2
    [##c]
    Next: 2
    [##c]
    Next: 12
    [##cdy]
    Next: 2
    [ ]
    Next: ∅
    [##dz]
    Next: 2
    a b c d z
    Input Word Tokenized Sequence

    STEP

    View Slide

  43. Example: Tokenization (LinMaxMatch)
    2021/11/12 Paper Reading (Hiraoka) 43
    a b c d x
    #
    # b
    d
    c d y
    z
    Vocabulary
    a,
    abcdx,
    ##b,
    ##c,
    ##cdy,
    ##dz
    Accepting
    State
    Non-accepting
    State
    Trie edge
    Legend
    0 3 4
    1
    2
    5 6 7
    8
    9 10 11
    12 13
    [a, ##b]
    Next: 9
    [ ]
    Next: ∅
    [ ]
    Next: ∅
    [ ]
    Next: ∅
    [a]
    Next: 2
    [a]
    Next: 8
    [a, ##b]
    Next: 10
    [abcdx]
    Next: 2
    [##b]
    Next: 2
    [##c]
    Next: 2
    [##c]
    Next: 12
    [##cdy]
    Next: 2
    [ ]
    Next: ∅
    [##dz]
    Next: 2
    a b c d z
    Input Word Tokenized Sequence

    STEP

    View Slide

  44. Example: Tokenization (LinMaxMatch)
    2021/11/12 Paper Reading (Hiraoka) 44
    a b c d x
    #
    # b
    d
    c d y
    z
    Vocabulary
    a,
    abcdx,
    ##b,
    ##c,
    ##cdy,
    ##dz
    Accepting
    State
    Non-accepting
    State
    Trie edge
    Legend
    0 3 4
    1
    2
    5 6 7
    8
    9 10 11
    12 13
    [a, ##b]
    Next: 9
    [ ]
    Next: ∅
    [ ]
    Next: ∅
    [ ]
    Next: ∅
    [a]
    Next: 2
    [a]
    Next: 8
    [a, ##b]
    Next: 10
    [abcdx]
    Next: 2
    [##b]
    Next: 2
    [##c]
    Next: 2
    [##c]
    Next: 12
    [##cdy]
    Next: 2
    [ ]
    Next: ∅
    [##dz]
    Next: 2
    a b c d z
    Input Word Tokenized Sequence

    STEP

    View Slide

  45. Example: Tokenization (LinMaxMatch)
    2021/11/12 Paper Reading (Hiraoka) 45
    a b c d x
    #
    # b
    d
    c d y
    z
    Vocabulary
    a,
    abcdx,
    ##b,
    ##c,
    ##cdy,
    ##dz
    Accepting
    State
    Non-accepting
    State
    Trie edge
    Legend
    0 3 4
    1
    2
    5 6 7
    8
    9 10 11
    12 13
    [a, ##b]
    Next: 9
    [ ]
    Next: ∅
    [ ]
    Next: ∅
    [ ]
    Next: ∅
    [a]
    Next: 2
    [a]
    Next: 8
    [a, ##b]
    Next: 10
    [abcdx]
    Next: 2
    [##b]
    Next: 2
    [##c]
    Next: 2
    [##c]
    Next: 12
    [##cdy]
    Next: 2
    [ ]
    Next: ∅
    [##dz]
    Next: 2
    a b c d z
    Input Word Tokenized Sequence
    a ##b
    No transition from d to z.
    Yield [a, ##b].
    Transit to state 10.

    STEP

    View Slide

  46. Example: Tokenization (LinMaxMatch)
    2021/11/12 Paper Reading (Hiraoka) 46
    a b c d x
    #
    # b
    d
    c d y
    z
    Vocabulary
    a,
    abcdx,
    ##b,
    ##c,
    ##cdy,
    ##dz
    Accepting
    State
    Non-accepting
    State
    Trie edge
    Legend
    0 3 4
    1
    2
    5 6 7
    8
    9 10 11
    12 13
    [a, ##b]
    Next: 9
    [ ]
    Next: ∅
    [ ]
    Next: ∅
    [ ]
    Next: ∅
    [a]
    Next: 2
    [a]
    Next: 8
    [a, ##b]
    Next: 10
    [abcdx]
    Next: 2
    [##b]
    Next: 2
    [##c]
    Next: 2
    [##c]
    Next: 12
    [##cdy]
    Next: 2
    [ ]
    Next: ∅
    [##dz]
    Next: 2
    a b c d z
    Input Word Tokenized Sequence
    a ##b ##c
    No transition from d to z.
    Yield [##c].
    Transit to state 12.

    STEP
    ʼ

    View Slide

  47. Example: Tokenization (LinMaxMatch)
    2021/11/12 Paper Reading (Hiraoka) 47
    a b c d x
    #
    # b
    d
    c d y
    z
    Vocabulary
    a,
    abcdx,
    ##b,
    ##c,
    ##cdy,
    ##dz
    Accepting
    State
    Non-accepting
    State
    Trie edge
    Legend
    0 3 4
    1
    2
    5 6 7
    8
    9 10 11
    12 13
    [a, ##b]
    Next: 9
    [ ]
    Next: ∅
    [ ]
    Next: ∅
    [ ]
    Next: ∅
    [a]
    Next: 2
    [a]
    Next: 8
    [a, ##b]
    Next: 10
    [abcdx]
    Next: 2
    [##b]
    Next: 2
    [##c]
    Next: 2
    [##c]
    Next: 12
    [##cdy]
    Next: 2
    [ ]
    Next: ∅
    [##dz]
    Next: 2
    a b c d z
    Input Word Tokenized Sequence
    a ##b ##c

    STEP

    View Slide

  48. Example: Tokenization (LinMaxMatch)
    2021/11/12 Paper Reading (Hiraoka) 48
    a b c d x
    #
    # b
    d
    c d y
    z
    Vocabulary
    a,
    abcdx,
    ##b,
    ##c,
    ##cdy,
    ##dz
    Accepting
    State
    Non-accepting
    State
    Trie edge
    Legend
    0 3 4
    1
    2
    5 6 7
    8
    9 10 11
    12 13
    [a, ##b]
    Next: 9
    [ ]
    Next: ∅
    [ ]
    Next: ∅
    [ ]
    Next: ∅
    [a]
    Next: 2
    [a]
    Next: 8
    [a, ##b]
    Next: 10
    [abcdx]
    Next: 2
    [##b]
    Next: 2
    [##c]
    Next: 2
    [##c]
    Next: 12
    [##cdy]
    Next: 2
    [ ]
    Next: ∅
    [##dz]
    Next: 2
    a b c d z
    Input Word Tokenized Sequence
    a ##b ##c

    STEP

    We can pass through
    from 6 to 13
    only by reading z

    View Slide

  49. Example: Tokenization (LinMaxMatch)
    2021/11/12 Paper Reading (Hiraoka) 49
    a b c d x
    #
    # b
    d
    c d y
    z
    Vocabulary
    a,
    abcdx,
    ##b,
    ##c,
    ##cdy,
    ##dz
    Accepting
    State
    Non-accepting
    State
    Trie edge
    Legend
    0 3 4
    1
    2
    5 6 7
    8
    9 10 11
    12 13
    [a, ##b]
    Next: 9
    [ ]
    Next: ∅
    [ ]
    Next: ∅
    [ ]
    Next: ∅
    [a]
    Next: 2
    [a]
    Next: 8
    [a, ##b]
    Next: 10
    [abcdx]
    Next: 2
    [##b]
    Next: 2
    [##c]
    Next: 2
    [##c]
    Next: 12
    [##cdy]
    Next: 2
    [ ]
    Next: ∅
    [##dz]
    Next: 2
    a b c d z
    Input Word Tokenized Sequence
    a ##b ##c ##dz
    No transition from z to _.
    Yield [##dz].

    STEP

    View Slide

  50. Example: Tokenization
    • The proposed method reads the input only one time
    • ⑤ , ⑤ have no-cost because just following the precomputed
    transition.
    • Reducing the computation cost to linear, 𝑂(𝑛).
    • This paper is originally titled as “Linear WordPiece Tokenization”.
    2021/11/12 Paper Reading (Hiraoka) 50
    a b c d z
    Input Word Tokenized Sequence
    a ##b ##c ##dz
    〇 ① ② ③ ④ ⑤ → a
    ⑤ → # # b
    ⑤ → # # c
    ⑤ ⑥ → # # d z
    Reading step to yield token
    ʼ

    ʼ ”

    View Slide

  51. Example: Tokenization
    • The proposed method reads the input only one time
    • ⑤ , ⑤ have no-cost because just following the precomputed
    transition.
    • Reducing the computation cost to linear, 𝑂(𝑛).
    • This paper is originally titled as “Linear WordPiece Tokenization”.
    2021/11/12 Paper Reading (Hiraoka) 51
    a b c d z
    Input Word Tokenized Sequence
    a ##b ##c ##dz
    〇 ① ② ③ ④ ⑤ → a
    ⑤ → # # b
    ⑤ → # # c
    ⑤ ⑥ → # # d z
    Reading step to yield token
    ʼ

    ʼ ”

    View Slide

  52. Example: Tokenization
    • The proposed method reads the input only one time
    • ⑤ , ⑤ have no-cost because just following the precomputed
    transition.
    • Reducing the computation cost to linear, 𝑂(𝑛).
    • This paper is originally titled as “Linear WordPiece Tokenization”.
    2021/11/12 Paper Reading (Hiraoka) 52
    a b c d z
    Input Word Tokenized Sequence
    a ##b ##c ##dz
    〇 ① ② ③ ④ ⑤ → a
    ⑤ → # # b
    ⑤ → # # c
    ⑤ ⑥ → # # d z
    Reading step to yield token
    ʼ

    ʼ ”

    View Slide

  53. Example: Tokenization
    • The proposed method reads the input only one time
    • ⑤ , ⑤ have no-cost because just following the precomputed
    transition.
    • Reducing the computation cost to linear, 𝑂(𝑛).
    • This paper is originally titled as “Linear WordPiece Tokenization”.
    2021/11/12 Paper Reading (Hiraoka) 53
    a b c d z
    Input Word Tokenized Sequence
    a ##b ##c ##dz
    〇 ① ② ③ ④ ⑤ → a
    ⑤ → # # b
    ⑤ → # # c
    ⑤ ⑥ → # # d z
    Reading step to yield token
    ʼ

    ʼ ”

    View Slide

  54. Example: Tokenization
    • The proposed method reads the input only one time
    • ⑤ , ⑤ have no-cost because just following the precomputed
    transition.
    • Reducing the computation cost to linear, 𝑂(𝑛).
    • This paper is originally titled as “Linear WordPiece Tokenization”.
    2021/11/12 Paper Reading (Hiraoka) 54
    a b c d z
    Input Word Tokenized Sequence
    a ##b ##c ##dz
    〇 ① ② ③ ④ ⑤ → a
    ⑤ → # # b
    ⑤ → # # c
    ⑤ ⑥ → # # d z
    Reading step to yield token
    ʼ

    ʼ ”

    View Slide

  55. Example: Tokenization
    • The proposed method reads the input only one time
    • ⑤ , ⑤ have no-cost because just following the precomputed
    transition.
    • Reducing the computation cost to linear, 𝑂(𝑛).
    • This paper is originally titled as “Linear WordPiece Tokenization”.
    2021/11/12 Paper Reading (Hiraoka) 55
    a b c d z
    Input Word Tokenized Sequence
    a ##b ##c ##dz
    〇 ① ② ③ ④ ⑤ → a
    ⑤ → # # b
    ⑤ → # # c
    ⑤ ⑥ → # # d z
    Reading step to yield token
    ʼ

    ʼ ”

    View Slide

  56. Experiments: Total Tokenization Speed
    • Vocabulary: BERT-base (Multilingual Cased model)
    • Data: 1000 sentences including 82 languages (Wikipedia)
    2021/11/12 Paper Reading (Hiraoka) 56
    WordPiece in Rust
    WordPiece in C++
    Proposed
    WordPiece in C++
    Note that all methods output the same tokenization.

    View Slide

  57. Experiments: Total Tokenization Speed
    • Vocabulary: BERT-base (Multilingual Cased model)
    • Data: 1000 sentences including 82 languages (Wikipedia)
    2021/11/12 Paper Reading (Hiraoka) 57
    WordPiece in Rust
    WordPiece in C++
    Proposed
    WordPiece in C++
    Note that all methods output the same tokenization.
    Tokenization speed
    for single word
    Tokenization speed
    for each sentence

    View Slide

  58. Experiments: Total Tokenization Speed
    • Vocabulary: BERT-base (Multilingual Cased model)
    • Data: 1000 sentences including 82 languages (Wikipedia)
    2021/11/12 Paper Reading (Hiraoka) 58
    WordPiece in Rust
    WordPiece in C++
    Proposed
    WordPiece in C++
    Note that all methods output the same tokenization.
    Tokenization speed
    for single word
    Tokenization speed
    for each sentence
    The proposed method is the fastest!

    View Slide

  59. Experiments: Speeds against Word Length
    2021/11/12 Paper Reading (Hiraoka) 59
    Longer Word
    Slower
    Much faster than the
    original implementation,
    especially for longer word!
    ←Rust Wordpiece
    ←C++ WordPiece
    ←Proposed WordPiece

    View Slide

  60. Overview and Comments
    • Target:
    • Tokenization algorithm used in WordPiece (tokenizer for BERT)
    • Longest-match-first (MaxMatch)
    • Problem:
    • Much more fast tokenization is required in NLP
    • Conventional implementation requires 𝑂(𝑛!), where 𝑛 is word length
    • Solution:
    • Propose fast algorithm for Longest-match-first (MaxMatch)
    • Realize 𝑂(𝑛) algorithm by adding pre-computing for the trie of vocabulary
    • Comment
    • Iʼm impressed that xACL accept such an algorithmic paper on
    tokenization nowadays.
    2021/11/12 Paper Reading (Hiraoka) 60

    View Slide