Upgrade to Pro — share decks privately, control downloads, hide ads and more …

BERT

tomohideshibata
October 20, 2018

 BERT

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova
Google AI Language

tomohideshibata

October 20, 2018
Tweet

More Decks by tomohideshibata

Other Decks in Science

Transcript

  1. BERT: Pre-training of Deep
    Bidirectional Transformers for
    Language Understanding
    Tomohide Shibata
    18/10/18
    Bidirectional Encoder Representations
    from Transformers

    View full-size slide

  2. Related Papers
    • Deep Contextualized Word Representations (ELMo)
    [Washington Univ. & Al2, 2018.2]
    • Improving Language Understanding by Generative
    Pre-Training (GPT) [OpenAI, 2018.6]
    • BERT: Pre-training of Deep Bidirectional
    Transformers for Language Understanding
    [GoogleAI, 2018.10]
    2

    View full-size slide

  3. 3
    2 Model
    Two major factors contribute to the success of our
    deep SRL model: (1) applying recent advances
    in training deep recurrent neural networks such as
    highway connections (Srivastava et al., 2015) and
    RNN-dropouts (Gal and Ghahramani, 2016),2 and
    (2) using an A⇤ decoding algorithm (Lewis and
    Steedman, 2014; Lee et al., 2016) to enforce struc-
    tural consistency at prediction time without adding
    more complexity to the training process.
    Formally, our task is to predict a sequence
    y
    given a sentence-predicate pair
    (w, v)
    as input.
    Each
    yi 2
    y
    belongs to a discrete set of BIO tags
    T
    . Words outside argument spans have the tag O,
    and words at the beginning and inside of argument
    spans with role
    r
    have the tags Br and Ir respec-
    tively. Let
    n =
    |
    w
    |
    =
    |
    y
    | be the length of the
    sequence.
    Predicting an SRL structure under our model
    involves finding the highest-scoring tag sequence
    over the space of all possibilities Y:
    ˆ
    y = argmax
    y2Y f(w, y)
    (1)
    We use a deep bidirectional LSTM (BiLSTM) to
    learn a locally decomposed scoring function con-
    ditioned on the input:
    Pn
    t=1 log p(yt |
    w)
    .
    To incorporate additional information (e.g.,
    structural consistency, syntactic input), we aug-
    ment the scoring function with penalization terms:
    f(w, y) =
    n
    X
    t=1
    log p(yt |
    w)
    X
    c2C
    c(w, y1:t)
    (2)
    Each constraint function
    c
    applies a non-negative
    penalty given the input
    w
    and a length-
    t
    prefix
    y1:t. These constraints can be hard or soft depend-
    ing on whether the penalties are finite.
    2.1 Deep BiLSTM Model
    Our model computes the distribution over tags us-
    ing stacked BiLSTMs, which we define as follows:
    il,t = (
    Wl
    i[hl,t+ l , xl,t] + b
    l
    i)
    (3)
    ol,t = (
    Wl
    o[hl,t+ l , xl,t] + b
    l
    o)
    (4)
    fl,t = (
    Wl
    f[hl,t+ l , xl,t] + b
    l
    f + 1)
    (5)
    ˜
    cl,t = tanh(
    Wl
    c[hl,t+ l , xl,t] + b
    l
    c)
    (6)
    cl,t = il,t ˜
    cl,t + fl,t ct+ l
    (7)
    hl,t = ol,t tanh(cl,t)
    (8)
    2We thank Mingxuan Wang for suggesting highway con-
    nections with simplified inputs and outputs. Part of our model
    is extended from his unpublished implementation.
    +
    +
    +
    The 0
    P(BARG0)
    +
    +
    +
    cats 0
    P(IARG0)
    +
    +
    +
    love 1
    P(BV)
    +
    +
    +
    hats 0
    P(BARG1)
    Softmax
    Transform
    Gates
    LSTM
    Word &
    Predicate
    Figure 1: Highway LSTM with four layers. The
    curved connections represent highway connec-
    tions, and the plus symbols represent transform
    gates that control inter-layer information flow.
    where
    xl,t is the input to the LSTM at layer
    l
    and
    timestep
    t
    . l is either
    1
    or
    1
    , indicating the di-
    rectionality of the LSTM at layer
    l
    .
    To stack the LSTMs in an interleaving pattern,
    as proposed by Zhou and Xu (2015), the layer-
    specific inputs
    xl,t and directionality l are ar-
    ranged in the following manner:
    xl,t =
    (
    [
    Wemb(wt),
    Wmask(t = v)] l = 1
    hl 1,t l > 1
    (9)
    l =
    (
    1
    if
    l
    is even
    1
    otherwise
    (10)
    The input vector
    x1,t is the concatenation of token
    wt’s word embedding and an embedding of the bi-
    nary feature
    (t = v)
    indicating whether
    wt word
    is the given predicate.
    Finally, the locally normalized distribution over
    output tags is computed via a softmax layer:
    p(yt |
    x)
    /
    exp(
    Wy
    taghL,t + btag)
    (11)
    Highway Connections To alleviate the vanish-
    ing gradient problem when training deep BiL-
    STMs, we use gated highway connections (Zhang
    et al., 2016; Srivastava et al., 2015). We include
    transform gates
    rt to control the weight of lin-
    ear and non-linear transformations between layers
    (See Figure 1). The output
    hl,t is changed to:
    rl,t = (
    Wl
    r[hl,t 1, xt] + b
    l
    r)
    (12)
    h
    0
    l,t = ol,t tanh(cl,t)
    (13)
    hl,t = rl,t h
    0
    l,t + (1 rl,t)
    Wl
    hxl,t (14)
    Trm Trm Trm
    Trm Trm Trm
    ...
    ...
    T
    1
    T
    2
    T
    N
    ...
    E
    1
    E
    2
    E
    N
    ...
    ure 1: Differences in pre-training model architectures. BERT uses a bidirectional Transformer. OpenAI GPT
    s a left-to-right Transformer. ELMo uses the concatenation of independently trained left-to-right and right-
    eft LSTM to generate features for downstream tasks. Among three, only BERT representations are jointly
    nditioned on both left and right context in all layers.
    dels pre-trained on ImageNet (Deng et al.,
    09; Yosinski et al., 2014).
    • BERTLARGE: L=24, H=1024, A=16, Total
    Parameters=340M
    BERT (Ours)
    Trm Trm Trm
    Trm Trm Trm
    ...
    ...
    T
    1
    T
    2
    T
    N
    ...
    E
    1
    E
    2
    E
    N
    ...
    Feature-based
    Fine-tuning
    Lstm
    ELMo
    Lstm Lstm
    Lstm Lstm Lstm
    Lstm Lstm Lstm
    Lstm Lstm Lstm
    ...
    ...
    ...
    ...
    T
    1
    T
    2
    T
    N
    ...
    E
    1
    E
    2
    E
    N
    ...
    aining model architectures. BERT uses a bidirectional Transformer. OpenAI GPT
    r. ELMo uses the concatenation of independently trained left-to-right and right-
    ures for downstream tasks. Among three, only BERT representations are jointly
    ht context in all layers.
    geNet (Deng et al.,
    detailed implementa-
    t cover the model ar-
    esentation for BERT.
    aining tasks, the core
    n Section 3.3. The
    d fine-tuning proce-
    n 3.4 and 3.5, respec-
    s between BERT and
    n Section 3.6.
    • BERTLARGE: L=24, H=1024, A=16, Total
    Parameters=340M
    BERTBASE was chosen to have an identical
    model size as OpenAI GPT for comparison pur-
    poses. Critically, however, the BERT Transformer
    uses bidirectional self-attention, while the GPT
    Transformer uses constrained self-attention where
    every token can only attend to context to its left.
    We note that in the literature the bidirectional
    Transformer is often referred to as a “Transformer
    encoder” while the left-context-only version is re-
    ferred to as a “Transformer decoder” since it can
    be used for text generation. The comparisons be-
    tween BERT, OpenAI GPT and ELMo are shown
    visually in Figure 1.
    BERT
    E
    [CLS]
    E
    1
    E
    [SEP]
    ... E
    N
    E
    1
    ’ ... E
    M

    C T
    1
    T
    [SEP]
    ... T
    N
    T
    1
    ’ ... T
    M

    [CLS]
    Tok
    1
    [SEP]
    ... Tok
    N
    Tok
    1
    ... Tok
    M
    Question Paragraph
    BERT
    E
    [CLS]
    E
    1
    E
    2
    E
    N
    C T
    1
    T
    2
    T
    N
    Single Sentence
    ...
    ...
    BERT
    Tok 1 Tok 2 Tok N
    ...
    [CLS]
    E
    [CLS]
    E
    1
    E
    2
    E
    N
    C T
    1
    T
    2
    T
    N
    Single Sentence
    B-PER
    O O
    ...
    ...
    E
    [CLS]
    E
    1
    E
    [SEP]
    Class
    Label
    ... E
    N
    E
    1
    ’ ... E
    M

    C T
    1
    T
    [SEP]
    ... T
    N
    T
    1
    ’ ... T
    M

    Start/End Span
    Class
    Label
    BERT
    Tok 1 Tok 2 Tok N
    ...
    [CLS] Tok 1
    [CLS]
    [CLS]
    Tok
    1
    [SEP]
    ... Tok
    N
    Tok
    1
    ... Tok
    M
    Sentence 1
    ...
    Sentence 2
    ELMo
    GPT
    BERT
    Figure 1:
    (left)
    Transformer architecture and training objectives used in this work.
    (right)
    Input
    transformations for fine-tuning on different tasks. We convert all structured inputs into token
    sequences to be processed by our pre-trained model, followed by a linear+softmax layer.
    3.3 Task-specific input transformations
    For some tasks, like text classification, we can directly fine-tune our model as described above.
    Certain other tasks, like question answering or textual entailment, have structured inputs such as
    ordered sentence pairs, or triplets of document, question, and answers. Since our pre-trained model
    was trained on contiguous sequences of text, we require some modifications to apply it to these tasks.
    Previous work proposed learning task specific architectures on top of transferred representations [44].
    Such an approach re-introduces a significant amount of task-specific customization and does not
    use transfer learning for these additional architectural components. Instead, we use a traversal-style
    approach [52], where we convert structured inputs into an ordered sequence that our pre-trained
    model can process. These input transformations allow us to avoid making extensive changes to the
    architecture across tasks. We provide a brief description of these input transformations below and
    Figure 1 provides a visual illustration. All transformations include adding randomly initialized start
    and end tokens (h
    s
    i, h
    e
    i).
    Textual entailment
    For entailment tasks, we concatenate the premise
    p
    and hypothesis
    h
    token
    sequences, with a delimiter token (
    $
    ) in between.

    shallow concatenation of
    left-to-right and right-to-left
    integrated
    architecture
    left-to-right language model
    task-specific
    architecture
    bidirectional conditioning

    View full-size slide

  4. Model Architecture
    4
    The Annotated Transformer:
    http://nlp.seas.harvard.edu/2018/04/03/attention.html
    Transformer [Vaswani+ 2017]
    BERT (Ours)
    Trm Trm Trm
    Trm Trm Trm
    ...
    ...
    T
    1
    T
    2
    T
    N
    ...
    E
    1
    E
    2
    E
    N
    ...
    Figure 1: Differences in pre-training model architectures. BERT uses a bidirectional Transformer. OpenAI GPT
    uses a left-to-right Transformer. ELMo uses the concatenation of independently trained left-to-right and right-
    to-left LSTM to generate features for downstream tasks. Among three, only BERT representations are jointly
    conditioned on both left and right context in all layers.
    models pre-trained on ImageNet (Deng et al.,
    2009; Yosinski et al., 2014).
    3 BERT
    We introduce BERT and its detailed implementa-
    tion in this section. We first cover the model ar-
    chitecture and the input representation for BERT.
    We then introduce the pre-training tasks, the core
    innovation in this paper, in Section 3.3. The
    pre-training procedures, and fine-tuning proce-
    • BERTLARGE: L=24, H=1024, A=16, Total
    Parameters=340M
    BERTBASE was chosen to have an identical
    model size as OpenAI GPT for comparison pur-
    poses. Critically, however, the BERT Transformer
    uses bidirectional self-attention, while the GPT
    Transformer uses constrained self-attention where
    every token can only attend to context to its left.
    We note that in the literature the bidirectional
    • L: # of layers
    • H: hidden size
    • A: # of self-attention heads
    • BERTBASE: L=12, H=768, A=12
    • BERTLARGE: L=24, H=1024, A=16
    same as GPT

    View full-size slide

  5. Input Representation
    5
    [CLS] he likes play ##ing [SEP]
    my dog is cute [SEP]
    Input
    E
    [CLS]
    E
    he
    E
    likes
    E
    play
    E
    ##ing
    E
    [SEP]
    E
    my
    E
    dog
    E
    is
    E
    cute
    E
    [SEP]
    Token
    Embeddings
    E
    A
    E
    B
    E
    B
    E
    B
    E
    B
    E
    B
    E
    A
    E
    A
    E
    A
    E
    A
    E
    A
    Segment
    Embeddings
    E
    0
    E
    6
    E
    7
    E
    8
    E
    9
    E
    10
    E
    1
    E
    2
    E
    3
    E
    4
    E
    5
    Position
    Embeddings
    Figure 2: BERT input representation. The input embeddings is the sum of the token embeddings, the segmentation
    embeddings and the position embeddings.
    • The first token of every sequence is al-
    ways the special classification embedding
    ([CLS]). The final hidden state (i.e., out-
    refer to this procedure as a “masked LM” (MLM),
    although it is often referred to as a Cloze task in
    the literature (Taylor, 1953). In this case, the fi-
    Wordpiece
    for classification for sent. pairs

    View full-size slide

  6. Pre-training Tasks:
    1. Masked LM (1/2)
    • Standard Language Model (LM) is left-to-right
    or right-to-light
    → “deeply bidirectional” is better
    • If deeply bidirectional conditioning is adopted
    in a standard LM, “see itself” problem arises
    6
    the man went to …
    man went to …
    cheating!

    View full-size slide

  7. Pre-training Tasks:
    1. Masked LM (2/2)
    • Solution: Masked LM
    = Cloze task or CBOW in word2vec
    • Mask 15% of tokens
    • Predict the masked token given deep
    bidirectional representations
    7
    the man [MASK1] to [MASK2] store
    went a
    “Bidirectional Transformer” is little confusing. It means
    Transformers seeing both left and right side context.

    View full-size slide

  8. Pre-training Tasks:
    2. Next Sentence Prediction
    • Understanding the relation between
    sentences is important in QA and Inference
    → Next sentence prediction task
    8
    [CLS] the man went to the store [SEP] he bought
    a gallon of milk
    Label: IsNext
    [CLS] the man [MASK] to the store [SEP] penguin
    [MASK] are flight ##less birds [SEP]
    Label: NotNext

    View full-size slide

  9. Pre-Training Procedure
    • Corpus: BookCorpus (800M words) and
    English Wikipedia (2,500M words)
    • Batchsize: 256 sequences * 512 tokens
    • Training:
    – BERTBASE: 4 TPUS Pod (16 TPU chips) → 4 days
    – BERTBASE: 16 TPUS Pod (64 TPU chips) → 4 days
    – Time Estimate for GPUs: 40 – 70 days with 8 GPUs
    http://timdettmers.com/2018/10/17/tpus-vs-gpus-for-transformers-bert/
    9

    View full-size slide

  10. Fine-Tuning:
    One additional Output Layer
    10
    BERT
    E
    [CLS]
    E
    1
    E
    [SEP]
    ... E
    N
    E
    1
    ’ ... E
    M

    C T
    1
    T
    [SEP]
    ... T
    N
    T
    1
    ’ ... T
    M

    [CLS]
    Tok
    1
    [SEP]
    ... Tok
    N
    Tok
    1
    ... Tok
    M
    Question Paragraph
    BERT
    E
    [CLS]
    E
    1
    E
    2
    E
    N
    C T
    1
    T
    2
    T
    N
    Single Sentence
    ...
    ...
    BERT
    Tok 1 Tok 2 Tok N
    ...
    [CLS]
    E
    [CLS]
    E
    1
    E
    2
    E
    N
    C T
    1
    T
    2
    T
    N
    Single Sentence
    B-PER
    O O
    ...
    ...
    E
    [CLS]
    E
    1
    E
    [SEP]
    Class
    Label
    ... E
    N
    E
    1
    ’ ... E
    M

    C T
    1
    T
    [SEP]
    ... T
    N
    T
    1
    ’ ... T
    M

    Start/End Span
    Class
    Label
    BERT
    Tok 1 Tok 2 Tok N
    ...
    [CLS] Tok 1
    [CLS]
    [CLS]
    Tok
    1
    [SEP]
    ... Tok
    N
    Tok
    1
    ... Tok
    M
    Sentence 1
    ...
    Sentence 2

    View full-size slide

  11. GLUE Results
    (General Language Understanding Evaluation,
    [Wang+ 18])
    11
    System MNLI-(m/mm) QQP QNLI SST-2 CoLA STS-B MRPC RTE Average
    392k 363k 108k 67k 8.5k 5.7k 3.5k 2.5k -
    Pre-OpenAI SOTA 80.6/80.1 66.1 82.3 93.2 35.0 81.0 86.0 61.7 74.0
    BiLSTM+ELMo+Attn 76.4/76.1 64.8 79.9 90.4 36.0 73.3 84.9 56.8 71.0
    OpenAI GPT 82.1/81.4 70.3 88.1 91.3 45.4 80.0 82.3 56.0 75.2
    BERTBASE 84.6/83.4 71.2 90.1 93.5 52.1 85.8 88.9 66.4 79.6
    BERTLARGE
    86.7/85.9 72.1 91.1 94.9 60.5 86.5 89.3 70.1 81.9
    Table 1: GLUE Test results, scored by the GLUE evaluation server. The number below each task denotes the
    number of training examples. The “Average” column is slightly different than the official GLUE score, since
    we exclude the problematic WNLI set. OpenAI GPT = (L=12, H=768, A=12); BERTBASE
    = (L=12, H=768,
    A=12); BERTLARGE
    = (L=24, H=1024, A=16). BERT and OpenAI GPT are single-model, single task. All
    results obtained from https://gluebenchmark.com/leaderboard and https://blog.openai.
    com/language-unsupervised/.
    RTE Recognizing Textual Entailment is a bi-
    nary entailment task similar to MNLI, but with
    much less training data (Bentivogli et al., 2009).6
    small data sets (i.e., some runs would produce de-
    generate results), so we ran several random restarts
    and selected the model that performed best on the

    View full-size slide

  12. Question Answering Task: SQuAD
    12
    e answer, the task is to predict the an-
    n in the paragraph. For example:
    uestion:
    water droplets collide with ice
    to form precipitation?
    aragraph:
    cipitation forms as smaller droplets
    via collision with other rain drops
    rystals within a cloud. ...
    Answer:
    cloud
    of span prediction task is quite dif-
    the sequence classification tasks of
    we are able to adapt BERT to run
    System Dev Test
    EM F1 EM F1
    Leaderboard (Oct 8th, 2018)
    Human - - 82.3 91.2
    #1 Ensemble - nlnet - - 86.0 91.7
    #2 Ensemble - QANet - - 84.5 90.5
    #1 Single - nlnet - - 83.5 90.1
    #2 Single - QANet - - 82.5 89.3
    Published
    BiDAF+ELMo (Single) - 85.8 - -
    R.M. Reader (Single) 78.9 86.3 79.5 86.6
    R.M. Reader (Ensemble) 81.2 87.9 82.3 88.5
    Ours
    BERTBASE
    (Single) 80.8 88.5 - -
    BERTLARGE
    (Single) 84.1 90.9 - -
    BERTLARGE
    (Ensemble) 85.8 91.8 - -
    BERTLARGE
    (Sgl.+TriviaQA) 84.2 91.1 85.1 91.8
    BERTLARGE
    (Ens.+TriviaQA) 86.2 92.2 87.4 93.2
    Table 2: SQuAD results. The BERT ensemble is 7x
    BERT
    E
    [CLS]
    E
    1
    E
    [SEP]
    ... E
    N
    E
    1
    ’ ... E
    M

    C T
    1
    T
    [SEP]
    ... T
    N
    T
    1
    ’ ... T
    M

    [CLS]
    Tok
    1
    [SEP]
    ... Tok
    N
    Tok
    1
    ... Tok
    M
    Question Paragraph
    Start/End Span
    Then, the probability of word i being
    he answer span is computed as a dot
    een Ti and S followed by a softmax
    e words in the paragraph:
    Pi
    =
    eS·Ti
    P
    j
    eS·Tj
    ormula is used for the end of the an-
    d the maximum scoring span is used
    ion. The training objective is the log-
    the correct start and end positions.
    r 3 epochs with a learning rate of 5e-
    size of 32. At inference time, since
    ction is not conditioned on the start,
    onstraint that the end must come after
    no other heuristics are used. The tok-
    Our best performing syste
    leaderboard system by +1.5
    +1.3 F1 as a single system
    BERT model outperforms
    tem in terms of F1 score. I
    SQuAD (without TriviaQA
    and still outperform all exis
    margin.
    4.3 Named Entity Recog
    To evaluate performance on
    we fine-tune BERT on the
    Entity Recognition (NER)
    consists of 200k training w
    annotated as Person, Orga
    Miscellaneous, or Other (
    start vector token embedding
    start prob.

    View full-size slide

  13. Token Tagging Task:
    Named Entity Recognition
    13
    System Dev F1 Test F1
    ELMo+BiLSTM+CRF 95.7 92.2
    CVT+Multi (Clark et al., 2018) - 92.6
    BERTBASE 96.4 92.4
    BERTLARGE
    96.6 92.8
    Table 3: CoNLL-2003 Named Entity Recognition re-
    sults. The hyperparameters were selected using the
    Dev set, and the reported Dev and Test scores are aver-
    aged over 5 random restarts using those hyperparame-
    ters.
    sub-token as input to the classifier. For example:
    Jim Hen ##son was a puppet ##eer
    I-PER I-PER X O O O X
    Where no prediction is made for X. Since
    the WordPiece tokenization boundaries are a
    known part of the input, this is done for both
    training and test. A visual representation is also
    given in Figure 3 (d). A cased WordPiece model
    is used for NER, whereas an uncased model is
    used for all other tasks.
    Results are presented in Table 3. BERTLARGE
    outperforms the existing SOTA, Cross-View
    we
    tho
    ple
    sco
    tio
    lea
    sul
    pe
    tem
    E
    [CLS]
    E
    1
    E
    2
    E
    N
    C T
    1
    T
    2
    T
    N
    Single Sentence
    ...
    ...
    BERT
    Tok 1 Tok 2 Tok N
    ...
    [CLS]
    B-PER
    O O
    ...
    no prediction no prediction

    View full-size slide

  14. Ablation Studies
    14
    tuning. This does significantly improve results on
    SQuAD, but the results are still far worse than the
    Dev Set
    Tasks MNLI-m QNLI MRPC SST-2 SQuAD
    (Acc) (Acc) (Acc) (Acc) (F1)
    BERTBASE
    84.4 88.4 86.7 92.7 88.5
    No NSP 83.9 84.9 86.5 92.6 87.9
    LTR & No NSP 82.1 84.3 77.5 92.1 77.8
    + BiLSTM 82.1 84.1 75.7 91.6 84.9
    Table 5: Ablation over the pre-training tasks using the
    BERTBASE
    architecture. “No NSP” is trained without
    the next sentence prediction task. “LTR & No NSP” is
    trained as a left-to-right LM without the next sentence
    prediction, like OpenAI GPT. “+ BiLSTM” adds a ran-
    domly initialized BiLSTM on top of the “LTR + No
    large
    is (L
    (Al-R
    H
    #L
    1
    1
    2
    Table
    numb
    tentio
    ear whether
    all data size
    d this poor
    full hyper-
    estarts.
    at strength-
    ding a ran-
    it for fine-
    e results on
    rse than the
    ST-2 SQuAD
    Acc) (F1)
    92.7 88.5
    92.6 87.9
    92.1 77.8
    91.6 84.9
    sks using the
    ined without
    ing examples, and is substantially different from
    the pre-training tasks. It is also perhaps surpris-
    ing that we are able to achieve such significant
    improvements on top of models which are al-
    ready quite large relative to the existing literature.
    For example, the largest Transformer explored in
    Vaswani et al. (2017) is (L=6, H=1024, A=16)
    with 100M parameters for the encoder, and the
    largest Transformer we have found in the literature
    is (L=64, H=512, A=2) with 235M parameters
    (Al-Rfou et al., 2018). By contrast, BERTBASE
    Hyperparams Dev Set Accuracy
    #L #H #A LM (ppl) MNLI-m MRPC SST-2
    3 768 12 5.84 77.9 79.8 88.4
    6 768 3 5.24 80.6 82.2 90.7
    6 768 12 4.68 81.9 84.8 91.3
    12 768 12 3.99 84.4 86.7 92.9
    12 1024 16 3.54 85.7 86.9 93.3
    24 1024 16 3.23 86.6 87.8 93.7

    View full-size slide

  15. Feature-Based Approach with BERT
    • Evaluate how well BERT performs in the
    feature-based approach
    – By generating ELMo-like representations
    15
    ng converge
    ce only 15%
    batch rather
    es converge
    odel. How-
    cy the MLM
    LTR model
    1,000
    LM)
    Right)
    g steps. This
    domly initialized two-layer 768-dimensional BiL-
    STM before the classification layer.
    Results are shown in Table 7. The best perform-
    ing method is to concatenate the token representa-
    tions from the top four hidden layers of the pre-
    trained Transformer, which is only 0.3 F1 behind
    fine-tuning the entire model. This demonstrates
    that BERT is effective for both the fine-tuning and
    feature-based approaches.
    Layers Dev F1
    Finetune All 96.4
    First Layer (Embeddings) 91.0
    Second-to-Last Hidden 95.6
    Last Hidden 94.9
    Sum Last Four Hidden 95.9
    Concat Last Four Hidden 96.1
    Sum All 12 Layers 95.5
    Table 7: Ablation using BERT with a feature-based ap-
    CoNLL-2013 NER
    only 0.3 F1
    → BERT is also effective for
    the feature-based approach

    View full-size slide

  16. Misc.
    • Reddit:
    https://www.reddit.com/r/MachineLearning/c
    omments/9nfqxz/r_bert_pretraining_of_deep
    _bidirectional/
    • Code & pre-trained model:
    – Will be released before the end of October 2018
    – BERT-pytorch (WIP):
    https://github.com/codertimo/BERT-pytorch
    16

    View full-size slide