Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Word Embeddings

Word Embeddings

word embeddings, distributed representation, distributional hypothesis, pointwise mutual information, singular value decomposition, word2vec, word analogy, GloVe, fastText

Naoaki Okazaki
PRO

August 07, 2020
Tweet

More Decks by Naoaki Okazaki

Other Decks in Research

Transcript

  1. Word Embeddings
    Naoaki Okazaki
    School of Computing,
    Tokyo Institute of Technology
    [email protected]
    PowerPoint template designed by https://ppt.design4u.jp/template/

    View Slide

  2. Deep Neural Networks (DNNs) and Natural Language Processing (NLP)
    1
    very
    good
    movie very good
    movie
    very
    good
    movie
    very
    good
    movie
    とても
    よい
    映画
    Word embeddings
    (representing a word
    as a vector)
    Semantic composition
    (computing the vector of phrases
    from constituent words)
    Encoder-decoder model
    (generating a sequence of words
    from the composed vector)
     DNN made breakthroughs in speech processing and computer vision
     Reduced the error rate of image recognition more than 10% (ILSVRC 2012)
     At first, DNN had limited impacts on NLP
     Natural languages have symbols that represent semantic information
     Recently, DNNs have successfully been applied to various tasks
     DNNs achieve the state-of-the-art performance on most NLP tasks
     DNNs learn vector representations of text and generate a text (e.g.,
    sequence of words) from the representations

    View Slide

  3. Word embedding
    2
    very
    good
    movie
    ∈ ℝ
    ∈ ℝ
    ∈ ℝ
     Represents a word with a vector of real numbers
     Embeds a word into a neural network
     Expresses semantic and syntactic aspects of a word

    View Slide

  4. Distributed representation (Hinton+ 1986)
    3
     Local representation
     Assigns a unit (neuron, dimension, symbol) to every concept
     Distributed representation
     Each concept is represented by multiple units (micro-features)
     Each unit commits to multiple concepts
    … …
    #249
    … …
    #809
    … …
    #18329

    View Slide

  5. Lexical dictionary
    4

    View Slide

  6. Lexical dictionary
    5
    http://wordnetweb.princeton.edu/perl/webwn?s=bass

    View Slide

  7. Limitation of dictionary: named entities
    6
    http://wordnetweb.princeton.edu/perl/webwn?s=apple
    No sense of “Apple” as a company

    View Slide

  8. Limitation of dictionary: neologism
    7
    http://wordnetweb.princeton.edu/perl/webwn?s=tweet
    No sense of “tweet” as posting a short text

    View Slide

  9. Limitation of dictionary: compositionality
    8
    http://wordnetweb.princeton.edu/perl/webwn?s=apple+tree
    No “apple tea” nor ”apple production”
    in the dictionary

    View Slide

  10. Distributional Hypothesis
    and
    Word-Context Matrix
    9

    View Slide

  11. Distributional hypothesis (Harris 1954; Firth 1957)
    10
    … packed with people drinking beer or wine. Many restaurants …
    into alcoholic drinks such as beer or hard liquor and derive …
    … in miles per hour, pints of beer, and inches for clothes. M…
    …ns and for pints for draught beer, cider, and milk sales. The
    carbonated beverages such as beer and soft drinks in non-ref…
    …g of a few young people to a beer blast or fancy formal part…
    …c and alcoholic drinks, like beer and mead, contributed to a…
    People are depicted drinking beer, listening to music, flirt…
    … and for the pint of draught beer sold in pubs (see Metricat…
    beer
    beer
    beer
    beer
    beer
    beer
    beer
    beer
    beer
    … ith people drinking beer or wine. Many restaurants can be f…
    …gan to drink regularly, host wine parties and consume prepar…
    principal grapes for the red wines are the grenache, mourved…
    … four or more glasses of red wine per week had a 50 percent …
    …e would drink two bottles of wine in an evening. According t…
    …. Teran is the principal red wine grape in these regions. In…
    …a beneficial compound in red wine that other types of alcohol
    … Colorino and even the white wine grapes like Trebbiano and …
    In Shakesperean theatre, red wine was used in a glass contai…
    wine
    wine
    wines
    wine
    wine
    wine
    wine
    wine
    wine
    You shall know a word by the company it keeps
    Z Harris. 1954. Distributional structure. Word, 10(23):146-162.
    J Firth. 1957. A synopsis of linguistic theory 1930-1955. In Studies in Linguistic Analysis, pp. 1-32.

    View Slide

  12. Word-context matrix
    11
    beer
    wine
    car
    ride
    have
    new
    drink
    bottle
    train
    book
    speed
    read
    36
    108
    578
    291
    841
    14
    14
    284
    94
    201
    72
    92
    3
    3
    0
    57
    86
    2
    0
    0
    3
    0
    37
    72
    2
    0
    1
    44
    43
    1
    1
    2
    3
    2
    338
    Context: words appearing in ±ℎ word offsets to the target word

    cols

    rows

    : Frequency of co-occurrences of the
    word with context word (for example,
    “train” co-occurred with “drink” three
    times)
    This row vector represents the
    meaning of the word “beer”
    Context
    Word

    View Slide

  13. Measure the similarity of two vectors with cos
    12
    Given two vectors and whose angle is ,
    ⋅ = cos
    Therefore,
    cos =


    The value of cos is,
     → 0 (same direction): cos → +1
     → /2 (orthogonal): cos → 0
     → (opposite direction): cos → −1
    In this way, cos can measure the similarity of two vectors within the
    range of −1, +1
    θ
    cosθ
    v
    u
    0 1

    View Slide

  14. Let’s compute cosine similarity
    13
    beer
    wine
    car
    ride
    have
    new
    drink
    bottle
    train
    book
    speed
    read
    36
    108
    578
    291
    841
    14
    14
    284
    94
    201
    72
    92
    3
    3
    0
    57
    86
    2
    0
    0
    3
    0
    37
    72
    2
    0
    1
    44
    43
    1
    1
    2
    3
    2
    338
    Context
    Word
    cos =


    Cosine similarity between “beer” and “wine”:
    cos =
    36 × 108 + 14 × 14 + 72 × 92 + 57 × 86 + 3 × 0 + 0 × 1 + 1 × 2
    362 + 142 + 722 + 572 + 32 + 02 + 12 1082 + 142 + 922 + 862 + 02 + 12 + 22
    = 0.941
    Cosine similarity between “beer” and “train”:
    cos =
    36 × 291 + 14 × 94 + 72 × 3 + 57 × 0 + 3 × 72 + 0 × 43 + 1 × 2
    362 + 142 + 722 + 572 + 32 + 02 + 12 2912 + 942 + 32 + 02 + 722 + 432 + 22
    = 0.387

    View Slide

  15. Positive Pointwise Mutual Information (PPMI) (Bullinaria+ 2007)
    14
    ,𝑗𝑗
    = max 0, log
    (, )
    ()
    = max 0, log #(, ) + log #(∗,∗) − log #(∗, ) − log #(,∗)
    , = #(, )/#(∗,∗), = #(,∗)/#(∗,∗), () = #(∗, )/#(∗,∗)
    #(,∗) = ∑𝑗𝑗
    #(,) ,#(∗, ) = ∑
    #(, ) ,#(∗,∗) = ∑,𝑗𝑗
    #(, )
    Discount frequent words
    and frequent context words
    beer
    wine
    car
    ride
    have
    new
    drink
    bottle
    train
    book
    speed
    read
    0
    0
    0.09
    0.03
    0.09
    0
    0
    0.49
    0.02
    0
    2.04
    1.78
    0
    0
    0
    1.97
    1.87
    0
    0
    0
    0
    0
    0.13
    1.43
    0
    0
    0
    0.55
    1.16
    0
    0
    0
    0
    0
    0.85
    Context
    Word
    cos(beer,wine)
    = 0.99 > 0.941
    cos(beer,train)
    = 0.00 < 0.387
    J Bullinaria and J Levy. 2007. Extracting semantic representations from word co-occurrence statistics: A computational study. Behavior Research
    Methods, 39:510–526.

    View Slide

  16. Latent Semantic Analysis (LSA) (Deerwester, 1990)
    15
    S Deerwester, S Dumais, G Furnas, T Landauer, R Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for
    Information Science, 41(6):391-407.
     Singular Value Decomposition (SVD) on
    = Σ T
     Truncate Σ with singular values

    = Σ
    T (-rank approximation)
    (
    is a minimizer of −
    among rank- matrices)
     Use Σ
    as -dimensional vectors


    = Σ
    T Σ
    T T
    = Σ
    Σ
    T
    Inner product of Σ
    is equal to that of
    ( × ) ( × ) ( × ) ( × )
    : unitary matrix
    Σ: diagonal matrix with singular values
    T: unitary matrix

    View Slide

  17. Low-rank approximation by SVD ( = 3)
    16
    Truncate
    with three
    SVs
    Uses up to
    three columns
    Uses up to
    three rows
    (SVD on the original matrix) (3-rank approximation)
    beer
    wine
    car
    train
    book
    Truncated SVD (Halko, 2011) finds top- singular values of the matrix efficiently
    (for example, sklearn.decomposition.TruncatedSVD)
    cos(beer,wine)
    = 0.96
    cos(beer,train)
    = 0.37
    N Halko, P G Martinsson, and J A Tropp. 2011. Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix
    Decompositions. SIAM Review, 53(2), 217-288.

    View Slide

  18. word2vec
    17

    View Slide

  19. Skip-gram with Negative Sampling (SGNS) (Mikolov+ 2013)
    18
    draught
    offer
    pubs beer, cider, and wine
    last
    use
    place
    people
    make
    city
    full
    know
    build
    time
    group
    have
    new
    game
    rather
    age
    show
    take
    take
    team
    season
    say
    Word vector
    ∈ ℝ
    Context vector �

    ∈ ℝ
    : Positive
    : Negative
    Update rule
    Corpus
    Each word
    vector
    predicts 2ℎ
    context
    words
    Sample 𝑘 words as negative
    words from the unigram
    distribution. Update vectors
    such that word vectors do not
    predict the negative words
    T Mikolov, I Sutskever, K Chen, G Corrado, and J Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS
    2013, pp. 3111–3119.

    View Slide

  20. SGD algorithm for updating vectors
    19
     Initialization:
     ← 0
     Word vectors ( ): Initialize with random values of [0,1]
     Context vectors ( ): Initialize with zero
     Repeat from the head to tail of the training corpus:
     ← + 1
     Learning rate = 0
    1 −
    +1
     For each connected with the target word



    =
    1 − ⋅ inner product → +∞
    ⋅ inner product → −∞
    ← +
    ← +

    View Slide

  21. Demo with word vectors
    20
     English: GoogleNews-vectors-negative300.bin.gz
     Trained on Google News dataset (100B words)
     https://code.google.com/archive/p/word2vec/
     Japanese: (trained by me)
     Trained on Japanese Wikipedia articles (400M words)
     Use gensim for manipulating them in Python
    https://github.com/chokkan/deeplearning/blob/master/notebook/word2vec_ja.ipynb
    https://github.com/chokkan/deeplearning/blob/master/notebook/word2vec_en.ipynb

    View Slide

  22. Plotted word vectors
    21

    View Slide

  23. Measure the similarity of word vectors
    22

    View Slide

  24. Finding similar words
    23

    View Slide

  25. Word analogy
    24

    View Slide

  26. Evaluation on the word analogy task
    25
    (Mikolov+ 2013)
    Example of semantic analogy: Athens : Greece = Tokyo : Japan
    Example of syntactic analogy: cool : cooler = deep : deeper
    T Mikolov, I Sutskever, K Chen, G Corrado, and J Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS
    2013, pp. 3111–3119.

    View Slide

  27. Word vectors exhibit additive composition
    26
    Famous example: king − man + woman ≈ queen
    (Mikolov+ 2013)
    T Mikolov, I Sutskever, K Chen, G Corrado, and J Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS
    2013, pp. 3111–3119.

    View Slide

  28. The objective function of SGNS
    27
     The objective function (MLE)
     (|) is modeled by softmax
     Approximate (|) with logistic regressions
    = − �



    log (|)
    : corpus (sequence of words)

    : a set of words appearing within
    the offset ±ℎ from the word
    Probability to predict ∈
    from
    =
    exp



    ′∈
    exp(

    ′)
    log ≈ log
    ⋅ �

    + � Ε

    log −
    ⋅ �

    Too heavy computation as this
    requires the sum over exponentials
    of inner products between the
    word with all words ′ ∈
    Sample a word from the unigram distribution
    ( times)

    View Slide

  29. SGNS is equivalent to Shifted PMI (Levy+ 2014)
    28
     SGNS models a co-occurrence matrix
    ,
    = PMI , − log ≈


     This is similar to training word vectors by building a co-
    occurrence matrix by using PMI
     The previous approach (PMI) could also realize additive
    composition
    Shifting PMI to negative
    O Levy and Y Goldberg. 2014. Neural word embedding as implicit matrix factorization. In NIPS 2014, pp. 2177–2185.

    View Slide

  30. Derivation of Shifted PMI (Levy+ 2014)
    29
    The objective function of SGNS (#(, ): co-occurrence frequency of with ; #(): frequency of ),
    = − �



    log


    − � Ε

    log −


    = − �



    #(, ) log


    − �

    #() ⋅ � Ε

    log −


    Compute the expectation explicitly,
    Ε

    log −


    = �

    #()
    ||
    log −


    =
    #()
    ||
    log −


    + �
    ∈∖{}
    #()
    ||
    log −


    Extract the portion of the objective function related to and (we can ignore ),
    , = −#(, ) log


    − # ⋅ ⋅
    #()
    ||
    log −


    Let =


    . Compute the gradient of , with respect to by using

    log () = − = 1 − (),
    (, )

    = −#(, ) − + #
    #()

    = # , − 1 + #
    #()


    Find the point where the gradient gets zero,
    1 +
    #()#()
    #(, )
    = 1 ⇔ 1 +
    #()#()
    #(, )

    1
    1 + −𝑥𝑥
    = 1 ⇔ exp − =
    #()#()
    #(, )
    Therefore (we assume = #(∗,∗)),
    =


    = log
    #(, )
    #()#()
    = log
    #(, )
    #()#()
    − log = PMI , − log
    O Levy and Y Goldberg. 2014. Neural word embedding as implicit matrix factorization. In NIPS 2014, pp. 2177–2185.

    View Slide

  31. GloVe
    30

    View Slide

  32. GloVe (Pennington+ 2014)
    31
    = �
    ,𝑗𝑗=1

    (,𝑗𝑗
    ) (

    𝑗𝑗
    +
    + �
    𝑗𝑗
    − log ,𝑗𝑗
    )2
    Minimize
    =
    (/max
    ) (if < max
    )
    1 (otherwise)
    Co-occurrence frequency
    between words and
    Total number of words
    Vector of word
    Vector of word
    Bias for word
    Bias for word
    Vector #1
    Vector #2
    Similarly to SGNS, each word has two vectors assigned.
    This study uses (
    + �

    ) after training the vectors (this
    treatment improves the performance)
    𝑚𝑚
    = 100, α = 0.75 →
    (by AdaGrad)
    J Pennington, R Socher, and C Manning. 2014. Glove: Global vectors for word representation. In EMNLP-2014, pp. 1532–1543.

    View Slide

  33. Rationale of

    𝑗𝑗
    +
    + �
    𝑗𝑗
    − log ,𝑗𝑗
    (1/4)
    32
     Consider representing a relation of words and on an
    aspect by using context word
     E.g., Relation between ice and steam on thermodynamics
     ,
    /𝑗𝑗,
    may be more useful than ,
    = (|) to capture
    the characteristics of words and
     E.g., solid and gas is more useful than water and fashion
    (Pennington+ 2014)
    J Pennington, R Socher, and C Manning. 2014. Glove: Global vectors for word representation. In EMNLP-2014, pp. 1532–1543.

    View Slide

  34. Rationale of

    𝑗𝑗
    +
    + �
    𝑗𝑗
    − log ,𝑗𝑗
    (2/4)
    33
     Let
    , 𝑗𝑗
    , �

    the vectors of words , ,
     In order to represent ,
    /𝑗𝑗,
    with word vectors,

    − 𝑗𝑗
    , �

    = ,
    /𝑗𝑗,
     The most simple way to cast the type of the left (vector)
    into that of the right (scalar),

    − 𝑗𝑗



    = ,
    /𝑗𝑗,
    Represent the contrast of the characteristics of
    words and with vector subtraction
    We will decide
    the form of later
    Different from

    View Slide

  35. Rationale of

    𝑗𝑗
    +
    + �
    𝑗𝑗
    − log ,𝑗𝑗
    (3/4)
    34
     Use exp: ℝ → ℝ+
    as
    exp
    − 𝑗𝑗



    =
    exp


    exp 𝑗𝑗


    =
    ,
    𝑗𝑗,
     Therefore,
    exp


    = ,
    = ,
    /
     Take the logarithms of the both sides,



    = log ,
    − log

    View Slide

  36. Rationale of

    𝑗𝑗
    +
    + �
    𝑗𝑗
    − log ,𝑗𝑗
    (4/4)
    35
     Words and contexts should be interchangeable
     Consider
    ↔ �

    and ↔ at the same time
     Words and contexts are not interchangeable



    = log ,
    − log
     Because we have no constant for
     Represent log
    as a bias term
    , and introduce
    a new bias term �

    about



    = log ,

    − �




    +
    + �

    = log ,

    View Slide

  37. Rationale of (,𝑗𝑗
    )
    36
     We cannot compute log ,𝑗𝑗
    when ,𝑗𝑗
    = 0
     Most elements in are 0 (sparse matrix)
     We ignore unobserved statistics
     We should not respect rare co-occurrences
     Hard to reproduce rare co-occurrences with vectors
     Force the weight (,𝑗𝑗
    /max
    ) when ,𝑗𝑗
    < max
     We should not respect frequent ones too much
     Treat frequent co-occurrences with the same importance
     Clip the weight 1 when ,𝑗𝑗
    ≥ max

    View Slide

  38. Advanced topics
    37

    View Slide

  39. Tricks used in implementations (Levy+ 2015)
    38
    Description Values PPMI SVD SGNS GloVe
    win Window size (ℎ) ℎ ∈ {2, 5, 10}    
    dyn Weighted context with(/ℎ), none     *1
    sub Subsampling with, none    
    del Rare word removal with, none    
    neg Negative samples ∈ {1, 5, 15}  *2  *2 
    cds Distribution correction α ∈ {1, 0.75}  *3  *3 
    w+c Vector summation
    , (
    + �

    )   
    eig Weighted SVs ∈ {0, 0.5, 1.0} 
    nrm Normalization *4 both, col, row, none    
    *1: The same weighting method implemented in word2vec
    *2: These are set by shifted PPMI
    *3: These are implemented by modifying the denominator of PMIs
    *4: Normalization for each word vector was the best
    Preprocessing Association measure Postprocessing
    O Levy, Y Goldberg, and I Dagan. 2015. Improving distributional similarity with lessons learned from word embeddings. Transactions of the
    Association for Computational Linguistics (TACL), 3:211-225.

    View Slide

  40. Tips for training word embeddings (Levy+ 2015)
    39
     Use context distribution smoothing (cds=0.75)
     Use SVD with symmetric variants (eig=0 or 0.5)
     No effect with neg > 1 in Shifted PPMI
     SGNS is a robust baseline
     It does not underperform in any scenario
     It trains word embeddings the fastest with the cheapest memory
    consumption
     Larger negative samples are better in SGNS
     Worth trying w+c in SGNS and GloVe
     May result in substantial gains (but sometimes in losses)
    O Levy, Y Goldberg, and I Dagan. 2015. Improving distributional similarity with lessons learned from word embeddings. Transactions of the
    Association for Computational Linguistics (TACL), 3:211-225.

    View Slide

  41. Different evaluations favor different embeddings (Schnabel+ 2015)
    40
    Task: a human worker chooses the most similar word among the candidates computed by word embeddings
    GloVe was poor at adverbs for some reason CBOW suffers from larger candidates (50 NN)
    T Schnabel, I Labutov, D Mimno, T Joachims. Evaluation methods for unsupervised word embeddings. In EMNLP 2015, pp. 298-307.

    View Slide

  42. Different tasks favor different embeddings (Schnabel+ 2015)
    41
     No almighty word embeddings for all tasks
     In order to improve the performance on a task, we
    should fine-tune word embeddings on the target task
    (Schnabel+ 2015)
    T Schnabel, I Labutov, D Mimno, T Joachims. Evaluation methods for unsupervised word embeddings. In EMNLP 2015, pp. 298-307.

    View Slide

  43. fastText (Bojanowski+ 2017)
    42
     SGNS and GloVe are unaware of internal letters in words
     Extend SGNS to consider letter -grams (subword units)
     The use of subword units is also effective in machine translation

    off
    ffe
    fer
    er>
    pubs
    draught
    show
    age
    take
    Word
    vector
    Subword
    vectors
    Sum
    Context
    vectors
    The update
    procedure
    is the same
    P Bojanowsk, E Grave, A Joulin, T Mikolov. 2017. Enriching word vectors with subword information, Transactions of the Association for Computational
    Linguistics (TACL), 5:135-146.

    View Slide

  44. Comparison between SGNS and fastText
    43
    (Bojanowski+ 2017)
    fastText (sisg) favors syntactic analogy
    more than semantic analogy
    fastText (sisg) outperforms the
    other except for WS353 in English
    P Bojanowsk, E Grave, A Joulin, T Mikolov. 2017. Enriching word vectors with subword information, Transactions of the Association for Computational
    Linguistics (TACL), 5:135-146.

    View Slide

  45. Demo with fastText
    44
    ※Trumponomics
    は造語

    View Slide

  46. Summary
     Word embeddings capture syntactic and semantic
    information to some extent
     The underlying idea is distributional hypothesis
     You shall know the word by the company it keeps
     You shall know the word by predicting its companies
     No almighty word embeddings for all downstream tasks
     Next question
    Can we represent a phrase/sentence with a vector?
    45

    View Slide