Upgrade to Pro — share decks privately, control downloads, hide ads and more …

An overview of word embeddings

Olivier Grisel
November 19, 2014

An overview of word embeddings

Olivier Grisel

November 19, 2014
Tweet

More Decks by Olivier Grisel

Other Decks in Science

Transcript

  1. Outline • Neural Word Models • word2vec & GloVe •

    Computing word analogies with word vectors • Applications, implementations & pre-trained models • Extensions: RNNs for Machine Translation
  2. Neural Language Models • Each word is represented by a

    fixed dimensional vector • Goal is to predict target word given ~5-10 words context from a random sentence in Wikipedia • Use NN-style training to optimize the vector coefficients typically with log-likelihood objective (bi-linear, deep or recurrent architectures)
  3. Trend in 2013 / 2014 • Simple linear models (word2vec)

    benefit from larger training data (1B+ words) and dimensions (typically 50-300) • Some models (GloVe) closer to matrix factorization than neural networks • Can successfully uncover semantic and syntactic word relationships from unlabeled corpora (wikipedia, Google News, Common Crawl).
  4. Analogies • [king] - [male] + [female] ~= [queen] •

    [Berlin] - [Germany] + [France] ~= [Paris] • [eating] - [eat] + [fly] ~= [flying]
  5. Triplet queries • b* is to b what a* is

    to a • queen is to king what female is to male
  6. Dealing with multi-word expressions (phrases) http://code.google.com/p/word2vec/source/browse/trunk/ questions-phrases.txt Do Several passes

    to extract 3-gram and 4-gram phrases. Then treat phrases in as new “words” to embed along with the unigrams. Score interesting bi-grams from counts and threshold:
  7. word2vec commandline ./word2vec -train $CORPUS -size 300 -window 10 -hs

    0 \ -negative 15 -threads 20 -min-count 100 \ -output word-vecs -dumpcv context-vecs source: GloVe model eval by Yoav Goldberg
  8. GloVe command-line ./vocab_count -min-count 100 -verbose 2 < $CORPUS >

    $VOCAB_FILE ./cooccur -memory 40 -vocab-file $VOCAB_FILE -verbose 2 \ -window-size 10 < $CORPUS > $COOCCURRENCE_FILE ./shuffle -memory 40 -verbose 2 \ < $COOCCURRENCE_FILE \ > $COOCCURRENCE_SHUF_FILE ./glove -save-file $SAVE_FILE -threads 8 \ -input-file $COOCCURRENCE_SHUF_FILE \ -x-max 100 -iter 15 -vector-size 300 -binary 2 \ -vocab-file $VOCAB_FILE -verbose 2 -model 0 source: GloVe model eval by Yoav Goldberg
  9. Implementations in Python • gensim has most of word2vec (and

    GloVe is planned) • gensim also has: • Wikipedia corpus loader (markup cleaning) • similarity queries and evaluation tools • glove-python (work in progress, very active) • Both use Cython and multi-threading
  10. RNN for MT source: Learning Phrase Representations using RNN Encoder-

    Decoder for Statistical Machine Translation
  11. Neural MT vs Phrase-based SMT BLEU scores of NMT &

    Phrase-SMT models on English / French (Oct. 2014)
  12. References • Word embeddings (see references to main papers on

    each project page) First gen: http://metaoptimize.com/projects/wordreprs/ Word2Vec: https://code.google.com/p/word2vec/ GloVe: http://nlp.stanford.edu/projects/glove/ Word2Vec & GloVe both provide pre-trained embeddings on English datasets. • Relation to sparse and explicit representations Linguistic Regularities in Sparse and Explicit Word Representations by Omer Levy and Yoav Goldberg
  13. References • Neural Machine Translation Google Brain: http://arxiv.org/abs/1409.3215 U. of

    Montreal: http://arxiv.org/abs/1406.1078 https://github.com/lisa-groundhog/GroundHog
  14. Explicit Sparse Vector Representations • Extract contexts with offsets: “The

    cat sat on the mat.” c(sat) = {the_m2, cat_m1, on_p1, the_p2}