Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Understanding Natural Language with Word Vector...

Understanding Natural Language with Word Vectors @ PyCon UK 2017

Slides for my talk on word embeddings presented at PyCon UK 2017:
http://2017.pyconuk.org/sessions/talks/understanding-natural-language-with-word-vectors/

Abstract:
This talk is an introduction to word vectors, a.k.a. word embeddings, a family of Natural Language Processing (NLP) algorithms where words are mapped to vectors.

An important property of these vector is being able to capture semantic relationships, for example: UK - London + Paris = ???

These techniques have been driving important improvements in many NLP applications over the past few years, so the interest around word embeddings is spreading. In this talk, we'll discuss the basic linguistic intuitions behind word embeddings, we'll compare some of the most popular word embedding approaches, from word2vec to fastText, and we'll showcase their use with Python libraries.

The aim of the talk is to be approachable for beginners, so the theory is kept to a minimum.

By attending this talk, you'll be able to learn: - the core features of word embeddings - how to choose between different word embedding algorithms - how to implement word embedding techniques in Python

Marco Bonzanini

October 27, 2017
Tweet

More Decks by Marco Bonzanini

Other Decks in Programming

Transcript

  1. I enjoyed eating some pizza at the restaurant I enjoyed

    eating some Welsh cake at the restaurant
  2. I enjoyed eating some pizza at the restaurant I enjoyed

    eating some Welsh cake at the restaurant
  3. I enjoyed eating some pizza at the restaurant Objective Function

    maximise
 the likelihood of the context
 given the focus word
  4. I enjoyed eating some pizza at the restaurant Objective Function

    maximise
 the likelihood of the context
 given the focus word P(eating | pizza)
  5. GloVe (2014) • Global co-occurrence matrix • Much bigger memory

    footprint • Downstream tasks: similar performances
  6. doc2vec (2014) • From words to documents • (or sentences,

    paragraphs, classes, …) • P(context | word, label)
  7. • word2vec + morphology (sub-words) • Pre-trained vectors on ~300

    languages • morphologically rich languages fastText (2016-17)
  8. But we’ve been doing this for X years • Approaches

    based on co-occurrences are not new
  9. But we’ve been doing this for X years • Approaches

    based on co-occurrences are not new • … but usually outperformed by word embeddings
  10. But we’ve been doing this for X years • Approaches

    based on co-occurrences are not new • … but usually outperformed by word embeddings • … and don’t scale as well as word embeddings
  11. Garbage in, garbage out • Pre-trained vectors are useful …

    until they’re not • The business domain is important
  12. Garbage in, garbage out • Pre-trained vectors are useful …

    until they’re not • The business domain is important • > 100K words? Maybe train your own model
  13. Garbage in, garbage out • Pre-trained vectors are useful …

    until they’re not • The business domain is important • > 100K words? Maybe train your own model • > 1M words? Yep, train your own model
  14. Summary • Word Embeddings are magic! • Big victory of

    unsupervised learning • Gensim makes your life easy
  15. Credits & Readings Credits • Lev Konstantinovskiy (@teagermylk) Readings •

    Deep Learning for NLP (R. Socher) http://cs224d.stanford.edu/ • “GloVe: global vectors for word representation” by Pennington et al. • “Distributed Representation of Sentences and Documents” (doc2vec)
 by Le and Mikolov • “Enriching Word Vectors with Subword Information” (fastText)
 by Bojanokwsi et al.
  16. Credits & Readings Even More Readings • “Man is to

    Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings” by Bolukbasi et al. • “Quantifying and Reducing Stereotypes in Word Embeddings” by Bolukbasi et al. • “Equality of Opportunity in Machine Learning” - Google Research Blog
 https://research.googleblog.com/2016/10/equality-of-opportunity-in-machine.html Pics Credits • Classification: https://commons.wikimedia.org/wiki/File:Cluster-2.svg • Translation: https://commons.wikimedia.org/wiki/File:Translation_-_A_till_%C3%85-colours.svg • Welsh cake: https://commons.wikimedia.org/wiki/File:Closeup_of_Welsh_cakes,_February_2009.jpg • Pizza: https://commons.wikimedia.org/wiki/File:Eq_it-na_pizza-margherita_sep2005_sml.jpg