Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduzione a word embeddings per capire il li...

Introduzione a word embeddings per capire il linguaggio naturale

Presentatione su Natural Language Processing in collegamento streaming con il meetup "Tarallucci, Vino e Machine Learning".

Descrizione dell'evento:
https://www.meetup.com/Tarallucci-Vino-Machine-Learning/events/251298019/

Marco Bonzanini

June 07, 2018
Tweet

More Decks by Marco Bonzanini

Other Decks in Programming

Transcript

  1. One-hot Encoding Rome Paris Italy France = [1, 0, 0,

    0, 0, 0, …, 0] = [0, 1, 0, 0, 0, 0, …, 0] = [0, 0, 1, 0, 0, 0, …, 0] = [0, 0, 0, 1, 0, 0, …, 0]
  2. One-hot Encoding Rome Paris Italy France = [1, 0, 0,

    0, 0, 0, …, 0] = [0, 1, 0, 0, 0, 0, …, 0] = [0, 0, 1, 0, 0, 0, …, 0] = [0, 0, 0, 1, 0, 0, …, 0] Rome Paris word V
  3. One-hot Encoding Rome Paris Italy France = [1, 0, 0,

    0, 0, 0, …, 0] = [0, 1, 0, 0, 0, 0, …, 0] = [0, 0, 1, 0, 0, 0, …, 0] = [0, 0, 0, 1, 0, 0, …, 0] V = vocabulary size (huge)
  4. Bag-of-words doc_1 doc_2 … doc_N = [32, 14, 1, 0,

    …, 6] = [ 2, 12, 0, 28, …, 12] … = [13, 0, 6, 2, …, 0]
  5. Bag-of-words doc_1 doc_2 … doc_N = [32, 14, 1, 0,

    …, 6] = [ 2, 12, 0, 28, …, 12] … = [13, 0, 6, 2, …, 0] Rome Paris word V
  6. Word Embeddings Rome Paris Italy France = [0.91, 0.83, 0.17,

    …, 0.41] = [0.92, 0.82, 0.17, …, 0.98] = [0.32, 0.77, 0.67, …, 0.42] = [0.33, 0.78, 0.66, …, 0.97]
  7. Word Embeddings Rome Paris Italy France = [0.91, 0.83, 0.17,

    …, 0.41] = [0.92, 0.82, 0.17, …, 0.98] = [0.32, 0.77, 0.67, …, 0.42] = [0.33, 0.78, 0.66, …, 0.97] n. dimensions << vocabulary size
  8. Word Embeddings Rome Paris Italy France = [0.91, 0.83, 0.17,

    …, 0.41] = [0.92, 0.82, 0.17, …, 0.98] = [0.32, 0.77, 0.67, …, 0.42] = [0.33, 0.78, 0.66, …, 0.97]
  9. Word Embeddings Rome Paris Italy France = [0.91, 0.83, 0.17,

    …, 0.41] = [0.92, 0.82, 0.17, …, 0.98] = [0.32, 0.77, 0.67, …, 0.42] = [0.33, 0.78, 0.66, …, 0.97]
  10. Word Embeddings Rome Paris Italy France = [0.91, 0.83, 0.17,

    …, 0.41] = [0.92, 0.82, 0.17, …, 0.98] = [0.32, 0.77, 0.67, …, 0.42] = [0.33, 0.78, 0.66, …, 0.97]
  11. I enjoyed eating some pizza at the restaurant I enjoyed

    eating some broccoli at the restaurant
  12. I enjoyed eating some pizza at the restaurant I enjoyed

    eating some broccoli at the restaurant
  13. I enjoyed eating some pizza at the restaurant I enjoyed

    eating some broccoli at the restaurant Same Context
  14. I enjoyed eating some pizza at the restaurant I enjoyed

    eating some broccoli at the restaurant = ?
  15. Vector Calculation Goal: learn vec(word) 1. Choose objective function 2.

    Init: random vectors 3. Run stochastic gradient descent
  16. Vector Calculation Goal: learn vec(word) 1. Choose objective function 2.

    Init: random vectors 3. Run stochastic gradient descent
  17. Vector Calculation Goal: learn vec(word) 1. Choose objective function 2.

    Init: random vectors 3. Run stochastic gradient descent
  18. Intermezzo (Gradient Descent) • Optimisation algorithm • Purpose: find the

    min (or max) for F • Batch-oriented (use all data points)
  19. Intermezzo (Gradient Descent) • Optimisation algorithm • Purpose: find the

    min (or max) for F • Batch-oriented (use all data points) • Stochastic GD: update after each sample
  20. I enjoyed eating some pizza at the restaurant Objective Function

    maximise
 the likelihood of a word
 given its context
  21. I enjoyed eating some pizza at the restaurant Objective Function

    maximise
 the likelihood of a word
 given its context e.g. P(pizza | eating)
  22. I enjoyed eating some pizza at the restaurant Objective Function

    maximise
 the likelihood of the context
 given its focus word
  23. I enjoyed eating some pizza at the restaurant Objective Function

    maximise
 the likelihood of the context
 given its focus word e.g. P(eating | pizza)
  24. I enjoyed eating some pizza at the restaurant Move to

    next focus word and repeat Example
  25. P( vout | vin ) P( vec(eating) | vec(pizza) )

    P( eating | pizza ) Input word Output word
  26. P( vout | vin ) P( vec(eating) | vec(pizza) )

    P( eating | pizza ) Input word Output word ???
  27. Case Study 1: Skills and CVs from gensim.models import Word2Vec

    fname = 'candidates.jsonl' corpus = ResumesCorpus(fname) model = Word2Vec(corpus)
  28. Case Study 1: Skills and CVs from gensim.models import Word2Vec

    fname = 'candidates.jsonl' corpus = ResumesCorpus(fname) model = Word2Vec(corpus)
  29. Case Study 1: Skills and CVs Useful for: Data exploration

    Query expansion/suggestion Recommendations
  30. Case Study 2: Beer! Data set of ~2.9M beer reviews

    89 different beer styles 635k unique tokens 185M total tokens https://snap.stanford.edu/data/web-RateBeer.html
  31. Case Study 2: Beer! from gensim.models import Doc2Vec fname =

    'ratebeer_data.csv' corpus = RateBeerCorpus(fname) model = Doc2Vec(corpus)
  32. Case Study 2: Beer! from gensim.models import Doc2Vec fname =

    'ratebeer_data.csv' corpus = RateBeerCorpus(fname) model = Doc2Vec(corpus) 3.5h on my laptop … remember to pickle
  33. Case Study 2: Beer! model.docvecs.most_similar('Stout') [('Sweet Stout', 0.9877), ('Porter', 0.9620),

    ('Foreign Stout', 0.9595), ('Dry Stout', 0.9561), ('Imperial/Strong Porter', 0.9028), ...]
  34. Case Study 2: Beer! model.most_similar([model.docvecs['Wheat Ale']]) 
 [('lemon', 0.6103), ('lemony',

    0.5909), ('wheaty', 0.5873), ('germ', 0.5684), ('lemongrass', 0.5653), ('wheat', 0.5649), ('lime', 0.55636), ('verbena', 0.5491), ('coriander', 0.5341), ('zesty', 0.5182)]
  35. Case Study 2: Beer! Useful for: Understanding the language of

    beer enthusiasts Planning your next pint Classification
  36. GloVe (2014) • Global co-occurrence matrix • Much bigger memory

    footprint • Downstream tasks: similar performances
  37. doc2vec (2014) • From words to documents • (or sentences,

    paragraphs, classes, …) • P(context | word, label)
  38. • word2vec + morphology (sub-words) • Pre-trained vectors on ~300

    languages • morphologically rich languages fastText (2016-17)
  39. But we’ve been doing this for X years • Approaches

    based on co-occurrences are not new
  40. But we’ve been doing this for X years • Approaches

    based on co-occurrences are not new • … but usually outperformed by word embeddings
  41. But we’ve been doing this for X years • Approaches

    based on co-occurrences are not new • … but usually outperformed by word embeddings • … and don’t scale as well as word embeddings
  42. Garbage in, garbage out • Pre-trained vectors are useful …

    until they’re not • The business domain is important
  43. Garbage in, garbage out • Pre-trained vectors are useful …

    until they’re not • The business domain is important • > 100K words? Maybe train your own model
  44. Garbage in, garbage out • Pre-trained vectors are useful …

    until they’re not • The business domain is important • > 100K words? Maybe train your own model • > 1M words? Yep, train your own model
  45. Summary • Word Embeddings are magic! • Big victory of

    unsupervised learning • Gensim makes your life easy
  46. Credits & Readings Credits • Lev Konstantinovskiy (@teagermylk) Readings •

    Deep Learning for NLP (R. Socher) http://cs224d.stanford.edu/ • “GloVe: global vectors for word representation” by Pennington et al. • “Distributed Representation of Sentences and Documents” (doc2vec)
 by Le and Mikolov • “Enriching Word Vectors with Subword Information” (fastText)
 by Bojanokwsi et al.
  47. Credits & Readings Even More Readings • “Man is to

    Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings” by Bolukbasi et al. • “Quantifying and Reducing Stereotypes in Word Embeddings” by Bolukbasi et al. • “Equality of Opportunity in Machine Learning” - Google Research Blog
 https://research.googleblog.com/2016/10/equality-of-opportunity-in-machine.html Pics Credits • Classification: https://commons.wikimedia.org/wiki/File:Cluster-2.svg • Translation: https://commons.wikimedia.org/wiki/File:Translation_-_A_till_%C3%85-colours.svg • Broccoli: https://commons.wikimedia.org/wiki/File:Broccoli_and_cross_section_edit.jpg • Pizza: https://commons.wikimedia.org/wiki/File:Eq_it-na_pizza-margherita_sep2005_sml.jpg