Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Word Embeddings - the Good, the Bad, and the Ugly

Word Embeddings - the Good, the Bad, and the Ugly

Word embeddings are the new magic tool for natural language processing. Without cumbersome preprocessing and feature design they are able to capture the semantics of language and texts, simply by being fed with lots of data. So they say.

We applied word embeddings - and for that matter also sentence embeddings - to various problem domains, such as chatbots, car reviews, news and language learning all in German domain-specific corpora. We will share our experiences and learnings: how much feature design was necessary, which alternative approaches are available and for which applications we were able to make use of word embeddings (recommendations, topic detection, error correction)?

MunichDataGeeks

October 07, 2017
Tweet

More Decks by MunichDataGeeks

Other Decks in Technology

Transcript

  1. Word embeddings The Good, the Bad, and the Ugly OR

    Word embeddings applied to car review data
  2. The old way to represent words as vectors A BMW

    X3 is an alternative to the Audi Q5 A BMW X3 Is An Alternative To The Audi Q5 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 Vocabulary Vector representing one word Problem: each word has the same distance to any other word
  3. In search for a better representation maintaining semantics Hypothesis: words

    that appear frequently in the same context share some meaning A BMW X3 is an alternative to the Audi Q5 • Calculate probability of each context word to appear in the context of the given focus word (CBOW) • Or calculate the probability of the focus word based on its context words (skip-gram) These probabilities are the foundation of a compressed representation
  4. Word vectors are a compressed and semantic representation • Predicting

    the probabilities is a cost function… • …on which a neural network (with one hidden layer) is trained • A word vector is the referring vector of the hidden layer (vocab_size x nr_dims) Input Hidden layer Output Focus or context words in one hot encoding (1 x V) (V x n) The probabilities of each word ro appear in the context of the focus word (1 x V)
  5. Word vectors capture natural language semantics My experiment is about

    to test whether this works for car data in German
  6. I Love Cars Car Data Brands, Emotions, Image, Business Hypothesis:

    word embeddings can capture these complex relations Make Model Model line / Equipment Functionality, Motorization Body Types, Generations / Facelifts
  7. The Data: German Car News and Reviews News and reviews

    from different sites about cars were used: Dataset A: • # words: 4.489.924 • # unique words: 323.399 • # unique words > 3: 82.501 • # sentences: 251.729 • Lexical diversity: 13,8 Dataset B: • # words: 2.129.582 • # unique words: 150.944 • # unique words > 3: 42.190 • # sentences: 136.724 • Lexical diversity: 14,1 Dataset C: • # words: 2.332.604 • # unique words: 174.113 • # unique words > 3: 43.916 • # sentences: 311.863 • Lexical diversity: 13,39 Dataset D (all combined): • # words: 6.741.644 • # unique words: 392.747 • # unique words > 3: 102.493 • # sentences: 397.261 • Lexical diversity: 17,2
  8. How the models were built • For each dataset we

    generated the word vectors • We used word2vec implementation of gensim • In order to assess the quality of the results we built a 2-dim PCA for each set of vectors • These vectors contained only the most frequent makes (57)
  9. Compare Similarities Dataset A Dataset B Dataset C Dataset D

    Benchmark Dataset D best matches the benchmark
  10. Use sum of squared error as additional metric • Dataset

    B: 4.74 • Dataset C: 3.96 • Dataset A: 3.27 • Dataset D: 1.84
  11. Use sum of squared error as additional metric • Dataset

    B: 4.74 (150.944) • Dataset C: 3.96 (174.113) • Dataset A: 3.27 (323.399) • Dataset D: 1.84 (392.747) Obviously: the more data – the better the results!
  12. Remove stopwords and evaluate quality again Dataset D w/o stopwords

    Benchmark • Dataset A: 3.27 • Dataset A´: 2.28 • Dataset D: 1.84 • Dataset D´: 1.66 Removing stopwords definitively improved quality!
  13. Size of word vectors do influence the quality 2,60 2,80

    3,00 3,20 3,40 50 100 200 300 1000 A 4,40 4,50 4,60 4,70 4,80 4,90 50 100 200 300 1000 B 3,60 3,70 3,80 3,90 4,00 4,10 4,20 4,30 50 100 200 300 1000 C 1,50 1,55 1,60 1,65 1,70 1,75 50 100 200 300 1000 D' Optimal size is between 100-200 depending on task and dataset.
  14. Middle class alternative - Skoda SEAT (0.777) Kia (0.638) Dacia

    (0.622) Hyundai (0.621) Mitsubishi (0.621) Similar to Skoda:
  15. The Cheap Global Players – Mitsubishi Toyota (0.777) Suzuki (0.795)

    Honda (0.787) Similar to Mitsubishi: Citroen (0.849*) *) Did you know about the cooperation between Mitsubishi and Citroen? Dacia (0.804)
  16. Super Cars - Lamborghini Bugatti (0.623) Maserati (0.749) Lotus (0.698)

    Similar to Lamborghini: Ferarri (0.849) Corvette (0.753)
  17. German Premium Brands - BMW Ford (0.0260) Cadillac (0.0309) Lexus

    (0.0278) Similar to BMW: Jaguar (0.400) Infiniti (0.339)
  18. Models cluster by make Audi BMW Ford Opel Toyota Skoda

    Of course the make is very present in the context of a model
  19. Bad example #1 – Audi Q5 BMW X3 (0.877) BMW

    X1 (0.873) Audi A6 (0.857) No BMW X5? Audi Q7, Q3? Porsche Macan?
  20. Bad Example #2 – Opel Adam Porsche Macan (0.831) Jaguar

    XE (0.773) Toyota Verso (0.770) What???
  21. Insights so far - Make • Approach does not work

    for German premium brands (Audi, BMW, Mercedes-Benz, Porsche)  do they appear in German press as absolutely unique and not comparable? • In general, the „meaning“ of brands is very ambigious • More data necessary • Improve data with preprocessing (VW vs Volkswagen, Mercedes vs Mercedes-Benz, …) • Test GloVe as an alternative to word2vec
  22. Insights so far - Model • Also here the most

    problems with German premium brands • In general these first steps look promising • Valueable as inspiration for the more unknown makes and models • Extensive preprocessing necessary • Blend results with recommendations from other data types
  23. Fabian Dill Managing Director Email: [email protected] Telefon: 089 / 189

    46 54 0 Mobil: 0151 / 226 76 116 Contact Details: DieProduktMacher