Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Word Embeddings - the Good, the Bad, and the Ugly

Word Embeddings - the Good, the Bad, and the Ugly

Word embeddings are the new magic tool for natural language processing. Without cumbersome preprocessing and feature design they are able to capture the semantics of language and texts, simply by being fed with lots of data. So they say.

We applied word embeddings - and for that matter also sentence embeddings - to various problem domains, such as chatbots, car reviews, news and language learning all in German domain-specific corpora. We will share our experiences and learnings: how much feature design was necessary, which alternative approaches are available and for which applications we were able to make use of word embeddings (recommendations, topic detection, error correction)?

3c3f3f18c25ea5283640ebd23553e7c6?s=128

MunichDataGeeks

October 07, 2017
Tweet

More Decks by MunichDataGeeks

Other Decks in Technology

Transcript

  1. Word Embeddings for Cars Datageeks Data Day, 2017-10-07 DieProduktMacher

  2. Word embeddings The Good, the Bad, and the Ugly OR

    Word embeddings applied to car review data
  3. The old way to represent words as vectors A BMW

    X3 is an alternative to the Audi Q5 A BMW X3 Is An Alternative To The Audi Q5 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 Vocabulary Vector representing one word Problem: each word has the same distance to any other word
  4. In search for a better representation maintaining semantics Hypothesis: words

    that appear frequently in the same context share some meaning A BMW X3 is an alternative to the Audi Q5 • Calculate probability of each context word to appear in the context of the given focus word (CBOW) • Or calculate the probability of the focus word based on its context words (skip-gram) These probabilities are the foundation of a compressed representation
  5. Word vectors are a compressed and semantic representation • Predicting

    the probabilities is a cost function… • …on which a neural network (with one hidden layer) is trained • A word vector is the referring vector of the hidden layer (vocab_size x nr_dims) Input Hidden layer Output Focus or context words in one hot encoding (1 x V) (V x n) The probabilities of each word ro appear in the context of the focus word (1 x V)
  6. Word vectors capture natural language semantics My experiment is about

    to test whether this works for car data in German
  7. The Data

  8. I Love Cars Car Data Brands, Emotions, Image, Business Hypothesis:

    word embeddings can capture these complex relations Make Model Model line / Equipment Functionality, Motorization Body Types, Generations / Facelifts
  9. The Data: German Car News and Reviews News and reviews

    from different sites about cars were used: Dataset A: • # words: 4.489.924 • # unique words: 323.399 • # unique words > 3: 82.501 • # sentences: 251.729 • Lexical diversity: 13,8 Dataset B: • # words: 2.129.582 • # unique words: 150.944 • # unique words > 3: 42.190 • # sentences: 136.724 • Lexical diversity: 14,1 Dataset C: • # words: 2.332.604 • # unique words: 174.113 • # unique words > 3: 43.916 • # sentences: 311.863 • Lexical diversity: 13,39 Dataset D (all combined): • # words: 6.741.644 • # unique words: 392.747 • # unique words > 3: 102.493 • # sentences: 397.261 • Lexical diversity: 17,2
  10. Experiments

  11. Hypothesis: word embeddings can capture make similarity

  12. How the models were built • For each dataset we

    generated the word vectors • We used word2vec implementation of gensim • In order to assess the quality of the results we built a 2-dim PCA for each set of vectors • These vectors contained only the most frequent makes (57)
  13. PCA Visualization for dataset A Some interesting cluster – but

    outlier everywhere BMW VW
  14. PCA Visualization for dataset B BMW Works quite well for

    some – but clutters the most
  15. PCA Visualization for dataset C Makes distributed randomly – micro

    relations seem OK
  16. PCA Visualization for dataset D BMW Citroën? Despite huge outliers

    good overall impression
  17. Which one is the best?

  18. How to evaluate the quality of word embeddings? Manually create

    a benchmark similarity matrix
  19. Compare Similarities Dataset A Dataset B Dataset C Dataset D

    Benchmark Dataset D best matches the benchmark
  20. Use sum of squared error as additional metric • Dataset

    B: 4.74 • Dataset C: 3.96 • Dataset A: 3.27 • Dataset D: 1.84
  21. Use sum of squared error as additional metric • Dataset

    B: 4.74 (150.944) • Dataset C: 3.96 (174.113) • Dataset A: 3.27 (323.399) • Dataset D: 1.84 (392.747) Obviously: the more data – the better the results!
  22. Next question: to preprocess or not to preprocess?

  23. Remove stopwords and evaluate quality again Dataset D w/o stopwords

    Benchmark • Dataset A: 3.27 • Dataset A´: 2.28 • Dataset D: 1.84 • Dataset D´: 1.66 Removing stopwords definitively improved quality!
  24. What about hyperparameters?

  25. Size of word vectors do influence the quality 2,60 2,80

    3,00 3,20 3,40 50 100 200 300 1000 A 4,40 4,50 4,60 4,70 4,80 4,90 50 100 200 300 1000 B 3,60 3,70 3,80 3,90 4,00 4,10 4,20 4,30 50 100 200 300 1000 C 1,50 1,55 1,60 1,65 1,70 1,75 50 100 200 300 1000 D' Optimal size is between 100-200 depending on task and dataset.
  26. Similar Makes based on word embeddings

  27. Middle class alternative - Skoda SEAT (0.777) Kia (0.638) Dacia

    (0.622) Hyundai (0.621) Mitsubishi (0.621) Similar to Skoda:
  28. The French Cluster Kia (0.577) Daihatsu (0.605) Hyundai (0.589) Similar

    to Peugeot: Renault (0.682) Citroen (0.657)
  29. The Cheap Global Players – Mitsubishi Toyota (0.777) Suzuki (0.795)

    Honda (0.787) Similar to Mitsubishi: Citroen (0.849*) *) Did you know about the cooperation between Mitsubishi and Citroen? Dacia (0.804)
  30. Super Cars - Lamborghini Bugatti (0.623) Maserati (0.749) Lotus (0.698)

    Similar to Lamborghini: Ferarri (0.849) Corvette (0.753)
  31. German Premium Brands - BMW Ford (0.0260) Cadillac (0.0309) Lexus

    (0.0278) Similar to BMW: Jaguar (0.400) Infiniti (0.339)
  32. Disturbing! – Smart Ssangyong (0.765) Daewoo (0.880) Isuzu (0.822) Similar

    to smart: Buick (0.902) GMC (0.892)
  33. Now a look at the next layer – the model

  34. Models cluster by make Audi BMW Ford Opel Toyota Skoda

    Of course the make is very present in the context of a model
  35. Results with preprocessed data are promising Could this be the

    basis for a service?
  36. Similar Cars based on word embeddings Only a small subset

  37. Compact Car – Ford Fiesta Renault Clio (0.857) Skoda Fabia

    (0.832) Ford Focus (0.831)
  38. SUV – Nissan Qashqai Kia Sportage (0.873) Ford Kuga (0.856)

    RenaultKangoo (0.855)
  39. Station Wagon – Toyota Avensis Opel Insignia (0.868) Hyundai i30

    (0.846) Hyundai Tucson (0.834)
  40. Luxury Car – Audi A8 Audi A6 (0.842) Audi Q7(0.835)

    Jaguar XF (0.833)
  41. Bad example #1 – Audi Q5 BMW X3 (0.877) BMW

    X1 (0.873) Audi A6 (0.857) No BMW X5? Audi Q7, Q3? Porsche Macan?
  42. Bad Example #2 – Opel Adam Porsche Macan (0.831) Jaguar

    XE (0.773) Toyota Verso (0.770) What???
  43. Summary

  44. Insights so far - Make • Approach does not work

    for German premium brands (Audi, BMW, Mercedes-Benz, Porsche)  do they appear in German press as absolutely unique and not comparable? • In general, the „meaning“ of brands is very ambigious • More data necessary • Improve data with preprocessing (VW vs Volkswagen, Mercedes vs Mercedes-Benz, …) • Test GloVe as an alternative to word2vec
  45. Insights so far - Model • Also here the most

    problems with German premium brands • In general these first steps look promising • Valueable as inspiration for the more unknown makes and models • Extensive preprocessing necessary • Blend results with recommendations from other data types
  46. Fabian Dill Managing Director Email: Fabian.Dill@DieProduktMacher.com Telefon: 089 / 189

    46 54 0 Mobil: 0151 / 226 76 116 Contact Details: DieProduktMacher
  47. Resources http://ruder.io/word-embeddings-1/ https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/ Gensim: https://radimrehurek.com/gensim/index.html