Word Embeddings - the Good, the Bad, and the Ugly

Word Embeddings for Cars Datageeks Data Day, 2017-10-07 DieProduktMacher

Word embeddings The Good, the Bad, and the Ugly OR
Word embeddings applied to car review data

The old way to represent words as vectors A BMW
X3 is an alternative to the Audi Q5 A BMW X3 Is An Alternative To The Audi Q5 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 Vocabulary Vector representing one word Problem: each word has the same distance to any other word

In search for a better representation maintaining semantics Hypothesis: words
that appear frequently in the same context share some meaning A BMW X3 is an alternative to the Audi Q5 • Calculate probability of each context word to appear in the context of the given focus word (CBOW) • Or calculate the probability of the focus word based on its context words (skip-gram) These probabilities are the foundation of a compressed representation

Word vectors are a compressed and semantic representation • Predicting
the probabilities is a cost function… • …on which a neural network (with one hidden layer) is trained • A word vector is the referring vector of the hidden layer (vocab_size x nr_dims) Input Hidden layer Output Focus or context words in one hot encoding (1 x V) (V x n) The probabilities of each word ro appear in the context of the focus word (1 x V)

Word vectors capture natural language semantics My experiment is about
to test whether this works for car data in German

The Data

I Love Cars Car Data Brands, Emotions, Image, Business Hypothesis:
word embeddings can capture these complex relations Make Model Model line / Equipment Functionality, Motorization Body Types, Generations / Facelifts

The Data: German Car News and Reviews News and reviews
from different sites about cars were used: Dataset A: • # words: 4.489.924 • # unique words: 323.399 • # unique words > 3: 82.501 • # sentences: 251.729 • Lexical diversity: 13,8 Dataset B: • # words: 2.129.582 • # unique words: 150.944 • # unique words > 3: 42.190 • # sentences: 136.724 • Lexical diversity: 14,1 Dataset C: • # words: 2.332.604 • # unique words: 174.113 • # unique words > 3: 43.916 • # sentences: 311.863 • Lexical diversity: 13,39 Dataset D (all combined): • # words: 6.741.644 • # unique words: 392.747 • # unique words > 3: 102.493 • # sentences: 397.261 • Lexical diversity: 17,2

Experiments

Hypothesis: word embeddings can capture make similarity

How the models were built • For each dataset we
generated the word vectors • We used word2vec implementation of gensim • In order to assess the quality of the results we built a 2-dim PCA for each set of vectors • These vectors contained only the most frequent makes (57)

PCA Visualization for dataset A Some interesting cluster – but
outlier everywhere BMW VW

PCA Visualization for dataset B BMW Works quite well for
some – but clutters the most

PCA Visualization for dataset C Makes distributed randomly – micro
relations seem OK

PCA Visualization for dataset D BMW Citroën? Despite huge outliers
good overall impression

Which one is the best?

How to evaluate the quality of word embeddings? Manually create
a benchmark similarity matrix

Compare Similarities Dataset A Dataset B Dataset C Dataset D
Benchmark Dataset D best matches the benchmark

Use sum of squared error as additional metric • Dataset
B: 4.74 • Dataset C: 3.96 • Dataset A: 3.27 • Dataset D: 1.84

Use sum of squared error as additional metric • Dataset
B: 4.74 (150.944) • Dataset C: 3.96 (174.113) • Dataset A: 3.27 (323.399) • Dataset D: 1.84 (392.747) Obviously: the more data – the better the results!

Next question: to preprocess or not to preprocess?

Remove stopwords and evaluate quality again Dataset D w/o stopwords
Benchmark • Dataset A: 3.27 • Dataset A´: 2.28 • Dataset D: 1.84 • Dataset D´: 1.66 Removing stopwords definitively improved quality!

What about hyperparameters?

Size of word vectors do influence the quality 2,60 2,80
3,00 3,20 3,40 50 100 200 300 1000 A 4,40 4,50 4,60 4,70 4,80 4,90 50 100 200 300 1000 B 3,60 3,70 3,80 3,90 4,00 4,10 4,20 4,30 50 100 200 300 1000 C 1,50 1,55 1,60 1,65 1,70 1,75 50 100 200 300 1000 D' Optimal size is between 100-200 depending on task and dataset.

Similar Makes based on word embeddings

Middle class alternative - Skoda SEAT (0.777) Kia (0.638) Dacia
(0.622) Hyundai (0.621) Mitsubishi (0.621) Similar to Skoda:

The French Cluster Kia (0.577) Daihatsu (0.605) Hyundai (0.589) Similar
to Peugeot: Renault (0.682) Citroen (0.657)

The Cheap Global Players – Mitsubishi Toyota (0.777) Suzuki (0.795)
Honda (0.787) Similar to Mitsubishi: Citroen (0.849*) *) Did you know about the cooperation between Mitsubishi and Citroen? Dacia (0.804)

Super Cars - Lamborghini Bugatti (0.623) Maserati (0.749) Lotus (0.698)
Similar to Lamborghini: Ferarri (0.849) Corvette (0.753)

German Premium Brands - BMW Ford (0.0260) Cadillac (0.0309) Lexus
(0.0278) Similar to BMW: Jaguar (0.400) Infiniti (0.339)

Disturbing! – Smart Ssangyong (0.765) Daewoo (0.880) Isuzu (0.822) Similar
to smart: Buick (0.902) GMC (0.892)

Now a look at the next layer – the model

Models cluster by make Audi BMW Ford Opel Toyota Skoda
Of course the make is very present in the context of a model

Results with preprocessed data are promising Could this be the
basis for a service?

Similar Cars based on word embeddings Only a small subset

Compact Car – Ford Fiesta Renault Clio (0.857) Skoda Fabia
(0.832) Ford Focus (0.831)

SUV – Nissan Qashqai Kia Sportage (0.873) Ford Kuga (0.856)
RenaultKangoo (0.855)

Station Wagon – Toyota Avensis Opel Insignia (0.868) Hyundai i30
(0.846) Hyundai Tucson (0.834)

Luxury Car – Audi A8 Audi A6 (0.842) Audi Q7(0.835)
Jaguar XF (0.833)

Bad example #1 – Audi Q5 BMW X3 (0.877) BMW
X1 (0.873) Audi A6 (0.857) No BMW X5? Audi Q7, Q3? Porsche Macan?

Bad Example #2 – Opel Adam Porsche Macan (0.831) Jaguar
XE (0.773) Toyota Verso (0.770) What???

Summary

Insights so far - Make • Approach does not work
for German premium brands (Audi, BMW, Mercedes-Benz, Porsche)  do they appear in German press as absolutely unique and not comparable? • In general, the „meaning“ of brands is very ambigious • More data necessary • Improve data with preprocessing (VW vs Volkswagen, Mercedes vs Mercedes-Benz, …) • Test GloVe as an alternative to word2vec

Insights so far - Model • Also here the most
problems with German premium brands • In general these first steps look promising • Valueable as inspiration for the more unknown makes and models • Extensive preprocessing necessary • Blend results with recommendations from other data types

Fabian Dill Managing Director Email: [email protected] Telefon: 089 / 189
46 54 0 Mobil: 0151 / 226 76 116 Contact Details: DieProduktMacher

Resources http://ruder.io/word-embeddings-1/ https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/ Gensim: https://radimrehurek.com/gensim/index.html

Word Embeddings - the Good, the Bad, and the Ugly

Word Embeddings - the Good, the Bad, and the Ugly

More Decks by MunichDataGeeks

Other Decks in Technology

Featured

Transcript