Word Embeddings - the Good, the Bad, and the Ugly

Slide 1

Slide 1 text

Word Embeddings for Cars Datageeks Data Day, 2017-10-07 DieProduktMacher

Slide 2

Slide 2 text

Word embeddings The Good, the Bad, and the Ugly OR Word embeddings applied to car review data

Slide 3

Slide 3 text

The old way to represent words as vectors A BMW X3 is an alternative to the Audi Q5 A BMW X3 Is An Alternative To The Audi Q5 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 Vocabulary Vector representing one word Problem: each word has the same distance to any other word

Slide 4

Slide 4 text

In search for a better representation maintaining semantics Hypothesis: words that appear frequently in the same context share some meaning A BMW X3 is an alternative to the Audi Q5 • Calculate probability of each context word to appear in the context of the given focus word (CBOW) • Or calculate the probability of the focus word based on its context words (skip-gram) These probabilities are the foundation of a compressed representation

Slide 5

Slide 5 text

Word vectors are a compressed and semantic representation • Predicting the probabilities is a cost function… • …on which a neural network (with one hidden layer) is trained • A word vector is the referring vector of the hidden layer (vocab_size x nr_dims) Input Hidden layer Output Focus or context words in one hot encoding (1 x V) (V x n) The probabilities of each word ro appear in the context of the focus word (1 x V)

Slide 6

Slide 6 text

Word vectors capture natural language semantics My experiment is about to test whether this works for car data in German

Slide 7

Slide 7 text

The Data

Slide 8

Slide 8 text

I Love Cars Car Data Brands, Emotions, Image, Business Hypothesis: word embeddings can capture these complex relations Make Model Model line / Equipment Functionality, Motorization Body Types, Generations / Facelifts

Slide 9

Slide 9 text

The Data: German Car News and Reviews News and reviews from different sites about cars were used: Dataset A: • # words: 4.489.924 • # unique words: 323.399 • # unique words > 3: 82.501 • # sentences: 251.729 • Lexical diversity: 13,8 Dataset B: • # words: 2.129.582 • # unique words: 150.944 • # unique words > 3: 42.190 • # sentences: 136.724 • Lexical diversity: 14,1 Dataset C: • # words: 2.332.604 • # unique words: 174.113 • # unique words > 3: 43.916 • # sentences: 311.863 • Lexical diversity: 13,39 Dataset D (all combined): • # words: 6.741.644 • # unique words: 392.747 • # unique words > 3: 102.493 • # sentences: 397.261 • Lexical diversity: 17,2

Slide 10

Slide 10 text

Experiments

Slide 11

Slide 11 text

Hypothesis: word embeddings can capture make similarity

Slide 12

Slide 12 text

How the models were built • For each dataset we generated the word vectors • We used word2vec implementation of gensim • In order to assess the quality of the results we built a 2-dim PCA for each set of vectors • These vectors contained only the most frequent makes (57)

Slide 13

Slide 13 text

PCA Visualization for dataset A Some interesting cluster – but outlier everywhere BMW VW

Slide 14

Slide 14 text

PCA Visualization for dataset B BMW Works quite well for some – but clutters the most

Slide 15

Slide 15 text

PCA Visualization for dataset C Makes distributed randomly – micro relations seem OK

Slide 16

Slide 16 text

PCA Visualization for dataset D BMW Citroën? Despite huge outliers good overall impression

Slide 17

Slide 17 text

Which one is the best?

Slide 18

Slide 18 text

How to evaluate the quality of word embeddings? Manually create a benchmark similarity matrix

Slide 19

Slide 19 text

Compare Similarities Dataset A Dataset B Dataset C Dataset D Benchmark Dataset D best matches the benchmark

Slide 20

Slide 20 text

Use sum of squared error as additional metric • Dataset B: 4.74 • Dataset C: 3.96 • Dataset A: 3.27 • Dataset D: 1.84

Slide 21

Slide 21 text

Use sum of squared error as additional metric • Dataset B: 4.74 (150.944) • Dataset C: 3.96 (174.113) • Dataset A: 3.27 (323.399) • Dataset D: 1.84 (392.747) Obviously: the more data – the better the results!

Slide 22

Slide 22 text

Next question: to preprocess or not to preprocess?

Slide 23

Slide 23 text

Remove stopwords and evaluate quality again Dataset D w/o stopwords Benchmark • Dataset A: 3.27 • Dataset A´: 2.28 • Dataset D: 1.84 • Dataset D´: 1.66 Removing stopwords definitively improved quality!

Slide 24

Slide 24 text

What about hyperparameters?

Slide 25

Slide 25 text

Size of word vectors do influence the quality 2,60 2,80 3,00 3,20 3,40 50 100 200 300 1000 A 4,40 4,50 4,60 4,70 4,80 4,90 50 100 200 300 1000 B 3,60 3,70 3,80 3,90 4,00 4,10 4,20 4,30 50 100 200 300 1000 C 1,50 1,55 1,60 1,65 1,70 1,75 50 100 200 300 1000 D' Optimal size is between 100-200 depending on task and dataset.

Slide 26

Slide 26 text

Similar Makes based on word embeddings

Slide 27

Slide 27 text

Middle class alternative - Skoda SEAT (0.777) Kia (0.638) Dacia (0.622) Hyundai (0.621) Mitsubishi (0.621) Similar to Skoda:

Slide 28

Slide 28 text

The French Cluster Kia (0.577) Daihatsu (0.605) Hyundai (0.589) Similar to Peugeot: Renault (0.682) Citroen (0.657)

Slide 29

Slide 29 text

The Cheap Global Players – Mitsubishi Toyota (0.777) Suzuki (0.795) Honda (0.787) Similar to Mitsubishi: Citroen (0.849*) *) Did you know about the cooperation between Mitsubishi and Citroen? Dacia (0.804)

Slide 30

Slide 30 text

Super Cars - Lamborghini Bugatti (0.623) Maserati (0.749) Lotus (0.698) Similar to Lamborghini: Ferarri (0.849) Corvette (0.753)

Slide 31

Slide 31 text

German Premium Brands - BMW Ford (0.0260) Cadillac (0.0309) Lexus (0.0278) Similar to BMW: Jaguar (0.400) Infiniti (0.339)

Slide 32

Slide 32 text

Disturbing! – Smart Ssangyong (0.765) Daewoo (0.880) Isuzu (0.822) Similar to smart: Buick (0.902) GMC (0.892)

Slide 33

Slide 33 text

Now a look at the next layer – the model

Slide 34

Slide 34 text

Models cluster by make Audi BMW Ford Opel Toyota Skoda Of course the make is very present in the context of a model

Slide 35

Slide 35 text

Results with preprocessed data are promising Could this be the basis for a service?

Slide 36

Slide 36 text

Similar Cars based on word embeddings Only a small subset

Slide 37

Slide 37 text

Compact Car – Ford Fiesta Renault Clio (0.857) Skoda Fabia (0.832) Ford Focus (0.831)

Slide 38

Slide 38 text

SUV – Nissan Qashqai Kia Sportage (0.873) Ford Kuga (0.856) RenaultKangoo (0.855)

Slide 39

Slide 39 text

Station Wagon – Toyota Avensis Opel Insignia (0.868) Hyundai i30 (0.846) Hyundai Tucson (0.834)

Slide 40

Slide 40 text

Luxury Car – Audi A8 Audi A6 (0.842) Audi Q7(0.835) Jaguar XF (0.833)

Slide 41

Slide 41 text

Bad example #1 – Audi Q5 BMW X3 (0.877) BMW X1 (0.873) Audi A6 (0.857) No BMW X5? Audi Q7, Q3? Porsche Macan?

Slide 42

Slide 42 text

Bad Example #2 – Opel Adam Porsche Macan (0.831) Jaguar XE (0.773) Toyota Verso (0.770) What???

Slide 43

Slide 43 text

Summary

Slide 44

Slide 44 text

Insights so far - Make • Approach does not work for German premium brands (Audi, BMW, Mercedes-Benz, Porsche)  do they appear in German press as absolutely unique and not comparable? • In general, the „meaning“ of brands is very ambigious • More data necessary • Improve data with preprocessing (VW vs Volkswagen, Mercedes vs Mercedes-Benz, …) • Test GloVe as an alternative to word2vec

Slide 45

Slide 45 text

Insights so far - Model • Also here the most problems with German premium brands • In general these first steps look promising • Valueable as inspiration for the more unknown makes and models • Extensive preprocessing necessary • Blend results with recommendations from other data types

Slide 46

Slide 46 text

Fabian Dill Managing Director Email: [email protected] Telefon: 089 / 189 46 54 0 Mobil: 0151 / 226 76 116 Contact Details: DieProduktMacher

Slide 47

Slide 47 text

Resources http://ruder.io/word-embeddings-1/ https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/ Gensim: https://radimrehurek.com/gensim/index.html