Using Deep Learning to rank and tag millions of hotel images

Slide 1

Slide 1 text

1 Using Deep Learning to rank and tag millions of hotel images Christopher Lennan (Senior Data Scientist) @chris_lennan Tanuj Jain (Data Scientist) @tjainn #idealoTech

Slide 2

Slide 2 text

Agenda 2 1. idealo.de 2. Business Motivation 3. Models and Training 4. Image Tagging 5. Image Aesthetics 6. Summary

Slide 3

Slide 3 text

Some Key Facts 18 More than 18 years experience 700 “idealos” from 40 nations Active in 6 different countries (DE, AT, ES, IT, FR, UK) 16 million users/month 1 50.000 shops Over 330 million offers for 2 million products Tüv certified comparison portal 2 Germany's 4th largest eCommerce website

Slide 4

Slide 4 text

Motivation 4

Slide 5

Slide 5 text

idealo hotel price comparison hotel.idealo.de 5 ● 2.306.658 accommodations ● 308.519.299 images ● ~ 133 images per accommodation

Slide 6

Slide 6 text

Importance of Photography for Hotels 6 “.. after price, photography is the most important factor for travelers and prospects scanning OTA sites..” “.. Photography plays a role of 60% in the decision to book with a particular hotel ..” “.. study published today by TripAdvisor, it would seem like photos have the greatest impact driving engagement from travelers researching on hotel and B&B pages ..”

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

1 2 3 4 5 6 7 8 9 10 11 12 13

Slide 11

Slide 11 text

11 Position: 19 Position: 1 Current image placement Image Aesthetics

Slide 12

Slide 12 text

12 Image Aesthetics Current image placement Position: 17 Position: 3

Slide 13

Slide 13 text

13 Beautiful images should appear earlier in the gallery

Slide 14

Slide 14 text

1 2 3 4 5 6 7 8 9 10 11 12 13

Slide 15

Slide 15 text

1 2 3 4 5 6 7 8 9 10 11 12 13

Slide 16

Slide 16 text

16 Ensure different areas get depicted

Slide 17

Slide 17 text

1 2 3 4 5 6 7 8 Bedroom Bathroom Restaurant Facade Fitness Studio Kitchen

Slide 18

Slide 18 text

Understanding Image Content 18 1. Tag the image with the hotel property area 2. Predict aesthetic quality Two part problem

Slide 19

Slide 19 text

Models & Training 19

Slide 20

Slide 20 text

Transfer Learning 20 ● Use pre-trained CNN that was trained on millions of images (e.g. MobileNet or VGG16) ● Replace top layers so that the output ﬁts with classiﬁcation task ● Train existing and new layer weights

Slide 21

Slide 21 text

Transfer Learning CNN architecture (VGG16) 21

Slide 22

Slide 22 text

Training regime 22 1. Only train the newly added dense layers with high learning rate 2. Then train all layers with low learning rate Goal: Do not juggle around the pre-trained convolutional weights too much

Slide 23

Slide 23 text

23 Training regime

Slide 24

Slide 24 text

● CEL generally used for “one-class” ground truth classiﬁcations (e.g. image tagging) ● CEL ignores inter-class relationships between score buckets 24 Loss functions Cross-entropy loss (CEL) source: https://ssq.github.io/2017/02/06/Udacity%20MLND%20Notebook/

Slide 25

Slide 25 text

25 Loss functions ● For ordered classes, classiﬁcation settings can outperform regressions ● Training on datasets with intrinsic ordering can beneﬁt from EMD loss objective Earth Mover’s Distance (EMD)

Slide 26

Slide 26 text

Local AWS 26 GPU training workﬂow ECR push Custom AMI datasets nvidia-docker EC2 GPU instance launch Docker Machine train script Docker image build Dockerfile SSH evaluation script Docker Machine EC2 GPU instance launch Jupyter notebook Setup Train Evaluate launch evaluation container with nvidia-docker pull image copy existing model S3 launch training container with nvidia-docker store train outputs pull image copy existing model

Slide 27

Slide 27 text

Image Tagging 27

Slide 28

Slide 28 text

Tagging Problem ● Given an image, tag it as belonging to a single class ● Multiclass classiﬁcation model with classes: ○ Bedroom ○ Bathroom ○ Foyer ○ Restaurant ○ Swimming Pool ○ Kitchen ○ View of Exterior (Facade) ○ Reception 28

Slide 29

Slide 29 text

Multiple Datasets Will go over them one-by-one and see: ● Dataset properties ● Results ● Issues 29

Slide 30

Slide 30 text

Wellness Dataset ● Idealo in-house pre-labelled images ● Mostly pictures of 2 or 3 stars properties 30

Slide 31

Slide 31 text

Wellness Dataset ● Balanced: Equal sample count in all categories for all sets 31

Slide 32

Slide 32 text

Wellness Dataset: Metrics Top-1- accuracy: 86% 32

Slide 33

Slide 33 text

Wellness Dataset: Wrong Predictions True Class of these images: BATHROOM, Predicted as: RECEPTION Rectangular structure = Reception with high probability → BIAS! 33

Slide 34

Slide 34 text

Wellness Dataset: Wrong Predictions True Class of these images: BATHROOM Wrong true label of images → NOISE in the dataset! 34

Slide 35

Slide 35 text

Correcting Bias ● Augmentation operations, same for every class: ○ Random cropping ○ Rotation ○ Horizontal ﬂipping ● Data enrichment: ○ External data from google images 35

Slide 36

Slide 36 text

Augmented Wellness + Google Dataset: Metrics Top-1- accuracy: 88% 36

Slide 37

Slide 37 text

Gotta Clean! 37

Slide 38

Slide 38 text

Cleaning Dataset ● Hand-cleaned each category: ○ Deleted pictures that do not belong in its category ○ Removed duplicates (presence of duplicates can give us wrong metrics) ○ Added more images from external sources for classes with a small number of images left after cleaning 38

Slide 39

Slide 39 text

Cleaned Data: Metrics Top-1- accuracy: 91% 39

Slide 40

Slide 40 text

Cleaned Dataset: Results ● Bathroom vs. Reception confusion has almost vanished! ● View_of_exterior vs Pool confusion has reduced ● Foyer performance: ○ Most misclassiﬁcations of Foyer gets assigned to Reception ○ This is human problem as well! 40

Slide 41

Slide 41 text

Foyer or Reception? 41

Slide 42

Slide 42 text

Learnings so far ● The model can only be as good as the data (cleaning) ● Foyer is a hard category to predict 42

Slide 43

Slide 43 text

Understanding Model Decisions 43

Slide 44

Slide 44 text

Understanding Decisions: Class Activation Maps ● Use the penultimate Global Average Pooling Layer (GAP) to get class activation map ● Highlights discriminative region that lead to a classiﬁcation 44

Slide 45

Slide 45 text

Insights With CAM Swimming Pool misclassiﬁed as Bathroom 45 CAM

Slide 46

Slide 46 text

Insights With CAM Swimming Pool misclassiﬁed as Bathroom 46 CAM

Slide 47

Slide 47 text

Insights With CAM Swimming Pool misclassiﬁed as Bathroom 47 CAM

Slide 48

Slide 48 text

Insights With CAM Swimming Pool misclassiﬁed as Bathroom Using rails to misidentify Pool as Bathroom. 48

Slide 49

Slide 49 text

Insights With CAM Bathroom correct classiﬁcation 49 CAM

Slide 50

Slide 50 text

Insights With CAM Bathroom correct classiﬁcation 50 CAM

Slide 51

Slide 51 text

Insights With CAM Bathroom correct classiﬁcation 51 CAM

Slide 52

Slide 52 text

Insights With CAM Bathroom correct classiﬁcation Using faucets to correctly identify Bathroom. 52

Slide 53

Slide 53 text

Learnings so far ● Attribution techniques like CAM lend interpretability ● CAM can drive data collection in speciﬁc directions 53

Slide 54

Slide 54 text

Tagging Next Steps 1. Add still more data a. Explore manual tagging options for training (Example: Amazon Mechanical Turk) 2. Add more classes a. Fitness Studio b. Conference Room c. Other 54

Slide 55

Slide 55 text

Image Aesthetics

Slide 56

Slide 56 text

Ground Truth Labels For the NIMA model we need “true” probability distribution over all classes for each image: ● AVA dataset: we have frequencies over all classes for each image → normalize frequencies to get “true” probability distribution 56 (6.151 / 1.334)

Slide 57

Slide 57 text

Iterations 57 We have gone through two iterations of the aesthetic model: ● First iteration - Train on AVA Dataset ● Second iteration - Fine-tune ﬁrst iteration model on in-house labelled data

Slide 58

Slide 58 text

Results - ﬁrst iteration 58 Linear correlation coefficient (LCC): 0.5987 Spearman's correlation coefficient (SCRR): 0.6072 Earth Mover's Distance: 0.2018 Accuracy (threshold at 5): 0.74 Aesthetic model - MobileNet

Slide 59

Slide 59 text

Examples - ﬁrst iteration Aesthetic model 59

Slide 60

Slide 60 text

Examples - ﬁrst iteration Aesthetic model 60

Slide 61

Slide 61 text

Examples - ﬁrst iteration Aesthetic model 61

Slide 62

Slide 62 text

Examples - ﬁrst iteration Aesthetic model 62

Slide 63

Slide 63 text

Examples - ﬁrst iteration Aesthetic model 63

Slide 64

Slide 64 text

Results - second iteration 64 ● We built a simple labeling application ● http://image-aesthetic-labelling-app-nima.apps.eu.idealo.com/ ● ~ 12 people from idealo Reise and Data Science labeled ○ 1000 hotel images for aesthetics ● We ﬁne-tuned the aesthetic model with 800 training images ● Built aesthetic test dataset with 200 images

Slide 65

Slide 65 text

Results - second iteration 65 Linear correlation coefficient (LCC): 0.7986 Spearman's correlation coefficient (SCRR): 0.7743 Earth Mover's Distance: 0.1236 Accuracy (threshold at 5): 0.85 Aesthetic model - MobileNet

Slide 66

Slide 66 text

Examples - second iteration Aesthetic model 66

Slide 67

Slide 67 text

Examples - second iteration Aesthetic model 67

Slide 68

Slide 68 text

Examples - second iteration Aesthetic model 68

Slide 69

Slide 69 text

Examples - second iteration Aesthetic model 69

Slide 70

Slide 70 text

Examples - second iteration Aesthetic model 70

Slide 71

Slide 71 text

Production Aesthetic model 71

Slide 72

Slide 72 text

Production Aesthetic model 72 ● To date we have scored ~280 million images ● Distribution of scores (sample of 1 million scores):

Slide 73

Slide 73 text

Production - Low Scores Aesthetic model 73

Slide 74

Slide 74 text

Production - Medium Scores Aesthetic model 74

Slide 75

Slide 75 text

Production - High Scores Aesthetic model 75

Slide 76

Slide 76 text

Understanding Model Decisions 76

Slide 77

Slide 77 text

Convolutional Filter Visualisations Layer 23 MobileNet original MobileNet Aesthetic 77

Slide 78

Slide 78 text

Convolutional Filter Visualisations Layer 51 MobileNet original MobileNet Aesthetic 78

Slide 79

Slide 79 text

Convolutional Filter Visualisations Layer 79 MobileNet original MobileNet Aesthetic 79

Slide 80

Slide 80 text

Aesthetic Learnings ● Hotel speciﬁc labeled data is key - Aesthetic model improved markedly from 800 additional training samples ● NIMA only requires few samples to achieve good results (EMD loss) ● Labeled hotel images also important for test set (model evaluation) ● Training on GPU signiﬁcantly improved training time (~30 fold) 80

Slide 81

Slide 81 text

● Continue labeling images for aesthetic classiﬁer ● Introduce new desirable biases in labeling (e.g. low technical quality == low aesthetics) ● Improve prediction speed of models (e.g. lighter CNN architectures) Aesthetics Next Steps 81

Slide 82

Slide 82 text

Summary 82

Slide 83

Slide 83 text

● Transfer learning allowed us to train image tagging and aesthetic classiﬁers with a few thousand domain speciﬁc samples ● Showed the importance of having noise-free data for quality predictions ● Use of attribution & visualization techniques helps understand model decisions and improve them Summary 83

Slide 84

Slide 84 text

Check us out! #idealoTech 84 https://github.com/idealo https://medium.com/idealo-tech-blog

Slide 85

Slide 85 text

We’re hiring! 85 Data Engineers, DevOps Engineers across different teams Check out our job postings: jobs.idealo.de

Slide 86

Slide 86 text

Tanuj Jain [email protected] @tjainn Christopher Lennan [email protected] @chris_lennan 86

Slide 87

Slide 87 text

THE END 87