Dat Tran - Head of Data Science
Deploying email classifier to production for an
enhanced ecommerce customer experience
Tanuj Jain (Data Scientist)
David Vinco (Junior Software Developer)
06/09/2018 - Bedcon
1
Slide 2
Slide 2 text
Agenda
1. Motivation
2. Machine Learning Model
3. Deployment
4. Learnings
5. Next steps
2
Slide 3
Slide 3 text
Motivation
Slide 4
Slide 4 text
idealo internet GmbH
● Two main businesses:
○ Price comparison
○ Ecommerce - idealo Direktkauf
4
Slide 5
Slide 5 text
Ecommerce Customer Aftersales
● Receive updates about placed orders:
○ Notification of successful order placement
○ Access to invoice
○ Shipping updates
○ Cancellation notification
5
Slide 6
Slide 6 text
6
Current Process
Stuff you want to buy →
Slide 7
Slide 7 text
7
Current Process
Slide 8
Slide 8 text
Current Process
Every interaction between customer and shop occurs via email
through idealo!
Order confirmation through email
SHOP
8
Slide 9
Slide 9 text
9
BUT ….
Slide 10
Slide 10 text
Better Solution
Provide all updates through idealo user account
Idealo
account
Order confirmation
Invoice access
Cancellation Updates
…..
10
Slide 11
Slide 11 text
How to reach the better solution?
Extracting
relevant
information from
emails
Understanding
Email content
Updating User account
Is it an Invoice?
OR
Is it a Cancellation
OR
Is it Anything else?
Invoice: Extract pdf
Cancellation: Update status
Shipping: Extract tracking
number
Displaying extracted
information on the user
front end.
11
Slide 12
Slide 12 text
Understanding Email Data
Slide 13
Slide 13 text
Challenges
● Hundreds of shops
● Millions of emails
● Several email types
13
Slide 14
Slide 14 text
Challenges
● Hundreds of shops
● Millions of emails
● Several email types
Machine Learning Magic!
14
Slide 15
Slide 15 text
Machine Learning 101
Step 1
Training
Where we ‘teach’ our model
through data what is classified
as an Apple or Banana or
Mango.
Supervised Learning
Characterize each fruit by features. Example:
1. Fruit colour
2. Length of the fruit
3. Width of the fruit
4. ….
15
Slide 16
Slide 16 text
Machine Learning 101
Step 2
Testing
Where we ‘quiz’ our model till it
passes with a satisfactory score.
Supervised Learning
16
Slide 17
Slide 17 text
Machine Learning Model
Multiclass-classification with binary classifier: ’One vs Rest’ strategy
3 classifiers:
• vs all-other fruits
• vs all-other fruits
• vs all-other fruits
Training step:
• Teach all 3 classifiers
Test step:
• Show test sample to each classifier
• Winning class: The one with highest probability
17
Slide 18
Slide 18 text
DATA
Slide 19
Slide 19 text
Labels & Distribution
● Predict only 3 labels:
○ Rechnung (Invoice)
○ Storno (Cancellation)
○ Other
● Work with only a subset of emails:
○ Only few months
● High imbalance in label distribution
First iteration
Label Count Share
Rechnung (invoice) ~100K 7.95%
Storno (cancellation) ~4K 0.25%
Other ~1.2 Million 91.8%
19
Data Cleaning Steps
● Lowercase
● Remove:
○ Punctuation
○ Email Ids
○ Weblinks
○ Footers
○ Stop words
● Merge words like ‘A R T I K E L Ü B E R S I C H T’
● Stemming and Lemmatization: Gets word roots
○ Eg: ‘Nachfolgenden’ becomes ‘nachfolg’
26
Slide 27
Slide 27 text
Original bodylength Distribution Final bodylength Distribution
27
Slide 28
Slide 28 text
28
Most Common Words: Rechnung (Invoice)
Slide 29
Slide 29 text
29
Most Common Words: Storno (Cancellation)
Slide 30
Slide 30 text
30
Most Common Words: Others
Slide 31
Slide 31 text
31
pdfAttached Distribution
Slide 32
Slide 32 text
Observations
● Each class has its own high frequency words
● Varying amounts of length reduction after cleaning per class
● pdfAttached distribution different per class
→ Good candidates for deriving features
32
Slide 33
Slide 33 text
Features
● Text features
○ Term frequency inverse document frequency (tfidf) on body
○ Tfidf lays more importance to most distinguishing words
● Body length after cleaning
● Difference between body length before and after cleaning
● pdfAttached - binary feature
33
Slide 34
Slide 34 text
Machine Learning Model
Logistic regression
• Linear classifier
• ‘One vs Rest’ strategy
34
Slide 35
Slide 35 text
Machine Learning Model
Imbalance strategy
Punishing misclassifications more severely on
minority class (Storno) during training step
o Modification in cost function
o Pushes classifier to work better on minority class
35
Slide 36
Slide 36 text
Metrics
• Precision
• Recall
• Accuracy = Proportion of all correctly
classified samples
36
Slide 37
Slide 37 text
37
Results: Loss function
Metric Value
Accuracy 0.99
Average Precision 0.81
Average Recall 0.95
Slide 38
Slide 38 text
Deployment
Slide 39
Slide 39 text
idealo Cloud
Multiple cloud systems
1. SiWo
a. VM-based
b. No containerised deployments
c. Legacy system
2. Openshift
3. AWS
39
Slide 40
Slide 40 text
idealo Cloud
idealo cloud strategy
Moving away from SiWo cloud
→ Architectural components present in multiple clouds
→ Sensitive data movements between clouds to be encrypted
40
Slide 41
Slide 41 text
Model Deployment: Openshift
Flask wrapper deployed on Openshift
• Container application Platform as a Service
(PaaS)
Kubernetes Openshift =
Kubernetes on
steroids
41
Slide 42
Slide 42 text
42
Slide 43
Slide 43 text
43
Slide 44
Slide 44 text
44
Slide 45
Slide 45 text
45
Slide 46
Slide 46 text
46
Slide 47
Slide 47 text
47
Slide 48
Slide 48 text
48
Slide 49
Slide 49 text
49
Slide 50
Slide 50 text
50
Slide 51
Slide 51 text
51
Slide 52
Slide 52 text
52
Slide 53
Slide 53 text
53
Slide 54
Slide 54 text
54
Slide 55
Slide 55 text
55
Slide 56
Slide 56 text
56
Slide 57
Slide 57 text
57
Slide 58
Slide 58 text
58
Slide 59
Slide 59 text
59
Slide 60
Slide 60 text
60
Slide 61
Slide 61 text
Learnings
• SiWo microservices = Springboot (Java)
Openshift model = Flask (Python)
→ Implementation of cross-platform encryption
• Reduction of email download time from IMAP servers for model
development through parallelization (More info at the idealo tech
blogpost)
61
Slide 62
Slide 62 text
Next steps
• Testing model on new data
• Making the product live (please don’t ask me when)
• Retraining pipeline
62