Slide 1

Slide 1 text

Dat Tran - Head of Data Science Deploying email classifier to production for an enhanced ecommerce customer experience Tanuj Jain (Data Scientist) David Vinco (Junior Software Developer) 06/09/2018 - Bedcon 1

Slide 2

Slide 2 text

Agenda 1. Motivation 2. Machine Learning Model 3. Deployment 4. Learnings 5. Next steps 2

Slide 3

Slide 3 text

Motivation

Slide 4

Slide 4 text

idealo internet GmbH ● Two main businesses: ○ Price comparison ○ Ecommerce - idealo Direktkauf 4

Slide 5

Slide 5 text

Ecommerce Customer Aftersales ● Receive updates about placed orders: ○ Notification of successful order placement ○ Access to invoice ○ Shipping updates ○ Cancellation notification 5

Slide 6

Slide 6 text

6 Current Process Stuff you want to buy →

Slide 7

Slide 7 text

7 Current Process

Slide 8

Slide 8 text

Current Process Every interaction between customer and shop occurs via email through idealo! Order confirmation through email SHOP 8

Slide 9

Slide 9 text

9 BUT ….

Slide 10

Slide 10 text

Better Solution Provide all updates through idealo user account Idealo account Order confirmation Invoice access Cancellation Updates ….. 10

Slide 11

Slide 11 text

How to reach the better solution? Extracting relevant information from emails Understanding Email content Updating User account Is it an Invoice? OR Is it a Cancellation OR Is it Anything else? Invoice: Extract pdf Cancellation: Update status Shipping: Extract tracking number Displaying extracted information on the user front end. 11

Slide 12

Slide 12 text

Understanding Email Data

Slide 13

Slide 13 text

Challenges ● Hundreds of shops ● Millions of emails ● Several email types 13

Slide 14

Slide 14 text

Challenges ● Hundreds of shops ● Millions of emails ● Several email types Machine Learning Magic! 14

Slide 15

Slide 15 text

Machine Learning 101 Step 1 Training Where we ‘teach’ our model through data what is classified as an Apple or Banana or Mango. Supervised Learning Characterize each fruit by features. Example: 1. Fruit colour 2. Length of the fruit 3. Width of the fruit 4. …. 15

Slide 16

Slide 16 text

Machine Learning 101 Step 2 Testing Where we ‘quiz’ our model till it passes with a satisfactory score. Supervised Learning 16

Slide 17

Slide 17 text

Machine Learning Model Multiclass-classification with binary classifier: ’One vs Rest’ strategy 3 classifiers: • vs all-other fruits • vs all-other fruits • vs all-other fruits Training step: • Teach all 3 classifiers Test step: • Show test sample to each classifier • Winning class: The one with highest probability 17

Slide 18

Slide 18 text

DATA

Slide 19

Slide 19 text

Labels & Distribution ● Predict only 3 labels: ○ Rechnung (Invoice) ○ Storno (Cancellation) ○ Other ● Work with only a subset of emails: ○ Only few months ● High imbalance in label distribution First iteration Label Count Share Rechnung (invoice) ~100K 7.95% Storno (cancellation) ~4K 0.25% Other ~1.2 Million 91.8% 19

Slide 20

Slide 20 text

20 Email Example

Slide 21

Slide 21 text

21 Mail Data After Parsing

Slide 22

Slide 22 text

Data Cleaning

Slide 23

Slide 23 text

23 Dirty Data! Greetings, stop words, Email-ids, Numbers

Slide 24

Slide 24 text

24 Dirty Data! Weblinks, tenses, Split-words

Slide 25

Slide 25 text

Dirty Data! 25 Footers

Slide 26

Slide 26 text

Data Cleaning Steps ● Lowercase ● Remove: ○ Punctuation ○ Email Ids ○ Weblinks ○ Footers ○ Stop words ● Merge words like ‘A R T I K E L Ü B E R S I C H T’ ● Stemming and Lemmatization: Gets word roots ○ Eg: ‘Nachfolgenden’ becomes ‘nachfolg’ 26

Slide 27

Slide 27 text

Original bodylength Distribution Final bodylength Distribution 27

Slide 28

Slide 28 text

28 Most Common Words: Rechnung (Invoice)

Slide 29

Slide 29 text

29 Most Common Words: Storno (Cancellation)

Slide 30

Slide 30 text

30 Most Common Words: Others

Slide 31

Slide 31 text

31 pdfAttached Distribution

Slide 32

Slide 32 text

Observations ● Each class has its own high frequency words ● Varying amounts of length reduction after cleaning per class ● pdfAttached distribution different per class → Good candidates for deriving features 32

Slide 33

Slide 33 text

Features ● Text features ○ Term frequency inverse document frequency (tfidf) on body ○ Tfidf lays more importance to most distinguishing words ● Body length after cleaning ● Difference between body length before and after cleaning ● pdfAttached - binary feature 33

Slide 34

Slide 34 text

Machine Learning Model Logistic regression • Linear classifier • ‘One vs Rest’ strategy 34

Slide 35

Slide 35 text

Machine Learning Model Imbalance strategy Punishing misclassifications more severely on minority class (Storno) during training step o Modification in cost function o Pushes classifier to work better on minority class 35

Slide 36

Slide 36 text

Metrics • Precision • Recall • Accuracy = Proportion of all correctly classified samples 36

Slide 37

Slide 37 text

37 Results: Loss function Metric Value Accuracy 0.99 Average Precision 0.81 Average Recall 0.95

Slide 38

Slide 38 text

Deployment

Slide 39

Slide 39 text

idealo Cloud Multiple cloud systems 1. SiWo a. VM-based b. No containerised deployments c. Legacy system 2. Openshift 3. AWS 39

Slide 40

Slide 40 text

idealo Cloud idealo cloud strategy Moving away from SiWo cloud → Architectural components present in multiple clouds → Sensitive data movements between clouds to be encrypted 40

Slide 41

Slide 41 text

Model Deployment: Openshift Flask wrapper deployed on Openshift • Container application Platform as a Service (PaaS) Kubernetes Openshift = Kubernetes on steroids 41

Slide 42

Slide 42 text

42

Slide 43

Slide 43 text

43

Slide 44

Slide 44 text

44

Slide 45

Slide 45 text

45

Slide 46

Slide 46 text

46

Slide 47

Slide 47 text

47

Slide 48

Slide 48 text

48

Slide 49

Slide 49 text

49

Slide 50

Slide 50 text

50

Slide 51

Slide 51 text

51

Slide 52

Slide 52 text

52

Slide 53

Slide 53 text

53

Slide 54

Slide 54 text

54

Slide 55

Slide 55 text

55

Slide 56

Slide 56 text

56

Slide 57

Slide 57 text

57

Slide 58

Slide 58 text

58

Slide 59

Slide 59 text

59

Slide 60

Slide 60 text

60

Slide 61

Slide 61 text

Learnings • SiWo microservices = Springboot (Java) Openshift model = Flask (Python) → Implementation of cross-platform encryption • Reduction of email download time from IMAP servers for model development through parallelization (More info at the idealo tech blogpost) 61

Slide 62

Slide 62 text

Next steps • Testing model on new data • Making the product live (please don’t ask me when) • Retraining pipeline 62

Slide 63

Slide 63 text

Questions?