Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Deploying email classifier to production

4c64e1d315367c77b1ae8bd556ad007a?s=47 Tanuj
September 06, 2018

Deploying email classifier to production

The user experience of e-commerce customers enhance greatly if they have quick access to information related to their purchases. The interesting information could be in the form of invoices, shipping updates and many others. For idealo (https://www.idealo.de/), the source of this information is a large volume of merchant emails. In order to extract relevant information, it is necessary to classify the incoming emails automatically into the respective categories. Therefore, with the help of machine learning, an email classification model was developed and eventually integrated into production. Several technologies were utilised in this pursuit.

4c64e1d315367c77b1ae8bd556ad007a?s=128

Tanuj

September 06, 2018
Tweet

Transcript

  1. Dat Tran - Head of Data Science Deploying email classifier

    to production for an enhanced ecommerce customer experience Tanuj Jain (Data Scientist) David Vinco (Junior Software Developer) 06/09/2018 - Bedcon 1
  2. Agenda 1. Motivation 2. Machine Learning Model 3. Deployment 4.

    Learnings 5. Next steps 2
  3. Motivation

  4. idealo internet GmbH • Two main businesses: ◦ Price comparison

    ◦ Ecommerce - idealo Direktkauf 4
  5. Ecommerce Customer Aftersales • Receive updates about placed orders: ◦

    Notification of successful order placement ◦ Access to invoice ◦ Shipping updates ◦ Cancellation notification 5
  6. 6 Current Process Stuff you want to buy →

  7. 7 Current Process

  8. Current Process Every interaction between customer and shop occurs via

    email through idealo! Order confirmation through email SHOP 8
  9. 9 BUT ….

  10. Better Solution Provide all updates through idealo user account Idealo

    account Order confirmation Invoice access Cancellation Updates ….. 10
  11. How to reach the better solution? Extracting relevant information from

    emails Understanding Email content Updating User account Is it an Invoice? OR Is it a Cancellation OR Is it Anything else? Invoice: Extract pdf Cancellation: Update status Shipping: Extract tracking number Displaying extracted information on the user front end. 11
  12. Understanding Email Data

  13. Challenges • Hundreds of shops • Millions of emails •

    Several email types 13
  14. Challenges • Hundreds of shops • Millions of emails •

    Several email types Machine Learning Magic! 14
  15. Machine Learning 101 Step 1 Training Where we ‘teach’ our

    model through data what is classified as an Apple or Banana or Mango. Supervised Learning Characterize each fruit by features. Example: 1. Fruit colour 2. Length of the fruit 3. Width of the fruit 4. …. 15
  16. Machine Learning 101 Step 2 Testing Where we ‘quiz’ our

    model till it passes with a satisfactory score. Supervised Learning 16
  17. Machine Learning Model Multiclass-classification with binary classifier: ’One vs Rest’

    strategy 3 classifiers: • vs all-other fruits • vs all-other fruits • vs all-other fruits Training step: • Teach all 3 classifiers Test step: • Show test sample to each classifier • Winning class: The one with highest probability 17
  18. DATA

  19. Labels & Distribution • Predict only 3 labels: ◦ Rechnung

    (Invoice) ◦ Storno (Cancellation) ◦ Other • Work with only a subset of emails: ◦ Only few months • High imbalance in label distribution First iteration Label Count Share Rechnung (invoice) ~100K 7.95% Storno (cancellation) ~4K 0.25% Other ~1.2 Million 91.8% 19
  20. 20 Email Example

  21. 21 Mail Data After Parsing

  22. Data Cleaning

  23. 23 Dirty Data! Greetings, stop words, Email-ids, Numbers

  24. 24 Dirty Data! Weblinks, tenses, Split-words

  25. Dirty Data! 25 Footers

  26. Data Cleaning Steps • Lowercase • Remove: ◦ Punctuation ◦

    Email Ids ◦ Weblinks ◦ Footers ◦ Stop words • Merge words like ‘A R T I K E L Ü B E R S I C H T’ • Stemming and Lemmatization: Gets word roots ◦ Eg: ‘Nachfolgenden’ becomes ‘nachfolg’ 26
  27. Original bodylength Distribution Final bodylength Distribution 27

  28. 28 Most Common Words: Rechnung (Invoice)

  29. 29 Most Common Words: Storno (Cancellation)

  30. 30 Most Common Words: Others

  31. 31 pdfAttached Distribution

  32. Observations • Each class has its own high frequency words

    • Varying amounts of length reduction after cleaning per class • pdfAttached distribution different per class → Good candidates for deriving features 32
  33. Features • Text features ◦ Term frequency inverse document frequency

    (tfidf) on body ◦ Tfidf lays more importance to most distinguishing words • Body length after cleaning • Difference between body length before and after cleaning • pdfAttached - binary feature 33
  34. Machine Learning Model Logistic regression • Linear classifier • ‘One

    vs Rest’ strategy 34
  35. Machine Learning Model Imbalance strategy Punishing misclassifications more severely on

    minority class (Storno) during training step o Modification in cost function o Pushes classifier to work better on minority class 35
  36. Metrics • Precision • Recall • Accuracy = Proportion of

    all correctly classified samples 36
  37. 37 Results: Loss function Metric Value Accuracy 0.99 Average Precision

    0.81 Average Recall 0.95
  38. Deployment

  39. idealo Cloud Multiple cloud systems 1. SiWo a. VM-based b.

    No containerised deployments c. Legacy system 2. Openshift 3. AWS 39
  40. idealo Cloud idealo cloud strategy Moving away from SiWo cloud

    → Architectural components present in multiple clouds → Sensitive data movements between clouds to be encrypted 40
  41. Model Deployment: Openshift Flask wrapper deployed on Openshift • Container

    application Platform as a Service (PaaS) Kubernetes Openshift = Kubernetes on steroids 41
  42. 42

  43. 43

  44. 44

  45. 45

  46. 46

  47. 47

  48. 48

  49. 49

  50. 50

  51. 51

  52. 52

  53. 53

  54. 54

  55. 55

  56. 56

  57. 57

  58. 58

  59. 59

  60. 60

  61. Learnings • SiWo microservices = Springboot (Java) Openshift model =

    Flask (Python) → Implementation of cross-platform encryption • Reduction of email download time from IMAP servers for model development through parallelization (More info at the idealo tech blogpost) 61
  62. Next steps • Testing model on new data • Making

    the product live (please don’t ask me when) • Retraining pipeline 62
  63. Questions?