Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Deploying email classifier to production

Tanuj
September 06, 2018

Deploying email classifier to production

The user experience of e-commerce customers enhance greatly if they have quick access to information related to their purchases. The interesting information could be in the form of invoices, shipping updates and many others. For idealo (https://www.idealo.de/), the source of this information is a large volume of merchant emails. In order to extract relevant information, it is necessary to classify the incoming emails automatically into the respective categories. Therefore, with the help of machine learning, an email classification model was developed and eventually integrated into production. Several technologies were utilised in this pursuit.

Tanuj

September 06, 2018
Tweet

More Decks by Tanuj

Other Decks in Technology

Transcript

  1. Dat Tran - Head of Data Science Deploying email classifier

    to production for an enhanced ecommerce customer experience Tanuj Jain (Data Scientist) David Vinco (Junior Software Developer) 06/09/2018 - Bedcon 1
  2. Ecommerce Customer Aftersales • Receive updates about placed orders: ◦

    Notification of successful order placement ◦ Access to invoice ◦ Shipping updates ◦ Cancellation notification 5
  3. Current Process Every interaction between customer and shop occurs via

    email through idealo! Order confirmation through email SHOP 8
  4. Better Solution Provide all updates through idealo user account Idealo

    account Order confirmation Invoice access Cancellation Updates ….. 10
  5. How to reach the better solution? Extracting relevant information from

    emails Understanding Email content Updating User account Is it an Invoice? OR Is it a Cancellation OR Is it Anything else? Invoice: Extract pdf Cancellation: Update status Shipping: Extract tracking number Displaying extracted information on the user front end. 11
  6. Challenges • Hundreds of shops • Millions of emails •

    Several email types Machine Learning Magic! 14
  7. Machine Learning 101 Step 1 Training Where we ‘teach’ our

    model through data what is classified as an Apple or Banana or Mango. Supervised Learning Characterize each fruit by features. Example: 1. Fruit colour 2. Length of the fruit 3. Width of the fruit 4. …. 15
  8. Machine Learning 101 Step 2 Testing Where we ‘quiz’ our

    model till it passes with a satisfactory score. Supervised Learning 16
  9. Machine Learning Model Multiclass-classification with binary classifier: ’One vs Rest’

    strategy 3 classifiers: • vs all-other fruits • vs all-other fruits • vs all-other fruits Training step: • Teach all 3 classifiers Test step: • Show test sample to each classifier • Winning class: The one with highest probability 17
  10. Labels & Distribution • Predict only 3 labels: ◦ Rechnung

    (Invoice) ◦ Storno (Cancellation) ◦ Other • Work with only a subset of emails: ◦ Only few months • High imbalance in label distribution First iteration Label Count Share Rechnung (invoice) ~100K 7.95% Storno (cancellation) ~4K 0.25% Other ~1.2 Million 91.8% 19
  11. Data Cleaning Steps • Lowercase • Remove: ◦ Punctuation ◦

    Email Ids ◦ Weblinks ◦ Footers ◦ Stop words • Merge words like ‘A R T I K E L Ü B E R S I C H T’ • Stemming and Lemmatization: Gets word roots ◦ Eg: ‘Nachfolgenden’ becomes ‘nachfolg’ 26
  12. Observations • Each class has its own high frequency words

    • Varying amounts of length reduction after cleaning per class • pdfAttached distribution different per class → Good candidates for deriving features 32
  13. Features • Text features ◦ Term frequency inverse document frequency

    (tfidf) on body ◦ Tfidf lays more importance to most distinguishing words • Body length after cleaning • Difference between body length before and after cleaning • pdfAttached - binary feature 33
  14. Machine Learning Model Imbalance strategy Punishing misclassifications more severely on

    minority class (Storno) during training step o Modification in cost function o Pushes classifier to work better on minority class 35
  15. idealo Cloud Multiple cloud systems 1. SiWo a. VM-based b.

    No containerised deployments c. Legacy system 2. Openshift 3. AWS 39
  16. idealo Cloud idealo cloud strategy Moving away from SiWo cloud

    → Architectural components present in multiple clouds → Sensitive data movements between clouds to be encrypted 40
  17. Model Deployment: Openshift Flask wrapper deployed on Openshift • Container

    application Platform as a Service (PaaS) Kubernetes Openshift = Kubernetes on steroids 41
  18. 42

  19. 43

  20. 44

  21. 45

  22. 46

  23. 47

  24. 48

  25. 49

  26. 50

  27. 51

  28. 52

  29. 53

  30. 54

  31. 55

  32. 56

  33. 57

  34. 58

  35. 59

  36. 60

  37. Learnings • SiWo microservices = Springboot (Java) Openshift model =

    Flask (Python) → Implementation of cross-platform encryption • Reduction of email download time from IMAP servers for model development through parallelization (More info at the idealo tech blogpost) 61
  38. Next steps • Testing model on new data • Making

    the product live (please don’t ask me when) • Retraining pipeline 62