Slide 1

Slide 1 text

Monzo’s Customer Service Augmentation Tech Sessions Nigel Ng Data Scientist

Slide 2

Slide 2 text

At Monzo, we want to provide world class customer service

Slide 3

Slide 3 text

But customer service enquiries scales linearly with number of users

Slide 4

Slide 4 text

How can we meet our goals of having response times < 10 minutes?

Slide 5

Slide 5 text

Reduce number of inbound questions Increase productivity of customer support agents

Slide 6

Slide 6 text

Here’s how we used ML to tackle those problems

Slide 7

Slide 7 text

Reducing queries: Enable natural language search for help content

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

Increasing productivity: Suggest saved responses to a support agent

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

We have ~800 of these saved responses. Approximately 70-80% of queries that come in can be handled by a saved response.

Slide 13

Slide 13 text

We achieved those two goals with a single NLP model

Slide 14

Slide 14 text

Formulate this as an information retrieval problem:

Slide 15

Slide 15 text

Given an incoming question, find the most similar answer from an answer pool

Slide 16

Slide 16 text

We define ‘most similar’ as the pairing (q, a) that yields the highest cosine similarity

Slide 17

Slide 17 text

To find cosine similarity of texts, we need to represent them as vectors Encoder I forgot my PIN I can reset it for you! [0.2, -3.0, …, 1.8, 0.3] [0.3, -2.2, …, 0.9, 0.4]

Slide 18

Slide 18 text

Word2vec works well on our data

Slide 19

Slide 19 text

Encoder So now we need to go from word vectors to paragraph vectors I [0.2, 0.4, …, -0.2] lost [-0.1, 0.1, …, 1.1] my [0.5, 1.2, …, -2.8] pin [-1.4, -0.1, …, -0.5] [-1.1, -0.4, …, 0.8] [0.9, -3.0, …, 1.8, 0.3]

Slide 20

Slide 20 text

Our journey to build such an encoder Just take average/max of the word vectors Paragraph vectors (Mikolov et al., 2015) - I had such high hopes ʭ Paragraph vectors in combination with textual search (BM25, tfidf) Supervised pre-training and removing the last classification layer (Deng et al., 2009) - doesn’t generalize well to less common phrases Vanilla word-level RNN - looks good but routinely fails on longer sentences

Slide 21

Slide 21 text

Our journey to build such an encoder Hierarchical Attention Networks (Yang et al., 2016) - nice! We used this in production for a while. Transformer (Google, 2017) - What we’re currently using.

Slide 22

Slide 22 text

Training the Transformer Model

Slide 23

Slide 23 text

Before using transformer, our models were written in Keras + TF

Slide 24

Slide 24 text

For the transformer model, we used Pytorch

Slide 25

Slide 25 text

We modify the architecture by only taking the encoder

Slide 26

Slide 26 text

~400k parameters (small model)

Slide 27

Slide 27 text

Customer service conversations are unstructured text data

Slide 28

Slide 28 text

We train with triplets... How do I change my PIN? You can change your card PIN at any large bank (HSBC, Barclays, etc.) ATM in the UK by selecting PIN services ʠ You'll be pleased to know that we never charge you any fees for withdrawing money from an ATM ȓ Q A+ A- IBM, Feng et al. 2015

Slide 29

Slide 29 text

… using a ranking loss / hinge loss objective function m is a margin, we set it to 0.2 per the IBM paper (Feng et al., 2015) If cos(Q, A+) is large, loss is 0, which is what we want

Slide 30

Slide 30 text

Train on GCE instance with 1xP100 GPU for 2-3 days

Slide 31

Slide 31 text

A few ‘tricks’ while training: Replace too-hard or too-easy samples every few epochs of training (Google Facenet, 2015) with semi-hard examples Use the same weights for both questions and answers Learning rate annealing (we found simple reduce-when-plateau to work better than fancy methods like SGDR)

Slide 32

Slide 32 text

Serving the model

Slide 33

Slide 33 text

For customer support, model runs on a GCE instance every minute (CPU only) Pushes to Intercom via API

Slide 34

Slide 34 text

For help search, serve as a Flask microservice in a Docker container on Kubernetes

Slide 35

Slide 35 text

In both cases, we’re using the Pytorch model directly. No exporting to Caffe2

Slide 36

Slide 36 text

(ms) It works well for our volumes x3 pods on k8s

Slide 37

Slide 37 text

Raw TF protobuf on the same Docker specs crashes the microservice

Slide 38

Slide 38 text

Unit tests to check prediction quality before pushing a new model live

Slide 39

Slide 39 text

Results

Slide 40

Slide 40 text

Our saved response usage went up...

Slide 41

Slide 41 text

… which gave us actionable clustering of support queries All fixed in latest version of the app Reduced queries by 10% (~800 queries a week)

Slide 42

Slide 42 text

We also use saved response predictions as classification

Slide 43

Slide 43 text

Users find what they need XX % of the time (recall @ 5)

Slide 44

Slide 44 text

Search also allows further insights into users problems

Slide 45

Slide 45 text

Other learnings

Slide 46

Slide 46 text

Do not train ranking problems as a classification problem How do I change my PIN? You can change your card PIN at any large bank (HSBC, Barclays, etc.) ATM in the UK by selecting PIN services ʠ You'll be pleased to know that we never charge you any fees for withdrawing money from an ATM ȓ How do I change my PIN? 1 0

Slide 47

Slide 47 text

Transformer does 20% better than HAN with same number of parameters http://smerity.com/articles/2017/mixture_of_softmaxes.html

Slide 48

Slide 48 text

Keras is a great tool for rapid prototyping, but impossible to implement more complex models

Slide 49

Slide 49 text

Despite being a dynamic graph, Pytorch performs better than TF in both training and serving

Slide 50

Slide 50 text

… and the size of the binary is smaller (70mb vs 110mb)

Slide 51

Slide 51 text

Only scenario in which TF is faster than Pytorch is CPU training

Slide 52

Slide 52 text

Use right level of abstraction Classification Ranking Text generation Lower abstraction / more complex model

Slide 53

Slide 53 text

Fine-tuning on short term anomalies doesn’t work - catastrophic forgetting Will try to implement Elastic Weight Consolidation (EWC) if I have time

Slide 54

Slide 54 text

Use Pytorch! It’s much more fun to code in with tf.gfile.FastGFile(graph_path, 'rb') as f: graph_def = tf.GraphDef() graph_def.ParseFromString(f.read()) with tf.Graph().as_default() as graph: tf.import_graph_def(graph_def, name='import') torch.load(checkpoint_files[ 0], map_location=lambda storage, loc: storage) Loading model (actual code in repo) TF Pytorch

Slide 55

Slide 55 text

Key takeaways

Slide 56

Slide 56 text

Build multi-purpose models to avoid technical debt Monitor metrics and actual usage habits Give Pytorch a go!

Slide 57

Slide 57 text

Thank you

Slide 58

Slide 58 text

No content