Monzo’s Customer Service
Augmentation
Tech Sessions
Nigel Ng
Data Scientist
Slide 2
Slide 2 text
At Monzo, we want to provide
world class customer service
Slide 3
Slide 3 text
But customer service enquiries
scales linearly with number of users
Slide 4
Slide 4 text
How can we meet our goals of
having response times < 10
minutes?
Slide 5
Slide 5 text
Reduce number
of inbound
questions
Increase
productivity of
customer support
agents
Slide 6
Slide 6 text
Here’s how we used ML to tackle those
problems
Slide 7
Slide 7 text
Reducing queries:
Enable natural language search for
help content
Slide 8
Slide 8 text
No content
Slide 9
Slide 9 text
No content
Slide 10
Slide 10 text
Increasing productivity:
Suggest saved responses to a support
agent
Slide 11
Slide 11 text
No content
Slide 12
Slide 12 text
We have ~800 of these saved responses.
Approximately 70-80% of queries that come
in can be handled by a saved response.
Slide 13
Slide 13 text
We achieved those two goals with a
single NLP model
Slide 14
Slide 14 text
Formulate this as an information
retrieval problem:
Slide 15
Slide 15 text
Given an incoming question, find the
most similar answer from an answer
pool
Slide 16
Slide 16 text
We define ‘most similar’ as the
pairing (q, a) that yields the
highest cosine similarity
Slide 17
Slide 17 text
To find cosine similarity of texts,
we need to represent them as
vectors
Encoder
I forgot my PIN
I can reset it for you!
[0.2, -3.0, …, 1.8, 0.3]
[0.3, -2.2, …, 0.9, 0.4]
Slide 18
Slide 18 text
Word2vec works well on our data
Slide 19
Slide 19 text
Encoder
So now we need to go from word
vectors to paragraph vectors
I [0.2, 0.4, …, -0.2]
lost [-0.1, 0.1, …, 1.1]
my [0.5, 1.2, …, -2.8]
pin [-1.4, -0.1, …, -0.5]
[-1.1, -0.4, …, 0.8]
[0.9, -3.0, …, 1.8, 0.3]
Slide 20
Slide 20 text
Our journey to build such an encoder
Just take average/max of the word vectors
Paragraph vectors (Mikolov et al., 2015) - I had such high hopes ʭ
Paragraph vectors in combination with textual search (BM25, tfidf)
Supervised pre-training and removing the last classification layer (Deng et al., 2009) -
doesn’t generalize well to less common phrases
Vanilla word-level RNN - looks good but routinely fails on longer sentences
Slide 21
Slide 21 text
Our journey to build such an encoder
Hierarchical Attention Networks (Yang et al., 2016) - nice! We used this in production
for a while.
Transformer (Google, 2017) - What we’re currently using.
Slide 22
Slide 22 text
Training the Transformer Model
Slide 23
Slide 23 text
Before using transformer, our
models were written in Keras + TF
Slide 24
Slide 24 text
For the transformer model, we
used Pytorch
Slide 25
Slide 25 text
We modify the architecture by
only taking the encoder
Slide 26
Slide 26 text
~400k parameters (small model)
Slide 27
Slide 27 text
Customer service conversations
are unstructured text data
Slide 28
Slide 28 text
We train with triplets...
How do I change
my PIN?
You can change
your card PIN at
any large bank
(HSBC, Barclays,
etc.) ATM in the UK
by selecting PIN
services ʠ
You'll be pleased to
know that we never
charge you any fees
for withdrawing
money from an ATM
ȓ
Q A+ A-
IBM, Feng et al. 2015
Slide 29
Slide 29 text
… using a ranking loss / hinge loss
objective function
m is a margin, we set it to 0.2 per the IBM paper (Feng et al., 2015)
If cos(Q, A+) is large, loss is 0, which is what we want
Slide 30
Slide 30 text
Train on GCE instance with 1xP100
GPU for 2-3 days
Slide 31
Slide 31 text
A few ‘tricks’ while training:
Replace too-hard or too-easy samples every few epochs of
training (Google Facenet, 2015)
with semi-hard examples
Use the same weights for both questions and answers
Learning rate annealing (we found simple
reduce-when-plateau to work better than fancy methods like
SGDR)
Slide 32
Slide 32 text
Serving the model
Slide 33
Slide 33 text
For customer support, model runs
on a GCE instance every minute
(CPU only)
Pushes to Intercom via API
Slide 34
Slide 34 text
For help search, serve as a Flask
microservice in a Docker
container on Kubernetes
Slide 35
Slide 35 text
In both cases, we’re using the
Pytorch model directly. No
exporting to Caffe2
Slide 36
Slide 36 text
(ms)
It works well for our volumes
x3 pods on k8s
Slide 37
Slide 37 text
Raw TF protobuf on the same
Docker specs crashes the
microservice
Slide 38
Slide 38 text
Unit tests to check prediction quality
before pushing a new model live
Slide 39
Slide 39 text
Results
Slide 40
Slide 40 text
Our saved response usage went
up...
Slide 41
Slide 41 text
… which gave us actionable
clustering of support queries
All fixed in latest version of
the app
Reduced queries by 10%
(~800 queries a week)
Slide 42
Slide 42 text
We also use saved response
predictions as classification
Slide 43
Slide 43 text
Users find what they need XX % of
the time (recall @ 5)
Slide 44
Slide 44 text
Search also allows further insights
into users problems
Slide 45
Slide 45 text
Other learnings
Slide 46
Slide 46 text
Do not train ranking problems as a
classification problem
How do I change
my PIN?
You can change your card PIN at any
large bank (HSBC, Barclays, etc.)
ATM in the UK by selecting PIN
services ʠ
You'll be pleased to know that we
never charge you any fees for
withdrawing money from an ATM ȓ
How do I change
my PIN?
1
0
Slide 47
Slide 47 text
Transformer does 20% better than
HAN with same number of
parameters
http://smerity.com/articles/2017/mixture_of_softmaxes.html
Slide 48
Slide 48 text
Keras is a great tool for rapid
prototyping, but impossible to
implement more complex models
Slide 49
Slide 49 text
Despite being a dynamic graph,
Pytorch performs better than TF
in both training and serving
Slide 50
Slide 50 text
… and the size of the binary is
smaller (70mb vs 110mb)
Slide 51
Slide 51 text
Only scenario in which TF is faster
than Pytorch is CPU training
Slide 52
Slide 52 text
Use right level of abstraction
Classification
Ranking
Text generation
Lower abstraction / more
complex model
Slide 53
Slide 53 text
Fine-tuning on short term
anomalies doesn’t work -
catastrophic forgetting
Will try to implement Elastic
Weight Consolidation (EWC) if I
have time
Slide 54
Slide 54 text
Use Pytorch! It’s much more fun
to code in
with tf.gfile.FastGFile(graph_path, 'rb') as f:
graph_def = tf.GraphDef()
graph_def.ParseFromString(f.read())
with tf.Graph().as_default() as graph:
tf.import_graph_def(graph_def, name='import')
torch.load(checkpoint_files[ 0], map_location=lambda
storage, loc: storage)
Loading model (actual code in repo)
TF Pytorch
Slide 55
Slide 55 text
Key takeaways
Slide 56
Slide 56 text
Build multi-purpose models to
avoid technical debt
Monitor metrics and actual usage
habits
Give Pytorch a go!