Semantic Concept Embedding for a natural language FAQ system by Bernardt Duvenhage

Slide 1

Slide 1 text

Semantic Concept Embedding for a Natural Language FAQ System Bernardt Duvenhage NLP & Machine Learning Lead, PhD [email protected], www.feersum.io

Slide 2

Slide 2 text

Python Coding - Disclaimer ‣ I started coding Python when I joined Feersum Engine team in Feb 2017. ‣ Coded mainly in C++ before that. ‣ In hours coded I’m only about 1/18’th a Python developer. ‣ And yes, I do miss my compiler…

Slide 3

Slide 3 text

Python Coding - Tip ‣ I quickly got to know this guy… ‣ Started using: ‣ Type hinting. ‣ Static code analysis ‣ pylint, mypy & flake8 ‣ And now… Less chance of…

Slide 4

Slide 4 text

Background - Feersum Engine

Slide 5

Slide 5 text

Background - Feersum Engine Enables the rapid creation and operation of rich, mixed structured and NLU orientated digital conversations.

Slide 6

Slide 6 text

Background - Feersum NLU ‣ The NLU service for Feersum Engine: ‣ Intent classification and information extraction. ‣ Natural language form filling. ‣ Natural language FAQs. ‣ Sentiment detection. ‣ Language identification. ‣ Text & document classification. Feersum NLU is everything a growing chatbot needs …

Slide 7

Slide 7 text

Background - Feersum NLU ‣ Use open source building blocks when possible: ‣ NLTK, scikit-learn, PyTorch … ‣ Develop own algorithms and implementations when needed: ‣ Multi-language semantic matching for FAQs, intent detection etc. ‣ Text language identification. ‣ Collaborate with the CSIR, universities and students.

Slide 8

Slide 8 text

FeersumNLU Arch. ‣ NLP & ML Layer ‣ Single-user environment ‣ Multi-user environment ‣ RESTful API

Slide 9

Slide 9 text

Scope of this Presentation ‣ 1. Overview of semantic concept embedding. ‣ 2. How semantic concept embedding is used for natural language FAQs. ‣ 3. Results. [Abstract: The Turbo Encabulator https://www.youtube.com/watch?v=Ac7G7xOG2Ag]

Slide 10

Slide 10 text

1.1 Semantic Concept Embedding ‣ "Where do I go to write a communiqué" == "How do I send a message" ? ‣ [0.10, 0.40, 0.52, 0.21] == [0.09, 0.38, 0.50, 0.20] ? ‣ A concept embedding is a vector space for sentences. ‣ sentence vector = superimposition of word vectors: ‣ Some geometric or component-wise combination of word vectors. ‣ The similarity measures include cosine, L1 and L2.

Slide 11

Slide 11 text

1.2 Word Vectors ‣ Distributional hypothesis - A word may be known by the company it keeps. ‣ Objective function minimises cosine distance between words used in similar contexts. ‣ Example embeddings: ‣ Stanford’s GloVe embedding. ‣ Google and Facebook’s Word2Vec embeddings. ‣ Typically low (50 - 300) dimensional vectors.

Slide 12

Slide 12 text

1.2 Word Vectors [https://www.tensorflow.org/tutorials/word2vec]

Slide 13

Slide 13 text

1.3 The Manifold Hypothesis ‣ Real world high dimensional data lie on low dimensional embedded manifolds. ‣ Reason why deep leaning works? Probably. ‣ [http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/ ] ‣ [NAACL2013 Deep Learning for NLP]

Slide 14

Slide 14 text

1.3 The Manifold Hypothesis [http://colah.github.io/posts/2015-01-Visualizing-Representations]

Slide 15

Slide 15 text

Shared Natural Language and Image Embeddings [Socher et al. 2013. Zero-Shot Learning Through Cross-Modal Transfer]

Slide 16

Slide 16 text

1.3 'The only new principle involved…' ‣ Instead of using the mean of the word vectors to create a sentence vector one rather retains the dominant semantic concepts over all of the words. ‣ Related to max pooling used in CNNs for text and image processing. ‣ This research is ongoing … How to disentangle manifolds.

Slide 17

Slide 17 text

2. A Natural Language FAQ ‣ Nearest neighbour search to find the closest question to a user’s question. ‣ Tried an SVM, but the KNN kept outperforming SVM and there are ways of speeding up the KNN inference if required. ‣ The search is preceded by a language identification so that one can run the correct FAQ for the user’s language. ‣ We are looking at ways to align and combine the various language models, but we’re not quite there yet.

Slide 18

Slide 18 text

2. A Natural Language FAQ ‣ ("How do I submit a claim?", "Please go to the chatbot …", "eng") ‣ ("Hoe moet ek my eis insit?", "Please go to the chatbot …", "afr") ‣ ("How much does it cost to get a quote?", "It is totally free …", "eng") ‣ ("Is 'n kwotasie verniet?", "It is totally free …", "afr") ‣ What will the price be? => How much does it cost to get a quote? ‣ Wat is die prys? => Is 'n kwotasie verniet? ‣ Jupyter demo …

Slide 19

Slide 19 text

3. Results ‣ Top 3/5000, given 3 example questions per FAQ - 92% accuracy. ‣ Top 3/5000, given 5 example questions per FAQ: ‣ 95% accuracy. ‣ 93% [Vector re-weighting and PCA - Arora et al., Apr 2017] ‣ SemEval2014 semantic similarity benchmark: ‣ 0.66 - 0.72 [when combined with NN regression] ‣ 0.73 - [Vector re-weighting and PCA - Arora et al., Apr 2017] [Arora et al., Apr 2017, A simple, but tough-to-beat baseline for sentence embeddings.]

Slide 20

Slide 20 text

The End - Thank you for listening. ‣ Please get in contact. We’re hiring. ‣ Want to experience our Feersum NLU playground service?: ‣ Email [email protected] to get an API key. ‣ Visit www.feersum.io for more details. Bernardt Duvenhage NLP & Machine Learning Lead, PhD [email protected], www.feersum.io

Slide 21

Slide 21 text

The End - Thank you for listening. Questions? Bernardt Duvenhage NLP & Machine Learning Lead, PhD [email protected], www.feersum.io