Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Semantic Concept Embedding for a natural langua...

Pycon ZA
October 06, 2017

Semantic Concept Embedding for a natural language FAQ system by Bernardt Duvenhage

For a number of months now work has been proceeding in order to bring perfection to the crudely conceived idea of a super-positioning of word vectors that would not only capture the tenor of a sentence in a vector of similar dimension, but that is based on the high dimensional manifold hypothesis to optimally retain the various semantic concepts. Such a super-positioning of word vectors is called the semantic concept embedding.

Now basically the only new principle involved is that instead of using the mean of the word vectors of a sentence one rather retains the dominant semantic concepts over all of the words. A modification informed by the aforementioned manifold hypothesis.

The original implementation retains the absolute maximum value over each of the dimensions of an embedding such as the GloVe embedding developed at Stanford University. The use of these semantic concept vectors then allows effective matching of users' questions to an online FAQ system which in turn allows a natural language adaptation of said system that easily achieves an F-score of 0.922 on the Quora dataset given only three examples of how any particular question may be asked.

The semantic concept embedding has now reached a high level of development. First, an analysis of the word embedding is applied to find the prepotent semantic concepts. The associated direction vectors are then used to transform the embeddings in just the right way to optimally detangle the principal manifolds and further increase the performance of the natural language FAQ system.

This talk will give an overview of:

• The problem of semantic sentence embedding.
• How NLTK, numpy, and Python machine learning frameworks are used to solve the problem.
• How semantic concept embedding is used for natural language FAQ systems in chatbots, etc.

Pycon ZA

October 06, 2017
Tweet

More Decks by Pycon ZA

Other Decks in Programming

Transcript

  1. Semantic Concept Embedding for a Natural Language FAQ System Bernardt

    Duvenhage NLP & Machine Learning Lead, PhD [email protected], www.feersum.io
  2. Python Coding - Disclaimer ‣ I started coding Python when

    I joined Feersum Engine team in Feb 2017. ‣ Coded mainly in C++ before that. ‣ In hours coded I’m only about 1/18’th a Python developer. ‣ And yes, I do miss my compiler…
  3. Python Coding - Tip ‣ I quickly got to know

    this guy… ‣ Started using: ‣ Type hinting. ‣ Static code analysis ‣ pylint, mypy & flake8 ‣ And now… Less chance of…
  4. Background - Feersum Engine Enables the rapid creation and operation

    of rich, mixed structured and NLU orientated digital conversations.
  5. Background - Feersum NLU ‣ The NLU service for Feersum

    Engine: ‣ Intent classification and information extraction. ‣ Natural language form filling. ‣ Natural language FAQs. ‣ Sentiment detection. ‣ Language identification. ‣ Text & document classification. Feersum NLU is everything a growing chatbot needs …
  6. Background - Feersum NLU ‣ Use open source building blocks

    when possible: ‣ NLTK, scikit-learn, PyTorch … ‣ Develop own algorithms and implementations when needed: ‣ Multi-language semantic matching for FAQs, intent detection etc. ‣ Text language identification. ‣ Collaborate with the CSIR, universities and students.
  7. FeersumNLU Arch. ‣ NLP & ML Layer ‣ Single-user environment

    ‣ Multi-user environment ‣ RESTful API
  8. Scope of this Presentation ‣ 1. Overview of semantic concept

    embedding. ‣ 2. How semantic concept embedding is used for natural language FAQs. ‣ 3. Results. [Abstract: The Turbo Encabulator https://www.youtube.com/watch?v=Ac7G7xOG2Ag]
  9. 1.1 Semantic Concept Embedding ‣ "Where do I go to

    write a communiqué" == "How do I send a message" ? ‣ [0.10, 0.40, 0.52, 0.21] == [0.09, 0.38, 0.50, 0.20] ? ‣ A concept embedding is a vector space for sentences. ‣ sentence vector = superimposition of word vectors: ‣ Some geometric or component-wise combination of word vectors. ‣ The similarity measures include cosine, L1 and L2.
  10. 1.2 Word Vectors ‣ Distributional hypothesis - A word may

    be known by the company it keeps. ‣ Objective function minimises cosine distance between words used in similar contexts. ‣ Example embeddings: ‣ Stanford’s GloVe embedding. ‣ Google and Facebook’s Word2Vec embeddings. ‣ Typically low (50 - 300) dimensional vectors.
  11. 1.3 The Manifold Hypothesis ‣ Real world high dimensional data

    lie on low dimensional embedded manifolds. ‣ Reason why deep leaning works? Probably. ‣ [http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/ ] ‣ [NAACL2013 Deep Learning for NLP]
  12. Shared Natural Language and Image Embeddings [Socher et al. 2013.

    Zero-Shot Learning Through Cross-Modal Transfer]
  13. 1.3 'The only new principle involved…' ‣ Instead of using

    the mean of the word vectors to create a sentence vector one rather retains the dominant semantic concepts over all of the words. ‣ Related to max pooling used in CNNs for text and image processing. ‣ This research is ongoing … How to disentangle manifolds.
  14. 2. A Natural Language FAQ ‣ Nearest neighbour search to

    find the closest question to a user’s question. ‣ Tried an SVM, but the KNN kept outperforming SVM and there are ways of speeding up the KNN inference if required. ‣ The search is preceded by a language identification so that one can run the correct FAQ for the user’s language. ‣ We are looking at ways to align and combine the various language models, but we’re not quite there yet.
  15. 2. A Natural Language FAQ ‣ ("How do I submit

    a claim?", "Please go to the chatbot …", "eng") ‣ ("Hoe moet ek my eis insit?", "Please go to the chatbot …", "afr") ‣ ("How much does it cost to get a quote?", "It is totally free …", "eng") ‣ ("Is 'n kwotasie verniet?", "It is totally free …", "afr") ‣ What will the price be? => How much does it cost to get a quote? ‣ Wat is die prys? => Is 'n kwotasie verniet? ‣ Jupyter demo …
  16. 3. Results ‣ Top 3/5000, given 3 example questions per

    FAQ - 92% accuracy. ‣ Top 3/5000, given 5 example questions per FAQ: ‣ 95% accuracy. ‣ 93% [Vector re-weighting and PCA - Arora et al., Apr 2017] ‣ SemEval2014 semantic similarity benchmark: ‣ 0.66 - 0.72 [when combined with NN regression] ‣ 0.73 - [Vector re-weighting and PCA - Arora et al., Apr 2017] [Arora et al., Apr 2017, A simple, but tough-to-beat baseline for sentence embeddings.]
  17. The End - Thank you for listening. ‣ Please get

    in contact. We’re hiring. ‣ Want to experience our Feersum NLU playground service?: ‣ Email [email protected] to get an API key. ‣ Visit www.feersum.io for more details. Bernardt Duvenhage NLP & Machine Learning Lead, PhD [email protected], www.feersum.io