Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building Chatbot using Elasticsearch

Sponsored · Ship Features Fearlessly Turn features on and off without deploys. Used by thousands of Ruby developers.
Avatar for ifengc ifengc
March 19, 2018

Building Chatbot using Elasticsearch

Avatar for ifengc

ifengc

March 19, 2018
Tweet

Other Decks in Programming

Transcript

  1. Outline • What I know about chatbots • Introduction to

    Elasticsearch • Building chatbot using Elasticsearch 2
  2. What I know about chatbots: Task-Oriented Approach: Lots of rules,

    information retrieval and NLP components. Are there any action movies to see this weekend ? I’d like to reserve a table for dinner. I forgot my password. 3
  3. What I know about chatbots: Marketing AsiaYo 如何在一週內打 造破萬使用者的 Chatbot

    AsiaYo 如何在一天內用 Chatbot 摧毀使用者體驗 Reference: https://www.inside.com.tw/2017/10/31/asiayo-chatbot 5
  4. What I know about chatbots: Chit-Chat Approach: Deep learning model

    v.s. information retrieval… 6 Reference: https://docs.google.com/presentation/d/1mzotBUwq55Dwio3XipFlZ2DpTEORX_BJZlw9mzCm5O4
  5. Full text search: Inverted Index Doc1: Elasticsearch也能做聊天機器人 Doc2: 鄉民教我做的聊天機器人 Doc1:

    elasticsearch 也 能 做 聊天 機器人 Doc2: 鄉民 教 我 做 的 聊天 機器人 Preprocessing Term Freq. Doc List elasticsearch 1 1 也 1 1 能 1 1 做 2 1, 2 聊天 2 1, 2 機器人 2 1, 2 鄉民 1 2 教 1 2 我 1 2 的 1 2 Inverted Index 9
  6. Elasticsearch v.s. RDBMS • RDBMS ◦ Specific data type and

    index for full text search (MySQL, Postgresql) ◦ SQL join ◦ ACID • MongoDB full text search • Elasticsearch ◦ Restful Http API ◦ Json Document (Nested Structure) ◦ Designed for full text search ◦ Various text processing and ranking plugins 10
  7. Elasticsearch • The no. of primary shards in an index

    is fixed, but the no. of replica shards can be changed at any time. • Segments are immutable, so there is no need for locking. When a document is deleted or updated, the old version of the document is only marked as deleted. 11 Reference: https://www.elastic.co/guide/en/elasticsearch/guide/current/inside-a-shard.html
  8. Elasticsearch Cluster 1 Node, 3 Shards, 0 replica 2 Node,

    3 Shards, 1 replica 3 Node, 3 Shards, 1 replica 3 Node, 3 Shards, 2 replica Reference: https://www.elastic.co/guide/en/elasticsearch/guide/current/distributed-cluster.html 12
  9. Elasticsearch analyzer v.s. Our pipeline When emojis meet jieba… ◢▆▅▄▃崩╰(〒皿〒)╯潰▃▄▅▆◣

    => ['◢', '▆', '▅', '▄', '▃', '崩', '╰', '(', '〒', '皿', '〒', ')', '╯', '潰', '▃', '▄', '▅', '▆', '◣'] 13 Reference: https://www.elastic.co/guide/en/elasticsearch/guide/current/custom-analyzers.html
  10. Chose tokenizer • jseg has lower error rate, but ccjieba

    has better pos tagging. 14 Reference: https://docs.google.com/presentation/d/1mzotBUwq55Dwio3XipFlZ2DpTEORX_BJZlw9mzCm5O4
  11. Elasticsearch Object vs Nested Object 17 Json with array Lucene

    has no concept of inner objects, so Elasticsearch will flatten it into multi-value fields. The association between alice and white is lost. Use Nested Object for arrays of objects. Reference: https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html * Nested documents are indexed as separate documents.
  12. Elasticsearch Relevance • Lucene (and thus Elasticsearch) uses Boolean model,

    TF/IDF, and the vector space model and combines the scores by the formula called the practical scoring function. • Vector space model: ◦ We have three documents and query “happy hippopotamus” ▪ I am happy in summer. ▪ After Christmas I’m a hippopotamus. ▪ The happy hippopotamus helped Harry. 18 Reference: https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html
  13. Query: Find the most relevant titles • Use unigram/bi-grams(shingle) and

    jieba with our pipeline and query both fields at the same time. 19
  14. Ranking the comments • Pointwise Mutual Information 20 • Use

    termvectors or multi termvectors to get tf/idf • The information is only retrieved for the shard the requested document resides in. Reference: https://docs.google.com/presentation/d/1mzotBUwq55Dwio3XipFlZ2DpTEORX_BJZlw9mzCm5O4
  15. Challenges • We have title scores from Lucene practical scoring

    function and comment scores from PMI. How should we combine them? • How to evaluate a chi-chat bot's performance? 22