Building Chatbot using Elasticsearch

91d5d86a4dea2143bd12b17f67750f62?s=47 ifengc
March 19, 2018

Building Chatbot using Elasticsearch

91d5d86a4dea2143bd12b17f67750f62?s=128

ifengc

March 19, 2018
Tweet

Transcript

  1. 2.

    Outline • What I know about chatbots • Introduction to

    Elasticsearch • Building chatbot using Elasticsearch 2
  2. 3.

    What I know about chatbots: Task-Oriented Approach: Lots of rules,

    information retrieval and NLP components. Are there any action movies to see this weekend ? I’d like to reserve a table for dinner. I forgot my password. 3
  3. 5.

    What I know about chatbots: Marketing AsiaYo 如何在一週內打 造破萬使用者的 Chatbot

    AsiaYo 如何在一天內用 Chatbot 摧毀使用者體驗 Reference: https://www.inside.com.tw/2017/10/31/asiayo-chatbot 5
  4. 6.

    What I know about chatbots: Chit-Chat Approach: Deep learning model

    v.s. information retrieval… 6 Reference: https://docs.google.com/presentation/d/1mzotBUwq55Dwio3XipFlZ2DpTEORX_BJZlw9mzCm5O4
  5. 9.

    Full text search: Inverted Index Doc1: Elasticsearch也能做聊天機器人 Doc2: 鄉民教我做的聊天機器人 Doc1:

    elasticsearch 也 能 做 聊天 機器人 Doc2: 鄉民 教 我 做 的 聊天 機器人 Preprocessing Term Freq. Doc List elasticsearch 1 1 也 1 1 能 1 1 做 2 1, 2 聊天 2 1, 2 機器人 2 1, 2 鄉民 1 2 教 1 2 我 1 2 的 1 2 Inverted Index 9
  6. 10.

    Elasticsearch v.s. RDBMS • RDBMS ◦ Specific data type and

    index for full text search (MySQL, Postgresql) ◦ SQL join ◦ ACID • MongoDB full text search • Elasticsearch ◦ Restful Http API ◦ Json Document (Nested Structure) ◦ Designed for full text search ◦ Various text processing and ranking plugins 10
  7. 11.

    Elasticsearch • The no. of primary shards in an index

    is fixed, but the no. of replica shards can be changed at any time. • Segments are immutable, so there is no need for locking. When a document is deleted or updated, the old version of the document is only marked as deleted. 11 Reference: https://www.elastic.co/guide/en/elasticsearch/guide/current/inside-a-shard.html
  8. 12.

    Elasticsearch Cluster 1 Node, 3 Shards, 0 replica 2 Node,

    3 Shards, 1 replica 3 Node, 3 Shards, 1 replica 3 Node, 3 Shards, 2 replica Reference: https://www.elastic.co/guide/en/elasticsearch/guide/current/distributed-cluster.html 12
  9. 13.

    Elasticsearch analyzer v.s. Our pipeline When emojis meet jieba… ◢▆▅▄▃崩╰(〒皿〒)╯潰▃▄▅▆◣

    => ['◢', '▆', '▅', '▄', '▃', '崩', '╰', '(', '〒', '皿', '〒', ')', '╯', '潰', '▃', '▄', '▅', '▆', '◣'] 13 Reference: https://www.elastic.co/guide/en/elasticsearch/guide/current/custom-analyzers.html
  10. 14.

    Chose tokenizer • jseg has lower error rate, but ccjieba

    has better pos tagging. 14 Reference: https://docs.google.com/presentation/d/1mzotBUwq55Dwio3XipFlZ2DpTEORX_BJZlw9mzCm5O4
  11. 17.

    Elasticsearch Object vs Nested Object 17 Json with array Lucene

    has no concept of inner objects, so Elasticsearch will flatten it into multi-value fields. The association between alice and white is lost. Use Nested Object for arrays of objects. Reference: https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html * Nested documents are indexed as separate documents.
  12. 18.

    Elasticsearch Relevance • Lucene (and thus Elasticsearch) uses Boolean model,

    TF/IDF, and the vector space model and combines the scores by the formula called the practical scoring function. • Vector space model: ◦ We have three documents and query “happy hippopotamus” ▪ I am happy in summer. ▪ After Christmas I’m a hippopotamus. ▪ The happy hippopotamus helped Harry. 18 Reference: https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html
  13. 19.

    Query: Find the most relevant titles • Use unigram/bi-grams(shingle) and

    jieba with our pipeline and query both fields at the same time. 19
  14. 20.

    Ranking the comments • Pointwise Mutual Information 20 • Use

    termvectors or multi termvectors to get tf/idf • The information is only retrieved for the shard the requested document resides in. Reference: https://docs.google.com/presentation/d/1mzotBUwq55Dwio3XipFlZ2DpTEORX_BJZlw9mzCm5O4
  15. 22.

    Challenges • We have title scores from Lucene practical scoring

    function and comment scores from PMI. How should we combine them? • How to evaluate a chi-chat bot's performance? 22
  16. 23.