Building Chatbot using Elasticsearch

ElasticSearch也能做聊天機器人欸偶其實是要介紹 ElasticSearch 啦 1

Outline • What I know about chatbots • Introduction to
Elasticsearch • Building chatbot using Elasticsearch 2

What I know about chatbots: Task-Oriented Approach: Lots of rules,
information retrieval and NLP components. Are there any action movies to see this weekend ? I’d like to reserve a table for dinner. I forgot my password. 3

What I know about chatbots: Task-Oriented Reference: https://www.csie.ntu.edu.tw/~yvchen/s105-icb/syllabus.html 4

What I know about chatbots: Marketing AsiaYo 如何在一週內打造破萬使用者的 Chatbot
AsiaYo 如何在一天內用 Chatbot 摧毀使用者體驗 Reference: https://www.inside.com.tw/2017/10/31/asiayo-chatbot 5

What I know about chatbots: Chit-Chat Approach: Deep learning model
v.s. information retrieval… 6 Reference: https://docs.google.com/presentation/d/1mzotBUwq55Dwio3XipFlZ2DpTEORX_BJZlw9mzCm5O4

Corpus 7

Architecture 8

Full text search: Inverted Index Doc1: Elasticsearch也能做聊天機器人 Doc2: 鄉民教我做的聊天機器人 Doc1:
elasticsearch 也能做聊天機器人 Doc2: 鄉民教我做的聊天機器人 Preprocessing Term Freq. Doc List elasticsearch 1 1 也 1 1 能 1 1 做 2 1, 2 聊天 2 1, 2 機器人 2 1, 2 鄉民 1 2 教 1 2 我 1 2 的 1 2 Inverted Index 9

Elasticsearch v.s. RDBMS • RDBMS ◦ Specific data type and
index for full text search (MySQL, Postgresql) ◦ SQL join ◦ ACID • MongoDB full text search • Elasticsearch ◦ Restful Http API ◦ Json Document (Nested Structure) ◦ Designed for full text search ◦ Various text processing and ranking plugins 10

Elasticsearch • The no. of primary shards in an index
is fixed, but the no. of replica shards can be changed at any time. • Segments are immutable, so there is no need for locking. When a document is deleted or updated, the old version of the document is only marked as deleted. 11 Reference: https://www.elastic.co/guide/en/elasticsearch/guide/current/inside-a-shard.html

Elasticsearch Cluster 1 Node, 3 Shards, 0 replica 2 Node,
3 Shards, 1 replica 3 Node, 3 Shards, 1 replica 3 Node, 3 Shards, 2 replica Reference: https://www.elastic.co/guide/en/elasticsearch/guide/current/distributed-cluster.html 12

Elasticsearch analyzer v.s. Our pipeline When emojis meet jieba… ◢▆▅▄▃崩╰(〒皿〒)╯潰▃▄▅▆◣
=> ['◢', '▆', '▅', '▄', '▃', '崩', '╰', '(', '〒', '皿', '〒', ')', '╯', '潰', '▃', '▄', '▅', '▆', '◣'] 13 Reference: https://www.elastic.co/guide/en/elasticsearch/guide/current/custom-analyzers.html

Chose tokenizer • jseg has lower error rate, but ccjieba
has better pos tagging. 14 Reference: https://docs.google.com/presentation/d/1mzotBUwq55Dwio3XipFlZ2DpTEORX_BJZlw9mzCm5O4

Elasticsearch Query with Python Client 15 Reference: https://github.com/elastic/elasticsearch-dsl-py V.S. elasticsearch-dsl-py
elasticsearch-py

Elasticsearch Mapping 16

Elasticsearch Object vs Nested Object 17 Json with array Lucene
has no concept of inner objects, so Elasticsearch will flatten it into multi-value fields. The association between alice and white is lost. Use Nested Object for arrays of objects. Reference: https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html * Nested documents are indexed as separate documents.

Elasticsearch Relevance • Lucene (and thus Elasticsearch) uses Boolean model,
TF/IDF, and the vector space model and combines the scores by the formula called the practical scoring function. • Vector space model: ◦ We have three documents and query “happy hippopotamus” ▪ I am happy in summer. ▪ After Christmas I’m a hippopotamus. ▪ The happy hippopotamus helped Harry. 18 Reference: https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html

Query: Find the most relevant titles • Use unigram/bi-grams(shingle) and
jieba with our pipeline and query both fields at the same time. 19

Ranking the comments • Pointwise Mutual Information 20 • Use
termvectors or multi termvectors to get tf/idf • The information is only retrieved for the shard the requested document resides in. Reference: https://docs.google.com/presentation/d/1mzotBUwq55Dwio3XipFlZ2DpTEORX_BJZlw9mzCm5O4

Recap: Architecture 21

Challenges • We have title scores from Lucene practical scoring
function and comment scores from PMI. How should we combine them? • How to evaluate a chi-chat bot's performance? 22

Members • 吳啟豪 (Charlie) • 許晉源 (Ed) • 陳儀峰 (ifeng)
• 顏孜羲 (Joe) • 趙彥翔 (Ryan) 23

Question? 24

Building Chatbot using Elasticsearch

Building Chatbot using Elasticsearch

ifengc

Other Decks in Programming

Featured

Transcript

ElasticSearch也能做聊天機器人欸偶其實是要介紹 ElasticSearch 啦 1

Outline • What I know about chatbots • Introduction to

What I know about chatbots: Task-Oriented Approach: Lots of rules,

What I know about chatbots: Task-Oriented Reference: https://www.csie.ntu.edu.tw/~yvchen/s105-icb/syllabus.html 4

What I know about chatbots: Marketing AsiaYo 如何在一週內打造破萬使用者的 Chatbot

What I know about chatbots: Chit-Chat Approach: Deep learning model

Corpus 7

Architecture 8

Full text search: Inverted Index Doc1: Elasticsearch也能做聊天機器人 Doc2: 鄉民教我做的聊天機器人 Doc1:

Elasticsearch v.s. RDBMS • RDBMS ◦ Specific data type and

Elasticsearch • The no. of primary shards in an index

Elasticsearch Cluster 1 Node, 3 Shards, 0 replica 2 Node,

Elasticsearch analyzer v.s. Our pipeline When emojis meet jieba… ◢▆▅▄▃崩╰(〒皿〒)╯潰▃▄▅▆◣

Chose tokenizer • jseg has lower error rate, but ccjieba

Elasticsearch Query with Python Client 15 Reference: https://github.com/elastic/elasticsearch-dsl-py V.S. elasticsearch-dsl-py

Elasticsearch Mapping 16

Elasticsearch Object vs Nested Object 17 Json with array Lucene

Elasticsearch Relevance • Lucene (and thus Elasticsearch) uses Boolean model,

Query: Find the most relevant titles • Use unigram/bi-grams(shingle) and

Ranking the comments • Pointwise Mutual Information 20 • Use

Recap: Architecture 21

Challenges • We have title scores from Lucene practical scoring

Members • 吳啟豪 (Charlie) • 許晉源 (Ed) • 陳儀峰 (ifeng)

Question? 24