How to use Elasticsearch Analyzers by EmergiNet

Analyzers Pablo Musa EmergiNet 05 de Maio de 2014

Outline 1 Motiva¸ c˜ ao 2 Elasticsearch e EmergiNet 3
Conceitos B´ asicos 4 Criando um Analisador 5 Problemas Comuns 6 Outros Trabalhos Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 2 / 26

Motiva¸ c˜ ao Caso de Uso Site de compras “Full
text search” em SQL ´ e complexo e lento Necessidade de um sistema de busca: mais r´ apido mais preciso mais simples de desenvolver Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 3 / 26

Elasticsearch R´ apido (em m´ edia 100x) Resultados excelentes F´
acil de consumir Instala¸ c˜ ao muito simples e escal´ avel API RESTful simples utilizando JSON “Schema ´ e autom´ atico” Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 4 / 26

Elasticsearch e EmergiNet Nem sempre o padr˜ ao ´ e
o melhor Ningu´ em conhece melhor seus dados do que vocˆ e Mapping personalizado EmergiNet solu¸ c˜ ao de consultoria ou execu¸ c˜ ao de projetos Otimizar a aplica¸ c˜ ao e incluir funcionalidades 1 Ordena¸ c˜ ao 2 Aggregations 3 Auto-Complete, Suggester 4 Auxiliar no SEO (Search Engine Optimization) Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 5 / 26

Elasticsearch Empty Index { "settings": { "analysis": { "filter": {
}, "analyzer": { "my_analyzer": { "type": "", "char_filter": [], "tokenizer": "", "filter": [] } } } }, "mappings": { "my_type": { "properties": { "title": { "type": "", "index": "", "analyzer": "" } } } } } “Empty” analysis and mappings. Example of the structure to be fulﬁlled. Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 6 / 26

Etapas de um analisador 1 Arrumar 2 Quebrar 3 Normalizar
Elasticsearch oferece analisadores pr´ e-deﬁnidos Por exemplo: standard, simple, whitespace, language Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 7 / 26

Arrumar Character Filters “Pr´ e-processamento” Limpeza da string Opcional Atualmente
existem 3 tipos: mapping (ex: "ph" => "f") html strip (removes tags and maps entities, "á" => "´ a") pattern replace (regular expression) Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 8 / 26

Arrumar Analysis with Character Filters "analysis": { "filter": { },
"analyzer": { "my_analyzer": { "type": "custom", "char_filter": [ "html_strip" ], "tokenizer": "", "filter": [] } } } Analysis with character ﬁlter function only. Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 9 / 26

Quebrar Tokenizers “Processamento” Quebra da string em termos individuais Obrigat´
orio Atualmente existem 10 tipos: standard keyword whitespace ngram, edge ngram letter, lowercase (opt), pattern, uax email url, path hierarchy Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 10 / 26

Quebrar Analysis with Character Filters and Tokenizers "analysis": { "filter":
{ }, "analyzer": { "my_analyzer": { "type": "custom", "char_filter": [ "html_strip" ], "tokenizer": "standard", "filter": [] } } } Analysis with character ﬁlter and tokenizer function. Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 11 / 26

Normalizar Token Filters “P´ os-processamento” Normalizar os tokens (alterar ou
remover) Opcional Atualmente existem 33 tipos: ascii folding lowercase, uppercase stop stemmer ngram, edge ngram, length, snowball, synonym, ... Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 12 / 26

Normalizar Analysis Complete "analysis": { "filter": { }, "analyzer": {
"my_analyzer": { "type": "custom", "char_filter": [ "html_strip" ], "tokenizer": "standard", "filter": [ "lowercase", "asciifolding" ] } } } Analysis using all functions. Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 13 / 26

Normalizar stop token filter Stop Words Remove palavras indesejadas ´
E baseado em uma lista de palavras e deve ser criado manualmente "stop_noise": { "type": "stop", "stopwords_path": "sw.txt" } "stop_noise": { "type": "stop", "stopwords": ["o", "a", "no", "na","de","da", "as","os"] } Stop word token filter definition. ignore case and remove trailing are boolean settings. Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 14 / 26

Normalizar Analysis Complete with stop words "analysis": { "filter": {
"stop_noise": { "type": "stop", "stopwords_path": "sw.txt" } }, "analyzer": { "my_analyzer": { "type": "custom", "char_filter": [ "html_strip" ], "tokenizer": "standard", "filter": [ "lowercase", "stop_noise", "asciifolding" ] } } } Analysis using all functions and my own stop words ﬁlter. Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 15 / 26

Normalizar stemmer token filter Stemmer (deriva¸ c˜ oes) “Trava” as
palavras ("jogar"=>"joga" ou "jogar" =>"jog") ´ E baseado em um conjunto j´ a existente, mas deve ser criado manualmente "my_stemmer": { "type": "stemmer", "name": "light_portuguese" } Stemmer token filter definition. minimal portuguese and portuguese are other portuguese options. Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 16 / 26

Normalizar Analysis Complete with stop words and stemmer "analysis": {
"filter": { "stop_noise": { "type": "stop", "stopwords_path": "sw.txt" }, "light_pt": { "type": "stemmer", "name": "light_portuguese" }, }, "analyzer": { "my_analyzer": { "type": "custom", "char_filter": [ "html_strip" ], "tokenizer": "standard", "filter": [ "lowercase", "stop_noise", "asciifolding", "light_pt" ] } } } Analysis using all functions, with my own stop words and light portuguese stemmer ﬁlters. Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 17 / 26

One Field Mapping "mappings": { "my_type": { "properties": { "title":
{ "type": "string", "index": "analyzed", "analyzer": "my_analyzer", } } } } Simple mapping with one string ﬁeld using my analyzer. Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 18 / 26

Problemas Ordenar Aggregation SEO (Search Engine Optimization) Pablo Musa (EmergiNet)
Analyzers 05 de Maio de 2014 19 / 26

Problemas Ordena¸ c˜ ao Ordena¸ c˜ ao em campos indexados
gera resultados aleat´ orios "Telha" < "casa" Novo analisador "sort": { "type": "custom", "tokenizer": "keyword", "filter": [ "lowercase", "asciifolding" ] } Sort analyzer. Makes use of lowercase and asciifolding ﬁlters and the keyword tokenizer. Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 20 / 26

Problemas Aggregation Como funciona: ”sao”, ”paulo”, ”rio” O que queremos:
”S˜ ao Paulo” Ou seja, n˜ ao queremos an´ alise Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 21 / 26

Problemas Search Engine Optimization Stemmer ´ e ruim Novo analisador
"url_analyzer": { "type": "custom", "char_filter": [ "html_strip" ], "tokenizer": "standard", "filter": [ "lowercase", "stop_noise", "asciifolding" ] } URL analyzer for SEO. It will not be used in mappings. Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 22 / 26

Problemas Search Engine Optimization N˜ ao precisamos mape´ a-lo para
um ﬁeld analyze API curl -XPOST "http://localhost:9200/my_index/_analyze?analyzer=my_analyzer" -d ’{ "O Meetup Elasticsearch RJ ser´ a no dia 05 de maio as 18h." }’ > meetup elasticsearch rj sera dia 05 maio 18h analyze API Example. Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 23 / 26

Resultado { "settings": { "analysis": { "filter": { "stop_noise": {
"type": "stop", "stopwords_path": "sw.txt" }, "light_pt": { "type": "stemmer", "name": "light_portuguese" } }, "analyzer": { "my_analyzer": { "type": "custom", "char_filter": [ "html_strip" ], "tokenizer": "standard", "filter": [ "lowercase", "stop_noise", "asciifolding", "light_pt" ] }, "sort": { "type": "custom", "tokenizer": "keyword", "filter": [ "lowercase", "asciifolding" ] }, "url_analyzer": { "type": "custom", "char_filter": [ "html_strip" ], "tokenizer": "standard", "filter": [ "lowercase", "stop_noise", "asciifolding" ] } } } }, "mappings": { "my_type": { "properties": { "title": { "type": "string", "index": "analyzed", "analyzer": "my_analyzer", "fields": { "sort": { "type": "string", "index": "analyzed", "analyzer": "sort" }, "raw": { "type": "string", "index": "not_analyzed" } } } } } } } Complete mapping for one ﬁeld using sub-ﬁelds to text search, sort, and aggregation. Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 24 / 26

Outros Trabalhos Boost Parent/Child Armazenamento de Logs (Logstash + Kibana)
Consultoria de infra estrutura para ELK Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 25 / 26

Obrigado www.emergi.net - [email protected] “Keep it simple, but not simpler.”

How to use Elasticsearch Analyzers by EmergiNet

How to use Elasticsearch Analyzers by EmergiNet

EmergiNet

Other Decks in Programming

Featured

Transcript

Analyzers Pablo Musa EmergiNet 05 de Maio de 2014

Outline 1 Motiva¸ c˜ ao 2 Elasticsearch e EmergiNet 3

Motiva¸ c˜ ao Caso de Uso Site de compras “Full

Elasticsearch R´ apido (em m´ edia 100x) Resultados excelentes F´

Elasticsearch e EmergiNet Nem sempre o padr˜ ao ´ e

Elasticsearch Empty Index { "settings": { "analysis": { "filter": {

Etapas de um analisador 1 Arrumar 2 Quebrar 3 Normalizar

Arrumar Character Filters “Pr´ e-processamento” Limpeza da string Opcional Atualmente

Arrumar Analysis with Character Filters "analysis": { "filter": { },

Quebrar Tokenizers “Processamento” Quebra da string em termos individuais Obrigat´

Quebrar Analysis with Character Filters and Tokenizers "analysis": { "filter":

Normalizar Token Filters “P´ os-processamento” Normalizar os tokens (alterar ou

Normalizar Analysis Complete "analysis": { "filter": { }, "analyzer": {

Normalizar stop token ﬁlter Stop Words Remove palavras indesejadas ´

Normalizar Analysis Complete with stop words "analysis": { "filter": {

Normalizar stemmer token ﬁlter Stemmer (deriva¸ c˜ oes) “Trava” as

Normalizar Analysis Complete with stop words and stemmer "analysis": {

One Field Mapping "mappings": { "my_type": { "properties": { "title":

Problemas Ordenar Aggregation SEO (Search Engine Optimization) Pablo Musa (EmergiNet)

Problemas Ordena¸ c˜ ao Ordena¸ c˜ ao em campos indexados

Problemas Aggregation Como funciona: ”sao”, ”paulo”, ”rio” O que queremos:

Problemas Search Engine Optimization Stemmer ´ e ruim Novo analisador

Problemas Search Engine Optimization N˜ ao precisamos mape´ a-lo para

Resultado { "settings": { "analysis": { "filter": { "stop_noise": {

Outros Trabalhos Boost Parent/Child Armazenamento de Logs (Logstash + Kibana)

Obrigado www.emergi.net - [email protected] “Keep it simple, but not simpler.”