Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tryomeetup - Elasticsearch

Javier Rey
October 22, 2014

Tryomeetup - Elasticsearch

Javier Rey

October 22, 2014
Tweet

More Decks by Javier Rey

Other Decks in Technology

Transcript

  1. ¿Qué es? • elasticsearch.org • Search server distribuido open-source basado

    en Lucene. • https://github.com/elasticsearch/elasticsearch • HTTP JSON API • 2010 (como un rewrite de Compass 2004)
 • elasticsearch.com • Empresa con ~ U$S 100 MM de funding • 2012 • CTO := Creador de ES
  2. Apache Lucene • “full-featured text search engine library written in

    Java” • Projecto Apache desde 2001 (top-level desde 2005) • Open-source • Handles on-disk storage, indexing and search.
  3. ¿Por qué no usar BD? • Búsquedas no son exactas

    • Filters vs Queries • Scoring • Boosting • BD no fueron creadas para full-text search. • Soporte indexado n-grams, stemming, stop words, geo, … • Plugins!
  4. document lo que se indexa dentro de un índice con

    un cierto tipo expresado en JSON
  5. API

  6. Create index and type curl -XPOST localhost:9200/blog -d '{ "settings":

    { "number_of_shards": 3 }, "mappings": { "post": { "properties": { "title": { "type": "string", "index": "not_analyzed" } } } } }'
  7. Create type curl -XPUT localhost:9200/blog -d '{ "mappings": { “comments":

    { "properties": { “author": { "type": "string", "index": "not_analyzed" } } } } }'
  8. Index document Sin ID curl -XPOST localhost:9200/blog/post -d'{ “title”: “Elasticsearch

    rulz” }’ Con ID curl -XPOST localhost:9200/blog/post/super_id -d'{ “title”: “Elasticsearch rulz” }’
  9. • Cada documento es indexado en base a un mapping.

    • El mapping describe el tipo de dato de cada atributo del documento y como analizarlo. • Si no hay mapping, se genera uno de forma automática (La magia schema-less de ES). • El mapping no se puede cambiar.
  10. • type • properties • dynamic (on by default) •

    enabled • type specific attributes • date format • geo precision • analyzer …
  11. Manera configurable para analizar y procesar atributos del documento al

    momento de indexado. Se mapean directamente con los Analyzers de Lucene. Compuestos de: • Un Tokenizer • Cero o más TokenFilters • Cero o más CharFilters Qué son?
  12. Tokenizer Separa un stream de strings en tokens • standard

    (whitespace, puntuación, …) • n-gram • email url • tailandés? TokenFilters Input del tokenizer y puede modificar, agregar, borrar tokens.
 • length • lowercase • stemmer (muchos) • phonetic • synonyms • stop words CharFilters • html strip • reemplazar & -> and
  13. Ejemplo (Source: San Diego Union-Tribune) Solar customers in San Diego

    area protested the proposal by San Diego Gas & Electric to shift customers to a time-based billing system, during the first of four public hearings organized by CPUC on the proposal… curl -XGET ‘localhost:9200/_analyze?tokenizer=standard&token_filters=standard,lowercase,stemmer' \ -d ‘ (Source: San Diego Union-Tribune) Solar customers in San Diego area protested the proposal by San Diego Gas & Electric to shift customers to a time-based billing system, during the first of four public hearings organized by CPUC on the proposal... ' \ | jq ".tokens[].token" \ | xargs echo sourc san diego union tribun solar custom in san diego area protest the propos by san diego ga electr to shift custom to a time base bill system dure the first of four public hear organ by cpuc on the propos
  14. curl ci_es:9201/incoming/item/_mapping | jq . { "incoming_v3": { "mappings": {

    "item": { "properties": { "_type": { "type": "string", "index": "not_analyzed" }, "body": { "type": "string", "analyzer": "mayhem_analyzer", "fields": { "language": { "type": "langdetect", "fields": { "language": { "type": "string" }, "lang": { "type": "string" } } } } }, "ci_score": { "type": "double" }, "cleaning_method": { "type": "string" }, "content_type": { "type": "string" }, "headers": { "type": "object", "enabled": false }, "highlighted_body": { "type": "string" }, "id": { "type": "string", curl ci_es:9201/incoming/item/_mapping | jq . { "incoming_v3": { "settings": { "index": { "uuid": "CK_gqNXiTummmdpDXO2FKw", "analysis": { "filter": { "custom_english_stemmer": { "type": "stemmer", "name": "light_english" } }, "analyzer": { "mayhem_analyzer": { "filter": [ "standard", "lowercase", "custom_english_stemmer" ], "tokenizer": "uax_url_email" } } }, "number_of_replicas": "1", "number_of_shards": "5", "version": { "created": "1030299" } } } } }
  15. curl ci_es:9201/incoming/item/_mapping | jq . { "incoming_v3": { "mappings": {

    "item": { … "query": { "properties": { "span_near": { "properties": { "clauses": { "properties": { "span_term": { "properties": { "body": { "properties": { "boost": { "type": "long" }, "value": { "type": "string" } } } } } } }, "in_order": { "type": "boolean" }, "slop": { "type": "long" } } } } }, "raw_body": { "type": "string", "index": "no" }, …
  16. curl -XGET 'http://localhost:9200/twitter/tweet/ _search' -d '{ "query": { "terms" :

    { "tags" : [ "blue", "pill" ], "minimum_should_match" : 1 } } }'
  17. curl -XGET 'http://localhost:9200/twitter/tweet/_search' -d '{ "query": { { "bool" :

    { "must" : { "term" : { "user" : "kimchy" } }, "must_not" : { "range" : { "age" : { "from" : 10, "to" : 20 } } }, "should" : [ { "term" : { "tag" : "wow" } }, { "term" : { "tag" : "elasticsearch" } } ], "minimum_should_match" : 1, "boost" : 1.0 } }}'
  18. curl -XGET 'http://localhost:9200/twitter/tweet/ _search' -d ‘{ “query”: { "filtered": {

    "query": { "match": { "tweet": "full text search" } }, "filter": { "range": { "created": { "gte": "now - 1d / d" }} } } } }' Queries vs Filters
  19. dis max query fuzzy like this query function score query

    fuzzy query geoshape query more like this query nested query prefix query query string query regexp query span first query span multi term query span near query span not query span or query span term query wildcard query minimum should match template query geo bounding box filter geo distance filter geo distance range filter geo polygon filter geoshape filter geohash cell filter script filter Other queries
  20. Aggregations • Facets supercharged (permite nested) • Histograms • Geobound

    aggregations • Percentiles aggregations • IPv4 range aggregation • Significant terms aggregation
  21. • scripting (java, groovy, python) • custom function scores •

    low level queries • clustering • routing • logstash • kibana • marvel • warmers • rolling index • doc ttl • percolator • …
  22. IN DOCKER WE TRUST Trusted automated build docker run -d

    -p 9200:9200 -p 9300:9300 dockerfile/elasticsearch