Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tryomeetup - Elasticsearch - English

Javier Rey
October 22, 2014

Tryomeetup - Elasticsearch - English

Javier Rey

October 22, 2014
Tweet

More Decks by Javier Rey

Other Decks in Technology

Transcript

  1. What is Elasticsearch? • elasticsearch.org • Open-source distributed search server

    based on Lucene. • https://github.com/elasticsearch/elasticsearch • HTTP JSON API • 2010, as a Compass (2004) rewrite
 • elasticsearch.com • Company with ~ U$S 100 MM of funding • 2012 • CTO := ES creator
  2. Apache Lucene • “full-featured text search engine library written in

    Java” • Apache Project since 2001 (top-level since 2005) • Open-source • Handles on-disk storage, indexing and search.
  3. Why not use a DB? • Searches aren’t exact •

    Filters vs Queries • Scoring • Boosting • DB weren’t made for full-text search. • n-grams, stemming, stop words, geo, … indexing • Plugins!
  4. API

  5. Create index and type curl -XPOST localhost:9200/blog -d '{ "settings":

    { "number_of_shards": 3 }, "mappings": { "post": { "properties": { "title": { "type": "string", "index": "not_analyzed" } } } } }'
  6. Create type curl -XPUT localhost:9200/blog -d '{ "mappings": { “comments":

    { "properties": { “author": { "type": "string", "index": "not_analyzed" } } } } }'
  7. Index document Sin ID curl -XPOST localhost:9200/blog/post -d'{ “title”: “Elasticsearch

    rulz” }’ Con ID curl -XPOST localhost:9200/blog/post/super_id -d'{ “title”: “Elasticsearch rulz” }’
  8. • Every document is indexed with a mapping. • The

    mapping describes the data type of each attribute in the document and how to analyze it. • If there is no mapping, one is generated automagically (Schema-less magic). • The mapping cannot be changed.
  9. • type • properties • dynamic (on by default) •

    enabled • type specific attributes • date format • geo precision • analyzer …
  10. Configurable way to analyze and process document attributes on index

    time. Map directly to Lucene’s Analyzers. Composed of: • One Tokenizer • Zero or more TokenFilters • Zero or more CharFilters Qué son?
  11. Tokenizer Splits a stream of strings into tokens. • standard

    (whitespace, puntuación, …) • n-gram • email url • thai? TokenFilters Receives tokens from tokenizer and can modify, add or delete tokens.
 • length • lowercase • stemmer (lots!) • phonetic • synonyms • stop words CharFilters • html strip • replace & -> and
  12. Example (Source: San Diego Union-Tribune) Solar customers in San Diego

    area protested the proposal by San Diego Gas & Electric to shift customers to a time-based billing system, during the first of four public hearings organized by CPUC on the proposal… curl -XGET ‘localhost:9200/_analyze?tokenizer=standard&token_filters=standard,lowercase,stemmer' \ -d ‘ (Source: San Diego Union-Tribune) Solar customers in San Diego area protested the proposal by San Diego Gas & Electric to shift customers to a time-based billing system, during the first of four public hearings organized by CPUC on the proposal... ' \ | jq ".tokens[].token" \ | xargs echo sourc san diego union tribun solar custom in san diego area protest the propos by san diego ga electr to shift custom to a time base bill system dure the first of four public hear organ by cpuc on the propos
  13. curl ci_es:9201/incoming/item/_mapping | jq . { "incoming_v3": { "mappings": {

    "item": { "properties": { "_type": { "type": "string", "index": "not_analyzed" }, "body": { "type": "string", "analyzer": “custom_analyzer", "fields": { "language": { "type": "langdetect", "fields": { "language": { "type": "string" }, "lang": { "type": "string" } } } } }, "score": { "type": "double" }, "cleaning_method": { "type": "string" }, "content_type": { "type": "string" }, "headers": { "type": "object", "enabled": false }, "highlighted_body": { "type": "string" }, "id": { "type": "string", curl ci_es:9201/incoming/item/_mapping | jq . { "incoming_v3": { "settings": { "index": { "uuid": "CK_gqNXiTummmdpDXO2FKw", "analysis": { "filter": { "custom_english_stemmer": { "type": "stemmer", "name": "light_english" } }, "analyzer": { "custom_analyzer": { "filter": [ "standard", "lowercase", "custom_english_stemmer" ], "tokenizer": "uax_url_email" } } }, "number_of_replicas": "1", "number_of_shards": "5", "version": { "created": "1030299" } } } } }
  14. curl ci_es:9201/incoming/item/_mapping | jq . { "incoming_v3": { "mappings": {

    "item": { … "query": { "properties": { "span_near": { "properties": { "clauses": { "properties": { "span_term": { "properties": { "body": { "properties": { "boost": { "type": "long" }, "value": { "type": "string" } } } } } } }, "in_order": { "type": "boolean" }, "slop": { "type": "long" } } } } }, "raw_body": { "type": "string", "index": "no" }, …
  15. curl -XGET 'http://localhost:9200/twitter/tweet/ _search' -d '{ "query": { "terms" :

    { "tags" : [ "blue", "pill" ], "minimum_should_match" : 1 } } }'
  16. curl -XGET 'http://localhost:9200/twitter/tweet/_search' -d '{ "query": { { "bool" :

    { "must" : { "term" : { "user" : "kimchy" } }, "must_not" : { "range" : { "age" : { "from" : 10, "to" : 20 } } }, "should" : [ { "term" : { "tag" : "wow" } }, { "term" : { "tag" : "elasticsearch" } } ], "minimum_should_match" : 1, "boost" : 1.0 } }}'
  17. curl -XGET 'http://localhost:9200/twitter/tweet/ _search' -d ‘{ “query”: { "filtered": {

    "query": { "match": { "tweet": "full text search" } }, "filter": { "range": { "created": { "gte": "now - 1d / d" }} } } } }' Queries vs Filters
  18. dis max query fuzzy like this query function score query

    fuzzy query geoshape query more like this query nested query prefix query query string query regexp query span first query span multi term query span near query span not query span or query span term query wildcard query minimum should match template query geo bounding box filter geo distance filter geo distance range filter geo polygon filter geoshape filter geohash cell filter script filter Other queries
  19. Aggregations • Facets supercharged (allows nested aggs) • Histograms •

    Geobound aggregations • Percentiles aggregations • IPv4 range aggregation • Significant terms aggregation
  20. • scripting (java, groovy, python) • custom function scores •

    low level queries • clustering • routing • logstash • kibana • marvel • warmers • rolling index • doc ttl • …
  21. IN DOCKER WE TRUST Docker trusted automated build docker run

    -d -p 9200:9200 -p 9300:9300 dockerfile/elasticsearch