Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ElasticSearch: The Missing Intro (Indexing and Querying) by Erik Rose

ElasticSearch: The Missing Intro (Indexing and Querying) by Erik Rose

Elasticsearch provides an easy path to clusterable full-text search, with synonyms, faceting, and geographic math, but there's a paucity of written wisdom beyond its API docs. This talk, part 1 of a 2-part series, surveys its capabilities and shows how its internal data structures and algorithms work. With the groundwork laid, we explore how to choose efficient indexing and the right queries to make your apps go fast.

Afcfefa1f067d10bd021de0cc2e5e806?s=128

PyCon 2013

March 15, 2013
Tweet

Transcript

  1. elasticsearch the missing intro Part 1: Indexing & Querying by

    Erik Rose
  2. None
  3. ‘ Full-text search ‘ Big data ‘ Faceting ‘ Geographical

    queries what it’s good for
  4. one (insanely productive) man

  5. the rest of us ?

  6. data structures

  7. JSON HTTP on port 9200

  8. index doctype another doctype {…}

  9. IDs 6a8ca01c-7896-48e9- 81cc-9f70661fcb32

  10. diplodocus …………………………… 333 duodenum …………………………… 201 dwaal …………………………… 500, 119

  11. row → 0,1,3 boat → 0,1 chicken → 2 row

    row row your boat row the row boat chicken chicken chicken the front row 0 1 2 3
  12. row → 0,1,3 boat → 0,1 chicken → 2 row

    row row your boat row the row boat chicken chicken chicken the front row 0 1 2 3
  13. row → 0,1,3 boat → 0,1 chicken → 2 row

    row row your boat row the row boat chicken chicken chicken the front row 0 1 2 3
  14. row → 0,1,3 boat → 0,1 chicken → 2 row

    row row your boat row the row boat chicken chicken chicken the front row 0 1 2 3
  15. row → 0,1,3 boat → 0,1 chicken → 2 row

    row row your boat row the row boat chicken chicken chicken the front row 0 1 2 3
  16. doc row → 0 [0,1,2] 1 [0,2] 3 [2] boat

    → 0 [4] 1 [3] chicken → 2 [0,1,2] row row row your boat row the row boat chicken chicken chicken the front row 0 1 2 3 positions
  17. doc row → 0 [0,1,2] 1 [0,2] 3 [2] boat

    → 0 [4] 1 [3] chicken → 2 [0,1,2] row row row your boat row the row boat 0 1 positions chicken chicken chicken the front row 2 3
  18. doc row → 0 [0,1,2] 1 [0,2] 3 [2] boat

    → 0 [4] 1 [3] chicken → 2 [0,1,2] row row row your boat row the row boat 0 1 positions chicken chicken chicken the front row 2 3 ?
  19. doc row → 0 [0,1,2] 1 [0,2] 3 [2] boat

    → 0 [4] 1 [3] chicken → 2 [0,1,2] row row row your boat row the row boat 0 1 positions chicken chicken chicken the front row 2 3 ?
  20. analysis

  21. None
  22. stock analyzers original: Red-orange gerbils live at #43A Franklin St.

    whitespace: Red-orange, gerbils, live, at, #43A, Franklin, St. standard: red, orange, gerbils, live, 43a, franklin, st simple: red, orange, gerbils, live, at, a, franklin, st stop: red, orange, gerbils, live, franklin, st snowball: red, orang, gerbil, live, 43a, franklin, st ‘ stopwords ‘ stemming ‘ punctuation ‘ case-folding
  23. curl -XGET -s 'http://localhost:9200/_analyze? analyzer=whitespace&pretty=true' -d 'Red- orange gerbils live

    at #43A Franklin St.' { "tokens" : [ { "token" : "Red-orange", "start_offset" : 0, "end_offset" : 10, "type" : "word", "position" : 1 }, { "token" : "gerbils", "start_offset" : 11, "end_offset" : 18, "type" : "word", "position" : 2 }, ...
  24. 'address': {'type': 'string', 'analyzer': 'address_analyzer'} address_analyzer CharFilter Tokenizer Token Filter

    terms
  25. 'analysis': { 'analyzer': { 'name_analyzer': { 'type': 'custom', 'tokenizer': 'name_tokenizer',

    'filter': ['lowercase'] } }, 'tokenizer': { 'name_tokenizer': { 'type': 'pattern', 'pattern': "[^a-zA-Z']+" } } } name_analyzer CharFilter Tokenizer Token Filter terms
  26. synonyms "filter": { "synonym": { "type": "synonym", "synonyms": [ "albert

    => albert, al", "allan => allan, al" ] } } original query: Allan Smith after synonyms: [allan, al] smith original query: Albert Smith after synonyms: [albert, al] smith
  27. quer ying

  28. None
  29. { "bool" : { "must" : { "term" : {

    "user" : "fred" } }, "must_not" : { "range" : { "age" : { "from" : 12, "to" : 21 } } }, "should" : [ { "term" : { "tag" : "crunchy" } }, { "term" : { "tag" : "elasticsearch" } } ], "minimum_number_should_match" : 1, "boost" : 1.0 } }
  30. None
  31. filters ‘ Boolean queries ‘ Fuzzy, scoring ‘ Fast ‘

    Slower ‘ Cacheable ‘ Not cacheable
  32. curl -XGET -s 'http://localhost:9200/blog/_search?pretty=true' -d \ '{ "query": { "filtered":

    { "filter": { "term": { "category": "rants" } }, "query": { "bool": { "should": [ { "match_phrase": { "body": "fix your little red wagon" } }, { "match": { "body": "fix your little red wagon" } } ] } } } } }'
  33. ‘ pyes ‘ pyelasticsearch query = {'query': { 'filtered': {

    'query': { 'match': {'name': 'test tester'} }, 'filter': { 'range': { 'age': {'from': 27, 'to': 37} }}}}} es.search(query, index='people') ‘ elasticutils print [item['title'] for item in searcher.query(title__text='cookie') .filter(topics='websites')] libraries ✚
  34. thank you twitter: ErikRose erik@mozilla.com Background image by Tim and

    Julie Wilson: https://secure.flickr.com/photos/secondtree/. This presentation is noncommercial sharealike in accordance with that image's license. Part 2: Sunday at 1:10pm