Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A primer to elasticsearch

A primer to elasticsearch

My talk at the berlin digitalocean meetup on july 13th, 2016

Avatar for Felix Gilcher

Felix Gilcher

July 20, 2016
Tweet

Other Decks in Technology

Transcript

  1. backend developer/ops person elasticsearch user since 0.10.something co-founder of the

    Search UG Berlin 3 — © 2016 asquera gmbh, creative commons cc by
  2. An overview of what ES does do, doesn't do and

    the relevant google keywords for further research 5 — © 2016 asquera gmbh, creative commons cc by
  3. To the daily practitioners: I’ll gloss over a lot of

    points. 6 — © 2016 asquera gmbh, creative commons cc by
  4. logstash is worker that takes individual input events, processes them

    in isolation and writes them to a data sink 9 — © 2016 asquera gmbh, creative commons cc by
  5. input events often are log lines 10 — © 2016

    asquera gmbh, creative commons cc by
  6. the data sink often is Elasticsearch, but others are possible:

    kafka, influxdb, graphite 11 — © 2016 asquera gmbh, creative commons cc by
  7. parts of logstash will move to ES ingest nodes with

    5.0 12 — © 2016 asquera gmbh, creative commons cc by
  8. Kibana visualizes the result of single Elasticsearch queries 14 —

    © 2016 asquera gmbh, creative commons cc by
  9. WHAT IS ELASTICSEARCH? a distributed json datastore based on apache

    lucene 15 — © 2016 asquera gmbh, creative commons cc by
  10. WHAT IS IT GOOD FOR? full text search timeline data

    distributed datastore for json docs companion datastore to another NoSQL store 16 — © 2016 asquera gmbh, creative commons cc by
  11. WHAT IS IT NOT GOOD FOR? primary datastore anything that

    requires transactions volatile data very heavy write loads across datacenter boundaries 17 — © 2016 asquera gmbh, creative commons cc by
  12. SO, HOW EXACTLY IS LUCENE DIFFERENT FROM MYSQL? 20 —

    © 2016 asquera gmbh, creative commons cc by
  13. shards are distributed and replicated in a cluster 24 —

    © 2016 asquera gmbh, creative commons cc by
  14. cluster nodes can be added and removed at any time,

    shards will be redistributed 25 — © 2016 asquera gmbh, creative commons cc by
  15. cluster can tolerate failure of individual nodes as long as

    shard replicas are available 26 — © 2016 asquera gmbh, creative commons cc by
  16. documents are json all api interactions are json 28 —

    © 2016 asquera gmbh, creative commons cc by
  17. WRITE ONCE DATASTRUCTURES Lucene segments are written once and never

    touched again. No in-place modification of documents. Update operations are insert + delete. 29 — © 2016 asquera gmbh, creative commons cc by
  18. queries cannot refer to other documents in other shards: no

    joins, no distinct queries 31 — © 2016 asquera gmbh, creative commons cc by
  19. doc id | content ------ | ------------- 0 | "Überlin

    ist auf Twitter" 1 | "Ich bin auf Twitter" 2 | "Ich folge Überlin" 34 — © 2016 asquera gmbh, creative commons cc by
  20. terms | document ids ------- | ------------- uberlin | 0,2

    twitter | 0,1 bin | 1 ich | 1,2 auf | 0,1 folge | 2 35 — © 2016 asquera gmbh, creative commons cc by
  21. Analysis determines which terms end up at the left side

    of the table in the first place. 37 — © 2016 asquera gmbh, creative commons cc by
  22. analysis | result ---------- | ---------- none | "ich folge

    Überlin" whitespace | "ich" "folge" "Überlin" lowercase | "ich" "folge" "überlin" normalize | "ich" "folge" "uberlin" stemming | "ich" "folg" "uberlin" 38 — © 2016 asquera gmbh, creative commons cc by
  23. This step happens both on indexing and queries 39 —

    © 2016 asquera gmbh, creative commons cc by
  24. Manipulating analysis is the basis for manipulating matches. 40 —

    © 2016 asquera gmbh, creative commons cc by
  25. Does your system comfortably speak Unicode? 42 — © 2016

    asquera gmbh, creative commons cc by
  26. Document: Index doc id | field value token | doc

    ids ------ | ----------- ------- | ------- 1 | Test test | 1,2 2 | test uberlin | 3 3 | Überlin 43 — © 2016 asquera gmbh, creative commons cc by
  27. search term | no. matches ----------- | ----------- Test |

    2 test | 2 Überlin | 1 überlin | 0 45 — © 2016 asquera gmbh, creative commons cc by
  28. This particular issue has been fixed in 3.2, but others

    remain, stopwords, ... 46 — © 2016 asquera gmbh, creative commons cc by
  29. The recommendation is to preprocess the input yourself 47 —

    © 2016 asquera gmbh, creative commons cc by
  30. "\u0055\u0308" => Ü "\u0075\u0308" => ü 50 — © 2016

    asquera gmbh, creative commons cc by
  31. PostgreSQL handles UCS-2 level 1, not UTF. 51 — ©

    2016 asquera gmbh, creative commons cc by
  32. “we should really reject combining chars, but can’t do that

    w/o breaking BC.” 52 — © 2016 asquera gmbh, creative commons cc by
  33. If you use PostgreSQL and text manipulation, you probably have

    a bug in the hiding there. 54 — © 2016 asquera gmbh, creative commons cc by
  34. don't ask me about MySQL full text search 55 —

    © 2016 asquera gmbh, creative commons cc by
  35. Elasticsearch handles all of this gracefully and much much more.

    56 — © 2016 asquera gmbh, creative commons cc by
  36. TERM FREQUENCY / INVERSE DOCUMENT FREQUENCY 60 — © 2016

    asquera gmbh, creative commons cc by
  37. Basic idea: rank documents based on terms matched. 61 —

    © 2016 asquera gmbh, creative commons cc by
  38. TERM FREQUENCY terms that are frequent in a document get

    higher scores. 62 — © 2016 asquera gmbh, creative commons cc by
  39. INVERSE DOCUMENT FREQUENCY Terms that are frequent in the whole

    Corpus get lower scores 63 — © 2016 asquera gmbh, creative commons cc by
  40. score(q,d) = queryNorm(q) · coord(q,d) · ∑ ( tf(t in

    d) · idf(t)² · t.getBoost() · norm(t,d) ) (t in q) 65 — © 2016 asquera gmbh, creative commons cc by
  41. Search is all about relevance and combinations thereof. 66 —

    © 2016 asquera gmbh, creative commons cc by
  42. Was the match in the title or the body of

    a document? 67 — © 2016 asquera gmbh, creative commons cc by
  43. Many systems can weight matches on fields differently. 68 —

    © 2016 asquera gmbh, creative commons cc by
  44. But many systems only support changing weights at index time.

    (PostgreSQL, MongoDB) 69 — © 2016 asquera gmbh, creative commons cc by
  45. With Elasticsearch, weights can be changed at query time. 70

    — © 2016 asquera gmbh, creative commons cc by
  46. Scores can be manipulated at query time based on geo-coordinates,

    recency, manual boosting, ... 71 — © 2016 asquera gmbh, creative commons cc by
  47. PHRASE SEARCH AND STOPWORDS doc id | content ------ |

    ------------- 1 | "Ich bin auf Twitter" 2 | "Ich folge Überlin" 72 — © 2016 asquera gmbh, creative commons cc by
  48. INDEXED WITH GERMAN STOPWORDS terms | document ids ------- |

    ------------- bin | 1 twitter | 1 uberlin | 2 folge | 2 73 — © 2016 asquera gmbh, creative commons cc by
  49. Search systems are not binary. Faults in the system degrade

    the quality of the system, rarely break it. 75 — © 2016 asquera gmbh, creative commons cc by
  50. Building a relevance model is usually the biggest part of

    a natural language search. 76 — © 2016 asquera gmbh, creative commons cc by
  51. requires careful balancing of multiple, sometimes contradictory, demands to achieve

    the best result 77 — © 2016 asquera gmbh, creative commons cc by
  52. WHERE TO GO FROM HERE 78 — © 2016 asquera

    gmbh, creative commons cc by
  53. IMAGE CREDITS page 58: Taken from Britta Webers presentation at

    ES UG Berlin, Wednesday, October 30, 13 80 — © 2016 asquera gmbh, creative commons cc by