Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Elasticsearch at Ticketbis

Jose
November 19, 2016

Elasticsearch at Ticketbis

Talk (spanish) can be found here: https://www.youtube.com/watch?v=vrtDe0sb09s

Jose

November 19, 2016
Tweet

More Decks by Jose

Other Decks in Technology

Transcript

  1. ~ 400 employees ~ 50 countries ~ 90 m€ in

    2015 ~ May 2016: Acquired by StubHub (eBay)
  2. Elasticsearch Intro ~ A distributed RESTful search engine ~ Lucene:

    search engine lib behind ES ~ Basic concepts - Index (~ database) - Type (~ table) - Document (~ row)
  3. Elasticsearch concepts Node 1 P1 Node 2 R1 Node 3

    P0 R2 Cluster P2 R0 3 nodes, 3 shards per index and 1 replica per shard (up to 6 nodes)
  4. Elasticsearch concepts Node 1 P1 R0 Node 2 R1 R2

    Node 3 P0 R1 R2 Cluster P2 R0 3 nodes, 3 shards per index and 2 replicas per shard (up to 9 nodes)
  5. 3 shards for a 2 nodes cluster (1 replica) Node

    1 P1 R0 Node 2 R1 R2 P0 Cluster P2 Capacity planning Overallocation (#shards > #nodes)
  6. Capacity planning Avoid Kagillion shards problem ~ A shard is

    a Lucene index (+resources) ~ A search request touch all shards ~ Poor document relevance (TF / IDF)
  7. “Es el vecino el que elige al alcalde y es

    el alcalde el que quiere que sean los vecinos el alcalde”
  8. “Es el vecino el que elige al alcalde y es

    el alcalde el que quiere que sean los vecinos el alcalde” tf(freq=3.0) = 1.73 . . . idf(docFreq=1, maxDocs=1) = 0.30 idf(docFreq=2, maxDocs=10) = 2.20
  9. Capacity planning Replicas ~ Improves performance and throughput ~ High-availability

    and huge read-intensive search performance ~ Number of replicas can be changed anytime
  10. Ticketbis use case Size & number of documents ~ 1.2KB

    per document ~ 30K docs in 63 lang = 2M docs / year ~ 2.3GB / year ~ Rule of thumb: < 32GB / shard
  11. Event document { "Id": 18, “Date”: “2016-12-04T20:00:00Z”, “Timezone”: “Europe/Madrid”, "shipment_deadline":

    "2016-12-01T23:00:00Z", “translations”: { “es_es”: { “name”: “Barcelona - Real Madrid”, “title”: “Entradas Barcelona - Real Madrid”, }, “en_gb”: { “name”: “Barcelona v Real Madrid”, “title”: “Barcelona v Real Madrid Tickets - La Liga”, }, }}
  12. I18n approaches ~ Parent / Child ~ Nested data ~

    Shard per language ~ Type per language ~ Index per language ~ Field per language
  13. Parent / Child PUT /catalog { "mappings": { "event": {...},

    // non i18n fields (date, ...) "translations": { "_parent": { "type": "event" }, “Properties”: { “locale”: { “type”: “string”}, “name”, .... // Rest of i18n fields } } } }
  14. Parent / Child ~ Parent / child in the same

    shard ~ Can update a single translation ~ Always hit all shards ~ Doesn’t support different analysers per locale
  15. Parent / Child ~ Parent / child in the same

    shard ~ Can update a single translation ~ Always hit all shards ~ Doesn’t support different analysers per locale
  16. Nested data PUT /catalog { "mappings": { "event": { “properties”:

    { … // non i18n fields (date, ...) "translations": { “type”: “nested”, “properties”: { “locale”: {“type”: “string”} “name”, … // rest of i18n fields } } } }}
  17. Nested data ~ Nested data in the same segment ~

    Query 10 times faster than parent/child ~ Always hit all shards ~ Can’t update a single translation ~ Doesn’t support different analysers per locale
  18. Nested data ~ Nested data in the same segment ~

    Query 10 times faster than parent/child ~ Always hit all shards ~ Can’t update a single translation ~ Doesn’t support different analysers per locale
  19. Shard per language (routing) PUT /catalog/event/18?routing=es_es { “Date”: “2016-12-04T20:00:00Z”, “Timezone”:

    “Europe/Madrid”, "shipment_deadline": "2016-12-01T23:00:00Z", “name”: “Barcelona - Real Madrid”, “title”: “Entradas Barcelona - Real Madrid”, } GET /catalog/event/18?routing=es_es
  20. Shard per language (routing) ~ Only hits one shard ~

    Better query performance ~ Doesn’t support different analysers per locale
  21. Shard per language (routing) ~ Only hits one shard ~

    Better query performance ~ Doesn’t support different analysers per locale
  22. Type per language PUT /catalog/event_es_es/18 { “Date”: “2016-12-04T20:00:00Z”, “Timezone”: “Europe/Madrid”,

    "shipment_deadline": "2016-12-01T23:00:00Z", “name”: “Barcelona - Real Madrid”, “title”: “Entradas Barcelona - Real Madrid”, } GET /catalog/event_es_es/18
  23. Type per language ~ Supports different analysers per locale ~

    Same fields in different types within the same index share the same inverted index
  24. locale es_es: “Una relajante taza de café con leche en

    Plaza Mayor” Taza: idf(docFreq=1, maxDocs=1) = 0.30
  25. locale es_es: “Una relajante taza de café con leche en

    Plaza Mayor” locale en_gb: “A relaxing cup of café con leche in Plaza Mayor” Taza: idf(docFreq=1, maxDocs=2) = 1.00
  26. Index per language PUT /catalog_es_es/event/18 { “Date”: “2016-12-04T20:00:00Z”, “Timezone”: “Europe/Madrid”,

    "shipment_deadline": "2016-12-01T23:00:00Z", “name”: “Barcelona - Real Madrid”, “title”: “Entradas Barcelona - Real Madrid”, } GET /catalog_es_es/event/18
  27. Field per language PUT /catalog/event/18 { “Date”: “2016-12-04T20:00:00Z”, “Timezone”: “Europe/Madrid”,

    "shipment_deadline": "2016-12-01T23:00:00Z", “name_es_es”: “Barcelona - Real Madrid”, “title_es_es”: “Entradas Barcelona - Real Madrid”, “name_en_gb”: “Barcelona v Real Madrid”, “title_en_gb”: “Barcelona v Real Madrid Tickets”, } GET /catalog/event/18
  28. Index / Field per language ~ Support different analysers per

    locale ~ Avoid TF/IDF and stemming problems ~ hit all shards
  29. Index compared to Field ~ more flexible ~ doesn’t need

    to index all languages at once ~ N indices (#languages) from the very beginning
  30. Benchmarking Index per language ~ Documents successfully indexed up to

    25 shards in +50 indices ~ happy with query performance for 63 indices, 3 shards per index and 3 nodes.
  31. Ticketbis use case #documents in 2016 (after integration) ~ 130K

    docs in 63 lang = 8.5M docs / year ~ 10GB / year (2.3 GB/year before integration) ~ Rule of thumb: < 32GB / shard ~ Field per language could be a problem
  32. Conclusions ~ Capacity planning is not the first step ~

    It always depends on your needs ~ A little overallocation can be good ~ Don’t mix languages and mind TF/IDF