Elasticsearch at Ticketbis

Slide 1

Slide 1 text

Elasticsearch at Ticketbis A real story in +50 languages

Slide 2

Slide 2 text

@jgargallo Jose Gargallo

Slide 3

Slide 3 text

A marketplace where fans can buy and sell tickets

Slide 4

Slide 4 text

~ 400 employees ~ 50 countries ~ 90 m€ in 2015 ~ May 2016: Acquired by StubHub (eBay)

Slide 5

Slide 5 text

Index ~ Elasticsearch intro ~ Capacity planning ~ Ticketbis use case for +50 languages

Slide 6

Slide 6 text

Elasticsearch Intro ~ A distributed RESTful search engine ~ Lucene: search engine lib behind ES ~ Basic concepts - Index (~ database) - Type (~ table) - Document (~ row)

Slide 7

Slide 7 text

Elasticsearch concepts Node 1 P1 Node 2 R1 Node 3 P0 R2 Cluster P2 R0 3 nodes, 3 shards per index and 1 replica per shard (up to 6 nodes)

Slide 8

Slide 8 text

Elasticsearch concepts Node 1 P1 R0 Node 2 R1 R2 Node 3 P0 R1 R2 Cluster P2 R0 3 nodes, 3 shards per index and 2 replicas per shard (up to 9 nodes)

Slide 9

Slide 9 text

How many shards?

Slide 10

Slide 10 text

How large should our cluster be?

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

It depends!

Slide 13

Slide 13 text

Capacity planning Growth expectation ~ size & number of documents ~ query volume ~ write load

Slide 14

Slide 14 text

3 shards for a 2 nodes cluster (1 replica) Node 1 P1 R0 Node 2 R1 R2 P0 Cluster P2 Capacity planning Overallocation (#shards > #nodes)

Slide 15

Slide 15 text

Capacity planning Avoid Kagillion shards problem ~ A shard is a Lucene index (+resources) ~ A search request touch all shards ~ Poor document relevance (TF / IDF)

Slide 16

Slide 16 text

“Es el vecino el que elige al alcalde y es el alcalde el que quiere que sean los vecinos el alcalde”

Slide 17

Slide 17 text

“Es el vecino el que elige al alcalde y es el alcalde el que quiere que sean los vecinos el alcalde” tf(freq=3.0) = 1.73 . . . idf(docFreq=1, maxDocs=1) = 0.30 idf(docFreq=2, maxDocs=10) = 2.20

Slide 18

Slide 18 text

Capacity planning Replicas ~ Improves performance and throughput ~ High-availability and huge read-intensive search performance ~ Number of replicas can be changed anytime

Slide 19

Slide 19 text

Got it! So… How large should our cluster be?

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

Ticketbis use case Size & number of documents ~ 1.2KB per document ~ 30K docs in 63 lang = 2M docs / year ~ 2.3GB / year ~ Rule of thumb: < 32GB / shard

Slide 22

Slide 22 text

Easy! One shard! Let’s go for a beer...

Slide 23

Slide 23 text

Wait! How do we structure our multilingual document?

Slide 24

Slide 24 text

Event document { "Id": 18, “Date”: “2016-12-04T20:00:00Z”, “Timezone”: “Europe/Madrid”, "shipment_deadline": "2016-12-01T23:00:00Z", “translations”: { “es_es”: { “name”: “Barcelona - Real Madrid”, “title”: “Entradas Barcelona - Real Madrid”, }, “en_gb”: { “name”: “Barcelona v Real Madrid”, “title”: “Barcelona v Real Madrid Tickets - La Liga”, }, }}

Slide 25

Slide 25 text

I18n approaches ~ Parent / Child ~ Nested data ~ Shard per language ~ Type per language ~ Index per language ~ Field per language

Slide 26

Slide 26 text

Parent / Child PUT /catalog { "mappings": { "event": {...}, // non i18n fields (date, ...) "translations": { "_parent": { "type": "event" }, “Properties”: { “locale”: { “type”: “string”}, “name”, .... // Rest of i18n fields } } } }

Slide 27

Slide 27 text

Parent / Child ~ Parent / child in the same shard ~ Can update a single translation ~ Always hit all shards ~ Doesn’t support different analysers per locale

Slide 28

Slide 28 text

Parent / Child ~ Parent / child in the same shard ~ Can update a single translation ~ Always hit all shards ~ Doesn’t support different analysers per locale

Slide 29

Slide 29 text

Nested data PUT /catalog { "mappings": { "event": { “properties”: { … // non i18n fields (date, ...) "translations": { “type”: “nested”, “properties”: { “locale”: {“type”: “string”} “name”, … // rest of i18n fields } } } }}

Slide 30

Slide 30 text

Nested data ~ Nested data in the same segment ~ Query 10 times faster than parent/child ~ Always hit all shards ~ Can’t update a single translation ~ Doesn’t support different analysers per locale

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Shard per language (routing) PUT /catalog/event/18?routing=es_es { “Date”: “2016-12-04T20:00:00Z”, “Timezone”: “Europe/Madrid”, "shipment_deadline": "2016-12-01T23:00:00Z", “name”: “Barcelona - Real Madrid”, “title”: “Entradas Barcelona - Real Madrid”, } GET /catalog/event/18?routing=es_es

Slide 33

Slide 33 text

Shard per language (routing) ~ Only hits one shard ~ Better query performance ~ Doesn’t support different analysers per locale

Slide 34

Slide 34 text

Shard per language (routing) ~ Only hits one shard ~ Better query performance ~ Doesn’t support different analysers per locale

Slide 35

Slide 35 text

Type per language PUT /catalog/event_es_es/18 { “Date”: “2016-12-04T20:00:00Z”, “Timezone”: “Europe/Madrid”, "shipment_deadline": "2016-12-01T23:00:00Z", “name”: “Barcelona - Real Madrid”, “title”: “Entradas Barcelona - Real Madrid”, } GET /catalog/event_es_es/18

Slide 36

Slide 36 text

Type per language ~ Supports different analysers per locale ~ Same fields in different types within the same index share the same inverted index

Slide 37

Slide 37 text

locale es_es: “Una relajante taza de café con leche en Plaza Mayor” Taza: idf(docFreq=1, maxDocs=1) = 0.30

Slide 38

Slide 38 text

locale es_es: “Una relajante taza de café con leche en Plaza Mayor” locale en_gb: “A relaxing cup of café con leche in Plaza Mayor” Taza: idf(docFreq=1, maxDocs=2) = 1.00

Slide 39

Slide 39 text

Index per language PUT /catalog_es_es/event/18 { “Date”: “2016-12-04T20:00:00Z”, “Timezone”: “Europe/Madrid”, "shipment_deadline": "2016-12-01T23:00:00Z", “name”: “Barcelona - Real Madrid”, “title”: “Entradas Barcelona - Real Madrid”, } GET /catalog_es_es/event/18

Slide 40

Slide 40 text

Field per language PUT /catalog/event/18 { “Date”: “2016-12-04T20:00:00Z”, “Timezone”: “Europe/Madrid”, "shipment_deadline": "2016-12-01T23:00:00Z", “name_es_es”: “Barcelona - Real Madrid”, “title_es_es”: “Entradas Barcelona - Real Madrid”, “name_en_gb”: “Barcelona v Real Madrid”, “title_en_gb”: “Barcelona v Real Madrid Tickets”, } GET /catalog/event/18

Slide 41

Slide 41 text

Index / Field per language ~ Support different analysers per locale ~ Avoid TF/IDF and stemming problems ~ hit all shards

Slide 42

Slide 42 text

Index compared to Field ~ more flexible ~ doesn’t need to index all languages at once ~ N indices (#languages) from the very beginning

Slide 43

Slide 43 text

We like flexibility! Is +50 indices (shards) too much overallocation?

Slide 44

Slide 44 text

Benchmarking Index per language ~ Documents successfully indexed up to 25 shards in +50 indices ~ happy with query performance for 63 indices, 3 shards per index and 3 nodes.

Slide 45

Slide 45 text

Ticketbis use case #documents in 2016 (after integration) ~ 130K docs in 63 lang = 8.5M docs / year ~ 10GB / year (2.3 GB/year before integration) ~ Rule of thumb: < 32GB / shard ~ Field per language could be a problem

Slide 46

Slide 46 text

Conclusions ~ Capacity planning is not the first step ~ It always depends on your needs ~ A little overallocation can be good ~ Don’t mix languages and mind TF/IDF