Elasticsearch at Ticketbis

Elasticsearch at Ticketbis A real story in +50 languages

@jgargallo Jose Gargallo

A marketplace where fans can buy and sell tickets

~ 400 employees ~ 50 countries ~ 90 m€ in
2015 ~ May 2016: Acquired by StubHub (eBay)

Index ~ Elasticsearch intro ~ Capacity planning ~ Ticketbis use
case for +50 languages

Elasticsearch Intro ~ A distributed RESTful search engine ~ Lucene:
search engine lib behind ES ~ Basic concepts - Index (~ database) - Type (~ table) - Document (~ row)

Elasticsearch concepts Node 1 P1 Node 2 R1 Node 3
P0 R2 Cluster P2 R0 3 nodes, 3 shards per index and 1 replica per shard (up to 6 nodes)

Elasticsearch concepts Node 1 P1 R0 Node 2 R1 R2
Node 3 P0 R1 R2 Cluster P2 R0 3 nodes, 3 shards per index and 2 replicas per shard (up to 9 nodes)

How many shards?

How large should our cluster be?

It depends!

Capacity planning Growth expectation ~ size & number of documents
~ query volume ~ write load

3 shards for a 2 nodes cluster (1 replica) Node
1 P1 R0 Node 2 R1 R2 P0 Cluster P2 Capacity planning Overallocation (#shards > #nodes)

Capacity planning Avoid Kagillion shards problem ~ A shard is
a Lucene index (+resources) ~ A search request touch all shards ~ Poor document relevance (TF / IDF)

“Es el vecino el que elige al alcalde y es
el alcalde el que quiere que sean los vecinos el alcalde”

“Es el vecino el que elige al alcalde y es
el alcalde el que quiere que sean los vecinos el alcalde” tf(freq=3.0) = 1.73 . . . idf(docFreq=1, maxDocs=1) = 0.30 idf(docFreq=2, maxDocs=10) = 2.20

Capacity planning Replicas ~ Improves performance and throughput ~ High-availability
and huge read-intensive search performance ~ Number of replicas can be changed anytime

Got it! So… How large should our cluster be?

Ticketbis use case Size & number of documents ~ 1.2KB
per document ~ 30K docs in 63 lang = 2M docs / year ~ 2.3GB / year ~ Rule of thumb: < 32GB / shard

Easy! One shard! Let’s go for a beer...

Wait! How do we structure our multilingual document?

Event document { "Id": 18, “Date”: “2016-12-04T20:00:00Z”, “Timezone”: “Europe/Madrid”, "shipment_deadline":
"2016-12-01T23:00:00Z", “translations”: { “es_es”: { “name”: “Barcelona - Real Madrid”, “title”: “Entradas Barcelona - Real Madrid”, }, “en_gb”: { “name”: “Barcelona v Real Madrid”, “title”: “Barcelona v Real Madrid Tickets - La Liga”, }, }}

I18n approaches ~ Parent / Child ~ Nested data ~
Shard per language ~ Type per language ~ Index per language ~ Field per language

Parent / Child PUT /catalog { "mappings": { "event": {...},
// non i18n fields (date, ...) "translations": { "_parent": { "type": "event" }, “Properties”: { “locale”: { “type”: “string”}, “name”, .... // Rest of i18n fields } } } }

Parent / Child ~ Parent / child in the same
shard ~ Can update a single translation ~ Always hit all shards ~ Doesn’t support different analysers per locale

Nested data PUT /catalog { "mappings": { "event": { “properties”:
{ … // non i18n fields (date, ...) "translations": { “type”: “nested”, “properties”: { “locale”: {“type”: “string”} “name”, … // rest of i18n fields } } } }}

Nested data ~ Nested data in the same segment ~
Query 10 times faster than parent/child ~ Always hit all shards ~ Can’t update a single translation ~ Doesn’t support different analysers per locale

Shard per language (routing) PUT /catalog/event/18?routing=es_es { “Date”: “2016-12-04T20:00:00Z”, “Timezone”:
“Europe/Madrid”, "shipment_deadline": "2016-12-01T23:00:00Z", “name”: “Barcelona - Real Madrid”, “title”: “Entradas Barcelona - Real Madrid”, } GET /catalog/event/18?routing=es_es

Shard per language (routing) ~ Only hits one shard ~
Better query performance ~ Doesn’t support different analysers per locale

Type per language PUT /catalog/event_es_es/18 { “Date”: “2016-12-04T20:00:00Z”, “Timezone”: “Europe/Madrid”,
"shipment_deadline": "2016-12-01T23:00:00Z", “name”: “Barcelona - Real Madrid”, “title”: “Entradas Barcelona - Real Madrid”, } GET /catalog/event_es_es/18

Type per language ~ Supports different analysers per locale ~
Same fields in different types within the same index share the same inverted index

locale es_es: “Una relajante taza de café con leche en
Plaza Mayor” Taza: idf(docFreq=1, maxDocs=1) = 0.30

locale es_es: “Una relajante taza de café con leche en
Plaza Mayor” locale en_gb: “A relaxing cup of café con leche in Plaza Mayor” Taza: idf(docFreq=1, maxDocs=2) = 1.00

Index per language PUT /catalog_es_es/event/18 { “Date”: “2016-12-04T20:00:00Z”, “Timezone”: “Europe/Madrid”,
"shipment_deadline": "2016-12-01T23:00:00Z", “name”: “Barcelona - Real Madrid”, “title”: “Entradas Barcelona - Real Madrid”, } GET /catalog_es_es/event/18

Field per language PUT /catalog/event/18 { “Date”: “2016-12-04T20:00:00Z”, “Timezone”: “Europe/Madrid”,
"shipment_deadline": "2016-12-01T23:00:00Z", “name_es_es”: “Barcelona - Real Madrid”, “title_es_es”: “Entradas Barcelona - Real Madrid”, “name_en_gb”: “Barcelona v Real Madrid”, “title_en_gb”: “Barcelona v Real Madrid Tickets”, } GET /catalog/event/18

Index / Field per language ~ Support different analysers per
locale ~ Avoid TF/IDF and stemming problems ~ hit all shards

Index compared to Field ~ more flexible ~ doesn’t need
to index all languages at once ~ N indices (#languages) from the very beginning

We like flexibility! Is +50 indices (shards) too much overallocation?

Benchmarking Index per language ~ Documents successfully indexed up to
25 shards in +50 indices ~ happy with query performance for 63 indices, 3 shards per index and 3 nodes.

Ticketbis use case #documents in 2016 (after integration) ~ 130K
docs in 63 lang = 8.5M docs / year ~ 10GB / year (2.3 GB/year before integration) ~ Rule of thumb: < 32GB / shard ~ Field per language could be a problem

Conclusions ~ Capacity planning is not the first step ~
It always depends on your needs ~ A little overallocation can be good ~ Don’t mix languages and mind TF/IDF

Thank you! Jose Gargallo @jgargallo

We are hiring! [email protected] @TicketbisEng ...and we are remote friendly!

Elasticsearch at Ticketbis

Elasticsearch at Ticketbis

More Decks by Jose

Other Decks in Technology

Featured

Transcript