search _ one main video index _ 5 shards with 1 replica _ nodes : 32 cores, 48 gb RAM, 15k disks > 600 to 1000 search requests per second > end-to-end response time < 40 ms
control of the query base score > problem is our text content is thin _ short title, a few tags _ a more or less relevant description > bare bones TF-IDF may not be suitable _ TF not that relevant to us
length > why common terms query _ increase performance _ ignore popular terms when searching _ but still use them for scoring _ like a real time specialized stop words list similarity: my_bm25: type: BM25 b: 0.001
(TF) only if repeated in query a doc titled “A A A game” has a better score than “A game” only when explicitly searching for “A A A” > boost term by position in query and documents search query and score brown fox zerzer brown fox zerzer the quick brown fox jumps. ^1.1 ^1.07 ^1.05 ^1.03 ^1.02
runs at 100 % for each query > less requests per second for the same hardware 9 ms + 10 ms + 6 ms + (…) = 140 ms shard 0 (9ms) shard 1 (10ms) shard 2 (6ms) shard 3 (9ms) 110 ms shard 0 shard 4 (10ms) shard 5 (10ms) shard 6 (9ms) shard 7 (10ms) shard 8 (10ms) shard 9 (8ms) shard 10 (5ms) shard 11 (10ms) shard 12 (7ms) shard 13 (7ms) shard 14 (10ms) shard 15 (10ms) 140 ms spent by the shards 110 ms spent
40 shards on 18 nodes _ ~2 millions docs per shard _ 3 gb by shards _ ~ 120 gb total index size > cluster was very loaded _ every single query was hitting all the nodes _ response times could have been better
per shard _ 4 gb by shards _ ~ 25 gb total index size > only data we need right now _ { "_source" : false } _ round numbers and dates _ { "precision_step" : 2147483647 } > less updates, faster indexation, rebalance, merges...
with Tsung _ dedicated test cluster _ run real queries, lots of them _ aim for our expected load _ monitor everything _ reshard, change schema _ set masters, data-only nodes... repeat
Elasticsearch to just filter and sort > these queries match millions of documents _ they are slow _ even when terms are cached _ iterating, scoring and sorting is tedious
are not enough hits? _ re-run the query without the filter > we use a custom query to do just that! _ breaks once it matches enough hits _ runs at segment level _ no round-trips