in the form by … Before: only track name − Track name After: various search Items − Track name items − Track name − Additional track name − Editable additional track name − Album name items − Artist name items − Label Product Code − ISRC Query time ≈ 1 sec Query time ≈ 5 sec
field. The field has a raw string. − Query: wildcard (term level query) − Returns documents that contain terms matching a wildcard pattern − Beginning patterns with * ( i.e., backward match) can slow search performance. Example: query → *bc* accepted words → bc, abc, bcd, abcd, …
release − The data type of text requires tokenize − Track, album, and artist name are proper nouns and short − They include neologism or slang, so, it may be difficult to decompose to morpheme Example: yonkey, Haunter
− Query: wildcard (term level query) After: − Data Type: text − Analyzed field − Tokenize: ? − Query: match (full text query) − Use inverted index − High search performance even with huge data
performing a search Query: match (full text query) Tokenize: n-gram - The tokenizer breaks down a word into a contiguous sequence of n characters 1-gram: abcde → [a, b, c, d, e] 2-gram: abcde → [ab, bc, cd, de] 3-gram: abcde → [abc, bcd, cde] [t, o, k, y, o] Analyze (Tokenize: 1-gram) tokyo match t o k y ID 1, 2, 4 1, 2, 3 1, 2 1, 2 Inversed index ID 1 (tokyo) [t, o, k, y, o] ID 2 (kyoto) [k, y, o, t, o] Documents
the order of a word! Change query in dynamic Change n-gram size by the length of a search word [a] 1-gram a [ab] 2-gram ab [tok, oky, kyo] 3-gram (length ≧ 3) tokyo match tok oky kyo yot oto ID 1 1 1, 2 2 2 Inversed index ID 1 (tokyo) [t, o, k, y, o] [to, ok, ky, yo] [tok, oky, kyo] ID 2 (kyoto) [k, y, o, t, o] [ky, yo, ot, to] [kyo, yot, oto] Documents
CMS − A backward match as wildcard query slow search performance with huge data − Match query is better to search performance than wildcard query − N-gram & dynamic query solve the problem of the order of a word − It can be applicated for other data with short words and included neologism − It is not a complete solution, although it may be no problem in practical
Node Data Node Data Node Master Node Master Node Master Node Master Node Master Node Data Node Data Node Data Node Data Node Data Node More running cost …
to work efficiently with the same number of data nodes? Shard and Replica Data Primary Shard A Primary Shard B Primary Shard C Primary A Primary B Primary C Replica Shard A Replica Shard A Replica Shard B Replica Shard B Replica Shard C Replica Shard C Replica B Replica C Replica C Replica A Replica B Replica A Shard: 3 Replica: 2
Primary A Replica A Replica A Primary B Replica B Replica B Data Nodes ≦ Shards × (Replicas + 1) Shard: 2 Replica: 2 Shard: 2 Replica: 1 Not used Not used
match as wildcard query slow search performance with huge data − N-gram & dynamic query solve search performance and the order of a word − Case: API for record labels − Load of Elasticsearch increases due to use from the label's system − We added data nodes and set optimized the number of shards and replicas − Data Nodes ≦ Shards × (Replicas + 1) − Let's try to benchmark by different the number of shards