Knowledge Derived from Search for Huge Track Metadata with Elasticsearch

Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

Speaker − Taku Tada − LINE Corporation − Development Center 3 − Service Development 1 − Development Team T − Server-Side Engineer of LINE MUSIC

Slide 3

Slide 3 text

- Music streaming service - Platform: IOS, Android, Web - About 85 million tracks - Music recommendation for you - MV, BGM setting in your LINE profile LINE MUSIC

Slide 4

Slide 4 text

Services for Record Labels CMS API − Search and check meta data − Track − Album − Artist − Video − Video Album − Get play count reports − Get ranking data

Slide 5

Slide 5 text

Meta Search API Elasticsearch Search request Search request Search query CMS Backend Server API Server Meta Search API Server Today’s Topic

Slide 6

Slide 6 text

About 85,000,000 Tracks

Slide 7

Slide 7 text

Agenda - Case: Search feature in CMS - Case: API for Record Labels - Summary

Slide 8

Slide 8 text

Search feature in CMS The form to search metadata

Slide 9

Slide 9 text

How to improve the search feature You can search tracks in the form by … Before: only track name − Track name After: various search Items − Track name items − Track name − Additional track name − Editable additional track name − Album name items − Artist name items − Label Product Code − ISRC Query time ≈ 1 sec Query time ≈ 5 sec

Slide 10

Slide 10 text

Why so slow? Before: − Data Type: keyword − No-analyzed field. The field has a raw string. − Query: wildcard (term level query) − Returns documents that contain terms matching a wildcard pattern − Beginning patterns with * ( i.e., backward match) can slow search performance. Example: query → *bc* accepted words → bc, abc, bcd, abcd, …

Slide 11

Slide 11 text

Why keyword type? − Fast delivery was primary at first release − The data type of text requires tokenize − Track, album, and artist name are proper nouns and short − They include neologism or slang, so, it may be difficult to decompose to morpheme Example: yonkey, Haunter

Slide 12

Slide 12 text

Data type Before: − Data Type: keyword − No-analyzed field − Query: wildcard (term level query) After: − Data Type: text − Analyzed field − Tokenize: ? − Query: match (full text query) − Use inverted index − High search performance even with huge data

Slide 13

Slide 13 text

Tokenize - The match query analyzes any provided text before performing a search Query: match (full text query) Tokenize: n-gram - The tokenizer breaks down a word into a contiguous sequence of n characters 1-gram: abcde → [a, b, c, d, e] 2-gram: abcde → [ab, bc, cd, de] 3-gram: abcde → [abc, bcd, cde] [t, o, k, y, o] Analyze (Tokenize: 1-gram) tokyo match t o k y ID 1, 2, 4 1, 2, 3 1, 2 1, 2 Inversed index ID 1 (tokyo) [t, o, k, y, o] ID 2 (kyoto) [k, y, o, t, o] Documents

Slide 14

Slide 14 text

N-gram ＆ Dynamic query The full text query can't assure the order of a word! Change query in dynamic Change n-gram size by the length of a search word [a] 1-gram a [ab] 2-gram ab [tok, oky, kyo] 3-gram (length ≧ 3) tokyo match tok oky kyo yot oto ID 1 1 1, 2 2 2 Inversed index ID 1 (tokyo) [t, o, k, y, o] [to, ok, ky, yo] [tok, oky, kyo] ID 2 (kyoto) [k, y, o, t, o] [ky, yo, ot, to] [kyo, yot, oto] Documents

Slide 15

Slide 15 text

Wildcard vs Match 0 1 2 3 4 5 0 2 4 6 8 10 12 Response Time [sec] Number of fields wildcard match 4.6 sec 0.2 sec

Slide 16

Slide 16 text

Wildcard vs Match 0 1 2 3 4 5 0 2 4 6 8 10 12 Response Time [sec] Number of fields wildcard match 4.6 sec 0.2 sec [ { "wildcard": { "name": { "value": "*test*” } } }, { "wildcard”: { "additional_name": { "value": "*test*” } } } ] [ { "match": { "name.3gram": "test" } }, { "match": { "additional_name.3gram": "test" } }, { "match": { "editable_additional_name.3gram": "test" } }, { "match": { "artist_name.3gram": "test" } } ]

Slide 17

Slide 17 text

Knowledge It derived from the case of search feature in CMS − A backward match as wildcard query slow search performance with huge data − Match query is better to search performance than wildcard query − N-gram & dynamic query solve the problem of the order of a word − It can be applicated for other data with short words and included neologism − It is not a complete solution, although it may be no problem in practical

Slide 18

Slide 18 text

Agenda - Case: Search feature in CMS - Case: API for Record Labels - Summary

Slide 19

Slide 19 text

API for Record Labels Search request Search request Search query CMS Backend Server API Server Meta Search API Server Elasticsearch

Slide 20

Slide 20 text

API for Record Labels Search query CMS Backend Server API Server Meta Search API Server Elasticsearch Label staff Label System Record Label

Slide 21

Slide 21 text

API for Record Labels Search query CMS Backend Server API Server Meta Search API Server Elasticsearch Label staff Label System Record Label Load increases

Slide 22

Slide 22 text

Adding Data Nodes Data Node Cluster Cluster Data Node Master Node Data Node Data Node Master Node Master Node Master Node Master Node Master Node Data Node Data Node Data Node Data Node Data Node More running cost …

Slide 23

Slide 23 text

Efficient use of Data Nodes − What should you set to work efficiently with the same number of data nodes? Shard and Replica Data Shard A Shard B Shard C Shard A Shard B Shard C Shard: 3

Slide 24

Slide 24 text

Efficient use of Data Nodes − What should you set to work efficiently with the same number of data nodes? Shard and Replica Data Primary Shard A Primary Shard B Primary Shard C Primary A Primary B Primary C Replica Shard A Replica Shard A Replica Shard B Replica Shard B Replica Shard C Replica Shard C Replica B Replica C Replica C Replica A Replica B Replica A Shard: 3 Replica: 2

Slide 25

Slide 25 text

Shards Arrangement Primary A Replica A Primary B Replica B Primary A Replica A Replica A Primary B Replica B Replica B Data Nodes ≦ Shards × (Replicas + 1) Shard: 2 Replica: 2 Shard: 2 Replica: 1 Not used Not used

Slide 26

Slide 26 text

Benchmark of Shard Real Data, 12 Data Nodes 0 50 100 150 200 250 Shard: 2 Replica: 5 Shard: 3 Replica: 3 Shard: 4 Replica: 2 Shard: 6 Replica: 1 Shard: 12 Replica: 1 Response Time [ms] mean min max p95 71.70 65.25

Slide 27

Slide 27 text

Elasticsearch Setting Number of nodes, shards and replicas Data Nodes 18 Replicas 2 Shards 6

Slide 28

Slide 28 text

Summary − Case: search feature in CMS − A backward match as wildcard query slow search performance with huge data − N-gram & dynamic query solve search performance and the order of a word − Case: API for record labels − Load of Elasticsearch increases due to use from the label's system − We added data nodes and set optimized the number of shards and replicas − Data Nodes ≦ Shards × (Replicas + 1) − Let's try to benchmark by different the number of shards

Slide 29

Slide 29 text

Thank you