Knowledge Derived from Search for Huge Track Metadata with Elasticsearch

Speaker − Taku Tada − LINE Corporation − Development Center
3 − Service Development 1 − Development Team T − Server-Side Engineer of LINE MUSIC

- Music streaming service - Platform: IOS, Android, Web -
About 85 million tracks - Music recommendation for you - MV, BGM setting in your LINE profile LINE MUSIC

Services for Record Labels CMS API − Search and check
meta data − Track − Album − Artist − Video − Video Album − Get play count reports − Get ranking data

Meta Search API Elasticsearch Search request Search request Search query
CMS Backend Server API Server Meta Search API Server Today’s Topic

About 85,000,000 Tracks

Agenda - Case: Search feature in CMS - Case: API
for Record Labels - Summary

Search feature in CMS The form to search metadata

How to improve the search feature You can search tracks
in the form by … Before: only track name − Track name After: various search Items − Track name items − Track name − Additional track name − Editable additional track name − Album name items − Artist name items − Label Product Code − ISRC Query time ≈ 1 sec Query time ≈ 5 sec

Why so slow? Before: − Data Type: keyword − No-analyzed
field. The field has a raw string. − Query: wildcard (term level query) − Returns documents that contain terms matching a wildcard pattern − Beginning patterns with * ( i.e., backward match) can slow search performance. Example: query → *bc* accepted words → bc, abc, bcd, abcd, …

Why keyword type? − Fast delivery was primary at first
release − The data type of text requires tokenize − Track, album, and artist name are proper nouns and short − They include neologism or slang, so, it may be difficult to decompose to morpheme Example: yonkey, Haunter

Data type Before: − Data Type: keyword − No-analyzed field
− Query: wildcard (term level query) After: − Data Type: text − Analyzed field − Tokenize: ? − Query: match (full text query) − Use inverted index − High search performance even with huge data

Tokenize - The match query analyzes any provided text before
performing a search Query: match (full text query) Tokenize: n-gram - The tokenizer breaks down a word into a contiguous sequence of n characters 1-gram: abcde → [a, b, c, d, e] 2-gram: abcde → [ab, bc, cd, de] 3-gram: abcde → [abc, bcd, cde] [t, o, k, y, o] Analyze (Tokenize: 1-gram) tokyo match t o k y ID 1, 2, 4 1, 2, 3 1, 2 1, 2 Inversed index ID 1 (tokyo) [t, o, k, y, o] ID 2 (kyoto) [k, y, o, t, o] Documents

N-gram ＆ Dynamic query The full text query can't assure
the order of a word! Change query in dynamic Change n-gram size by the length of a search word [a] 1-gram a [ab] 2-gram ab [tok, oky, kyo] 3-gram (length ≧ 3) tokyo match tok oky kyo yot oto ID 1 1 1, 2 2 2 Inversed index ID 1 (tokyo) [t, o, k, y, o] [to, ok, ky, yo] [tok, oky, kyo] ID 2 (kyoto) [k, y, o, t, o] [ky, yo, ot, to] [kyo, yot, oto] Documents

Wildcard vs Match 0 1 2 3 4 5 0
2 4 6 8 10 12 Response Time [sec] Number of fields wildcard match 4.6 sec 0.2 sec

Wildcard vs Match 0 1 2 3 4 5 0
2 4 6 8 10 12 Response Time [sec] Number of fields wildcard match 4.6 sec 0.2 sec [ { "wildcard": { "name": { "value": "*test*” } } }, { "wildcard”: { "additional_name": { "value": "*test*” } } } ] [ { "match": { "name.3gram": "test" } }, { "match": { "additional_name.3gram": "test" } }, { "match": { "editable_additional_name.3gram": "test" } }, { "match": { "artist_name.3gram": "test" } } ]

Knowledge It derived from the case of search feature in
CMS − A backward match as wildcard query slow search performance with huge data − Match query is better to search performance than wildcard query − N-gram & dynamic query solve the problem of the order of a word − It can be applicated for other data with short words and included neologism − It is not a complete solution, although it may be no problem in practical

Agenda - Case: Search feature in CMS - Case: API
for Record Labels - Summary

API for Record Labels Search request Search request Search query
CMS Backend Server API Server Meta Search API Server Elasticsearch

API for Record Labels Search query CMS Backend Server API
Server Meta Search API Server Elasticsearch Label staff Label System Record Label

API for Record Labels Search query CMS Backend Server API
Server Meta Search API Server Elasticsearch Label staff Label System Record Label Load increases

Adding Data Nodes Data Node Cluster Cluster Data Node Master
Node Data Node Data Node Master Node Master Node Master Node Master Node Master Node Data Node Data Node Data Node Data Node Data Node More running cost …

Efficient use of Data Nodes − What should you set
to work efficiently with the same number of data nodes? Shard and Replica Data Shard A Shard B Shard C Shard A Shard B Shard C Shard: 3

Efficient use of Data Nodes − What should you set
to work efficiently with the same number of data nodes? Shard and Replica Data Primary Shard A Primary Shard B Primary Shard C Primary A Primary B Primary C Replica Shard A Replica Shard A Replica Shard B Replica Shard B Replica Shard C Replica Shard C Replica B Replica C Replica C Replica A Replica B Replica A Shard: 3 Replica: 2

Shards Arrangement Primary A Replica A Primary B Replica B
Primary A Replica A Replica A Primary B Replica B Replica B Data Nodes ≦ Shards × (Replicas + 1) Shard: 2 Replica: 2 Shard: 2 Replica: 1 Not used Not used

Benchmark of Shard Real Data, 12 Data Nodes 0 50
100 150 200 250 Shard: 2 Replica: 5 Shard: 3 Replica: 3 Shard: 4 Replica: 2 Shard: 6 Replica: 1 Shard: 12 Replica: 1 Response Time [ms] mean min max p95 71.70 65.25

Elasticsearch Setting Number of nodes, shards and replicas Data Nodes
18 Replicas 2 Shards 6

Summary − Case: search feature in CMS − A backward
match as wildcard query slow search performance with huge data − N-gram & dynamic query solve search performance and the order of a word − Case: API for record labels − Load of Elasticsearch increases due to use from the label's system − We added data nodes and set optimized the number of shards and replicas − Data Nodes ≦ Shards × (Replicas + 1) − Let's try to benchmark by different the number of shards

Thank you

Knowledge Derived from Search for Huge Track Me...

Knowledge Derived from Search for Huge Track Metadata with Elasticsearch

LINE DEVDAY 2021

More Decks by LINE DEVDAY 2021

Other Decks in Technology

Featured

Transcript

Speaker − Taku Tada − LINE Corporation − Development Center

- Music streaming service - Platform: IOS, Android, Web -

Services for Record Labels CMS API − Search and check

Meta Search API Elasticsearch Search request Search request Search query

About 85,000,000 Tracks

Agenda - Case: Search feature in CMS - Case: API

Search feature in CMS The form to search metadata

How to improve the search feature You can search tracks

Why so slow? Before: − Data Type: keyword − No-analyzed

Why keyword type? − Fast delivery was primary at first

Data type Before: − Data Type: keyword − No-analyzed field

Tokenize - The match query analyzes any provided text before

N-gram ＆ Dynamic query The full text query can't assure

Wildcard vs Match 0 1 2 3 4 5 0

Wildcard vs Match 0 1 2 3 4 5 0

Knowledge It derived from the case of search feature in

Agenda - Case: Search feature in CMS - Case: API

API for Record Labels Search request Search request Search query

API for Record Labels Search query CMS Backend Server API

API for Record Labels Search query CMS Backend Server API

Adding Data Nodes Data Node Cluster Cluster Data Node Master

Efficient use of Data Nodes − What should you set

Efficient use of Data Nodes − What should you set

Shards Arrangement Primary A Replica A Primary B Replica B

Benchmark of Shard Real Data, 12 Data Nodes 0 50

Elasticsearch Setting Number of nodes, shards and replicas Data Nodes

Summary − Case: search feature in CMS − A backward

Thank you