Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Knowledge Derived from Search for Huge Track Metadata with Elasticsearch

Knowledge Derived from Search for Huge Track Metadata with Elasticsearch

LINE DEVDAY 2021

November 11, 2021
Tweet

More Decks by LINE DEVDAY 2021

Other Decks in Technology

Transcript

  1. Speaker − Taku Tada − LINE Corporation − Development Center

    3 − Service Development 1 − Development Team T − Server-Side Engineer of LINE MUSIC
  2. - Music streaming service - Platform: IOS, Android, Web -

    About 85 million tracks - Music recommendation for you - MV, BGM setting in your LINE profile LINE MUSIC
  3. Services for Record Labels CMS API − Search and check

    meta data − Track − Album − Artist − Video − Video Album − Get play count reports − Get ranking data
  4. Meta Search API Elasticsearch Search request Search request Search query

    CMS Backend Server API Server Meta Search API Server Today’s Topic
  5. How to improve the search feature You can search tracks

    in the form by … Before: only track name − Track name After: various search Items − Track name items − Track name − Additional track name − Editable additional track name − Album name items − Artist name items − Label Product Code − ISRC Query time ≈ 1 sec Query time ≈ 5 sec
  6. Why so slow? Before: − Data Type: keyword − No-analyzed

    field. The field has a raw string. − Query: wildcard (term level query) − Returns documents that contain terms matching a wildcard pattern − Beginning patterns with * ( i.e., backward match) can slow search performance. Example: query → *bc* accepted words → bc, abc, bcd, abcd, …
  7. Why keyword type? − Fast delivery was primary at first

    release − The data type of text requires tokenize − Track, album, and artist name are proper nouns and short − They include neologism or slang, so, it may be difficult to decompose to morpheme Example: yonkey, Haunter
  8. Data type Before: − Data Type: keyword − No-analyzed field

    − Query: wildcard (term level query) After: − Data Type: text − Analyzed field − Tokenize: ? − Query: match (full text query) − Use inverted index − High search performance even with huge data
  9. Tokenize - The match query analyzes any provided text before

    performing a search Query: match (full text query) Tokenize: n-gram - The tokenizer breaks down a word into a contiguous sequence of n characters 1-gram: abcde → [a, b, c, d, e] 2-gram: abcde → [ab, bc, cd, de] 3-gram: abcde → [abc, bcd, cde] [t, o, k, y, o] Analyze (Tokenize: 1-gram) tokyo match t o k y ID 1, 2, 4 1, 2, 3 1, 2 1, 2 Inversed index ID 1 (tokyo) [t, o, k, y, o] ID 2 (kyoto) [k, y, o, t, o] Documents
  10. N-gram & Dynamic query The full text query can't assure

    the order of a word! Change query in dynamic Change n-gram size by the length of a search word [a] 1-gram a [ab] 2-gram ab [tok, oky, kyo] 3-gram (length ≧ 3) tokyo match tok oky kyo yot oto ID 1 1 1, 2 2 2 Inversed index ID 1 (tokyo) [t, o, k, y, o] [to, ok, ky, yo] [tok, oky, kyo] ID 2 (kyoto) [k, y, o, t, o] [ky, yo, ot, to] [kyo, yot, oto] Documents
  11. Wildcard vs Match 0 1 2 3 4 5 0

    2 4 6 8 10 12 Response Time [sec] Number of fields wildcard match 4.6 sec 0.2 sec
  12. Wildcard vs Match 0 1 2 3 4 5 0

    2 4 6 8 10 12 Response Time [sec] Number of fields wildcard match 4.6 sec 0.2 sec [ { "wildcard": { "name": { "value": "*test*” } } }, { "wildcard”: { "additional_name": { "value": "*test*” } } } ] [ { "match": { "name.3gram": "test" } }, { "match": { "additional_name.3gram": "test" } }, { "match": { "editable_additional_name.3gram": "test" } }, { "match": { "artist_name.3gram": "test" } } ]
  13. Knowledge It derived from the case of search feature in

    CMS − A backward match as wildcard query slow search performance with huge data − Match query is better to search performance than wildcard query − N-gram & dynamic query solve the problem of the order of a word − It can be applicated for other data with short words and included neologism − It is not a complete solution, although it may be no problem in practical
  14. API for Record Labels Search request Search request Search query

    CMS Backend Server API Server Meta Search API Server Elasticsearch
  15. API for Record Labels Search query CMS Backend Server API

    Server Meta Search API Server Elasticsearch Label staff Label System Record Label
  16. API for Record Labels Search query CMS Backend Server API

    Server Meta Search API Server Elasticsearch Label staff Label System Record Label Load increases
  17. Adding Data Nodes Data Node Cluster Cluster Data Node Master

    Node Data Node Data Node Master Node Master Node Master Node Master Node Master Node Data Node Data Node Data Node Data Node Data Node More running cost …
  18. Efficient use of Data Nodes − What should you set

    to work efficiently with the same number of data nodes? Shard and Replica Data Shard A Shard B Shard C Shard A Shard B Shard C Shard: 3
  19. Efficient use of Data Nodes − What should you set

    to work efficiently with the same number of data nodes? Shard and Replica Data Primary Shard A Primary Shard B Primary Shard C Primary A Primary B Primary C Replica Shard A Replica Shard A Replica Shard B Replica Shard B Replica Shard C Replica Shard C Replica B Replica C Replica C Replica A Replica B Replica A Shard: 3 Replica: 2
  20. Shards Arrangement Primary A Replica A Primary B Replica B

    Primary A Replica A Replica A Primary B Replica B Replica B Data Nodes ≦ Shards × (Replicas + 1) Shard: 2 Replica: 2 Shard: 2 Replica: 1 Not used Not used
  21. Benchmark of Shard Real Data, 12 Data Nodes 0 50

    100 150 200 250 Shard: 2 Replica: 5 Shard: 3 Replica: 3 Shard: 4 Replica: 2 Shard: 6 Replica: 1 Shard: 12 Replica: 1 Response Time [ms] mean min max p95 71.70 65.25
  22. Summary − Case: search feature in CMS − A backward

    match as wildcard query slow search performance with huge data − N-gram & dynamic query solve search performance and the order of a word − Case: API for record labels − Load of Elasticsearch increases due to use from the label's system − We added data nodes and set optimized the number of shards and replicas − Data Nodes ≦ Shards × (Replicas + 1) − Let's try to benchmark by different the number of shards