$30 off During Our Annual Pro Sale. View Details »

Knowledge Derived from Search for Huge Track Metadata with Elasticsearch

Knowledge Derived from Search for Huge Track Metadata with Elasticsearch

LINE DEVDAY 2021
PRO

November 11, 2021
Tweet

More Decks by LINE DEVDAY 2021

Other Decks in Technology

Transcript

  1. None
  2. Speaker − Taku Tada − LINE Corporation − Development Center

    3 − Service Development 1 − Development Team T − Server-Side Engineer of LINE MUSIC
  3. - Music streaming service - Platform: IOS, Android, Web -

    About 85 million tracks - Music recommendation for you - MV, BGM setting in your LINE profile LINE MUSIC
  4. Services for Record Labels CMS API − Search and check

    meta data − Track − Album − Artist − Video − Video Album − Get play count reports − Get ranking data
  5. Meta Search API Elasticsearch Search request Search request Search query

    CMS Backend Server API Server Meta Search API Server Today’s Topic
  6. About 85,000,000 Tracks

  7. Agenda - Case: Search feature in CMS - Case: API

    for Record Labels - Summary
  8. Search feature in CMS The form to search metadata

  9. How to improve the search feature You can search tracks

    in the form by … Before: only track name − Track name After: various search Items − Track name items − Track name − Additional track name − Editable additional track name − Album name items − Artist name items − Label Product Code − ISRC Query time ≈ 1 sec Query time ≈ 5 sec
  10. Why so slow? Before: − Data Type: keyword − No-analyzed

    field. The field has a raw string. − Query: wildcard (term level query) − Returns documents that contain terms matching a wildcard pattern − Beginning patterns with * ( i.e., backward match) can slow search performance. Example: query → *bc* accepted words → bc, abc, bcd, abcd, …
  11. Why keyword type? − Fast delivery was primary at first

    release − The data type of text requires tokenize − Track, album, and artist name are proper nouns and short − They include neologism or slang, so, it may be difficult to decompose to morpheme Example: yonkey, Haunter
  12. Data type Before: − Data Type: keyword − No-analyzed field

    − Query: wildcard (term level query) After: − Data Type: text − Analyzed field − Tokenize: ? − Query: match (full text query) − Use inverted index − High search performance even with huge data
  13. Tokenize - The match query analyzes any provided text before

    performing a search Query: match (full text query) Tokenize: n-gram - The tokenizer breaks down a word into a contiguous sequence of n characters 1-gram: abcde → [a, b, c, d, e] 2-gram: abcde → [ab, bc, cd, de] 3-gram: abcde → [abc, bcd, cde] [t, o, k, y, o] Analyze (Tokenize: 1-gram) tokyo match t o k y ID 1, 2, 4 1, 2, 3 1, 2 1, 2 Inversed index ID 1 (tokyo) [t, o, k, y, o] ID 2 (kyoto) [k, y, o, t, o] Documents
  14. N-gram & Dynamic query The full text query can't assure

    the order of a word! Change query in dynamic Change n-gram size by the length of a search word [a] 1-gram a [ab] 2-gram ab [tok, oky, kyo] 3-gram (length ≧ 3) tokyo match tok oky kyo yot oto ID 1 1 1, 2 2 2 Inversed index ID 1 (tokyo) [t, o, k, y, o] [to, ok, ky, yo] [tok, oky, kyo] ID 2 (kyoto) [k, y, o, t, o] [ky, yo, ot, to] [kyo, yot, oto] Documents
  15. Wildcard vs Match 0 1 2 3 4 5 0

    2 4 6 8 10 12 Response Time [sec] Number of fields wildcard match 4.6 sec 0.2 sec
  16. Wildcard vs Match 0 1 2 3 4 5 0

    2 4 6 8 10 12 Response Time [sec] Number of fields wildcard match 4.6 sec 0.2 sec [ { "wildcard": { "name": { "value": "*test*” } } }, { "wildcard”: { "additional_name": { "value": "*test*” } } } ] [ { "match": { "name.3gram": "test" } }, { "match": { "additional_name.3gram": "test" } }, { "match": { "editable_additional_name.3gram": "test" } }, { "match": { "artist_name.3gram": "test" } } ]
  17. Knowledge It derived from the case of search feature in

    CMS − A backward match as wildcard query slow search performance with huge data − Match query is better to search performance than wildcard query − N-gram & dynamic query solve the problem of the order of a word − It can be applicated for other data with short words and included neologism − It is not a complete solution, although it may be no problem in practical
  18. Agenda - Case: Search feature in CMS - Case: API

    for Record Labels - Summary
  19. API for Record Labels Search request Search request Search query

    CMS Backend Server API Server Meta Search API Server Elasticsearch
  20. API for Record Labels Search query CMS Backend Server API

    Server Meta Search API Server Elasticsearch Label staff Label System Record Label
  21. API for Record Labels Search query CMS Backend Server API

    Server Meta Search API Server Elasticsearch Label staff Label System Record Label Load increases
  22. Adding Data Nodes Data Node Cluster Cluster Data Node Master

    Node Data Node Data Node Master Node Master Node Master Node Master Node Master Node Data Node Data Node Data Node Data Node Data Node More running cost …
  23. Efficient use of Data Nodes − What should you set

    to work efficiently with the same number of data nodes? Shard and Replica Data Shard A Shard B Shard C Shard A Shard B Shard C Shard: 3
  24. Efficient use of Data Nodes − What should you set

    to work efficiently with the same number of data nodes? Shard and Replica Data Primary Shard A Primary Shard B Primary Shard C Primary A Primary B Primary C Replica Shard A Replica Shard A Replica Shard B Replica Shard B Replica Shard C Replica Shard C Replica B Replica C Replica C Replica A Replica B Replica A Shard: 3 Replica: 2
  25. Shards Arrangement Primary A Replica A Primary B Replica B

    Primary A Replica A Replica A Primary B Replica B Replica B Data Nodes ≦ Shards × (Replicas + 1) Shard: 2 Replica: 2 Shard: 2 Replica: 1 Not used Not used
  26. Benchmark of Shard Real Data, 12 Data Nodes 0 50

    100 150 200 250 Shard: 2 Replica: 5 Shard: 3 Replica: 3 Shard: 4 Replica: 2 Shard: 6 Replica: 1 Shard: 12 Replica: 1 Response Time [ms] mean min max p95 71.70 65.25
  27. Elasticsearch Setting Number of nodes, shards and replicas Data Nodes

    18 Replicas 2 Shards 6
  28. Summary − Case: search feature in CMS − A backward

    match as wildcard query slow search performance with huge data − N-gram & dynamic query solve search performance and the order of a word − Case: API for record labels − Load of Elasticsearch increases due to use from the label's system − We added data nodes and set optimized the number of shards and replicas − Data Nodes ≦ Shards × (Replicas + 1) − Let's try to benchmark by different the number of shards
  29. Thank you