Upgrade to Pro — share decks privately, control downloads, hide ads and more …

LINE 広告における 8400 万人を対象としたリーチ数の推定 / Estimated re...

LINE 広告における 8400 万人を対象としたリーチ数の推定 / Estimated reach of 84 million people in Line ads

加賀谷 北斗 (LINE / 開発4センター/ B2B Platform開発室)
2020年3月に LINE 広告の管理画面向けにリリースした「推定オーディエンスモジュール」は、「こんな人に広告を配信したい」というターゲティングの設定から、実際に配信される可能性のあるユーザの数を推定する機能です。日本における LINE の MAU 8400 万人を対象にして、複雑な条件を満たすユーザのユニーク数を推定するのは単純な問題ではなく、ナイーブな処理方法を取ると全く太刀打ちできません。このセッションでは、"Count-distinct Problem" とも呼ばれるこの問題に我々がどのように対処しているかをサーバサイドエンジニアリングの視点からご紹介します。

LINE Developers

July 29, 2020
Tweet

More Decks by LINE Developers

Other Decks in Technology

Transcript

  1. User Demographics - Attributes (ex. marital status, mobile carrier, estimated

    salary etc.) - Behaviors (ex. How often do you watching TV?) - Interests (ex. game, sports, fashion, music, book, travel etc.) - Age, Gender - Area (spot targeting, radius targeting) * These demographic data are basically estimated results based on some behaviors of LINE users
  2. Audience (Re)Targeting - Audience Group - IDFA/AAID/Phone number/E-mail address list

    audience - LINE tag audience - LINE Official Account friends audience - Mobile app reengagement audience - Lookalike audience - Cross platform audience (LINE Official Account, LINE POINT…) - etc..
  3. Example - You want to deliver your ads to users

    who: - live in/work in any places within 5 km of the Shinjuku station - are married - are over 40s - use Android - belong to audience group A or audience group B - do NOT belong to audience group C - have interest on “Finance” OR “health and fitness” - do NOT have interest on “Sports” ?
  4. Example - You want to deliver your ads to users

    who: - live in/work in any places within 5 km of the Shinjuku station - are married - are over 40s - use Android - belong to audience group A or audience group B - do NOT belong to audience group C - have interest on “Finance” or “health and fitness” - do NOT have interest on “Sports” AND NOT OR OR NOT ?
  5. The difficulties - Set operation - Must support multiple set

    operations: AND, OR and NOT - These set operations can be nested - Size - Multiple input sources - Numerous users - Updatability - Each set is updated day-by-day
  6. The difficulties - Set operation - Must support multiple set

    operations: AND, OR and NOT - These set operations can be nested - Size - Multiple input sources - Numerous users - Updatability - Each set is updated day-by-day
  7. Handle set operations against massive sets - What’s the problem?

    In other words.. - “To search (count) users who satisfy multiple conditions” - We can use any general search engines! “A” OR “B” AND “C” -D
  8. "_source": { "country": “TH", "gender": “2”, "age_range": { "gte": 25,

    "lte": 29 }, "os_code": "ANDROID", "os_version": "9.0.0.0", "carrier": "9", "persona": [ 76, 127 ], "interests": [ "1.999", "3.999", “6.999", ], "area_geohash": "xxxyyyz", "area_updated_at":1595898442725, "area_level_1": "xxx.u", "area_level_2": “xxx.u.y”, “area_level_3”: ”xxx.u.y.z”, "audience_groups": [ 1111111111111, 2222222222222, 3333333333333 ] }
  9. The difficulties - Set operation - Must support multiple set

    operations: AND, OR and NOT - These set operations can be nested - Size - Multiple input sources - Numerous users - Updatability - Each set is updated day-by-day
  10. Store data to Elasticsearch - Input Sources - Multiple input

    sources - audience groups ← Redis for Ad delivery - estimated attributes ← Hadoop cluster - location data ← Hadoop cluster - NRT data (audience groups) ← Job server
  11. Store data to Elasticsearch - # of Users - Numerous

    users - Sampling -> About 29 Million documents (≒ users, global) - Just “estimation”! - We don’t need exact results
  12. Store data to Elasticsearch - Audience Groups - Audience groups

    are updated day-by-day! - Tag events - Daily execution of lookalike algorithm - List upload by advertisers - Changes of Official Account Friends - Updated by other platforms - etc..
  13. Store data to Elasticsearch - Audience Groups - We were

    going to adopt stream processing initially - Consume ADD/REMOVE events - Store data to Elasticsearch by updating a document - Elasticsearch is NOT good at UPDATE operation - UPDATE ≒ DELETE and INSERT - ONE event means an operation to an “array” field (ADD/REMOVE) - It’s a costly operation max 200,000 qps!
  14. Store data to Elasticsearch - Audience Groups - Classic batch

    processing - Update “all” audience groups that a user belongs to - Just “estimation”!!! - We don’t need exact results - Execute a batch job bi-hourly
  15. Key Takeaways - We’ve built a system to estimate audience

    size for LINE Ads - We’ve used Elasticsearch as a main storage - Solutions to huge and frequently-changed data - Sampling - Classic batch processing
  16. Other topics - Security / Privacy - Use one-way hash

    for user ID (document ID on Elasticsearch) - Introduce data retention to support opt-out - Periodic removal of obsolete data from Elasticsearch - ID conversion - IDFA/AAID <=> LINE (internal) user ID - We have a mapping on HBase
  17. (Appendix) Statistics - About 15,000 query per weekday - About

    1,500 unique users per weekday - Execution time %ile (milliseconds) - 50% (median): 32 - 90%: 73 - 95%: 98 - 99%: 171
  18. (Appendix) Statistics 0 1 2 3 4 5 6 7

    8 9 0 5000000 10000000 15000000 20000000 25000000 30000000 35000000 40000000 45000000 50000000 efficient_targeting_option_size / estimated_size