Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Technologies that Support the Data Solution Business

Technologies that Support the Data Solution Business

Hiroshi Yamaguchi (Yahoo! JAPAN / Data Application Division, Data Group, Technology Group / Manager)

https://tech-verse.me/ja/sessions/133
https://tech-verse.me/en/sessions/133
https://tech-verse.me/ko/sessions/133

Tech-Verse2022

November 18, 2022
Tweet

More Decks by Tech-Verse2022

Other Decks in Technology

Transcript

  1. About Me Hiroshi Yamaguchi Manager Yahoo JAPAN Corporation Data Solution

    Department Data Application Division Data Group
  2. - Multi Big Data - Search log - Location log

    - etc. - Services - DS.INSIGHT - DS.API - DS.DATASET - DS.ANALYSIS Data Solution
  3. About search log data par year Unique Query 8.1 Billion

    total logdata Raw Data 500 TB monthly active user active User 84 Million
  4. Agenda - Topic1: Do not keep the analysist waiting -

    Topic2: Provide flexible analysis - Topic3: Reduce pre-processing time
  5. - Topic1: Do not keep the analysist waiting - Topic2:

    Provide flexible analysis - Topic3: Reduction of pre-processing time 写真:アフロ
  6. - dimension : keyword - sub dimensions - generation -

    gender - region - metric Keyword Ranking
  7. Issue: Storing Large-scale Data - How should we improve the

    efficiency of data storage? - Search Ranking - Search Trend - Search Volume - Search Journey - etc. etc. Trend Volume Ranking Raw Data
  8. Data Mart for High Speed DSI People DSI BE API

    Hadoop Data Mart Raw Data Data Mart Data API In-house Tool Cassandra UI Service Computing Data DS API Data Mart DB
  9. Efficient Data Storage keyword_id logdate metric1 male_uu ... twenties_uu ...

    metricN hash1 20221118 1234 ... ... hash2 20221118 1234 ... ... hash3 20221118 1234 ... ... keyword_id logdate Ranking hash1 20221118 1 hash2 20221118 2 hash3 20221118 3 Ranking Metrics
  10. To Encourage Users to Make Trial and Error - Average

    execution time: Number of sec/req - Data Mart is divided and optimized - Limited combinations of dimensions
  11. - Topic1: Do not keep analyst waiting - Topic2: Provide

    flexible analysis - Topic3: Reduce pre-processing time 写真:アフロ
  12. Issue: Flexibility of Extraction Conditions - How should the aggregation

    cost be reduced? - Specifying user attributes - wide range of periods - Categorization and other summarization - etc. Data Mart Trino (Presto)
  13. Processing for Flexible Analysis hadoop Trino (Presto) DSI Job Server

    Raw Data Data Mart DSI Persona DS.DATASET DS.DATASET Job Server UI Service Computing Data
  14. Aggregation Period and Execution Time 0 2 4 6 8

    10 1month 2month 3month 6month 12month memory limit error execute time (min) aggregation period
  15. Number of Parallels and Execution Time 0 2 4 6

    8 10 12 14 16 18 20 22 24 1 2 3 4 5 6 ◆ 1month ▪ 2months execute time (min) number of parallel memory limit error
  16. Challenges in Execution All jobs in execution fail when cluster

    limit is reached - Finite computing resources - Job management within a product - Noisy neighbor - Split resource groups by product
  17. Achieve flexible Analysis - Average execution time in min/req 〜

    - Use the computing resources efficiently - Aggregation period is limited
  18. System Configuration DSI People DSI BE API hadoop Data Mart

    Trino (Presto) Raw Data Data Mart DSI Persona DS.DATASET Data API In-house Tool DSI Job Server DS.DATASET Job Server Cassandra UI Service Computing Data DS.API Data Mart DB
  19. Generic Analysis Flexible Analysis Term Long Short Latency Several seconds

    Minutes to hours Segments Fixed Flexible Frequency of Use Frequent Less frequent Scale cost Storage Computing Response to Overall Issues Subtitle
  20. - Topic1: Do not keep analyst waiting - Topic2: Provide

    flexible analysis - Topic3: Reduce pre-processing time 写真:アフロ
  21. System Configuration DSI People DSI BE API hadoop Data Mart

    Trino (Presto) Raw Data Data Mart DSI Persona DS.DATASET Data API In-house Tool DSI Job Server DS.DATASET Job Server Cassandra UI Service Computing Data DS.API Data Mart DB
  22. Pre-processing for Keywords - keyword = user‘s search query -

    Same context - with or without space - word order - spelling inconsistencies - unique noun - products name - professional jargon keywords metric1 metric2 “yahoo! DataSolution” 1200 650 “データソリューション Yahoo!” 600 400 “yahoo!DataSolution” 400 300 keywords metric1 metric2 “yahoo! DataSolution” 2200 1350
  23. Issues in Pre-processing - Realization of appropriate division of keywords

    such as technical terms - Realization of summarization by context “Yahoo!DataSolution” “Yahoo! DataSolution” “Yahoo!” & “DataSolution” “Yahoo!” & “Data” & “Solution” which one to choose?
  24. Three Rules for Flexible Pre-processing - Divide keywords appropriately and

    normalize them as tokens - Assign labels to the tokens to express context - Grouping the labels to achieve complex aggregation
  25. Dictionary System hadoop Trino (Presto) Data Mart DS.DATASET DS.DATASET Job

    Server UI Service Computing Data data aggregation dictionary Dictionary UI dictionary API
  26. Normalization Processing for Keywords “製品 激安” “製品 価格 商品A” “口コミ

    製品 商品B” “製品 人気” “商品A げきやす” “価格” “激安” “価格” “商品A” “口コミ” “商品B” “人気” “激安” “商品A” “評価” labeling “商品A” “商品B” tokenize keywords
  27. Normalization Processing for Keywords “製品 激安” “製品 価格 商品A” “口コミ

    製品 商品B” “製品 人気” “商品A げきやす” “価格” “激安” “価格” “商品A” “口コミ” “商品B” “人気” “激安” “商品A” “評価” labeling “商品A” “商品B” tokenize keywords
  28. Summary of Labels 商品 価格 “商品A” “価格” “商品B” “商品C” logdate

    region label1 group 1 metrics 20221118 tokyo “価格” “商品A” 1200 20221118 tokyo “価格” “商品B” 600 20221118 osaka “価格” “商品A” 800 20221118 nagoya “価格” “商品A” 600 virtual table label group
  29. Realization of Pre-processing - Construct rules to make pre-processing easier

    for analyst - Implement a virtual table that can be handled easily in analysis
  30. Response to Analytical Issues - Overall system configuration - It

    is necessary to modify the configuration depending on the application, either generic or dedicated - Ensure that the master data is in one place - Eliminating the need for pre-processing - Create necessary rules - Use the table structure consciously so that it can be used for analysis
  31. Future Prospects - Further optimization of system - Strengthening of

    rule base and introduction of AI - Under consideration including AutoFM