Technologies that Support the Data Solution Business

About Me Hiroshi Yamaguchi Manager Yahoo JAPAN Corporation Data Solution
Department Data Application Division Data Group

- Multi Big Data - Search log - Location log
- etc. - Services - DS.INSIGHT - DS.API - DS.DATASET - DS.ANALYSIS Data Solution

DS Products customer insight DS.INSIGHT in-house enviroment DS.DATASET connected data
DS.API

About search log data par year Unique Query 8.1 Billion
total logdata Raw Data 500 TB monthly active user active User 84 Million

Handling Large-scale Data Efficiently for Analysis

Agenda - Topic1: Do not keep the analysist waiting -
Topic2: Provide flexible analysis - Topic3: Reduce pre-processing time

- Topic1: Do not keep the analysist waiting - Topic2:
Provide flexible analysis - Topic3: Reduction of pre-processing time 写真：アフロ

- dimension : keyword - sub dimensions - generation -
gender - region - metric Keyword Ranking

Issue: Storing Large-scale Data - How should we improve the
efficiency of data storage? - Search Ranking - Search Trend - Search Volume - Search Journey - etc. etc. Trend Volume Ranking Raw Data

Data Mart for High Speed DSI People DSI BE API
Hadoop Data Mart Raw Data Data Mart Data API In-house Tool Cassandra UI Service Computing Data DS API Data Mart DB

Efficient Data Storage keyword_id logdate metric1 male_uu ... twenties_uu ...
metricN hash1 20221118 1234 ... ... hash2 20221118 1234 ... ... hash3 20221118 1234 ... ... keyword_id logdate Ranking hash1 20221118 1 hash2 20221118 2 hash3 20221118 3 Ranking Metrics

To Encourage Users to Make Trial and Error - Average
execution time: Number of sec/req - Data Mart is divided and optimized - Limited combinations of dimensions

- Topic1: Do not keep analyst waiting - Topic2: Provide
flexible analysis - Topic3: Reduce pre-processing time 写真：アフロ

Issue: Flexibility of Extraction Conditions - How should the aggregation
cost be reduced? - Specifying user attributes - wide range of periods - Categorization and other summarization - etc. Data Mart Trino (Presto)

Processing for Flexible Analysis hadoop Trino (Presto) DSI Job Server
Raw Data Data Mart DSI Persona DS.DATASET DS.DATASET Job Server UI Service Computing Data

Issues of flexible aggregation processing - Finite Computing Resource -
Noisy Neighbor

Aggregation Period and Execution Time 0 2 4 6 8
10 1month 2month 3month 6month 12month memory limit error execute time (min) aggregation period

Number of Parallels and Execution Time 0 2 4 6
8 10 12 14 16 18 20 22 24 1 2 3 4 5 6 ◆ 1month ▪ 2months execute time (min) number of parallel memory limit error

Challenges in Execution All jobs in execution fail when cluster
limit is reached - Finite computing resources - Job management within a product - Noisy neighbor - Split resource groups by product

Achieve flexible Analysis - Average execution time in min/req 〜
- Use the computing resources efficiently - Aggregation period is limited

System Configuration DSI People DSI BE API hadoop Data Mart
Trino (Presto) Raw Data Data Mart DSI Persona DS.DATASET Data API In-house Tool DSI Job Server DS.DATASET Job Server Cassandra UI Service Computing Data DS.API Data Mart DB

Generic Analysis Flexible Analysis Term Long Short Latency Several seconds
Minutes to hours Segments Fixed Flexible Frequency of Use Frequent Less frequent Scale cost Storage Computing Response to Overall Issues Subtitle

- Topic1: Do not keep analyst waiting - Topic2: Provide
flexible analysis - Topic3: Reduce pre-processing time 写真：アフロ

System Configuration DSI People DSI BE API hadoop Data Mart
Trino (Presto) Raw Data Data Mart DSI Persona DS.DATASET Data API In-house Tool DSI Job Server DS.DATASET Job Server Cassandra UI Service Computing Data DS.API Data Mart DB

Pre-processing for Keywords - keyword = user‘s search query -
Same context - with or without space - word order - spelling inconsistencies - unique noun - products name - professional jargon keywords metric1 metric2 “yahoo! DataSolution” 1200 650 “データソリューション Yahoo!” 600 400 “yahoo!DataSolution” 400 300 keywords metric1 metric2 “yahoo! DataSolution” 2200 1350

Issues in Pre-processing - Realization of appropriate division of keywords
such as technical terms - Realization of summarization by context “Yahoo!DataSolution” “Yahoo! DataSolution” “Yahoo!” & “DataSolution” “Yahoo!” & “Data” & “Solution” which one to choose?

Three Rules for Flexible Pre-processing - Divide keywords appropriately and
normalize them as tokens - Assign labels to the tokens to express context - Grouping the labels to achieve complex aggregation

Dictionary System hadoop Trino (Presto) Data Mart DS.DATASET DS.DATASET Job
Server UI Service Computing Data data aggregation dictionary Dictionary UI dictionary API

Normalization Processing for Keywords “製品激安” “製品価格商品A” “口コミ
製品商品B” “製品人気” “商品A げきやす” “価格” “激安” “価格” “商品A” “口コミ” “商品B” “人気” “激安” “商品A” “評価” labeling “商品A” “商品B” tokenize keywords

Summary of Labels 商品価格 “商品A” “価格” “商品B” “商品C” logdate
region label1 group 1 metrics 20221118 tokyo “価格” “商品A” 1200 20221118 tokyo “価格” “商品B” 600 20221118 osaka “価格” “商品A” 800 20221118 nagoya “価格” “商品A” 600 virtual table label group

Realization of Pre-processing - Construct rules to make pre-processing easier
for analyst - Implement a virtual table that can be handled easily in analysis

Response to Analytical Issues - Overall system configuration - It
is necessary to modify the configuration depending on the application, either generic or dedicated - Ensure that the master data is in one place - Eliminating the need for pre-processing - Create necessary rules - Use the table structure consciously so that it can be used for analysis

Future Prospects - Further optimization of system - Strengthening of
rule base and introduction of AI - Under consideration including AutoFM

Thank you

Technologies that Support the Data Solution Bus...

Technologies that Support the Data Solution Business

More Decks by Tech-Verse2022

Other Decks in Technology

Featured

Transcript