Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

About Me Hiroshi Yamaguchi Manager Yahoo JAPAN Corporation Data Solution Department Data Application Division Data Group

Slide 3

Slide 3 text

- Multi Big Data - Search log - Location log - etc. - Services - DS.INSIGHT - DS.API - DS.DATASET - DS.ANALYSIS Data Solution

Slide 4

Slide 4 text

DS Products customer insight DS.INSIGHT in-house enviroment DS.DATASET connected data DS.API

Slide 5

Slide 5 text

About search log data par year Unique Query 8.1 Billion total logdata Raw Data 500 TB monthly active user active User 84 Million

Slide 6

Slide 6 text

Handling Large-scale Data Efficiently for Analysis

Slide 7

Slide 7 text

Agenda - Topic1: Do not keep the analysist waiting - Topic2: Provide flexible analysis - Topic3: Reduce pre-processing time

Slide 8

Slide 8 text

- Topic1: Do not keep the analysist waiting - Topic2: Provide flexible analysis - Topic3: Reduction of pre-processing time 写真:アフロ

Slide 9

Slide 9 text

- dimension : keyword - sub dimensions - generation - gender - region - metric Keyword Ranking

Slide 10

Slide 10 text

Issue: Storing Large-scale Data - How should we improve the efficiency of data storage? - Search Ranking - Search Trend - Search Volume - Search Journey - etc. etc. Trend Volume Ranking Raw Data

Slide 11

Slide 11 text

Data Mart for High Speed DSI People DSI BE API Hadoop Data Mart Raw Data Data Mart Data API In-house Tool Cassandra UI Service Computing Data DS API Data Mart DB

Slide 12

Slide 12 text

Efficient Data Storage keyword_id logdate metric1 male_uu ... twenties_uu ... metricN hash1 20221118 1234 ... ... hash2 20221118 1234 ... ... hash3 20221118 1234 ... ... keyword_id logdate Ranking hash1 20221118 1 hash2 20221118 2 hash3 20221118 3 Ranking Metrics

Slide 13

Slide 13 text

To Encourage Users to Make Trial and Error - Average execution time: Number of sec/req - Data Mart is divided and optimized - Limited combinations of dimensions

Slide 14

Slide 14 text

- Topic1: Do not keep analyst waiting - Topic2: Provide flexible analysis - Topic3: Reduce pre-processing time 写真:アフロ

Slide 15

Slide 15 text

Issue: Flexibility of Extraction Conditions - How should the aggregation cost be reduced? - Specifying user attributes - wide range of periods - Categorization and other summarization - etc. Data Mart Trino (Presto)

Slide 16

Slide 16 text

Processing for Flexible Analysis hadoop Trino (Presto) DSI Job Server Raw Data Data Mart DSI Persona DS.DATASET DS.DATASET Job Server UI Service Computing Data

Slide 17

Slide 17 text

Issues of flexible aggregation processing - Finite Computing Resource - Noisy Neighbor

Slide 18

Slide 18 text

Aggregation Period and Execution Time 0 2 4 6 8 10 1month 2month 3month 6month 12month memory limit error execute time (min) aggregation period

Slide 19

Slide 19 text

Number of Parallels and Execution Time 0 2 4 6 8 10 12 14 16 18 20 22 24 1 2 3 4 5 6 ◆ 1month ■ 2months execute time (min) number of parallel memory limit error

Slide 20

Slide 20 text

Challenges in Execution All jobs in execution fail when cluster limit is reached - Finite computing resources - Job management within a product - Noisy neighbor - Split resource groups by product

Slide 21

Slide 21 text

Achieve flexible Analysis - Average execution time in min/req 〜 - Use the computing resources efficiently - Aggregation period is limited

Slide 22

Slide 22 text

System Configuration DSI People DSI BE API hadoop Data Mart Trino (Presto) Raw Data Data Mart DSI Persona DS.DATASET Data API In-house Tool DSI Job Server DS.DATASET Job Server Cassandra UI Service Computing Data DS.API Data Mart DB

Slide 23

Slide 23 text

Generic Analysis Flexible Analysis Term Long Short Latency Several seconds Minutes to hours Segments Fixed Flexible Frequency of Use Frequent Less frequent Scale cost Storage Computing Response to Overall Issues Subtitle

Slide 24

Slide 24 text

- Topic1: Do not keep analyst waiting - Topic2: Provide flexible analysis - Topic3: Reduce pre-processing time 写真:アフロ

Slide 25

Slide 25 text

System Configuration DSI People DSI BE API hadoop Data Mart Trino (Presto) Raw Data Data Mart DSI Persona DS.DATASET Data API In-house Tool DSI Job Server DS.DATASET Job Server Cassandra UI Service Computing Data DS.API Data Mart DB

Slide 26

Slide 26 text

Pre-processing for Keywords - keyword = user‘s search query - Same context - with or without space - word order - spelling inconsistencies - unique noun - products name - professional jargon keywords metric1 metric2 “yahoo! DataSolution” 1200 650 “データソリューション Yahoo!” 600 400 “yahoo!DataSolution” 400 300 keywords metric1 metric2 “yahoo! DataSolution” 2200 1350

Slide 27

Slide 27 text

Issues in Pre-processing - Realization of appropriate division of keywords such as technical terms - Realization of summarization by context “Yahoo!DataSolution” “Yahoo! DataSolution” “Yahoo!” & “DataSolution” “Yahoo!” & “Data” & “Solution” which one to choose?

Slide 28

Slide 28 text

Three Rules for Flexible Pre-processing - Divide keywords appropriately and normalize them as tokens - Assign labels to the tokens to express context - Grouping the labels to achieve complex aggregation

Slide 29

Slide 29 text

Dictionary System hadoop Trino (Presto) Data Mart DS.DATASET DS.DATASET Job Server UI Service Computing Data data aggregation dictionary Dictionary UI dictionary API

Slide 30

Slide 30 text

Normalization Processing for Keywords “製品 激安” “製品 価格 商品A” “口コミ 製品 商品B” “製品 人気” “商品A げきやす” “価格” “激安” “価格” “商品A” “口コミ” “商品B” “人気” “激安” “商品A” “評価” labeling “商品A” “商品B” tokenize keywords

Slide 31

Slide 31 text

Normalization Processing for Keywords “製品 激安” “製品 価格 商品A” “口コミ 製品 商品B” “製品 人気” “商品A げきやす” “価格” “激安” “価格” “商品A” “口コミ” “商品B” “人気” “激安” “商品A” “評価” labeling “商品A” “商品B” tokenize keywords

Slide 32

Slide 32 text

Summary of Labels 商品 価格 “商品A” “価格” “商品B” “商品C” logdate region label1 group 1 metrics 20221118 tokyo “価格” “商品A” 1200 20221118 tokyo “価格” “商品B” 600 20221118 osaka “価格” “商品A” 800 20221118 nagoya “価格” “商品A” 600 virtual table label group

Slide 33

Slide 33 text

Realization of Pre-processing - Construct rules to make pre-processing easier for analyst - Implement a virtual table that can be handled easily in analysis

Slide 34

Slide 34 text

Response to Analytical Issues - Overall system configuration - It is necessary to modify the configuration depending on the application, either generic or dedicated - Ensure that the master data is in one place - Eliminating the need for pre-processing - Create necessary rules - Use the table structure consciously so that it can be used for analysis

Slide 35

Slide 35 text

Future Prospects - Further optimization of system - Strengthening of rule base and introduction of AI - Under consideration including AutoFM

Slide 36

Slide 36 text

Thank you