10billion user analytics architecture using BigQuery

100 億人のユーザー行動からインサイトを得るための大規模分析基盤〜BigQuery を活用して〜 @kargo113 株式会社プレイドエンジニア

1. KARTE について 2. KARTE Insight 3. 分析基盤アーキテクチャ 4. BigQuery
活用ポイント 5. まとめアジェンダ

KARTE について 01

株式会社プレイド東京都中央区銀座6-10-1 GINZA SIX 10F 設立：2011 年 10 月従業員：190
名 ※2020 年 9 月 30 日時点資本金：26 億 3201 万 3778 円 ※2021 年 3 月 31 日時点

Customer Experience Platform karte.io

100 億 UU 累計ユーザー数 ※1 105,000 over 秒間トラッキング数 ※3 0.x
秒/解析解析速度 2.13 兆円年間解析流通金額 ※2 ※1 ローンチ〜 2021 年 2 月までの解析ユニークユーザー数の実績 ※2 EC 領域における解析流通金額。 2020 年 3 月〜 2021 年 2 月までの単年の実績 ※3 秒間解析イベント数（閲覧、購入、クリックなど全計測イベントが対象。 2021 年 3 月最大値） 180 + PB 月間解析データ量 8 + PB 蓄積データ量 Customer Experience Platform karte.io

KARTE 導入企業様一部抜粋

KARTE の活用

KARTE Insight 02

人（企業）に ”顧客目線” を提供して顧客の ”解像度” を上げることによってさらなる事業成長に繋げることができるようになります顧客の ”解像度” を上げることで事業成長へ

① ユーザーリストサイトに訪問中の「いまこの瞬間」の状況をリアルタイムに可視化しますユーザーごとのストーリーで今につながる経験・感情の変化を直感的に知ることができます

② KARTE Live 「顧客を知る」までを即座に、強烈に顧客の行動を動画で見ることで実際の体験を理解しインサイトを発見することができます

③ カスタマージャーニー探索（β）ゴール (購入 etc.) に繋がりやすい「行動パターン」を遷移確率を元に分析することでなぜ？が見出せます

④ 行動チェーン（β）特定の行動パターンをとったユーザーを一人ひとり深堀りすることで「新たな気付き」が得られます

Proprietary + Confidential 分析基盤アーキテクチャ 03

AWS Cloud Bigtable Analyze Amazon EC2 Autoscaling Amazon EC2 Track
GCE Autoscaling Analyze GCE Autoscaling GCE Pub/Sub Admin GKE Autoscaling Autoscaling Track Amazon EC2 Autoscaling Route53 Ops KARTE のアーキテクチャ BigQuery 分析基盤 Route53 Admin Anthos clusters on AWS

分析基盤視点 (KARTE Insight) Analyist Session Summary BigQuery Batch Job GKE
Admin GKE Autoscaling Batch Job Dataﬂow Reference Table Cloud Spanner Client Cache Tables By Conditions BigQuery ML Pipeline AI Platform Pipeline Cloud Bigtable Track GCE Autoscaling Analyze GCE Autoscaling Pub/Sub User KARTE Pockyevent BigQuery

Admin GKE Autoscaling Batch Job Dataﬂow Reference Table Cloud Spanner Client Cache Tables By Conditions BigQuery ML Pipeline AI Platform Pipeline Cloud Bigtable Track GCE Autoscaling Analyze GCE Autoscaling Pub/Sub User KARTE Pockyevent BigQuery ユーザーリストユーザーサマリーカスタマージャーニー探索機能名

Admin GKE Autoscaling Batch Job Dataﬂow Reference Table Cloud Spanner Client Cache Tables By Conditions BigQuery ML Pipeline AI Platform Pipeline Cloud Bigtable Track GCE Autoscaling Analyze GCE Autoscaling Pub/Sub User KARTE Pockyevent BigQuery

Proprietary + Confidential BigQuery 活用ポイント 04

① Sharding & Partitioning Sharding Table と Partitioning Table の
Pros / Cons を活かして共存 ② キャッシングによる高速検索中間テーブルを介することで　クエリキャッシュを活用 ③ Column の最適化分析に必要なデータに応じて Column を最適化する BigQuery 活用ポイント

Sharding / Partitioning _20210401 KARTE Pockyevent BigQuery Session Summary BigQuery
_20210402 _20210403 _20210404 $20210401 $20210402 $20210403 $20210404 Sharding Partitioning

Pros / Cons Sharding テーブルが複数テーブル毎の柔軟なスキーマ定義が可能 Clustered Table による
パフォーマンス効率が悪い Partitioning テーブルが 1 つパーティション間でスキーマ定義を統一する必要がある Clustered Table によるパフォーマンス効率が良いどちらのメリットも両立させたい ...!

Sharding & Partitioning Architecture Analyist KARTE Pockyevent BigQuery Session Summary
BigQuery Batch Job GKE Admin GKE Autoscaling Batch Job Dataﬂow Reference Table Cloud Spanner Client Sharding Partitioning

Analyist KARTE Pockyevent BigQuery Session Summary BigQuery Batch Job GKE
Admin GKE Autoscaling Batch Job Dataﬂow Reference Table Cloud Spanner Client 変更の容易さを重視 Sharding Sharding & Partitioning Architecture

変更の容易さを重視 sync_date user_id event_na me values segments shorten_seg ments item.catego
ry_name item.brand_ name 2021-04-01 20:33:17.689 UTC user-0001 view {“view”: {“item”:{“item_id”: “AABB”,”price”5000,.. .}}} 927ccb74-83fa-4 e46-b317-702f23 0a9923,... A5n,B6x T-shirt Datahub 2021-04-01 20:33:17.689 UTC user-0001 view {“view”: {“item”:{“item_id”: “BBB”,”price”10000,... }}} 927ccb74-83fa-4 e46-b317-702f23 0a9923,... A5n,B6x Bottoms Blocks 2021-04-01 20:33:17.689 UTC user-0001 cart {“cart”: {“price”: 5000, “status”: true, item_ids:[“AABB”],...}} 927ccb74-83fa-4 e46-b317-702f23 0a9923,... A5n,B6x T-Shrit Datahub 2021-04-01 20:33:17.689 UTC user-0001 buy {“buy”: {“transaction_id”: “5fdg3”,”revenue”:50 00,...} 927ccb74-83fa-4 e46-b317-702f23 0a9923,... A5n,B6x,C8s T-Shrit Datahub 2021-04-01 20:33:17.689 UTC user-0001 leave {“leave”: {“url”: “https://****”,”spend_t ime”:300,...} 927ccb74-83fa-4 e46-b317-702f23 0a9923,... A5n,B6x,C8s Add columns Remove Columns

Analyist KARTE Pockyevent BigQuery Session Summary BigQuery Batch Job GKE
Admin GKE Autoscaling Batch Job Dataﬂow Reference Table Cloud Spanner Client パフォーマンスを重視 Partitioning Sharding & Partitioning Architecture

パフォーマンスを重視 sync_date api_key user_id session_id total_active_t ime view_categ ory_name_to p1
referrer_url page_type 2021-04-01 20:33:17.689 UTC 738adfgb user-0001 session-8fh 30 datahub google top 2021-04-01 20:33:17.689 UTC 738adfgb user-0001 session-tgw 54 datahub google top 2021-04-01 20:33:17.689 UTC 738adfgb user-0002 session-6hh 120 datahub yahoo item_detail 2021-04-01 20:33:17.689 UTC 738adfgb user-0003 session-g54 60 datahub yahoo item_detail 2021-04-01 20:33:17.689 UTC 738adfgb user-0004 lsession-h43 10 blocks instagram Clustered by Partition by 全クライアント * 全ユーザーの集計テーブル - 60 million rows / day - 2 billion rows / month

データ活用の幅を広げる Analyist KARTE Pockyevent BigQuery Session Summary BigQuery Batch Job
GKE Admin GKE Autoscaling Batch Job Dataﬂow Reference Table Cloud Spanner Client クエリで高速に検索/表示 User Summary セグメントにして接客に活用

キャッシングによる高速検索 Client Admin GKE Autoscaling KARTE Pockyevent BigQuery Intermediate Tables
By Conditions BigQuery GCE 全体構成概要

ログデータに対するクエリ問題自由な検索ができる一方、検索コストが高い Client Admin GKE Autoscaling KARTE Pockyevent BigQuery 条件を自由に指定
ログデータなのでクエリキャッシュが効かない表示に時間がかかる

中間テーブルの活用 Client Admin GKE Autoscaling KARTE Pockyevent BigQuery Intermediate Tables
By Conditions BigQuery GCE 検索条件を保存/照合検索条件ごとに作成（ Expiration を設定）クエリキャッシュを透過的に利用検索条件ごとに中間テーブルを用意条件を自由に指定キャッシュがなれけばこちらへ高速に様々な条件で検索可能

Intermediate Table vs Materialized View Materialized View を活用しないのはなぜ？ Intermediate table
テーブルごとに管理が必要データセットごとに最大 50,000 個制限のないクエリ Materialized View 特別管理することなく常に最新状態を維持可能データセットごとに最大 20 個制限付きクエリ「検索の自由度」を担保するために、通常の中間テーブルを採用 • 個々のテーブルは非常に小さい • 特定の期限を付与することで管理

社内での分析の効率化 Analyist KARTE Pockyevent BigQuery Session Summary BigQuery Batch Job
GKE Admin GKE Autoscaling Batch Job Dataﬂow Reference Table Cloud Spanner Client 分析の効率化 ML Pipeline AI Platform Pipeline

Pockyevent の分析上の問題柔軟性の高いデータは分析しづらい sync_date user_id event_name values 2021-04-01 20:33:17.689 UTC
user-0001 view {“view”: {“item”:{“item_id”: “AAA”,”price”5000,"category_name":"T-Shirt”,””brand”: “Datahub”,...}}} 2021-04-01 20:33:17.689 UTC user-0001 view {“view”: {“item”:{“item_id”: “AAB”,”price”10000,"category_name":"Bottoms”,””brand”: “Blocks”,...}}} 2021-04-01 20:33:17.689 UTC user-0001 cart {“cart”: {“price”: 5000, “status”: true, item_ids:[“AABB”],...}} 2021-04-01 20:33:17.689 UTC user-0001 buy {“buy”: {“transaction_id”: “5fdg3”,”revenue”:5000,...} 2021-04-01 20:33:17.689 UTC user-0001 leave {“leave”: {“url”: “https://****”,”spend_time”:300,...} JSON_EXTRACT_SCALAR(values, $.view.item.category_name) スキャン量も肥大化し、**TB は良くあるレベル

Column の切り出し分析用途に応じて分解してカラムを切り出す sync_date user_id event_ name values item.category_name item.brand_name
2021-04-01 20:33:17.689 UTC user-0001 view {“view”: {“item”:{“item_id”: “AAA”,”price”5000,"category_n ame":"T-Shirt”,””brand”: “Datahub”,...}}} T-shirt Datahub 2021-04-01 20:33:17.689 UTC user-0001 view {“view”: {“item”:{“item_id”: “AAB”,”price”10000,"category_ name":"Bottoms”,””brand”: “Blocks”,...}}} Bottoms Blocks 2021-04-01 20:33:17.689 UTC user-0001 cart {“cart”: {“price”: 5000, “status”: true, item_ids:[“AABB”],...}} T-Shrit Datahub 2021-04-01 20:33:17.689 UTC user-0001 buy {“buy”: {“transaction_id”: “5fdg3”,”revenue”:5000,...} T-Shrit Datahub 2021-04-01 20:33:17.689 UTC user-0001 leave {“leave”: {“url”: “https://****”,”spend_time”:300, ...} • スキャン量：128 GB → 7.1 GB • 消費スロット：1 hr 19 min → 27 min

Column の切り出し分析でよく使う Column を社内で定期的に調査して調整フィールド名タイプモード view.session_ spend_time
INTEGER NULLABLE view.auto_pa ge_group STRING NULLABLE view.access.i n_referrer.url STRING NULLABLE view.access.o s.name STRING NULLABLE _merge_user. source_user_i d STRING NULLABLE ... ... ...

ML Pipeline との連携 Analyist KARTE Pockyevent BigQuery Session Summary BigQuery
Batch Job GKE Admin GKE Autoscaling Batch Job Dataﬂow Reference Table Cloud Spanner Client ML 基盤 ML Pipeline AI Platform Pipeline

ML Pipeline との連携 KARTE Pockyevent BigQuery Admin GKE Autoscaling ML
Pipeline System Experiment AI Platform Notebook ML Pipeline AI Platform Pipelines Batch Train / Prediction AI Platform Feature Store Firestore Feature Store BigQuery ML Realtime System Prediction Job GKE AutoPilot Event Data Cloud Spanner API trigger Cloud Pub/Sub Data Process Dataﬂow Prediction Result Cloud Bigtable Event Data Pub/Sub Core Server Compute Engine Core System Admin GKE Autoscaling Client KARTE Pockyevent BigQuery

まとめ 05

• BigQuery を中心とした大規模分析基盤 ◦ Sharding & Partitioning ◦ キャッシングによる高速検索 ◦
Column の最適化本日のまとめ分析基盤は常に改善が回せる開発 /運用体制が特に重要

10billion user analytics architecture using Big...

10billion user analytics architecture using BigQuery

More Decks by kargo113

Other Decks in Technology

Featured

Transcript