10billion user analytics architecture using BigQuery

Slide 1

Slide 1 text

100 億人のユーザー行動からインサイトを得るための大規模分析基盤〜BigQuery を活用して〜 @kargo113 株式会社プレイドエンジニア

Slide 2

Slide 2 text

1. KARTE について 2. KARTE Insight 3. 分析基盤アーキテクチャ 4. BigQuery 活用ポイント 5. まとめアジェンダ

Slide 3

Slide 3 text

KARTE について 01

Slide 4

Slide 4 text

株式会社プレイド東京都中央区銀座6-10-1 GINZA SIX 10F 設立：2011 年 10 月従業員：190 名 ※2020 年 9 月 30 日時点資本金：26 億 3201 万 3778 円 ※2021 年 3 月 31 日時点

Slide 5

Slide 5 text

Customer Experience Platform karte.io

Slide 6

Slide 6 text

100 億 UU 累計ユーザー数 ※1 105,000 over 秒間トラッキング数 ※3 0.x 秒/解析解析速度 2.13 兆円年間解析流通金額 ※2 ※1 ローンチ〜 2021 年 2 月までの解析ユニークユーザー数の実績 ※2 EC 領域における解析流通金額。 2020 年 3 月〜 2021 年 2 月までの単年の実績 ※3 秒間解析イベント数（閲覧、購入、クリックなど全計測イベントが対象。 2021 年 3 月最大値） 180 + PB 月間解析データ量 8 + PB 蓄積データ量 Customer Experience Platform karte.io

Slide 7

Slide 7 text

KARTE 導入企業様一部抜粋

Slide 8

Slide 8 text

KARTE の活用

Slide 9

Slide 9 text

KARTE Insight 02

Slide 10

Slide 10 text

人（企業）に ”顧客目線” を提供して顧客の ”解像度” を上げることによってさらなる事業成長に繋げることができるようになります顧客の ”解像度” を上げることで事業成長へ

Slide 11

Slide 11 text

① ユーザーリストサイトに訪問中の「いまこの瞬間」の状況をリアルタイムに可視化しますユーザーごとのストーリーで今につながる経験・感情の変化を直感的に知ることができます

Slide 12

Slide 12 text

② KARTE Live 「顧客を知る」までを即座に、強烈に顧客の行動を動画で見ることで実際の体験を理解しインサイトを発見することができます

Slide 13

Slide 13 text

③ カスタマージャーニー探索（β）ゴール (購入 etc.) に繋がりやすい「行動パターン」を遷移確率を元に分析することでなぜ？が見出せます

Slide 14

Slide 14 text

④ 行動チェーン（β）特定の行動パターンをとったユーザーを一人ひとり深堀りすることで「新たな気付き」が得られます

Slide 15

Slide 15 text

Proprietary + Confidential 分析基盤アーキテクチャ 03

Slide 16

Slide 16 text

AWS Cloud Bigtable Analyze Amazon EC2 Autoscaling Amazon EC2 Track GCE Autoscaling Analyze GCE Autoscaling GCE Pub/Sub Admin GKE Autoscaling Autoscaling Track Amazon EC2 Autoscaling Route53 Ops KARTE のアーキテクチャ BigQuery 分析基盤 Route53 Admin Anthos clusters on AWS

Slide 17

Slide 17 text

分析基盤視点 (KARTE Insight) Analyist Session Summary BigQuery Batch Job GKE Admin GKE Autoscaling Batch Job Dataﬂow Reference Table Cloud Spanner Client Cache Tables By Conditions BigQuery ML Pipeline AI Platform Pipeline Cloud Bigtable Track GCE Autoscaling Analyze GCE Autoscaling Pub/Sub User KARTE Pockyevent BigQuery

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Proprietary + Confidential BigQuery 活用ポイント 04

Slide 21

Slide 21 text

① Sharding & Partitioning Sharding Table と Partitioning Table の Pros / Cons を活かして共存 ② キャッシングによる高速検索中間テーブルを介することで　クエリキャッシュを活用 ③ Column の最適化分析に必要なデータに応じて Column を最適化する BigQuery 活用ポイント

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Sharding / Partitioning _20210401 KARTE Pockyevent BigQuery Session Summary BigQuery _20210402 _20210403 _20210404 $20210401 $20210402 $20210403 $20210404 Sharding Partitioning

Slide 24

Slide 24 text

Pros / Cons Sharding テーブルが複数テーブル毎の柔軟なスキーマ定義が可能 Clustered Table によるパフォーマンス効率が悪い Partitioning テーブルが 1 つパーティション間でスキーマ定義を統一する必要がある Clustered Table によるパフォーマンス効率が良いどちらのメリットも両立させたい ...!

Slide 25

Slide 25 text

Sharding & Partitioning Architecture Analyist KARTE Pockyevent BigQuery Session Summary BigQuery Batch Job GKE Admin GKE Autoscaling Batch Job Dataﬂow Reference Table Cloud Spanner Client Sharding Partitioning

Slide 26

Slide 26 text

Analyist KARTE Pockyevent BigQuery Session Summary BigQuery Batch Job GKE Admin GKE Autoscaling Batch Job Dataﬂow Reference Table Cloud Spanner Client 変更の容易さを重視 Sharding Sharding & Partitioning Architecture

Slide 27

Slide 27 text

変更の容易さを重視 sync_date user_id event_na me values segments shorten_seg ments item.catego ry_name item.brand_ name 2021-04-01 20:33:17.689 UTC user-0001 view {“view”: {“item”:{“item_id”: “AABB”,”price”5000,.. .}}} 927ccb74-83fa-4 e46-b317-702f23 0a9923,... A5n,B6x T-shirt Datahub 2021-04-01 20:33:17.689 UTC user-0001 view {“view”: {“item”:{“item_id”: “BBB”,”price”10000,... }}} 927ccb74-83fa-4 e46-b317-702f23 0a9923,... A5n,B6x Bottoms Blocks 2021-04-01 20:33:17.689 UTC user-0001 cart {“cart”: {“price”: 5000, “status”: true, item_ids:[“AABB”],...}} 927ccb74-83fa-4 e46-b317-702f23 0a9923,... A5n,B6x T-Shrit Datahub 2021-04-01 20:33:17.689 UTC user-0001 buy {“buy”: {“transaction_id”: “5fdg3”,”revenue”:50 00,...} 927ccb74-83fa-4 e46-b317-702f23 0a9923,... A5n,B6x,C8s T-Shrit Datahub 2021-04-01 20:33:17.689 UTC user-0001 leave {“leave”: {“url”: “https://****”,”spend_t ime”:300,...} 927ccb74-83fa-4 e46-b317-702f23 0a9923,... A5n,B6x,C8s Add columns Remove Columns

Slide 28

Slide 28 text

Analyist KARTE Pockyevent BigQuery Session Summary BigQuery Batch Job GKE Admin GKE Autoscaling Batch Job Dataﬂow Reference Table Cloud Spanner Client パフォーマンスを重視 Partitioning Sharding & Partitioning Architecture

Slide 29

Slide 29 text

パフォーマンスを重視 sync_date api_key user_id session_id total_active_t ime view_categ ory_name_to p1 referrer_url page_type 2021-04-01 20:33:17.689 UTC 738adfgb user-0001 session-8fh 30 datahub google top 2021-04-01 20:33:17.689 UTC 738adfgb user-0001 session-tgw 54 datahub google top 2021-04-01 20:33:17.689 UTC 738adfgb user-0002 session-6hh 120 datahub yahoo item_detail 2021-04-01 20:33:17.689 UTC 738adfgb user-0003 session-g54 60 datahub yahoo item_detail 2021-04-01 20:33:17.689 UTC 738adfgb user-0004 lsession-h43 10 blocks instagram Clustered by Partition by 全クライアント * 全ユーザーの集計テーブル - 60 million rows / day - 2 billion rows / month

Slide 30

Slide 30 text

データ活用の幅を広げる Analyist KARTE Pockyevent BigQuery Session Summary BigQuery Batch Job GKE Admin GKE Autoscaling Batch Job Dataﬂow Reference Table Cloud Spanner Client クエリで高速に検索/表示 User Summary セグメントにして接客に活用

Slide 31

Slide 31 text

Slide 32

Slide 32 text

キャッシングによる高速検索 Client Admin GKE Autoscaling KARTE Pockyevent BigQuery Intermediate Tables By Conditions BigQuery GCE 全体構成概要

Slide 33

Slide 33 text

ログデータに対するクエリ問題自由な検索ができる一方、検索コストが高い Client Admin GKE Autoscaling KARTE Pockyevent BigQuery 条件を自由に指定ログデータなのでクエリキャッシュが効かない表示に時間がかかる

Slide 34

Slide 34 text

中間テーブルの活用 Client Admin GKE Autoscaling KARTE Pockyevent BigQuery Intermediate Tables By Conditions BigQuery GCE 検索条件を保存/照合検索条件ごとに作成（ Expiration を設定）クエリキャッシュを透過的に利用検索条件ごとに中間テーブルを用意条件を自由に指定キャッシュがなれけばこちらへ高速に様々な条件で検索可能

Slide 35

Slide 35 text

Intermediate Table vs Materialized View Materialized View を活用しないのはなぜ？ Intermediate table テーブルごとに管理が必要データセットごとに最大 50,000 個制限のないクエリ Materialized View 特別管理することなく常に最新状態を維持可能データセットごとに最大 20 個制限付きクエリ「検索の自由度」を担保するために、通常の中間テーブルを採用 ● 個々のテーブルは非常に小さい ● 特定の期限を付与することで管理

Slide 36

Slide 36 text

Slide 37

Slide 37 text

社内での分析の効率化 Analyist KARTE Pockyevent BigQuery Session Summary BigQuery Batch Job GKE Admin GKE Autoscaling Batch Job Dataﬂow Reference Table Cloud Spanner Client 分析の効率化 ML Pipeline AI Platform Pipeline

Slide 38

Slide 38 text

Pockyevent の分析上の問題柔軟性の高いデータは分析しづらい sync_date user_id event_name values 2021-04-01 20:33:17.689 UTC user-0001 view {“view”: {“item”:{“item_id”: “AAA”,”price”5000,"category_name":"T-Shirt”,””brand”: “Datahub”,...}}} 2021-04-01 20:33:17.689 UTC user-0001 view {“view”: {“item”:{“item_id”: “AAB”,”price”10000,"category_name":"Bottoms”,””brand”: “Blocks”,...}}} 2021-04-01 20:33:17.689 UTC user-0001 cart {“cart”: {“price”: 5000, “status”: true, item_ids:[“AABB”],...}} 2021-04-01 20:33:17.689 UTC user-0001 buy {“buy”: {“transaction_id”: “5fdg3”,”revenue”:5000,...} 2021-04-01 20:33:17.689 UTC user-0001 leave {“leave”: {“url”: “https://****”,”spend_time”:300,...} JSON_EXTRACT_SCALAR(values, $.view.item.category_name) スキャン量も肥大化し、**TB は良くあるレベル

Slide 39

Slide 39 text

Column の切り出し分析用途に応じて分解してカラムを切り出す sync_date user_id event_ name values item.category_name item.brand_name 2021-04-01 20:33:17.689 UTC user-0001 view {“view”: {“item”:{“item_id”: “AAA”,”price”5000,"category_n ame":"T-Shirt”,””brand”: “Datahub”,...}}} T-shirt Datahub 2021-04-01 20:33:17.689 UTC user-0001 view {“view”: {“item”:{“item_id”: “AAB”,”price”10000,"category_ name":"Bottoms”,””brand”: “Blocks”,...}}} Bottoms Blocks 2021-04-01 20:33:17.689 UTC user-0001 cart {“cart”: {“price”: 5000, “status”: true, item_ids:[“AABB”],...}} T-Shrit Datahub 2021-04-01 20:33:17.689 UTC user-0001 buy {“buy”: {“transaction_id”: “5fdg3”,”revenue”:5000,...} T-Shrit Datahub 2021-04-01 20:33:17.689 UTC user-0001 leave {“leave”: {“url”: “https://****”,”spend_time”:300, ...} ● スキャン量：128 GB → 7.1 GB ● 消費スロット：1 hr 19 min → 27 min

Slide 40

Slide 40 text

Column の切り出し分析でよく使う Column を社内で定期的に調査して調整フィールド名タイプモード view.session_ spend_time INTEGER NULLABLE view.auto_pa ge_group STRING NULLABLE view.access.i n_referrer.url STRING NULLABLE view.access.o s.name STRING NULLABLE _merge_user. source_user_i d STRING NULLABLE ... ... ...

Slide 41

Slide 41 text

ML Pipeline との連携 Analyist KARTE Pockyevent BigQuery Session Summary BigQuery Batch Job GKE Admin GKE Autoscaling Batch Job Dataﬂow Reference Table Cloud Spanner Client ML 基盤 ML Pipeline AI Platform Pipeline

Slide 42

Slide 42 text

ML Pipeline との連携 KARTE Pockyevent BigQuery Admin GKE Autoscaling ML Pipeline System Experiment AI Platform Notebook ML Pipeline AI Platform Pipelines Batch Train / Prediction AI Platform Feature Store Firestore Feature Store BigQuery ML Realtime System Prediction Job GKE AutoPilot Event Data Cloud Spanner API trigger Cloud Pub/Sub Data Process Dataﬂow Prediction Result Cloud Bigtable Event Data Pub/Sub Core Server Compute Engine Core System Admin GKE Autoscaling Client KARTE Pockyevent BigQuery

Slide 43

Slide 43 text

まとめ 05

Slide 44

Slide 44 text

● BigQuery を中心とした大規模分析基盤 ○ Sharding & Partitioning ○ キャッシングによる高速検索 ○ Column の最適化本日のまとめ分析基盤は常に改善が回せる開発 /運用体制が特に重要