Embulkを利用したデータ統合SaaSの構築と運用

Slide 1

Slide 1 text

2020/07/09 @Embulk&Digdag Online Meetup 2020 (株)primeNumber 鈴木健太 Embulk を利用したデータ統合SaaSの構築と運用

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

©2020 primeNumber Inc. primeNumber 14名目黒駅 1.48億円（準備金含む）社員数：所在地：資本金：株式会社primeNumber 社名： 2015年11月設立： - to B のデータ領域のサービスを展開 - 2018年末「trocco®」リリース - 2019年7月シリーズAラウンド資金調達 Data Engineering Study #1 2020/07/15(水) 19:30〜 ForkewllさんとprimeNumberの共催

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

©2020 primeNumber Inc. データ基盤運用で必要となる周辺機能をサポート ● Slack通知 ○ 成功/失敗/レコード数条件/実行時間条件 ● 設定への変数埋め込み ○ 実行開始時間を利用可能例) s3://bucket/$date$/ ● スケジュール実行 ● 並列度制御 ● データマート生成 ○ DWH上にサマリーテーブルの生成 ● データチェック・バリデーション ○ 任意のクエリ結果を元にバリデーション ● ワークフロー ○ データ転送→データマート生成→バリデーションなどの組み合わせが可能 ● 権限管理 ○ グループ内での設定共有 ● CDC（Change Data Capture） ○ MySQLのbinlogを読み込み、DELETE含めてBigQueryに同期

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

©2020 primeNumber Inc. ジョブの立ち上げをK8sに任せられる - create jobリクエストさえ成功すれば、nodeのリソース状況を見てK8s がJob（コンテナ）を立ち上げてくれる - K8s側でキューイングしてくれている - 必要なリソースからnodeの空き状況を元にJobを順次立ち上げ - Cluster Autoscalerにより、Jobを立ち上げるnodeリソースが無いときにはnodeのスケールアウトをやってくれる - nodeが空いているときはスケールインも - node上に起動中のJobがある場合、Jobの完了を待ってからスケールインしてくれる

Slide 17

Slide 17 text

©2020 primeNumber Inc. Fan-Out or Dispatcher SQS Container Container SQS Container Container Fan-Out方式 Dispatcher方式 Dispatcher App Dispatcherがキューをpollingし、ジョブ単位で都度コンテナを立ち上げる各コンテナがSQSをpollingし、ジョブを実行。実行が完了したら再度polling し、次のジョブを実行。 or もともとはECSサービスを利用したFan-Out方式を採用していた EKSによるDispatcher方式に移行

Slide 18

Slide 18 text

©2020 primeNumber Inc. Dispatcher方式を採用した理由 ● デプロイのシンプルさ ○ リリース当初はFan-Out方式を採用していた ○ 実行に長時間かかる転送ジョブがある ○ デプロイのタイミングで実行中のジョブを終了させたくないため、デプロイ完了はジョブの終了を待ってから ■ 何時間もデプロイが完了しない・・ ○ Dispatcher方式であれば、コンテナのイメージタグを差し替えるだけでデプロイ完了 ■ 実行中のジョブは影響を受けない Container Container Dispatcher Container Jobが参照するのコンテナタグを変更する新しく立ち上がるコンテナは別のタグを参照既存のコンテナは影響を受けない TAG 1 TAG 2 TAG 1 TAG 1 TAG 2

Slide 19

Slide 19 text

Slide 20

Slide 20 text

©2020 primeNumber Inc. なぜEKS(K8s)？ECSタスクではないのか？ - ECSタスク on Fargate - 起動時間が遅い。デフォルト以上のボリュームを割当ることができない ※2019年2月時点 - ECSタスク on EC2 - EC2リソースに余裕がない場合、ECSタスク（コンテナ）が起動できない - ECSタスクを確実に起動できるように、タスク起動の成功失敗監視とリトライの仕組みが必要 ※2019年2月時点 - 開発環境と本番環境にズレができる - 開発環境では素のDockerコンテナを立ち上げ、本番環境ではECSタスクを立ち上げることに - EKS(K8s) Job on EC2 - コンテナの立ち上げはK8sに任せることが出来る - EC2リソース監視とスケール、コンテナの立ち上げはk8s側の仕事 - 開発環境と本番環境を揃えて開発できる

Slide 21

Slide 21 text

- ジョブの立ち上げはK8s側で良しなにやってくれる - nodeスケール、ジョブの起動はK8s側の責務に - 並列実行などの観点で、どのジョブをcreate jobするべきかを考えるだけで良いのでシンプルに - Dispatcher方式を採用できることで、デプロイもシンプルに - イメージタグの付替 - 実行中のジョブが影響を受けずにデプロイが完了 - 開発環境と本番環境を揃えることが出来、開発・実装がシンプルに ©2020 primeNumber Inc. K8sによるシンプルなジョブ基盤

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

who am i Software Eengineer 高際　兼一（Ken Takagiwa） github: @giwa twitter: @gi_world fb: @giwacchi futsal(代々木公園の個サルによくいます)/ drink

Slide 27

Slide 27 text

embulk-output-bigquery plugin Brief History of this plugin v0.1.0 is Java Output File base and developed by Treasure Data(sakama .et al) on Mar 2015 v0.3.0.pre1 is JRuby Output base and developed by Sonots on Mar 2016 embulk-output-bigquery_java is Java Output base, is compatible with JRuby version base, and developed by trocco from Feb 2020 Our motivation 1. hit the limit of transfer time for customer requirement using JRuby 2. minimizing transfer time reduces our infrastructure cost. most of customers use BigQuery as DWH

Slide 28

Slide 28 text

Other issues Embulk itself prioritize Java since 0.10 JRuby plugin problem(https://twitter.com/hiroysato/status/1269822234454982656) google-api-client incompatibility JRuby and JRE related problem (https://github.com/embulk/embulk-output-bigquery/issues/122) Sometime error: output plugin bigquery not found embulk/embulk#1214 Sometimes failed to run(can not copy embedded jar to temp directory) embulk/embulk#1148 Occasionally failed by "JRuby runtime is not loaded successfully" error embulk/embulk#978

Slide 29

Slide 29 text

Performance EC2: m5ad.large (2 core, 8GB) transfer: local_file -> BigQuery embulk: 0.9.23 embulk-output-bigquery(v0.6.4) embulk-output-bigquery_java (0.0.14) speedometer を使ったスループットを計測 in: type: file path_prefix: /home/ec2-user/bq_rb/data.csv parser: charset: UTF-8 newline: LF type: csv delimiter: ',' quote: '"' escape: '"' trim_if_not_quoted: false skip_header_lines: 1 allow_extra_columns: false allow_optional_columns: false columns: - {name: c1, type: long} - {name: c2, type: string} - {name: c3, type: long} - {name: c4, type: double} - {name: c5, type: string} out: type: bigquery_java auth_method: service_account json_keyfile: *** dataset: **** table: bq_performance_java auto_create_dataset: false auto_create_table: true mode: replace location: US open_timeout_sec: 300 timeout_sec: 300 send_timeout_sec: 300 read_timeout_sec: 300 retries: 5 allow_quoted_newlines: true source_format: NEWLINE_DELIMITED_JSON compression: GZIP path_prefix: "/home/ec2-user/bq_java/"

Slide 30

Slide 30 text

Performance

Slide 31

Slide 31 text

Performance

Slide 32

Slide 32 text

Performance 2.56x

Slide 33

Slide 33 text

Current status of output-bigquery_java Java output plugin config coverage of JRuby around 40% not implemented major function: GCS config diff with JRuby: before_load, column_options.description

Slide 34

Slide 34 text

Road map embulk/embulk-output-bigquery: keep JRuby version and fix bugs trocco-io/embulk-output-bigquery_java: replace embulk/embulk-output-bigquery and release 0.7.0 in embulk/embulk-output-bigquery at Spring 2021 drop some options related JRuby performance issues: e.g. payload_column_index number of contributor: 1 only me, the more contributors joins this project, the faster we can achieve milestone in road map :)

Slide 35

Slide 35 text

Please give us feedback and contribution