Data processing, workflow and us ~How to manage automated jobs~

by Ryo Okubo

Slide 1

Slide 1 text

Data processing, workflow and us ~How to manage automated jobs~ SRE-SET Automation Night Mercari SRE team :: @syu_cream

Slide 2

Slide 2 text

whoami ● @syu_cream ● Mercari SRE newbie :) ● Develop/maintain Data processing systems and some middlewares ● My interests: ○ Go, mruby, and automation

Slide 3

Slide 3 text

My concerns... ● Data processing is already automated. ○ Fluentd, Embulk and some tools … ○ Shell scripts, cron … ● How can we improve job management?

Slide 4

Slide 4 text

Agenda ● BigQuery log uploader ○ Previous status and issues ○ Job management w/ digdag ● Statistics analyser ○ Previous status and issues ○ Re-implementation w/ Cloud Dataflow ○ Job management w/ Airflow

Slide 5

Slide 5 text

BigQuery log uploader

Slide 6

Slide 6 text

Mercari log analysis infrastructure See details at https://speakerdeck.com/cubicdaiya/mercari-data-analysis-infrastructure

Slide 7

Slide 7 text

Deep dive to around BigQuery ● Receive forwarded logs by Fluentd ● Manage uploading job by cron ○ Split and compress large log files ○ Upload logs to GCS ○ Load logs from GCS to BigQuery

Slide 8

Slide 8 text

Issues... ● It was hard to retry failed job ○ We catch slack alert when it fails ○ But we need to reproduce some cmds ● It didn’t visualize job status ● Uploading logic becomes complex ○ Related to BQ resource limit ■ Compressed JSON has 4GB upper limit ○ So we had split large file before uploading

Slide 9

Slide 9 text

Improve Job w/ digdag ● We start to use digdag, a simple workflow engine. ○ https://www.digdag.io/ ● And Split log files in Fluentd ○ By adjusting buffer_chunk_limit

Slide 10

Slide 10 text

Improvements ● We can retry failed job. ● We can view job progress. ● Splitting logic is offloaded. ○ So uploading logic become simple :)

Slide 11

Slide 11 text

Statistics analyser

Slide 12

Slide 12 text

Mercari basic statistics analyzer ● Collect some data from DB and logs ● Process by Node.js ● Report basic KPIs by slack and email

Slide 13

Slide 13 text

Issues... ● It’s hard to maintain … ○ Analyser become very complex :( ○ Memory usage is very high ● It’s performance is poor. ○ Single thread ○ Sequential processing

Slide 14

Slide 14 text

Reimplement statistics analyser ● Replace w/ Cloud Dataflow ○ https://cloud.google.com/dataflow/ ○ Full-managed ETL service ○ It can offload machine resource issues. ● Reimplement analyser logic by Scala ○ Using https://github.com/spotify/scio ○ Performance is good and it’s auto-scaled ○ Job status is visualized automatically

Slide 15

Slide 15 text

Job management w/ Airflow ● I’m trying Apache Airflow ○ https://airflow.apache.org/ ● It can define complex job relations. ○ Job dependencies ○ Waiting datasource ○ Reporting via e-mail/slack

Slide 16

Slide 16 text

Improvements ● Memory usage is offloaded ● Job status is visualized ● Each job become retryable ● Slack/email reporting is integrated

Slide 17

Slide 17 text

Summary ● I’ve tried to improve for data processing ○ w/ workflow engines ○ And full-managed services

Slide 18

Slide 18 text

We’re hiring ● ソフトウェアエンジニア（Backend System） ○ https://open.talentio.com/1/c/mercari/requisitions/ detail/4245 ● ソフトウェアエンジニア（Site Reliability） ○ https://open.talentio.com/1/c/mercari/requisitions/ detail/4246 ● ソフトウェアエンジニア（MySQL Reliability） ○ https://open.talentio.com/1/c/mercari/requisitions/ detail/6415 ● … and who loves automation!

Slide 19

Slide 19 text

References ● KPI に関わる数値の集計処理を Cloud Dataflow に置き換えている話 ○ http://tech.mercari.com/entry/2017/11/02/142931 ● メルカリのデータ分析基盤の紹介〜BigQuery周辺の話〜 ○ http://tech.mercari.com/entry/2017/12/09/103000