Data processing, workflow and us ~How to manage automated jobs~

Data processing, workflow and us ~How to manage automated jobs~
SRE-SET Automation Night Mercari SRE team :: @syu_cream

whoami • @syu_cream • Mercari SRE newbie :) • Develop/maintain
Data processing systems and some middlewares • My interests: ◦ Go, mruby, and automation

My concerns... • Data processing is already automated. ◦ Fluentd,
Embulk and some tools … ◦ Shell scripts, cron … • How can we improve job management?

Agenda • BigQuery log uploader ◦ Previous status and issues
◦ Job management w/ digdag • Statistics analyser ◦ Previous status and issues ◦ Re-implementation w/ Cloud Dataflow ◦ Job management w/ Airflow

BigQuery log uploader

Mercari log analysis infrastructure See details at https://speakerdeck.com/cubicdaiya/mercari-data-analysis-infrastructure

Deep dive to around BigQuery • Receive forwarded logs by
Fluentd • Manage uploading job by cron ◦ Split and compress large log files ◦ Upload logs to GCS ◦ Load logs from GCS to BigQuery

Issues... • It was hard to retry failed job ◦
We catch slack alert when it fails ◦ But we need to reproduce some cmds • It didn’t visualize job status • Uploading logic becomes complex ◦ Related to BQ resource limit ▪ Compressed JSON has 4GB upper limit ◦ So we had split large file before uploading

Improve Job w/ digdag • We start to use digdag,
a simple workflow engine. ◦ https://www.digdag.io/ • And Split log files in Fluentd ◦ By adjusting buffer_chunk_limit

Improvements • We can retry failed job. • We can
view job progress. • Splitting logic is offloaded. ◦ So uploading logic become simple :)

Statistics analyser

Mercari basic statistics analyzer • Collect some data from DB
and logs • Process by Node.js • Report basic KPIs by slack and email

Issues... • It’s hard to maintain … ◦ Analyser become
very complex :( ◦ Memory usage is very high • It’s performance is poor. ◦ Single thread ◦ Sequential processing

Reimplement statistics analyser • Replace w/ Cloud Dataflow ◦ https://cloud.google.com/dataflow/
◦ Full-managed ETL service ◦ It can offload machine resource issues. • Reimplement analyser logic by Scala ◦ Using https://github.com/spotify/scio ◦ Performance is good and it’s auto-scaled ◦ Job status is visualized automatically

Job management w/ Airflow • I’m trying Apache Airflow ◦
https://airflow.apache.org/ • It can define complex job relations. ◦ Job dependencies ◦ Waiting datasource ◦ Reporting via e-mail/slack

Improvements • Memory usage is offloaded • Job status is
visualized • Each job become retryable • Slack/email reporting is integrated

Summary • I’ve tried to improve for data processing ◦
w/ workflow engines ◦ And full-managed services

We’re hiring • ソフトウェアエンジニア（Backend System） ◦ https://open.talentio.com/1/c/mercari/requisitions/ detail/4245 • ソフトウェアエンジニア（Site
Reliability） ◦ https://open.talentio.com/1/c/mercari/requisitions/ detail/4246 • ソフトウェアエンジニア（MySQL Reliability） ◦ https://open.talentio.com/1/c/mercari/requisitions/ detail/6415 • … and who loves automation!

References • KPI に関わる数値の集計処理を Cloud Dataflow に置き換えている話 ◦ http://tech.mercari.com/entry/2017/11/02/142931
• メルカリのデータ分析基盤の紹介〜BigQuery周辺の話〜 ◦ http://tech.mercari.com/entry/2017/12/09/103000

Data processing, workflow and us ~How to manage...

Data processing, workflow and us ~How to manage automated jobs~

Ryo Okubo

More Decks by Ryo Okubo

Other Decks in Programming

Featured

Transcript