Data processing, workflow and us
~How to manage automated jobs~
SRE-SET Automation Night
Mercari SRE team :: @syu_cream
Slide 2
Slide 2 text
whoami
● @syu_cream
● Mercari SRE newbie :)
● Develop/maintain Data processing
systems and some middlewares
● My interests:
○ Go, mruby, and automation
Slide 3
Slide 3 text
My concerns...
● Data processing is already automated.
○ Fluentd, Embulk and some tools …
○ Shell scripts, cron …
● How can we improve job management?
Slide 4
Slide 4 text
Agenda
● BigQuery log uploader
○ Previous status and issues
○ Job management w/ digdag
● Statistics analyser
○ Previous status and issues
○ Re-implementation w/ Cloud Dataflow
○ Job management w/ Airflow
Slide 5
Slide 5 text
BigQuery log uploader
Slide 6
Slide 6 text
Mercari log analysis infrastructure
See details at https://speakerdeck.com/cubicdaiya/mercari-data-analysis-infrastructure
Slide 7
Slide 7 text
Deep dive to around BigQuery
● Receive forwarded logs by Fluentd
● Manage uploading job by cron
○ Split and compress large log files
○ Upload logs to GCS
○ Load logs from GCS to BigQuery
Slide 8
Slide 8 text
Issues...
● It was hard to retry failed job
○ We catch slack alert when it fails
○ But we need to reproduce some cmds
● It didn’t visualize job status
● Uploading logic becomes complex
○ Related to BQ resource limit
■ Compressed JSON has 4GB upper
limit
○ So we had split large file before
uploading
Slide 9
Slide 9 text
Improve Job w/ digdag
● We start to use digdag, a simple workflow
engine.
○ https://www.digdag.io/
● And Split log files in Fluentd
○ By adjusting buffer_chunk_limit
Slide 10
Slide 10 text
Improvements
● We can retry failed job.
● We can view job progress.
● Splitting logic is offloaded.
○ So uploading logic become simple :)
Slide 11
Slide 11 text
Statistics analyser
Slide 12
Slide 12 text
Mercari basic statistics analyzer
● Collect some data from DB and logs
● Process by Node.js
● Report basic KPIs by slack and email
Slide 13
Slide 13 text
Issues...
● It’s hard to maintain …
○ Analyser become very complex :(
○ Memory usage is very high
● It’s performance is poor.
○ Single thread
○ Sequential processing
Slide 14
Slide 14 text
Reimplement statistics analyser
● Replace w/ Cloud Dataflow
○ https://cloud.google.com/dataflow/
○ Full-managed ETL service
○ It can offload machine resource issues.
● Reimplement analyser logic by Scala
○ Using https://github.com/spotify/scio
○ Performance is good and it’s auto-scaled
○ Job status is visualized automatically
Slide 15
Slide 15 text
Job management w/ Airflow
● I’m trying Apache Airflow
○ https://airflow.apache.org/
● It can define complex job relations.
○ Job dependencies
○ Waiting datasource
○ Reporting via e-mail/slack
Slide 16
Slide 16 text
Improvements
● Memory usage is offloaded
● Job status is visualized
● Each job become retryable
● Slack/email reporting is integrated
Slide 17
Slide 17 text
Summary
● I’ve tried to improve for data processing
○ w/ workflow engines
○ And full-managed services