Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data processing, workflow and us ~How to manage...

Ryo Okubo
December 12, 2017

Data processing, workflow and us ~How to manage automated jobs~

Ryo Okubo

December 12, 2017
Tweet

More Decks by Ryo Okubo

Other Decks in Programming

Transcript

  1. Data processing, workflow and us ~How to manage automated jobs~

    SRE-SET Automation Night Mercari SRE team :: @syu_cream
  2. whoami • @syu_cream • Mercari SRE newbie :) • Develop/maintain

    Data processing systems and some middlewares • My interests: ◦ Go, mruby, and automation
  3. My concerns... • Data processing is already automated. ◦ Fluentd,

    Embulk and some tools … ◦ Shell scripts, cron … • How can we improve job management?
  4. Agenda • BigQuery log uploader ◦ Previous status and issues

    ◦ Job management w/ digdag • Statistics analyser ◦ Previous status and issues ◦ Re-implementation w/ Cloud Dataflow ◦ Job management w/ Airflow
  5. Deep dive to around BigQuery • Receive forwarded logs by

    Fluentd • Manage uploading job by cron ◦ Split and compress large log files ◦ Upload logs to GCS ◦ Load logs from GCS to BigQuery
  6. Issues... • It was hard to retry failed job ◦

    We catch slack alert when it fails ◦ But we need to reproduce some cmds • It didn’t visualize job status • Uploading logic becomes complex ◦ Related to BQ resource limit ▪ Compressed JSON has 4GB upper limit ◦ So we had split large file before uploading
  7. Improve Job w/ digdag • We start to use digdag,

    a simple workflow engine. ◦ https://www.digdag.io/ • And Split log files in Fluentd ◦ By adjusting buffer_chunk_limit
  8. Improvements • We can retry failed job. • We can

    view job progress. • Splitting logic is offloaded. ◦ So uploading logic become simple :)
  9. Mercari basic statistics analyzer • Collect some data from DB

    and logs • Process by Node.js • Report basic KPIs by slack and email
  10. Issues... • It’s hard to maintain … ◦ Analyser become

    very complex :( ◦ Memory usage is very high • It’s performance is poor. ◦ Single thread ◦ Sequential processing
  11. Reimplement statistics analyser • Replace w/ Cloud Dataflow ◦ https://cloud.google.com/dataflow/

    ◦ Full-managed ETL service ◦ It can offload machine resource issues. • Reimplement analyser logic by Scala ◦ Using https://github.com/spotify/scio ◦ Performance is good and it’s auto-scaled ◦ Job status is visualized automatically
  12. Job management w/ Airflow • I’m trying Apache Airflow ◦

    https://airflow.apache.org/ • It can define complex job relations. ◦ Job dependencies ◦ Waiting datasource ◦ Reporting via e-mail/slack
  13. Improvements • Memory usage is offloaded • Job status is

    visualized • Each job become retryable • Slack/email reporting is integrated
  14. Summary • I’ve tried to improve for data processing ◦

    w/ workflow engines ◦ And full-managed services
  15. We’re hiring • ソフトウェアエンジニア(Backend System) ◦ https://open.talentio.com/1/c/mercari/requisitions/ detail/4245 • ソフトウェアエンジニア(Site

    Reliability) ◦ https://open.talentio.com/1/c/mercari/requisitions/ detail/4246 • ソフトウェアエンジニア(MySQL Reliability) ◦ https://open.talentio.com/1/c/mercari/requisitions/ detail/6415 • … and who loves automation!
  16. References • KPI に関わる数値の集計処理を Cloud Dataflow に置 き換えている話 ◦ http://tech.mercari.com/entry/2017/11/02/142931

    • メルカリのデータ分析基盤の紹介〜BigQuery周辺の 話〜 ◦ http://tech.mercari.com/entry/2017/12/09/103000