Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data processing, workflow and us ~How to manage automated jobs~

Ryo Okubo
December 12, 2017

Data processing, workflow and us ~How to manage automated jobs~

Ryo Okubo

December 12, 2017
Tweet

More Decks by Ryo Okubo

Other Decks in Programming

Transcript

  1. Data processing, workflow and us
    ~How to manage automated jobs~
    SRE-SET Automation Night
    Mercari SRE team :: @syu_cream

    View Slide

  2. whoami
    ● @syu_cream
    ● Mercari SRE newbie :)
    ● Develop/maintain Data processing
    systems and some middlewares
    ● My interests:
    ○ Go, mruby, and automation

    View Slide

  3. My concerns...
    ● Data processing is already automated.
    ○ Fluentd, Embulk and some tools …
    ○ Shell scripts, cron …
    ● How can we improve job management?

    View Slide

  4. Agenda
    ● BigQuery log uploader
    ○ Previous status and issues
    ○ Job management w/ digdag
    ● Statistics analyser
    ○ Previous status and issues
    ○ Re-implementation w/ Cloud Dataflow
    ○ Job management w/ Airflow

    View Slide

  5. BigQuery log uploader

    View Slide

  6. Mercari log analysis infrastructure
    See details at https://speakerdeck.com/cubicdaiya/mercari-data-analysis-infrastructure

    View Slide

  7. Deep dive to around BigQuery
    ● Receive forwarded logs by Fluentd
    ● Manage uploading job by cron
    ○ Split and compress large log files
    ○ Upload logs to GCS
    ○ Load logs from GCS to BigQuery

    View Slide

  8. Issues...
    ● It was hard to retry failed job
    ○ We catch slack alert when it fails
    ○ But we need to reproduce some cmds
    ● It didn’t visualize job status
    ● Uploading logic becomes complex
    ○ Related to BQ resource limit
    ■ Compressed JSON has 4GB upper
    limit
    ○ So we had split large file before
    uploading

    View Slide

  9. Improve Job w/ digdag
    ● We start to use digdag, a simple workflow
    engine.
    ○ https://www.digdag.io/
    ● And Split log files in Fluentd
    ○ By adjusting buffer_chunk_limit

    View Slide

  10. Improvements
    ● We can retry failed job.
    ● We can view job progress.
    ● Splitting logic is offloaded.
    ○ So uploading logic become simple :)

    View Slide

  11. Statistics analyser

    View Slide

  12. Mercari basic statistics analyzer
    ● Collect some data from DB and logs
    ● Process by Node.js
    ● Report basic KPIs by slack and email

    View Slide

  13. Issues...
    ● It’s hard to maintain …
    ○ Analyser become very complex :(
    ○ Memory usage is very high
    ● It’s performance is poor.
    ○ Single thread
    ○ Sequential processing

    View Slide

  14. Reimplement statistics analyser
    ● Replace w/ Cloud Dataflow
    ○ https://cloud.google.com/dataflow/
    ○ Full-managed ETL service
    ○ It can offload machine resource issues.
    ● Reimplement analyser logic by Scala
    ○ Using https://github.com/spotify/scio
    ○ Performance is good and it’s auto-scaled
    ○ Job status is visualized automatically

    View Slide

  15. Job management w/ Airflow
    ● I’m trying Apache Airflow
    ○ https://airflow.apache.org/
    ● It can define complex job relations.
    ○ Job dependencies
    ○ Waiting datasource
    ○ Reporting via e-mail/slack

    View Slide

  16. Improvements
    ● Memory usage is offloaded
    ● Job status is visualized
    ● Each job become retryable
    ● Slack/email reporting is integrated

    View Slide

  17. Summary
    ● I’ve tried to improve for data processing
    ○ w/ workflow engines
    ○ And full-managed services

    View Slide

  18. We’re hiring
    ● ソフトウェアエンジニア(Backend System)
    ○ https://open.talentio.com/1/c/mercari/requisitions/
    detail/4245
    ● ソフトウェアエンジニア(Site Reliability)
    ○ https://open.talentio.com/1/c/mercari/requisitions/
    detail/4246
    ● ソフトウェアエンジニア(MySQL Reliability)
    ○ https://open.talentio.com/1/c/mercari/requisitions/
    detail/6415
    ● … and who loves automation!

    View Slide

  19. References
    ● KPI に関わる数値の集計処理を Cloud Dataflow に置
    き換えている話
    ○ http://tech.mercari.com/entry/2017/11/02/142931
    ● メルカリのデータ分析基盤の紹介〜BigQuery周辺の
    話〜
    ○ http://tech.mercari.com/entry/2017/12/09/103000

    View Slide