Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Production-ready stream data pipeline in Merpay, Inc

Ryo Okubo
September 12, 2019

Production-ready stream data pipeline in Merpay, Inc

This is a slide for https://www.apachecon.com/acna19/s/#/scheduledEvent/1329


We’ve started to provide our stream based data pipeline by using Google Cloud Dataflow and Apache Beam since 2018 fall. It collects event logs from microservices running on GKE, then transforms and forwards the logs to GCS and BigQuery to use for analytics, etc. As you know, implementing and operating streaming jobs are challenging. We’re encountered various issues during that time.
I’d like to share our knowledge on development and operation perspective. There are 3 topics in the development part, 1) Implementing stream jobs with using spotify/scio, 2) How to debug the jobs, especially having DynamicDestination, 3) How to load testing, to ensure our jobs stable. And topics in the next part, 1) How to deploy new jobs safely(with avoiding data loss), 2) How to monitor the jobs and surrounding systems, and misc.

Ryo Okubo

September 12, 2019

More Decks by Ryo Okubo

Other Decks in Programming


  1. Background
 • Deep knowledge for Apache Beam/Cloud Dataflow is not

 • We want to find a better way
 So, in this presentation:
 • Describe/share our use case and activities
 • Get better ways or knowledge we don’t have 2 Objectives
  2. What is Merpay? 01 Overview of Our Stream Data Pipeline

    02 How We Make it Production Ready 03 How We Operate it in Production 04 3 Agenda
  3. What is Merpay? Merpay is a mobile payments service operated

    by Merpay, Inc. The company belongs to the Mercari Group, widely recognized for its service ‘Mercari,’ the top C2C marketplace app in Japan. Money earned by selling unwanted items on the Mercari app and money charged to the app through user’s bank accounts can be used to make payments at stores and on the Mercari app itself.
  4. Compatible with both iOS and Android Support for NFC payments

    with iD at 900,000 merchants nationwide
  5. Code Payments Coverage of merchants that do not accept iD.

    Customers pay by scanning their barcode at the store.

  6. Our stream data pipeline: backgound • microservices on GKE •

    using many GCP services
 API Gateway Authority API Service X API Service Y Google Cloud LoadBalancer Service A Service B Google Kubernetes Engine Service C Web Service Z Cloud Spanner Project A Cloud Spanner Cloud Pub/Sub Project B Project GKE Project C Cloud Spanner Cloud Storage
  7. Our stream data pipeline: overview • Aggregate microservice logs 

    • Dataflow + PubSub chain 
 ◦ 6 stream jobs ◦ 3+many pubsub topics • GCS + BQ as DataLake 
 • Started since Oct/2018 
 • input: ~2500 rps 

  8. Our stream data pipeline: technical stack • Data Sources

    Cloud Pub/Sub topic/subscription’s per microservice ◦ Created by in-house Terraform module • Dataflow jobs
 ◦ Written in Scala with spotify/scio • Data Sinks
 ◦ Store Apache Avro files on GCS ◦ Using Streaming Insert and bq-load from Apache Airflow to write BigQuery
  9. Issues in development 
 Test => Unit tests and JobTest’s

    in spotify/scio 01 Benchmark => Pub/Sub load generator 02 Profiling => Stackdriver Profiler (with magic spells) 03 Debug deeply => Stackdriver Debugger / dockernize + debugger 04
  10. Test • Just plain Java/Scala code 
 ◦ Easy! Just

    write unit tests • Each PTransform
 ◦ Same as plain code if using spotify/scio • But the integration parts...? 

  11. Test: simplify code • Make plain Java/Scala code as possible

 ◦ It’s able to write unit test and do fast execution ◦ spotify/scio is helpful, it wraps lambda functions to DoFn and converts it to transform • Separate I/O parts
 ◦ It’s able to write transform integration test in usual transform chains ◦ It doesn’t require any mock and emulator • Including I/O … JobTest in spotify/scio! 
 ◦ A simple error case:
  12. Test: Property-based testing • Our pipeline handles many kind of

    logs encoded in Apache Avro and Protocol Buffer 
 ◦ Preparing test data is high cost and boring • scalacheck + spotify/ratatool is awesome for us!! 
 ◦ protobufOf[GeneratedProtoClass], avroOf[GeneratedAvroClass] generate random data
  13. Benchmark • Concerned about ability to accept the loads we

 ◦ Recent max objective: 15,000+ rps ◦ There are some shuffle tasks and writings to GCS ◦ Cloud Pub/Sub quota • Implemented load testing Dataflow template jobs, have Pub/Sub target 
 ◦ It has the benefits of Cloud Dataflow, Streaming Engine, multiple workers… ◦ Template jobs are reusable and easy to use ◦ We can also use random data generated by scalacheck + ratatool
  14. Benchmark
 • Input: UnboundSource from GeneratedSequence 
 • Transform: Just

    getting random data and packing it as PubsubMessage
 • Output: PubsubIO

  15. Profiling
 • We need to dive deeply when we find

    critical bottlenecks from load testing 
 ◦ But it isn’t easy because the jobs are actually hosted on Dataflow Service • There’s magic to enable Profiling on the jobs with Stackdriver 
 ◦ --profilingAgentConfiguration=’{ \”APICurated\”: true }’, a pipeline option you give from cli ◦ https://medium.com/google-cloud/profiling-dataflow-pipelines-ddbbef07761d ◦ [trivial issue] Dataflow service accepts job name with upper case, but the profile agent doesn’t
  16. Debug • Stackdriver Debugger 
 ◦ If possible to deploy

    to Dataflow Service ◦ If any problem occurs in production • Dockernize + jvm debugger 
 ◦ Feel free to use ◦ Create breakpoints to the jobs and doing step executions • ex. Dump heap and analyze 
 ◦ If OOM occurs
  17. Debug: Stackdriver Debugger • Just give the --enableCloudDebugger option! ◦

    It requires dataflow execution of SA Stackdriber Debugger related permissions • // Application ID is human readable … :( ◦ You can find the ID on Stackdriver Logging: its written by worker-setup in “uniquifier” field
  18. Debug: Stackdriver Debugger • It captures variables and a stacktrace

    at specified points for running Dataflow jobs 

  19. Debug: Dockernize • Local execution & testing is great to

    debug! • Pub/Sub alternative: ◦ Pub/Sub emulator • GCS alternative: ◦ Local Filesystem • BQ alternative: ◦ nothing…
  20. Debug: Dockernize • The containers require many CPU/RAM resources ◦

    sbt, jobs on DirectRunner, Pub/Sub emulator, etc. ◦ Possible to on a Mac, but executing it on the machine is severe option • docker-machine helps! ◦ Docker-machine has google driver which enables to host docker daemon on GCE ◦ Debug ports can be forwarded by docker-machine ssh
  21. Debug: Dockernize • Start remote instance on GCE ◦ $

    docker-machine create --driver google --google-machine-type <strong type> docker-remote • Use the instance on docker-compose ◦ $ eval (docker-machine env docker-remote) • Up docker-compose services ◦ $ docker-compose -f docker-compose-basic.yml docker-compose-deadletter.yml up ◦ To focus on specific jobs, the files are separated • Attach to a debug port via ssh forwarding ◦ $ docker-compose ssh -L 5005:localhost:5005 ◦ Attach from jdb or IDEs ◦ make breakpoints, dig, dig.... • Stop the instance ◦ $ docker-machine stop docker-remote
  22. Debug: Heap dump • Dataflow service supports writing heap dumps

    to GCS on OOM ◦ It needs --dumpHeapOnOOM=true and --saveHeapDumpsToGcsPath=gs://bucket/path/to ◦ We can analyze it by Eclipse Memory Analyzer or other tools
  23. Our issues in production PipelineOptions management => yaml base configration

    01 CI/CD => CircleCI 02 Monitoring => OpenCensus + Stackdriver Monitoring 03 Alert, OnCall => PagerDuty + Stackdriver Monitoring 04
  24. PipelineOptions management • There are various PipelineOptions ◦ Actually DataflowPipelineOptions

    extends 10+ sub options ◦ And more defined users… • The option management is sometimes painful ◦ Structured options… ◦ Using different settings between dev / prod... ◦ We want describing reasons why we select the value…
  25. PipelineOptions management: yaml settings • PipelineOptions accepts JSON objects ◦

    but JSON is not user-friendly - doesn’t support comments • Our solution: using yaml! ◦ Supports complex structures, comments, etc. • For example: # basic
 runner: org.apache.beam.runners.direct.DirectRunner
 region: us-central1
 streaming: true
 autoscalingAlgorithm: THROUGHPUT_BASED
 maxNumWorkers: 4
 tempLocation: gs://merpay_dataplatform_jp_dev_clouddataflow_testing/tmp/rawdatahub2structureddatahub
 enableStreamingEngine: true
 - enable_stackdriver_agent_metrics
 # for debuggability
 enableCloudDebug: true
 dumpHeapOnOOM: true
 saveHeapDumpsToGcsPath: gs://merpay_dataplatform_jp_dev_clouddataflow_testing/heapdumps/rawdatahub2structureddatahub
 APICurated: true
 # I/O
 input: projects/merpay-dataplatform-jp-test/subscriptions/raw_datahub_to_structured_datahub
 output: projects/merpay-dataplatform-jp-test/topics/structured_datahub
 deadLetter: projects/merpay-dataplatform-jp-test/topics/deadletter_hub

  26. CI/CD • CI: CircleCI ◦ sbt test, sbt it ◦

    trying template job builds • CD: CircleCI ◦ Automatically deploying to development env ◦ Template build -> drain -> run pipeline ▪ To simplify deployments ▪ Update is good if possible to keep simplity ◦ Basically keep compatibility on each jobs ▪ To avoid requiring in-order deployments
  27. Monitoring • GCP supports basic metrics ◦ Dataflow Service: system

    lag, watermark,… ◦ Cloud Pub/Sub: unacked messages,… ◦ Stackdriver Logging: log-based custom metrics, e.g.) number of OOM Exceptions ◦ JVM: CPU util, GC time, … ▪ It needs --experiments=enable_stackdriver_agent_metrics • Application level metrics ◦ We implement metrics collector with OpenCensus ◦ Processed entries count ◦ Deadletter count ◦ Transform duration time
  28. Alerts, OnCall • Create monitors on Stackdriver Monitoring
 ◦ configured

    in Terraform ◦ Major targets of metrics are system lag, watermark • Trigger alert to PagerDuty and catch the call

  29. Closing • We are
 ◦ Providing mobile payment service ◦

    Use Apache Beam and Cloud Dataflow to run stream jobs ◦ Fully using Stackdriver and some ways for debuggability ◦ Keep operations very simple • Share us your related knowledge!

  30. Schema management, evolution • We accept both schema-on-read and schema-on-write

    strategies ◦ PubsubMessage.payload is just Array[Byte], unknown formats are also ok ◦ We’ve defined an original input-side protocol ▪ Anyway we are storing incoming raw data ▪ Sender can specify schema information in Pubsub attributes ▪ If Dataflow job knows the schema, it tries to parse and convert to Avro files and/or BQ records ▪ We have Protocol Buffer -> Avro conversion layer ▪ Next, we are thinking to have a schema registry
  31. BigQuery and streaming insert • BigQuery is awesome DHW -

    many features, high performance! ◦ Streaming insert is a fast way to provide lambda architecture ◦ But schema evolution is super painful… ▪ We are doing crazy schema backward compatibility checks for BigQuery, calling patch API if possible to evolute and finally insert ▪ If it fails? No way to prevent data-loss easily!