Production-ready stream data pipeline in Merpay, Inc

4ab3fec3e82ddb19bcadd93ef909a443?s=47 Ryo Okubo
September 12, 2019

Production-ready stream data pipeline in Merpay, Inc

This is a slide for


We’ve started to provide our stream based data pipeline by using Google Cloud Dataflow and Apache Beam since 2018 fall. It collects event logs from microservices running on GKE, then transforms and forwards the logs to GCS and BigQuery to use for analytics, etc. As you know, implementing and operating streaming jobs are challenging. We’re encountered various issues during that time.
I’d like to share our knowledge on development and operation perspective. There are 3 topics in the development part, 1) Implementing stream jobs with using spotify/scio, 2) How to debug the jobs, especially having DynamicDestination, 3) How to load testing, to ensure our jobs stable. And topics in the next part, 1) How to deploy new jobs safely(with avoiding data loss), 2) How to monitor the jobs and surrounding systems, and misc.


Ryo Okubo

September 12, 2019


  1. Ryo Okubo
 Production-Ready Stream Data 
 Pipeline in Merpay

  2. Background
 • Deep knowledge for Apache Beam/Cloud Dataflow is not

 • We want to find a better way
 So, in this presentation:
 • Describe/share our use case and activities
 • Get better ways or knowledge we don’t have 2 Objectives
  3. What is Merpay? 01 Overview of Our Stream Data Pipeline

    02 How We Make it Production Ready 03 How We Operate it in Production 04 3 Agenda
  4. What is Merpay?

  5. What is Merpay? Merpay is a mobile payments service operated

    by Merpay, Inc. The company belongs to the Mercari Group, widely recognized for its service ‘Mercari,’ the top C2C marketplace app in Japan. Money earned by selling unwanted items on the Mercari app and money charged to the app through user’s bank accounts can be used to make payments at stores and on the Mercari app itself.
  6. Compatible with both iOS and Android Support for NFC payments

    with iD at 900,000 merchants nationwide
  7. Code Payments Coverage of merchants that do not accept iD.

    Customers pay by scanning their barcode at the store.

  8. Overview of Our Stream Data Pipeline 

  9. Our stream data pipeline: backgound • microservices on GKE •

    using many GCP services
 API Gateway Authority API Service X API Service Y Google Cloud LoadBalancer Service A Service B Google Kubernetes Engine Service C Web Service Z Cloud Spanner Project A Cloud Spanner Cloud Pub/Sub Project B Project GKE Project C Cloud Spanner Cloud Storage
  10. Our stream data pipeline: overview • Aggregate microservice logs 

    • Dataflow + PubSub chain 
 ◦ 6 stream jobs ◦ 3+many pubsub topics • GCS + BQ as DataLake 
 • Started since Oct/2018 
 • input: ~2500 rps 

  11. Our stream data pipeline: technical stack • Data Sources

    Cloud Pub/Sub topic/subscription’s per microservice ◦ Created by in-house Terraform module • Dataflow jobs
 ◦ Written in Scala with spotify/scio • Data Sinks
 ◦ Store Apache Avro files on GCS ◦ Using Streaming Insert and bq-load from Apache Airflow to write BigQuery
  12. How to Make it Production Ready

  13. Issues in development Test 01 Benchmark 02 Profiling 03 Debug

    deeply 04
  14. Issues in development 
 Test => Unit tests and JobTest’s

    in spotify/scio 01 Benchmark => Pub/Sub load generator 02 Profiling => Stackdriver Profiler (with magic spells) 03 Debug deeply => Stackdriver Debugger / dockernize + debugger 04
  15. Test • Just plain Java/Scala code 
 ◦ Easy! Just

    write unit tests • Each PTransform
 ◦ Same as plain code if using spotify/scio • But the integration parts...? 

  16. Test: simplify code • Make plain Java/Scala code as possible

 ◦ It’s able to write unit test and do fast execution ◦ spotify/scio is helpful, it wraps lambda functions to DoFn and converts it to transform • Separate I/O parts
 ◦ It’s able to write transform integration test in usual transform chains ◦ It doesn’t require any mock and emulator • Including I/O … JobTest in spotify/scio! 
 ◦ A simple error case:
  17. Test: Property-based testing • Our pipeline handles many kind of

    logs encoded in Apache Avro and Protocol Buffer 
 ◦ Preparing test data is high cost and boring • scalacheck + spotify/ratatool is awesome for us!! 
 ◦ protobufOf[GeneratedProtoClass], avroOf[GeneratedAvroClass] generate random data
  18. Benchmark • Concerned about ability to accept the loads we

 ◦ Recent max objective: 15,000+ rps ◦ There are some shuffle tasks and writings to GCS ◦ Cloud Pub/Sub quota • Implemented load testing Dataflow template jobs, have Pub/Sub target 
 ◦ It has the benefits of Cloud Dataflow, Streaming Engine, multiple workers… ◦ Template jobs are reusable and easy to use ◦ We can also use random data generated by scalacheck + ratatool
  19. Benchmark
 • Input: UnboundSource from GeneratedSequence 
 • Transform: Just

    getting random data and packing it as PubsubMessage
 • Output: PubsubIO

  20. Benchmark
 • An execution example on Google Cloud Console

  21. Profiling
 • We need to dive deeply when we find

    critical bottlenecks from load testing 
 ◦ But it isn’t easy because the jobs are actually hosted on Dataflow Service • There’s magic to enable Profiling on the jobs with Stackdriver 
 ◦ --profilingAgentConfiguration=’{ \”APICurated\”: true }’, a pipeline option you give from cli ◦ ◦ [trivial issue] Dataflow service accepts job name with upper case, but the profile agent doesn’t
  22. Profiling
 • A Stackdriver Profiler result example: 

  23. Debug • Stackdriver Debugger 
 ◦ If possible to deploy

    to Dataflow Service ◦ If any problem occurs in production • Dockernize + jvm debugger 
 ◦ Feel free to use ◦ Create breakpoints to the jobs and doing step executions • ex. Dump heap and analyze 
 ◦ If OOM occurs
  24. Debug: Stackdriver Debugger • Just give the --enableCloudDebugger option! ◦

    It requires dataflow execution of SA Stackdriber Debugger related permissions • // Application ID is human readable … :( ◦ You can find the ID on Stackdriver Logging: its written by worker-setup in “uniquifier” field
  25. Debug: Stackdriver Debugger • It captures variables and a stacktrace

    at specified points for running Dataflow jobs 

  26. Debug: Dockernize • Local execution & testing is great to

    debug! • Pub/Sub alternative: ◦ Pub/Sub emulator • GCS alternative: ◦ Local Filesystem • BQ alternative: ◦ nothing…
  27. Debug: Dockernize • The containers require many CPU/RAM resources ◦

    sbt, jobs on DirectRunner, Pub/Sub emulator, etc. ◦ Possible to on a Mac, but executing it on the machine is severe option • docker-machine helps! ◦ Docker-machine has google driver which enables to host docker daemon on GCE ◦ Debug ports can be forwarded by docker-machine ssh
  28. Debug: Dockernize • Start remote instance on GCE ◦ $

    docker-machine create --driver google --google-machine-type <strong type> docker-remote • Use the instance on docker-compose ◦ $ eval (docker-machine env docker-remote) • Up docker-compose services ◦ $ docker-compose -f docker-compose-basic.yml docker-compose-deadletter.yml up ◦ To focus on specific jobs, the files are separated • Attach to a debug port via ssh forwarding ◦ $ docker-compose ssh -L 5005:localhost:5005 ◦ Attach from jdb or IDEs ◦ make breakpoints, dig, dig.... • Stop the instance ◦ $ docker-machine stop docker-remote
  29. Debug: Demo • Demo time!

  30. Debug: Heap dump • Dataflow service supports writing heap dumps

    to GCS on OOM ◦ It needs --dumpHeapOnOOM=true and --saveHeapDumpsToGcsPath=gs://bucket/path/to ◦ We can analyze it by Eclipse Memory Analyzer or other tools
  31. How We Operate it in Production

  32. Our issues in production PipelineOptions management 01 CI/CD 02 Monitoring

    03 Alert, On Call 04
  33. Our issues in production PipelineOptions management => yaml base configration

    01 CI/CD => CircleCI 02 Monitoring => OpenCensus + Stackdriver Monitoring 03 Alert, OnCall => PagerDuty + Stackdriver Monitoring 04
  34. PipelineOptions management • There are various PipelineOptions ◦ Actually DataflowPipelineOptions

    extends 10+ sub options ◦ And more defined users… • The option management is sometimes painful ◦ Structured options… ◦ Using different settings between dev / prod... ◦ We want describing reasons why we select the value…
  35. PipelineOptions management: yaml settings • PipelineOptions accepts JSON objects ◦

    but JSON is not user-friendly - doesn’t support comments • Our solution: using yaml! ◦ Supports complex structures, comments, etc. • For example: # basic
 region: us-central1
 streaming: true
 autoscalingAlgorithm: THROUGHPUT_BASED
 maxNumWorkers: 4
 tempLocation: gs://merpay_dataplatform_jp_dev_clouddataflow_testing/tmp/rawdatahub2structureddatahub
 enableStreamingEngine: true
 - enable_stackdriver_agent_metrics
 # for debuggability
 enableCloudDebug: true
 dumpHeapOnOOM: true
 saveHeapDumpsToGcsPath: gs://merpay_dataplatform_jp_dev_clouddataflow_testing/heapdumps/rawdatahub2structureddatahub
 APICurated: true
 # I/O
 input: projects/merpay-dataplatform-jp-test/subscriptions/raw_datahub_to_structured_datahub
 output: projects/merpay-dataplatform-jp-test/topics/structured_datahub
 deadLetter: projects/merpay-dataplatform-jp-test/topics/deadletter_hub

  36. CI/CD • CI: CircleCI ◦ sbt test, sbt it ◦

    trying template job builds • CD: CircleCI ◦ Automatically deploying to development env ◦ Template build -> drain -> run pipeline ▪ To simplify deployments ▪ Update is good if possible to keep simplity ◦ Basically keep compatibility on each jobs ▪ To avoid requiring in-order deployments
  37. Monitoring • GCP supports basic metrics ◦ Dataflow Service: system

    lag, watermark,… ◦ Cloud Pub/Sub: unacked messages,… ◦ Stackdriver Logging: log-based custom metrics, e.g.) number of OOM Exceptions ◦ JVM: CPU util, GC time, … ▪ It needs --experiments=enable_stackdriver_agent_metrics • Application level metrics ◦ We implement metrics collector with OpenCensus ◦ Processed entries count ◦ Deadletter count ◦ Transform duration time
  38. Monitoring • Example: Our dashboard on Stackdriver Monitoring:

  39. Alerts, OnCall • Create monitors on Stackdriver Monitoring
 ◦ configured

    in Terraform ◦ Major targets of metrics are system lag, watermark • Trigger alert to PagerDuty and catch the call

  40. Closing • We are
 ◦ Providing mobile payment service ◦

    Use Apache Beam and Cloud Dataflow to run stream jobs ◦ Fully using Stackdriver and some ways for debuggability ◦ Keep operations very simple • Share us your related knowledge!

  41. Appendix
 (will talk if I have time)

  42. Schema management, evolution • We accept both schema-on-read and schema-on-write

    strategies ◦ PubsubMessage.payload is just Array[Byte], unknown formats are also ok ◦ We’ve defined an original input-side protocol ▪ Anyway we are storing incoming raw data ▪ Sender can specify schema information in Pubsub attributes ▪ If Dataflow job knows the schema, it tries to parse and convert to Avro files and/or BQ records ▪ We have Protocol Buffer -> Avro conversion layer ▪ Next, we are thinking to have a schema registry
  43. BigQuery and streaming insert • BigQuery is awesome DHW -

    many features, high performance! ◦ Streaming insert is a fast way to provide lambda architecture ◦ But schema evolution is super painful… ▪ We are doing crazy schema backward compatibility checks for BigQuery, calling patch API if possible to evolute and finally insert ▪ If it fails? No way to prevent data-loss easily!