Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Production-ready stream data pipeline in Merpay, Inc

Ryo Okubo
September 12, 2019

Production-ready stream data pipeline in Merpay, Inc

This is a slide for https://www.apachecon.com/acna19/s/#/scheduledEvent/1329


We’ve started to provide our stream based data pipeline by using Google Cloud Dataflow and Apache Beam since 2018 fall. It collects event logs from microservices running on GKE, then transforms and forwards the logs to GCS and BigQuery to use for analytics, etc. As you know, implementing and operating streaming jobs are challenging. We’re encountered various issues during that time.
I’d like to share our knowledge on development and operation perspective. There are 3 topics in the development part, 1) Implementing stream jobs with using spotify/scio, 2) How to debug the jobs, especially having DynamicDestination, 3) How to load testing, to ensure our jobs stable. And topics in the next part, 1) How to deploy new jobs safely(with avoiding data loss), 2) How to monitor the jobs and surrounding systems, and misc.

Ryo Okubo

September 12, 2019

More Decks by Ryo Okubo

Other Decks in Programming


  1. Ryo Okubo

    Production-Ready Stream Data 

    Pipeline in Merpay

    View Slide

  2. Background

    ● Deep knowledge for Apache Beam/Cloud Dataflow is not enough

    ● We want to find a better way

    So, in this presentation:

    ● Describe/share our use case and activities

    ● Get better ways or knowledge we don’t have

    View Slide

  3. What is Merpay?
    Overview of Our Stream Data Pipeline
    How We Make it Production Ready
    How We Operate it in Production

    View Slide

  4. What is Merpay?


    View Slide

  5. What is Merpay?
    Merpay is a mobile payments service
    operated by Merpay, Inc. The company
    belongs to the Mercari Group, widely
    recognized for its service ‘Mercari,’ the
    top C2C marketplace app in Japan.
    Money earned by selling unwanted
    items on the Mercari app and money
    charged to the app through user’s bank
    accounts can be used to make
    payments at stores and on the Mercari
    app itself.

    View Slide

  6. Compatible with both iOS and Android
    Support for NFC payments with iD at 900,000 merchants nationwide

    View Slide

  7. Code Payments
    Coverage of merchants that do not accept iD. Customers pay by scanning their barcode at the store.

    View Slide

  8. Overview of Our Stream Data Pipeline 


    View Slide

  9. Our stream data pipeline: backgound
    ● microservices on GKE
    ● using many GCP services

    API Gateway
    Service X
    Service Y
    Google Cloud LoadBalancer
    Service A Service B
    Google Kubernetes Engine
    Service C
    Service Z
    Project A
    Project B
    Project GKE
    Project C

    View Slide

  10. Our stream data pipeline: overview
    ● Aggregate microservice logs 

    ● Dataflow + PubSub chain 

    ○ 6 stream jobs
    ○ 3+many pubsub topics
    ● GCS + BQ as DataLake 

    ● Started since Oct/2018 

    ● input: ~2500 rps 

    View Slide

  11. Our stream data pipeline: technical stack
    ● Data Sources

    ○ Cloud Pub/Sub topic/subscription’s per microservice
    ○ Created by in-house Terraform module
    ● Dataflow jobs

    ○ Written in Scala with spotify/scio
    ● Data Sinks

    ○ Store Apache Avro files on GCS
    ○ Using Streaming Insert and bq-load from Apache Airflow to write BigQuery

    View Slide

  12. How to Make it Production Ready


    View Slide

  13. Issues in development
    Debug deeply

    View Slide

  14. Issues in development

    Test => Unit tests and JobTest’s in spotify/scio
    Benchmark => Pub/Sub load generator
    Profiling => Stackdriver Profiler (with magic spells)
    Debug deeply => Stackdriver Debugger / dockernize + debugger

    View Slide

  15. Test
    ● Just plain Java/Scala code 

    ○ Easy! Just write unit tests
    ● Each PTransform

    ○ Same as plain code if using spotify/scio
    ● But the integration parts...? 

    View Slide

  16. Test: simplify code
    ● Make plain Java/Scala code as possible 

    ○ It’s able to write unit test and do fast execution
    ○ spotify/scio is helpful, it wraps lambda functions to DoFn and converts it to transform
    ● Separate I/O parts

    ○ It’s able to write transform integration test in usual transform chains
    ○ It doesn’t require any mock and emulator
    ● Including I/O … JobTest in spotify/scio! 

    ○ A simple error case:

    View Slide

  17. Test: Property-based testing
    ● Our pipeline handles many kind of logs encoded in Apache Avro and Protocol Buffer 

    ○ Preparing test data is high cost and boring
    ● scalacheck + spotify/ratatool is awesome for us!! 

    ○ protobufOf[GeneratedProtoClass], avroOf[GeneratedAvroClass] generate random data

    View Slide

  18. Benchmark
    ● Concerned about ability to accept the loads we expect 

    ○ Recent max objective: 15,000+ rps
    ○ There are some shuffle tasks and writings to GCS
    ○ Cloud Pub/Sub quota
    ● Implemented load testing Dataflow template jobs, have Pub/Sub target 

    ○ It has the benefits of Cloud Dataflow, Streaming Engine, multiple workers…
    ○ Template jobs are reusable and easy to use
    ○ We can also use random data generated by scalacheck + ratatool

    View Slide

  19. Benchmark

    ● Input: UnboundSource from GeneratedSequence 

    ● Transform: Just getting random data and packing it as PubsubMessage

    ● Output: PubsubIO

    View Slide

  20. Benchmark

    ● An execution example on Google Cloud Console WebUI

    View Slide

  21. Profiling

    ● We need to dive deeply when we find critical bottlenecks from load testing 

    ○ But it isn’t easy because the jobs are actually hosted on Dataflow Service
    ● There’s magic to enable Profiling on the jobs with Stackdriver 

    ○ --profilingAgentConfiguration=’{ \”APICurated\”: true }’, a pipeline option you give
    from cli
    ○ https://medium.com/google-cloud/profiling-dataflow-pipelines-ddbbef07761d
    ○ [trivial issue] Dataflow service accepts job name with upper case, but the profile agent

    View Slide

  22. Profiling

    ● A Stackdriver Profiler result example: 

    View Slide

  23. Debug
    ● Stackdriver Debugger 

    ○ If possible to deploy to Dataflow Service
    ○ If any problem occurs in production
    ● Dockernize + jvm debugger 

    ○ Feel free to use
    ○ Create breakpoints to the jobs and doing step executions
    ● ex. Dump heap and analyze 

    ○ If OOM occurs

    View Slide

  24. Debug: Stackdriver Debugger
    ● Just give the --enableCloudDebugger option!
    ○ It requires dataflow execution of SA Stackdriber Debugger related
    ● // Application ID is human readable … :(
    ○ You can find the ID on Stackdriver Logging: its written by worker-setup in
    “uniquifier” field

    View Slide

  25. Debug: Stackdriver Debugger
    ● It captures variables and a stacktrace at specified points for running Dataflow jobs

    View Slide

  26. Debug: Dockernize
    ● Local execution & testing is great to
    ● Pub/Sub alternative:
    ○ Pub/Sub emulator
    ● GCS alternative:
    ○ Local Filesystem
    ● BQ alternative:
    ○ nothing…

    View Slide

  27. Debug: Dockernize
    ● The containers require many CPU/RAM resources
    ○ sbt, jobs on DirectRunner, Pub/Sub emulator, etc.
    ○ Possible to on a Mac, but executing it on the machine is severe option
    ● docker-machine helps!
    ○ Docker-machine has google driver which enables to host docker daemon on
    ○ Debug ports can be forwarded by docker-machine ssh

    View Slide

  28. Debug: Dockernize
    ● Start remote instance on GCE
    ○ $ docker-machine create --driver google --google-machine-type type> docker-remote
    ● Use the instance on docker-compose
    ○ $ eval (docker-machine env docker-remote)
    ● Up docker-compose services
    ○ $ docker-compose -f docker-compose-basic.yml
    docker-compose-deadletter.yml up
    ○ To focus on specific jobs, the files are separated
    ● Attach to a debug port via ssh forwarding
    ○ $ docker-compose ssh -L 5005:localhost:5005
    ○ Attach from jdb or IDEs
    ○ make breakpoints, dig, dig....
    ● Stop the instance
    ○ $ docker-machine stop docker-remote

    View Slide

  29. Debug: Demo
    ● Demo time!

    View Slide

  30. Debug: Heap dump
    ● Dataflow service supports writing heap dumps to GCS on OOM
    ○ It needs --dumpHeapOnOOM=true and
    ○ We can analyze it by Eclipse Memory Analyzer or other tools

    View Slide

  31. How We Operate it in Production


    View Slide

  32. Our issues in production
    PipelineOptions management
    Alert, On Call

    View Slide

  33. Our issues in production
    PipelineOptions management => yaml base configration
    CI/CD => CircleCI
    Monitoring => OpenCensus + Stackdriver Monitoring
    Alert, OnCall => PagerDuty + Stackdriver Monitoring

    View Slide

  34. PipelineOptions management
    ● There are various PipelineOptions
    ○ Actually DataflowPipelineOptions extends 10+ sub options
    ○ And more defined users…
    ● The option management is sometimes painful
    ○ Structured options…
    ○ Using different settings between dev / prod...
    ○ We want describing reasons why we select the value…

    View Slide

  35. PipelineOptions management: yaml settings
    ● PipelineOptions accepts JSON objects
    ○ but JSON is not user-friendly - doesn’t support comments
    ● Our solution: using yaml!
    ○ Supports complex structures, comments, etc.
    ● For example:
    # basic

    runner: org.apache.beam.runners.direct.DirectRunner

    region: us-central1

    streaming: true

    autoscalingAlgorithm: THROUGHPUT_BASED

    maxNumWorkers: 4

    tempLocation: gs://merpay_dataplatform_jp_dev_clouddataflow_testing/tmp/rawdatahub2structureddatahub

    enableStreamingEngine: true


    - enable_stackdriver_agent_metrics

    # for debuggability

    enableCloudDebug: true

    dumpHeapOnOOM: true

    saveHeapDumpsToGcsPath: gs://merpay_dataplatform_jp_dev_clouddataflow_testing/heapdumps/rawdatahub2structureddatahub


    APICurated: true

    # I/O

    input: projects/merpay-dataplatform-jp-test/subscriptions/raw_datahub_to_structured_datahub

    output: projects/merpay-dataplatform-jp-test/topics/structured_datahub

    deadLetter: projects/merpay-dataplatform-jp-test/topics/deadletter_hub

    View Slide

  36. CI/CD
    ● CI: CircleCI
    ○ sbt test, sbt it
    ○ trying template job builds
    ● CD: CircleCI
    ○ Automatically deploying to development env
    ○ Template build -> drain -> run pipeline
    ■ To simplify deployments
    ■ Update is good if possible to keep simplity
    ○ Basically keep compatibility on each jobs
    ■ To avoid requiring in-order deployments

    View Slide

  37. Monitoring
    ● GCP supports basic metrics
    ○ Dataflow Service: system lag, watermark,…
    ○ Cloud Pub/Sub: unacked messages,…
    ○ Stackdriver Logging: log-based custom metrics, e.g.) number of OOM
    ○ JVM: CPU util, GC time, …
    ■ It needs --experiments=enable_stackdriver_agent_metrics
    ● Application level metrics
    ○ We implement metrics collector with OpenCensus
    ○ Processed entries count
    ○ Deadletter count
    ○ Transform duration time

    View Slide

  38. Monitoring
    ● Example: Our dashboard on Stackdriver Monitoring:

    View Slide

  39. Alerts, OnCall
    ● Create monitors on Stackdriver Monitoring

    ○ configured in Terraform
    ○ Major targets of metrics are system lag, watermark
    ● Trigger alert to PagerDuty and catch the call

    View Slide

  40. Closing
    ● We are

    ○ Providing mobile payment service
    ○ Use Apache Beam and Cloud Dataflow to run stream jobs
    ○ Fully using Stackdriver and some ways for debuggability
    ○ Keep operations very simple
    ● Share us your related knowledge!

    View Slide

  41. Appendix

    (will talk if I have time)


    View Slide

  42. Schema management, evolution
    ● We accept both schema-on-read and schema-on-write strategies
    ○ PubsubMessage.payload is just Array[Byte], unknown formats are also ok
    ○ We’ve defined an original input-side protocol
    ■ Anyway we are storing incoming raw data
    ■ Sender can specify schema information in Pubsub attributes
    ■ If Dataflow job knows the schema, it tries to parse and convert to Avro
    files and/or BQ records
    ■ We have Protocol Buffer -> Avro conversion layer
    ■ Next, we are thinking to have a schema registry

    View Slide

  43. BigQuery and streaming insert
    ● BigQuery is awesome DHW - many features, high performance!
    ○ Streaming insert is a fast way to provide lambda architecture
    ○ But schema evolution is super painful…
    ■ We are doing crazy schema backward compatibility checks for
    BigQuery, calling patch API if possible to evolute and finally insert
    ■ If it fails? No way to prevent data-loss easily!

    View Slide