Production-ready stream data pipeline in Merpay, Inc

by Ryo Okubo

Slide 1

Slide 1 text

Ryo Okubo  Production-Ready Stream Data   Pipeline in Merpay 

Slide 2

Slide 2 text

Background  ● Deep knowledge for Apache Beam/Cloud Dataflow is not enough  ● We want to find a better way  So, in this presentation:  ● Describe/share our use case and activities  ● Get better ways or knowledge we don’t have 2 Objectives

Slide 3

Slide 3 text

What is Merpay? 01 Overview of Our Stream Data Pipeline 02 How We Make it Production Ready 03 How We Operate it in Production 04 3 Agenda

Slide 4

Slide 4 text

What is Merpay?  4

Slide 5

Slide 5 text

What is Merpay? Merpay is a mobile payments service operated by Merpay, Inc. The company belongs to the Mercari Group, widely recognized for its service ‘Mercari,’ the top C2C marketplace app in Japan. Money earned by selling unwanted items on the Mercari app and money charged to the app through user’s bank accounts can be used to make payments at stores and on the Mercari app itself.

Slide 6

Slide 6 text

Compatible with both iOS and Android Support for NFC payments with iD at 900,000 merchants nationwide

Slide 7

Slide 7 text

Code Payments Coverage of merchants that do not accept iD. Customers pay by scanning their barcode at the store. 

Slide 8

Slide 8 text

Overview of Our Stream Data Pipeline   8

Slide 9

Slide 9 text

Our stream data pipeline: backgound ● microservices on GKE ● using many GCP services  API Gateway Authority API Service X API Service Y Google Cloud LoadBalancer Service A Service B Google Kubernetes Engine Service C Web Service Z Cloud Spanner Project A Cloud Spanner Cloud Pub/Sub Project B Project GKE Project C Cloud Spanner Cloud Storage

Slide 10

Slide 10 text

Our stream data pipeline: overview ● Aggregate microservice logs   ● Dataflow + PubSub chain   ○ 6 stream jobs ○ 3+many pubsub topics ● GCS + BQ as DataLake   ● Started since Oct/2018   ● input: ~2500 rps  

Slide 11

Slide 11 text

Our stream data pipeline: technical stack ● Data Sources  ○ Cloud Pub/Sub topic/subscription’s per microservice ○ Created by in-house Terraform module ● Dataflow jobs  ○ Written in Scala with spotify/scio ● Data Sinks  ○ Store Apache Avro files on GCS ○ Using Streaming Insert and bq-load from Apache Airflow to write BigQuery

Slide 12

Slide 12 text

How to Make it Production Ready  12

Slide 13

Slide 13 text

Issues in development Test 01 Benchmark 02 Proﬁling 03 Debug deeply 04

Slide 14

Slide 14 text

Issues in development   Test => Unit tests and JobTest’s in spotify/scio 01 Benchmark => Pub/Sub load generator 02 Proﬁling => Stackdriver Profiler (with magic spells) 03 Debug deeply => Stackdriver Debugger / dockernize + debugger 04

Slide 15

Slide 15 text

Test ● Just plain Java/Scala code   ○ Easy! Just write unit tests ● Each PTransform  ○ Same as plain code if using spotify/scio ● But the integration parts...?  

Slide 16

Slide 16 text

Test: simplify code ● Make plain Java/Scala code as possible   ○ It’s able to write unit test and do fast execution ○ spotify/scio is helpful, it wraps lambda functions to DoFn and converts it to transform ● Separate I/O parts  ○ It’s able to write transform integration test in usual transform chains ○ It doesn’t require any mock and emulator ● Including I/O … JobTest in spotify/scio!   ○ A simple error case:

Slide 17

Slide 17 text

Test: Property-based testing ● Our pipeline handles many kind of logs encoded in Apache Avro and Protocol Buffer   ○ Preparing test data is high cost and boring ● scalacheck + spotify/ratatool is awesome for us!!   ○ protobufOf[GeneratedProtoClass], avroOf[GeneratedAvroClass] generate random data

Slide 18

Slide 18 text

Benchmark ● Concerned about ability to accept the loads we expect   ○ Recent max objective: 15,000+ rps ○ There are some shuffle tasks and writings to GCS ○ Cloud Pub/Sub quota ● Implemented load testing Dataflow template jobs, have Pub/Sub target   ○ It has the benefits of Cloud Dataflow, Streaming Engine, multiple workers… ○ Template jobs are reusable and easy to use ○ We can also use random data generated by scalacheck + ratatool

Slide 19

Slide 19 text

Benchmark  ● Input: UnboundSource from GeneratedSequence   ● Transform: Just getting random data and packing it as PubsubMessage  ● Output: PubsubIO 

Slide 20

Slide 20 text

Benchmark    ● An execution example on Google Cloud Console WebUI

Slide 21

Slide 21 text

Proﬁling  ● We need to dive deeply when we find critical bottlenecks from load testing   ○ But it isn’t easy because the jobs are actually hosted on Dataflow Service ● There’s magic to enable Profiling on the jobs with Stackdriver   ○ --profilingAgentConfiguration=’{ \”APICurated\”: true }’, a pipeline option you give from cli ○ https://medium.com/google-cloud/profiling-dataflow-pipelines-ddbbef07761d ○ [trivial issue] Dataflow service accepts job name with upper case, but the profile agent doesn’t

Slide 22

Slide 22 text

Proﬁling  ● A Stackdriver Profiler result example:    

Slide 23

Slide 23 text

Debug ● Stackdriver Debugger   ○ If possible to deploy to Dataflow Service ○ If any problem occurs in production ● Dockernize + jvm debugger   ○ Feel free to use ○ Create breakpoints to the jobs and doing step executions ● ex. Dump heap and analyze   ○ If OOM occurs

Slide 24

Slide 24 text

Debug: Stackdriver Debugger ● Just give the --enableCloudDebugger option! ○ It requires dataflow execution of SA Stackdriber Debugger related permissions ● // Application ID is human readable … :( ○ You can find the ID on Stackdriver Logging: its written by worker-setup in “uniquifier” field

Slide 25

Slide 25 text

Debug: Stackdriver Debugger ● It captures variables and a stacktrace at speciﬁed points for running Dataﬂow jobs  

Slide 26

Slide 26 text

Debug: Dockernize ● Local execution & testing is great to debug! ● Pub/Sub alternative: ○ Pub/Sub emulator ● GCS alternative: ○ Local Filesystem ● BQ alternative: ○ nothing…

Slide 27

Slide 27 text

Debug: Dockernize ● The containers require many CPU/RAM resources ○ sbt, jobs on DirectRunner, Pub/Sub emulator, etc. ○ Possible to on a Mac, but executing it on the machine is severe option ● docker-machine helps! ○ Docker-machine has google driver which enables to host docker daemon on GCE ○ Debug ports can be forwarded by docker-machine ssh

Slide 28

Slide 28 text

Debug: Dockernize ● Start remote instance on GCE ○ $ docker-machine create --driver google --google-machine-type docker-remote ● Use the instance on docker-compose ○ $ eval (docker-machine env docker-remote) ● Up docker-compose services ○ $ docker-compose -f docker-compose-basic.yml docker-compose-deadletter.yml up ○ To focus on speciﬁc jobs, the ﬁles are separated ● Attach to a debug port via ssh forwarding ○ $ docker-compose ssh -L 5005:localhost:5005 ○ Attach from jdb or IDEs ○ make breakpoints, dig, dig.... ● Stop the instance ○ $ docker-machine stop docker-remote

Slide 29

Slide 29 text

Debug: Demo ● Demo time!

Slide 30

Slide 30 text

Debug: Heap dump ● Dataﬂow service supports writing heap dumps to GCS on OOM ○ It needs --dumpHeapOnOOM=true and --saveHeapDumpsToGcsPath=gs://bucket/path/to ○ We can analyze it by Eclipse Memory Analyzer or other tools

Slide 31

Slide 31 text

How We Operate it in Production  31

Slide 32

Slide 32 text

Our issues in production PipelineOptions management 01 CI/CD 02 Monitoring 03 Alert, On Call 04

Slide 33

Slide 33 text

Our issues in production PipelineOptions management => yaml base configration 01 CI/CD => CircleCI 02 Monitoring => OpenCensus + Stackdriver Monitoring 03 Alert, OnCall => PagerDuty + Stackdriver Monitoring 04

Slide 34

Slide 34 text

PipelineOptions management ● There are various PipelineOptions ○ Actually DataﬂowPipelineOptions extends 10+ sub options ○ And more deﬁned users… ● The option management is sometimes painful ○ Structured options… ○ Using different settings between dev / prod... ○ We want describing reasons why we select the value…

Slide 35

Slide 35 text

PipelineOptions management: yaml settings ● PipelineOptions accepts JSON objects ○ but JSON is not user-friendly - doesn’t support comments ● Our solution: using yaml! ○ Supports complex structures, comments, etc. ● For example: # basic  runner: org.apache.beam.runners.direct.DirectRunner  region: us-central1  streaming: true  autoscalingAlgorithm: THROUGHPUT_BASED  maxNumWorkers: 4  tempLocation: gs://merpay_dataplatform_jp_dev_clouddataflow_testing/tmp/rawdatahub2structureddatahub  enableStreamingEngine: true  experiments:  - enable_stackdriver_agent_metrics  # for debuggability  enableCloudDebug: true  dumpHeapOnOOM: true  saveHeapDumpsToGcsPath: gs://merpay_dataplatform_jp_dev_clouddataflow_testing/heapdumps/rawdatahub2structureddatahub  profilingAgentConfiguration:  APICurated: true  # I/O  input: projects/merpay-dataplatform-jp-test/subscriptions/raw_datahub_to_structured_datahub  output: projects/merpay-dataplatform-jp-test/topics/structured_datahub  deadLetter: projects/merpay-dataplatform-jp-test/topics/deadletter_hub 

Slide 36

Slide 36 text

CI/CD ● CI: CircleCI ○ sbt test, sbt it ○ trying template job builds ● CD: CircleCI ○ Automatically deploying to development env ○ Template build -> drain -> run pipeline ■ To simplify deployments ■ Update is good if possible to keep simplity ○ Basically keep compatibility on each jobs ■ To avoid requiring in-order deployments

Slide 37

Slide 37 text

Monitoring ● GCP supports basic metrics ○ Dataﬂow Service: system lag, watermark,… ○ Cloud Pub/Sub: unacked messages,… ○ Stackdriver Logging: log-based custom metrics, e.g.) number of OOM Exceptions ○ JVM: CPU util, GC time, … ■ It needs --experiments=enable_stackdriver_agent_metrics ● Application level metrics ○ We implement metrics collector with OpenCensus ○ Processed entries count ○ Deadletter count ○ Transform duration time

Slide 38

Slide 38 text

Monitoring ● Example: Our dashboard on Stackdriver Monitoring:

Slide 39

Slide 39 text

Alerts, OnCall ● Create monitors on Stackdriver Monitoring  ○ configured in Terraform ○ Major targets of metrics are system lag, watermark ● Trigger alert to PagerDuty and catch the call   

Slide 40

Slide 40 text

Closing ● We are  ○ Providing mobile payment service ○ Use Apache Beam and Cloud Dataflow to run stream jobs ○ Fully using Stackdriver and some ways for debuggability ○ Keep operations very simple ● Share us your related knowledge! 

Slide 41

Slide 41 text

Appendix  (will talk if I have time)  41

Slide 42

Slide 42 text

Schema management, evolution ● We accept both schema-on-read and schema-on-write strategies ○ PubsubMessage.payload is just Array[Byte], unknown formats are also ok ○ We’ve defined an original input-side protocol ■ Anyway we are storing incoming raw data ■ Sender can specify schema information in Pubsub attributes ■ If Dataflow job knows the schema, it tries to parse and convert to Avro files and/or BQ records ■ We have Protocol Buffer -> Avro conversion layer ■ Next, we are thinking to have a schema registry

Slide 43

Slide 43 text

BigQuery and streaming insert ● BigQuery is awesome DHW - many features, high performance! ○ Streaming insert is a fast way to provide lambda architecture ○ But schema evolution is super painful… ■ We are doing crazy schema backward compatibility checks for BigQuery, calling patch API if possible to evolute and ﬁnally insert ■ If it fails? No way to prevent data-loss easily!