Production-ready stream data pipeline in Merpay, Inc

Ryo Okubo  Production-Ready Stream Data   Pipeline in Merpay 

Background  • Deep knowledge for Apache Beam/Cloud Dataflow is not
enough  • We want to find a better way  So, in this presentation:  • Describe/share our use case and activities  • Get better ways or knowledge we don’t have 2 Objectives

What is Merpay? 01 Overview of Our Stream Data Pipeline
02 How We Make it Production Ready 03 How We Operate it in Production 04 3 Agenda

What is Merpay?  4

What is Merpay? Merpay is a mobile payments service operated
by Merpay, Inc. The company belongs to the Mercari Group, widely recognized for its service ‘Mercari,’ the top C2C marketplace app in Japan. Money earned by selling unwanted items on the Mercari app and money charged to the app through user’s bank accounts can be used to make payments at stores and on the Mercari app itself.

Compatible with both iOS and Android Support for NFC payments
with iD at 900,000 merchants nationwide

Code Payments Coverage of merchants that do not accept iD.
Customers pay by scanning their barcode at the store. 

Overview of Our Stream Data Pipeline   8

Our stream data pipeline: backgound • microservices on GKE •
using many GCP services  API Gateway Authority API Service X API Service Y Google Cloud LoadBalancer Service A Service B Google Kubernetes Engine Service C Web Service Z Cloud Spanner Project A Cloud Spanner Cloud Pub/Sub Project B Project GKE Project C Cloud Spanner Cloud Storage

Our stream data pipeline: overview • Aggregate microservice logs  
• Dataflow + PubSub chain   ◦ 6 stream jobs ◦ 3+many pubsub topics • GCS + BQ as DataLake   • Started since Oct/2018   • input: ~2500 rps  

Our stream data pipeline: technical stack • Data Sources  ◦
Cloud Pub/Sub topic/subscription’s per microservice ◦ Created by in-house Terraform module • Dataflow jobs  ◦ Written in Scala with spotify/scio • Data Sinks  ◦ Store Apache Avro files on GCS ◦ Using Streaming Insert and bq-load from Apache Airflow to write BigQuery

How to Make it Production Ready  12

Issues in development Test 01 Benchmark 02 Proﬁling 03 Debug
deeply 04

Issues in development   Test => Unit tests and JobTest’s
in spotify/scio 01 Benchmark => Pub/Sub load generator 02 Proﬁling => Stackdriver Profiler (with magic spells) 03 Debug deeply => Stackdriver Debugger / dockernize + debugger 04

Test • Just plain Java/Scala code   ◦ Easy! Just
write unit tests • Each PTransform  ◦ Same as plain code if using spotify/scio • But the integration parts...?  

Test: simplify code • Make plain Java/Scala code as possible
  ◦ It’s able to write unit test and do fast execution ◦ spotify/scio is helpful, it wraps lambda functions to DoFn and converts it to transform • Separate I/O parts  ◦ It’s able to write transform integration test in usual transform chains ◦ It doesn’t require any mock and emulator • Including I/O … JobTest in spotify/scio!   ◦ A simple error case:

Test: Property-based testing • Our pipeline handles many kind of
logs encoded in Apache Avro and Protocol Buffer   ◦ Preparing test data is high cost and boring • scalacheck + spotify/ratatool is awesome for us!!   ◦ protobufOf[GeneratedProtoClass], avroOf[GeneratedAvroClass] generate random data

Benchmark • Concerned about ability to accept the loads we
expect   ◦ Recent max objective: 15,000+ rps ◦ There are some shuffle tasks and writings to GCS ◦ Cloud Pub/Sub quota • Implemented load testing Dataflow template jobs, have Pub/Sub target   ◦ It has the benefits of Cloud Dataflow, Streaming Engine, multiple workers… ◦ Template jobs are reusable and easy to use ◦ We can also use random data generated by scalacheck + ratatool

Benchmark  • Input: UnboundSource from GeneratedSequence   • Transform: Just
getting random data and packing it as PubsubMessage  • Output: PubsubIO 

Benchmark    • An execution example on Google Cloud Console
WebUI

Proﬁling  • We need to dive deeply when we find
critical bottlenecks from load testing   ◦ But it isn’t easy because the jobs are actually hosted on Dataflow Service • There’s magic to enable Profiling on the jobs with Stackdriver   ◦ --profilingAgentConfiguration=’{ \”APICurated\”: true }’, a pipeline option you give from cli ◦ https://medium.com/google-cloud/profiling-dataflow-pipelines-ddbbef07761d ◦ [trivial issue] Dataflow service accepts job name with upper case, but the profile agent doesn’t

Proﬁling  • A Stackdriver Profiler result example:    

Debug • Stackdriver Debugger   ◦ If possible to deploy
to Dataflow Service ◦ If any problem occurs in production • Dockernize + jvm debugger   ◦ Feel free to use ◦ Create breakpoints to the jobs and doing step executions • ex. Dump heap and analyze   ◦ If OOM occurs

Debug: Stackdriver Debugger • Just give the --enableCloudDebugger option! ◦
It requires dataflow execution of SA Stackdriber Debugger related permissions • // Application ID is human readable … :( ◦ You can find the ID on Stackdriver Logging: its written by worker-setup in “uniquifier” field

Debug: Stackdriver Debugger • It captures variables and a stacktrace
at speciﬁed points for running Dataﬂow jobs  

Debug: Dockernize • Local execution & testing is great to
debug! • Pub/Sub alternative: ◦ Pub/Sub emulator • GCS alternative: ◦ Local Filesystem • BQ alternative: ◦ nothing…

Debug: Dockernize • The containers require many CPU/RAM resources ◦
sbt, jobs on DirectRunner, Pub/Sub emulator, etc. ◦ Possible to on a Mac, but executing it on the machine is severe option • docker-machine helps! ◦ Docker-machine has google driver which enables to host docker daemon on GCE ◦ Debug ports can be forwarded by docker-machine ssh

Debug: Dockernize • Start remote instance on GCE ◦ $
docker-machine create --driver google --google-machine-type <strong type> docker-remote • Use the instance on docker-compose ◦ $ eval (docker-machine env docker-remote) • Up docker-compose services ◦ $ docker-compose -f docker-compose-basic.yml docker-compose-deadletter.yml up ◦ To focus on speciﬁc jobs, the ﬁles are separated • Attach to a debug port via ssh forwarding ◦ $ docker-compose ssh -L 5005:localhost:5005 ◦ Attach from jdb or IDEs ◦ make breakpoints, dig, dig.... • Stop the instance ◦ $ docker-machine stop docker-remote

Debug: Demo • Demo time!

Debug: Heap dump • Dataﬂow service supports writing heap dumps
to GCS on OOM ◦ It needs --dumpHeapOnOOM=true and --saveHeapDumpsToGcsPath=gs://bucket/path/to ◦ We can analyze it by Eclipse Memory Analyzer or other tools

How We Operate it in Production  31

Our issues in production PipelineOptions management 01 CI/CD 02 Monitoring
03 Alert, On Call 04

Our issues in production PipelineOptions management => yaml base configration
01 CI/CD => CircleCI 02 Monitoring => OpenCensus + Stackdriver Monitoring 03 Alert, OnCall => PagerDuty + Stackdriver Monitoring 04

PipelineOptions management • There are various PipelineOptions ◦ Actually DataﬂowPipelineOptions
extends 10+ sub options ◦ And more deﬁned users… • The option management is sometimes painful ◦ Structured options… ◦ Using different settings between dev / prod... ◦ We want describing reasons why we select the value…

PipelineOptions management: yaml settings • PipelineOptions accepts JSON objects ◦
but JSON is not user-friendly - doesn’t support comments • Our solution: using yaml! ◦ Supports complex structures, comments, etc. • For example: # basic  runner: org.apache.beam.runners.direct.DirectRunner  region: us-central1  streaming: true  autoscalingAlgorithm: THROUGHPUT_BASED  maxNumWorkers: 4  tempLocation: gs://merpay_dataplatform_jp_dev_clouddataflow_testing/tmp/rawdatahub2structureddatahub  enableStreamingEngine: true  experiments:  - enable_stackdriver_agent_metrics  # for debuggability  enableCloudDebug: true  dumpHeapOnOOM: true  saveHeapDumpsToGcsPath: gs://merpay_dataplatform_jp_dev_clouddataflow_testing/heapdumps/rawdatahub2structureddatahub  profilingAgentConfiguration:  APICurated: true  # I/O  input: projects/merpay-dataplatform-jp-test/subscriptions/raw_datahub_to_structured_datahub  output: projects/merpay-dataplatform-jp-test/topics/structured_datahub  deadLetter: projects/merpay-dataplatform-jp-test/topics/deadletter_hub 

CI/CD • CI: CircleCI ◦ sbt test, sbt it ◦
trying template job builds • CD: CircleCI ◦ Automatically deploying to development env ◦ Template build -> drain -> run pipeline ▪ To simplify deployments ▪ Update is good if possible to keep simplity ◦ Basically keep compatibility on each jobs ▪ To avoid requiring in-order deployments

Monitoring • GCP supports basic metrics ◦ Dataﬂow Service: system
lag, watermark,… ◦ Cloud Pub/Sub: unacked messages,… ◦ Stackdriver Logging: log-based custom metrics, e.g.) number of OOM Exceptions ◦ JVM: CPU util, GC time, … ▪ It needs --experiments=enable_stackdriver_agent_metrics • Application level metrics ◦ We implement metrics collector with OpenCensus ◦ Processed entries count ◦ Deadletter count ◦ Transform duration time

Monitoring • Example: Our dashboard on Stackdriver Monitoring:

Alerts, OnCall • Create monitors on Stackdriver Monitoring  ◦ configured
in Terraform ◦ Major targets of metrics are system lag, watermark • Trigger alert to PagerDuty and catch the call   

Closing • We are  ◦ Providing mobile payment service ◦
Use Apache Beam and Cloud Dataflow to run stream jobs ◦ Fully using Stackdriver and some ways for debuggability ◦ Keep operations very simple • Share us your related knowledge! 

Appendix  (will talk if I have time)  41

Schema management, evolution • We accept both schema-on-read and schema-on-write
strategies ◦ PubsubMessage.payload is just Array[Byte], unknown formats are also ok ◦ We’ve defined an original input-side protocol ▪ Anyway we are storing incoming raw data ▪ Sender can specify schema information in Pubsub attributes ▪ If Dataflow job knows the schema, it tries to parse and convert to Avro files and/or BQ records ▪ We have Protocol Buffer -> Avro conversion layer ▪ Next, we are thinking to have a schema registry

BigQuery and streaming insert • BigQuery is awesome DHW -
many features, high performance! ◦ Streaming insert is a fast way to provide lambda architecture ◦ But schema evolution is super painful… ▪ We are doing crazy schema backward compatibility checks for BigQuery, calling patch API if possible to evolute and ﬁnally insert ▪ If it fails? No way to prevent data-loss easily!

Production-ready stream data pipeline in Merpay...

Production-ready stream data pipeline in Merpay, Inc

More Decks by Ryo Okubo

Other Decks in Programming

Featured

Transcript