Reconcile as a Service

1 RaaS Reconcile as a Service A configuration based data
processing platform August, 2021

2 Background  1. We have a lot of separated recon
projects     2. According to the kafka improvement plan, we may need more recon jobs to compare data between each other    Problems    1. How can we create recon jobs with as low as possible development cost?     2. How can we manage all recon jobs in one place?    

3 Overview  Importer Importer

4 Components  Copier:  Download files from external source to S3
  Support Protocol: SFTP, FTP    Importer:  Read data, filter and create data snapshot with SparkSQL   Support Protocols: S3 file, Data-as-a-service platform Hudi    Processor:  Process data from multiple importers with SparkSQL   Custom process logic(Spark, Scala) can be defined     Comparator:  Compare input data with SparkSQL   - save compared result in S3   - save metadata of the result in DB   -   Uploader:  Upload the results from Importer, Processor   Support Protocols: S3, SFTP 

5 How to create recon jobs   order table order_check
data Importer Importer comparator Recon Result We have a order table and a order_check table   And we want to compare and find discrepancies from them  

6 How to create recon jobs   A Yaml file! 

How it works (What is Airflow)   • Platform to
programmatically author, schedule and monitor workflows • Periodic recurring runs can be set using cron expression • Workflows are called DAG ( Directed Acyclic Graphs ) • Nodes = individual processing unit. Can be run parallely or sequentially • To simplify writing Airflow code, we convert YAML written by you to python. Compare

8 How it works (airflow)  

9 How it works (pipeline)   S3 Importer Processor Comparator
Importer DaaS (S3) Other T1 T2 T3 Recon Result Uploader Other Comparator: only one   1. Load data with spark as table   2. Using spark sql to join and query the tables and export data to S3   3. Using spark sql to summarize result and save to DB  Processor: multiple, optional   Importer: multiple  Uploader: multiple, optional  

10 How it works (data level)   order table order_check
data Importer Importer comparator Recon Result Let’s move back to this   S3 result

11 How it works (data level)   Let’s see the
config again  

How it works (data level)   SELECT   t2.order_id as
order_id,  COALESCE(t2.amount, 0) - COALESCE(t1.amount, 0) as amount_diff  FROM order t1  JOIN order_check t2 on t1.id = t2.order_id  WHERE created_at = ‘2021/03/28’  Table: order  Table: order_check  SELECT   COUNT(  CASE WHEN amount_diff = 0 THEN 0 ELSE 1 END) as int  ) as mismatch_count,  SUM(amount_diff) as mismatch_amount  FROM result_table  Table: result_table  will be created automatically  and saved to output path(s3)  Table: metadata  will be stored in DB as json  We also offer a local offline tool to test SQL

How it works (data level)   Metadata will be shown
as result on Backoffice page (Tool to view results for non-tech people)   Data of result_table can be downloaded from there itself.   raas_testing_db raas_testing_db testing testing testing testing testing recon_test

14 How it works (notification)   Gordon Service Exporter Processor
Comparator Exporter Uploader Notification id  keywords matching   channels  matching  Each component will send messages to kafka,   Gordon service will redirect the message based  on notification settings  

How it works (monitoring)   We will push metrics to
NR, like recon-duration-ms, job-failure   Each team could create alert condition based on the metrics  

Thank you for listening  Currently we are managing over 50+
recon jobs, created in last 4-5 months.

Reconcile as a Service

Reconcile as a Service

PayPay Corporation. PRO

More Decks by PayPay Corporation.

Other Decks in Technology

Featured

Transcript

1 RaaS Reconcile as a Service A configuration based data

2 Background  1. We have a lot of separated recon

3 Overview  Importer Importer

4 Components  Copier:  Download files from external source to S3

5 How to create recon jobs   order table order_check

6 How to create recon jobs   A Yaml file!

How it works (What is Airflow)   • Platform to

8 How it works (airflow)

9 How it works (pipeline)   S3 Importer Processor Comparator

10 How it works (data level)   order table order_check

11 How it works (data level)   Let’s see the

How it works (data level)   SELECT   t2.order_id as

How it works (data level)   Metadata will be shown

14 How it works (notification)   Gordon Service Exporter Processor

How it works (monitoring)   We will push metrics to

Thank you for listening  Currently we are managing over 50+