Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reconcile as a Service

Reconcile as a Service

PayPay Corporation.

August 25, 2021

More Decks by PayPay Corporation.

Other Decks in Technology


  1. 2 Background
 1. We have a lot of separated recon

 2. According to the kafka improvement plan, we may need more recon jobs to compare data between each other
 1. How can we create recon jobs with as low as possible development cost? 
 2. How can we manage all recon jobs in one place? 

  2. 4 Components
 Download files from external source to S3

 Support Protocol: SFTP, FTP
 Read data, filter and create data snapshot with SparkSQL 
 Support Protocols: S3 file, Data-as-a-service platform Hudi
 Process data from multiple importers with SparkSQL 
 Custom process logic(Spark, Scala) can be defined 
 Compare input data with SparkSQL 
 - save compared result in S3 
 - save metadata of the result in DB 
 Upload the results from Importer, Processor 
 Support Protocols: S3, SFTP

  3. 5 How to create recon jobs 
 order table order_check

    data Importer Importer comparator Recon Result We have a order table and a order_check table 
 And we want to compare and find discrepancies from them 

  4. How it works (What is Airflow) 
 • Platform to

    programmatically author, schedule and monitor workflows • Periodic recurring runs can be set using cron expression • Workflows are called DAG ( Directed Acyclic Graphs ) • Nodes = individual processing unit. Can be run parallely or sequentially • To simplify writing Airflow code, we convert YAML written by you to python. Compare
  5. 9 How it works (pipeline) 
 S3 Importer Processor Comparator

    Importer DaaS (S3) Other T1 T2 T3 Recon Result Uploader Other Comparator: only one 
 1. Load data with spark as table 
 2. Using spark sql to join and query the tables and export data to S3 
 3. Using spark sql to summarize result and save to DB
 Processor: multiple, optional 
 Importer: multiple
 Uploader: multiple, optional 

  6. 10 How it works (data level) 
 order table order_check

    data Importer Importer comparator Recon Result Let’s move back to this 
 S3 result
  7. How it works (data level) 
 t2.order_id as

 COALESCE(t2.amount, 0) - COALESCE(t1.amount, 0) as amount_diff
 FROM order t1
 JOIN order_check t2 on t1.id = t2.order_id
 WHERE created_at = ‘2021/03/28’
 Table: order
 Table: order_check
 CASE WHEN amount_diff = 0 THEN 0 ELSE 1 END) as int
 ) as mismatch_count,
 SUM(amount_diff) as mismatch_amount
 FROM result_table
 Table: result_table
 will be created automatically
 and saved to output path(s3)
 Table: metadata
 will be stored in DB as json
 We also offer a local offline tool to test SQL
  8. How it works (data level) 
 Metadata will be shown

    as result on Backoffice page (Tool to view results for non-tech people) 
 Data of result_table can be downloaded from there itself. 
 raas_testing_db raas_testing_db testing testing testing testing testing recon_test
  9. 14 How it works (notification) 
 Gordon Service Exporter Processor

    Comparator Exporter Uploader Notification id
 keywords matching 
 Each component will send messages to kafka, 
 Gordon service will redirect the message based
 on notification settings 

  10. How it works (monitoring) 
 We will push metrics to

    NR, like recon-duration-ms, job-failure 
 Each team could create alert condition based on the metrics 

  11. Thank you for listening
 Currently we are managing over 50+

    recon jobs, created in last 4-5 months.