Slide 1

Slide 1 text

1 RaaS Reconcile as a Service A configuration based data processing platform August, 2021

Slide 2

Slide 2 text

2 Background
 1. We have a lot of separated recon projects 
 
 2. According to the kafka improvement plan, we may need more recon jobs to compare data between each other
 
 Problems
 
 1. How can we create recon jobs with as low as possible development cost? 
 
 2. How can we manage all recon jobs in one place? 
 


Slide 3

Slide 3 text

3 Overview
 Importer Importer

Slide 4

Slide 4 text

4 Components
 Copier:
 Download files from external source to S3 
 Support Protocol: SFTP, FTP
 
 Importer:
 Read data, filter and create data snapshot with SparkSQL 
 Support Protocols: S3 file, Data-as-a-service platform Hudi
 
 Processor:
 Process data from multiple importers with SparkSQL 
 Custom process logic(Spark, Scala) can be defined 
 
 Comparator:
 Compare input data with SparkSQL 
 - save compared result in S3 
 - save metadata of the result in DB 
 - 
 Uploader:
 Upload the results from Importer, Processor 
 Support Protocols: S3, SFTP


Slide 5

Slide 5 text

5 How to create recon jobs 
 order table order_check data Importer Importer comparator Recon Result We have a order table and a order_check table 
 And we want to compare and find discrepancies from them 


Slide 6

Slide 6 text

6 How to create recon jobs 
 A Yaml file!


Slide 7

Slide 7 text

How it works (What is Airflow) 
 ● Platform to programmatically author, schedule and monitor workflows ● Periodic recurring runs can be set using cron expression ● Workflows are called DAG ( Directed Acyclic Graphs ) ● Nodes = individual processing unit. Can be run parallely or sequentially ● To simplify writing Airflow code, we convert YAML written by you to python. Compare

Slide 8

Slide 8 text

8 How it works (airflow) 


Slide 9

Slide 9 text

9 How it works (pipeline) 
 S3 Importer Processor Comparator Importer DaaS (S3) Other T1 T2 T3 Recon Result Uploader Other Comparator: only one 
 1. Load data with spark as table 
 2. Using spark sql to join and query the tables and export data to S3 
 3. Using spark sql to summarize result and save to DB
 Processor: multiple, optional 
 Importer: multiple
 Uploader: multiple, optional 


Slide 10

Slide 10 text

10 How it works (data level) 
 order table order_check data Importer Importer comparator Recon Result Let’s move back to this 
 S3 result

Slide 11

Slide 11 text

11 How it works (data level) 
 Let’s see the config again 


Slide 12

Slide 12 text

How it works (data level) 
 SELECT 
 t2.order_id as order_id,
 COALESCE(t2.amount, 0) - COALESCE(t1.amount, 0) as amount_diff
 FROM order t1
 JOIN order_check t2 on t1.id = t2.order_id
 WHERE created_at = ‘2021/03/28’
 Table: order
 Table: order_check
 SELECT 
 COUNT(
 CASE WHEN amount_diff = 0 THEN 0 ELSE 1 END) as int
 ) as mismatch_count,
 SUM(amount_diff) as mismatch_amount
 FROM result_table
 Table: result_table
 will be created automatically
 and saved to output path(s3)
 Table: metadata
 will be stored in DB as json
 We also offer a local offline tool to test SQL

Slide 13

Slide 13 text

How it works (data level) 
 Metadata will be shown as result on Backoffice page (Tool to view results for non-tech people) 
 Data of result_table can be downloaded from there itself. 
 raas_testing_db raas_testing_db testing testing testing testing testing recon_test

Slide 14

Slide 14 text

14 How it works (notification) 
 Gordon Service Exporter Processor Comparator Exporter Uploader Notification id
 keywords matching 
 channels
 matching
 Each component will send messages to kafka, 
 Gordon service will redirect the message based
 on notification settings 


Slide 15

Slide 15 text

How it works (monitoring) 
 We will push metrics to NR, like recon-duration-ms, job-failure 
 Each team could create alert condition based on the metrics 


Slide 16

Slide 16 text

Thank you for listening
 Currently we are managing over 50+ recon jobs, created in last 4-5 months.