Reconcile as a Service - Speaker Deck

Slide 1

Slide 1 text

1 RaaS Reconcile as a Service A configuration based data processing platform August, 2021

Slide 2

Slide 2 text

2 Background  1. We have a lot of separated recon projects     2. According to the kafka improvement plan, we may need more recon jobs to compare data between each other    Problems    1. How can we create recon jobs with as low as possible development cost?     2. How can we manage all recon jobs in one place?    

Slide 3

Slide 3 text

3 Overview  Importer Importer

Slide 4

Slide 4 text

4 Components  Copier:  Download files from external source to S3   Support Protocol: SFTP, FTP    Importer:  Read data, filter and create data snapshot with SparkSQL   Support Protocols: S3 file, Data-as-a-service platform Hudi    Processor:  Process data from multiple importers with SparkSQL   Custom process logic(Spark, Scala) can be defined     Comparator:  Compare input data with SparkSQL   - save compared result in S3   - save metadata of the result in DB   -   Uploader:  Upload the results from Importer, Processor   Support Protocols: S3, SFTP 

Slide 5

Slide 5 text

5 How to create recon jobs   order table order_check data Importer Importer comparator Recon Result We have a order table and a order_check table   And we want to compare and find discrepancies from them  

Slide 6

Slide 6 text

6 How to create recon jobs   A Yaml file! 

Slide 7

Slide 7 text

How it works (What is Airflow)   ● Platform to programmatically author, schedule and monitor workflows ● Periodic recurring runs can be set using cron expression ● Workflows are called DAG ( Directed Acyclic Graphs ) ● Nodes = individual processing unit. Can be run parallely or sequentially ● To simplify writing Airflow code, we convert YAML written by you to python. Compare

Slide 8

Slide 8 text

8 How it works (airflow)  

Slide 9

Slide 9 text

9 How it works (pipeline)   S3 Importer Processor Comparator Importer DaaS (S3) Other T1 T2 T3 Recon Result Uploader Other Comparator: only one   1. Load data with spark as table   2. Using spark sql to join and query the tables and export data to S3   3. Using spark sql to summarize result and save to DB  Processor: multiple, optional   Importer: multiple  Uploader: multiple, optional  

Slide 10

Slide 10 text

10 How it works (data level)   order table order_check data Importer Importer comparator Recon Result Let’s move back to this   S3 result

Slide 11

Slide 11 text

11 How it works (data level)   Let’s see the config again  

Slide 12

Slide 12 text

How it works (data level)   SELECT   t2.order_id as order_id,  COALESCE(t2.amount, 0) - COALESCE(t1.amount, 0) as amount_diff  FROM order t1  JOIN order_check t2 on t1.id = t2.order_id  WHERE created_at = ‘2021/03/28’  Table: order  Table: order_check  SELECT   COUNT(  CASE WHEN amount_diff = 0 THEN 0 ELSE 1 END) as int  ) as mismatch_count,  SUM(amount_diff) as mismatch_amount  FROM result_table  Table: result_table  will be created automatically  and saved to output path(s3)  Table: metadata  will be stored in DB as json  We also offer a local offline tool to test SQL

Slide 13

Slide 13 text

How it works (data level)   Metadata will be shown as result on Backoffice page (Tool to view results for non-tech people)   Data of result_table can be downloaded from there itself.   raas_testing_db raas_testing_db testing testing testing testing testing recon_test

Slide 14

Slide 14 text

14 How it works (notification)   Gordon Service Exporter Processor Comparator Exporter Uploader Notification id  keywords matching   channels  matching  Each component will send messages to kafka,   Gordon service will redirect the message based  on notification settings  

Slide 15

Slide 15 text

How it works (monitoring)   We will push metrics to NR, like recon-duration-ms, job-failure   Each team could create alert condition based on the metrics  

Slide 16

Slide 16 text

Thank you for listening  Currently we are managing over 50+ recon jobs, created in last 4-5 months.