projects 2. According to the kafka improvement plan, we may need more recon jobs to compare data between each other Problems 1. How can we create recon jobs with as low as possible development cost? 2. How can we manage all recon jobs in one place?
Support Protocol: SFTP, FTP Importer: Read data, filter and create data snapshot with SparkSQL Support Protocols: S3 file, Data-as-a-service platform Hudi Processor: Process data from multiple importers with SparkSQL Custom process logic(Spark, Scala) can be defined Comparator: Compare input data with SparkSQL - save compared result in S3 - save metadata of the result in DB - Uploader: Upload the results from Importer, Processor Support Protocols: S3, SFTP
programmatically author, schedule and monitor workflows • Periodic recurring runs can be set using cron expression • Workflows are called DAG ( Directed Acyclic Graphs ) • Nodes = individual processing unit. Can be run parallely or sequentially • To simplify writing Airflow code, we convert YAML written by you to python. Compare
Importer DaaS (S3) Other T1 T2 T3 Recon Result Uploader Other Comparator: only one 1. Load data with spark as table 2. Using spark sql to join and query the tables and export data to S3 3. Using spark sql to summarize result and save to DB Processor: multiple, optional Importer: multiple Uploader: multiple, optional
order_id, COALESCE(t2.amount, 0) - COALESCE(t1.amount, 0) as amount_diff FROM order t1 JOIN order_check t2 on t1.id = t2.order_id WHERE created_at = ‘2021/03/28’ Table: order Table: order_check SELECT COUNT( CASE WHEN amount_diff = 0 THEN 0 ELSE 1 END) as int ) as mismatch_count, SUM(amount_diff) as mismatch_amount FROM result_table Table: result_table will be created automatically and saved to output path(s3) Table: metadata will be stored in DB as json We also offer a local offline tool to test SQL
as result on Backoffice page (Tool to view results for non-tech people) Data of result_table can be downloaded from there itself. raas_testing_db raas_testing_db testing testing testing testing testing recon_test
Comparator Exporter Uploader Notification id keywords matching channels matching Each component will send messages to kafka, Gordon service will redirect the message based on notification settings