Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reconcile as a Service

Reconcile as a Service

PayPay Corporation.
PRO

August 25, 2021
Tweet

More Decks by PayPay Corporation.

Other Decks in Technology

Transcript

  1. 1
    RaaS
    Reconcile as a Service
    A configuration based data processing platform
    August, 2021

    View Slide

  2. 2
    Background

    1. We have a lot of separated recon projects 


    2. According to the kafka improvement plan, we may need more recon jobs to compare data between
    each other


    Problems


    1. How can we create recon jobs with as low as possible development cost? 


    2. How can we manage all recon jobs in one place? 


    View Slide

  3. 3
    Overview

    Importer
    Importer

    View Slide

  4. 4
    Components

    Copier:

    Download files from external source to S3 

    Support Protocol: SFTP, FTP


    Importer:

    Read data, filter and create data snapshot with SparkSQL 

    Support Protocols: S3 file, Data-as-a-service platform Hudi


    Processor:

    Process data from multiple importers with SparkSQL 

    Custom process logic(Spark, Scala) can be defined 


    Comparator:

    Compare input data with SparkSQL 

    - save compared result in S3 

    - save metadata of the result in DB 

    - 

    Uploader:

    Upload the results from Importer, Processor 

    Support Protocols: S3, SFTP


    View Slide

  5. 5
    How to create recon jobs 

    order
    table
    order_check
    data
    Importer
    Importer
    comparator
    Recon
    Result
    We have a order table and a order_check table 

    And we want to compare and find discrepancies from them 


    View Slide

  6. 6
    How to create recon jobs 

    A Yaml file!


    View Slide

  7. How it works (What is Airflow) 

    ● Platform to programmatically author, schedule and monitor workflows
    ● Periodic recurring runs can be set using cron expression
    ● Workflows are called DAG ( Directed Acyclic Graphs )
    ● Nodes = individual processing unit. Can be run parallely or sequentially
    ● To simplify writing Airflow code, we convert YAML written by you to python.
    Compare

    View Slide

  8. 8
    How it works (airflow) 


    View Slide

  9. 9
    How it works (pipeline) 

    S3
    Importer
    Processor
    Comparator
    Importer
    DaaS
    (S3)
    Other
    T1 T2 T3
    Recon
    Result
    Uploader Other
    Comparator: only one

    1. Load data with spark as table

    2. Using spark sql to join and query the
    tables and export data to S3

    3. Using spark sql to summarize result
    and save to DB

    Processor: multiple, optional

    Importer: multiple

    Uploader: multiple, optional

    View Slide

  10. 10
    How it works (data level) 

    order
    table
    order_check
    data
    Importer
    Importer
    comparator
    Recon
    Result
    Let’s move back to this 

    S3 result

    View Slide

  11. 11
    How it works (data level) 

    Let’s see the config again 


    View Slide

  12. How it works (data level) 

    SELECT 

    t2.order_id as order_id,

    COALESCE(t2.amount, 0) - COALESCE(t1.amount, 0) as amount_diff

    FROM order t1

    JOIN order_check t2 on t1.id = t2.order_id

    WHERE created_at = ‘2021/03/28’

    Table: order

    Table: order_check

    SELECT 

    COUNT(

    CASE WHEN amount_diff = 0 THEN 0 ELSE 1 END) as int

    ) as mismatch_count,

    SUM(amount_diff) as mismatch_amount

    FROM result_table

    Table: result_table

    will be created automatically

    and saved to output path(s3)

    Table: metadata

    will be stored in DB as json

    We also offer a local offline tool to test SQL

    View Slide

  13. How it works (data level) 

    Metadata will be shown as result on Backoffice page (Tool to view results for non-tech people) 

    Data of result_table can be downloaded from there itself. 

    raas_testing_db
    raas_testing_db
    testing
    testing
    testing
    testing
    testing
    recon_test

    View Slide

  14. 14
    How it works (notification) 

    Gordon
    Service
    Exporter
    Processor
    Comparator
    Exporter
    Uploader
    Notification id

    keywords matching

    channels

    matching

    Each component will send messages to kafka, 

    Gordon service will redirect the message based

    on notification settings 


    View Slide

  15. How it works (monitoring) 

    We will push metrics to NR, like recon-duration-ms, job-failure 

    Each team could create alert condition based on the metrics 


    View Slide

  16. Thank you for listening

    Currently we are managing over 50+ recon jobs, created in last 4-5 months.

    View Slide