Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Near Real-Time Data Lake at PayPay

Near Real-Time Data Lake at PayPay

PayPay Corporation.
PRO

August 25, 2021
Tweet

More Decks by PayPay Corporation.

Other Decks in Technology

Transcript

  1. Near Real-Time Data Lake at PayPay
    Uddhav Arote
    Software Engineer, PayPay

    View Slide

  2. 2
    Uddhav Arote
    ● From India
    ● MTech, Computer Science
    ● Work History
    ○ Yahoo Japan (2017 - 2020)
    ○ PayPay (2019 - Present)

    View Slide

  3. 3
    ● (Big) Data At PayPay
    ● PayPay’s Data Lake
    Agenda

    View Slide

  4. (Big) Data At PayPay

    View Slide

  5. 5
    ● Payment Transactions
    ● User Signups
    ● Active Users
    ● Merchant Stores
    ● Cashback Transactions
    ● ...
    Data at PayPay
    Everyday humongous amounts of data is
    generated by PayPay Services
    Amazon
    ElastiCache
    Amazon
    DynamoDB
    TiDB
    Amazon
    Aurora
    Apache
    Kafka
    More than
    600 data
    sources

    View Slide

  6. Data Lake At PayPay

    View Slide

  7. 7
    Data Lake at PayPay
    Range Query based Pipeline: ‘Old Pipeline’
    ● Efficient for small tables
    ● These tables have a
    column named updated_at
    ○ it keeps tracks of when
    was the record
    updated last
    but

    View Slide

  8. 8
    Datalake at PayPay
    ‘Old Pipeline’ had problems
    1. Missing Data
    2. Could not query real-time data
    3. Slow Write speed
    4. Could not delete data from data lake
    5. Incremental data is about 7TB ~ 10TB

    View Slide

  9. 9
    And, table size grew. Rows started
    getting upserted more frequently ..
    Data Lake at PayPay
    and
    problems started getting bigger
    How can we mitigate
    these problems?

    View Slide

  10. ようこそ
    Near Real-Time Data Lake

    View Slide

  11. 11
    Data Lake at PayPay
    Data Change Capture based Pipeline using
    Apache Hudi: ‘New Pipeline’
    ● Table size does not matter
    ● No additional index required
    ● No missing data
    ● Data is no more stale
    ● Capture all DDL/DML events on
    table

    View Slide

  12. 12
    Data Lake at PayPay
    No missing
    binlogs, since they
    are published
    ATLEAST ONCE
    De-duplication of data is handled in
    stream processing by
    ● Merging: choose the latest record
    ● Compaction: compare the record commit
    time
    NO
    MISSING
    DATA

    View Slide

  13. 13
    ● support for upserting records in DFS
    ● support for (pluggable) indexing records for fast
    updates and deletes
    ● support two table types1
    ○ Copy on Write (COW): data stored in purely
    columnar format
    ○ Merge On Read (MOR): data stored using
    combination of columnar and row format
    ● In MOR table, updates are added in delta (avro)
    files which are later compacted with columnar
    files synchronously or asynchronously
    ● Depending on one’s requirement, compaction can
    be tuned
    ○ We compact every write, since Amazon
    Athena only serves Read Optimized View2
    ● And Amazon EMR supports3,4 Hudi
    Data Lake at PayPay
    We use MOR
    table type
    Amazon
    EMR
    We use Apache
    Hudi 0.7.0 with
    EMR 6.0.0
    1. https://hudi.apache.org/docs/overview.html#table-types
    2. https://docs.aws.amazon.com/athena/latest/ug/querying-hudi.html
    3. https://aws.amazon.com/emr/features/hudi/
    4. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-6x.html (support is for Hudi 0.5.0)
    Support for Fast Writes
    using
    Apache Hudi

    View Slide

  14. 14
    Data Lake at PayPay
    Number of rows per write
    record count between 100K~700K input rows in millions for
    some tables
    And time take to process these number of
    rows is ..

    View Slide

  15. 15
    Data Lake at PayPay
    Write Speed
    About ~99% of tables writes
    are under 10 mins
    Avg write time

    View Slide

  16. 16
    ● When records are deleted, generated
    binlog events are applied to the data lake
    tables
    ● Perform soft delete: do not delete the
    record physically
    ○ update an internal column
    xxx_is_deleted to true
    ● To query the data, one has to add the
    condition xxx_is_deleted = false
    ○ batch job to physically delete records
    Data Lake at PayPay
    Can handle Data Deletion Amazon S3 Usage
    LOW

    View Slide

  17. 17
    Data Lake at PayPay
    How is this Data Lake used?
    ● Hourly KPI Dashboards
    ○ Using views created on Amazon Athena
    ● Reconciliation platform is built on top of this data which is used for
    verifying data correctness
    ● Business strategy
    ● Map based visualization
    ○ https://map.yahoo.co.jp/congestion?lat=35.67717&lon=139.72141&zoom=13&maptype=basic

    View Slide

  18. Thank you

    View Slide

  19. What’s Next?

    View Slide

  20. 20
    What’s Next?
    Securing our Data Lake with AWS Lake
    Formation
    Adding Recommendation Services with
    Amazon Personalize
    Data Archival
    Enable Time Travel

    View Slide