Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Near Real-Time Data Lake at PayPay

Near Real-Time Data Lake at PayPay

PayPay Corporation.

August 25, 2021
Tweet

More Decks by PayPay Corporation.

Other Decks in Technology

Transcript

  1. 2 Uddhav Arote • From India • MTech, Computer Science

    • Work History ◦ Yahoo Japan (2017 - 2020) ◦ PayPay (2019 - Present)
  2. 5 • Payment Transactions • User Signups • Active Users

    • Merchant Stores • Cashback Transactions • ... Data at PayPay Everyday humongous amounts of data is generated by PayPay Services Amazon ElastiCache Amazon DynamoDB TiDB Amazon Aurora Apache Kafka More than 600 data sources
  3. 7 Data Lake at PayPay Range Query based Pipeline: ‘Old

    Pipeline’ • Efficient for small tables • These tables have a column named updated_at ◦ it keeps tracks of when was the record updated last but
  4. 8 Datalake at PayPay ‘Old Pipeline’ had problems 1. Missing

    Data 2. Could not query real-time data 3. Slow Write speed 4. Could not delete data from data lake 5. Incremental data is about 7TB ~ 10TB
  5. 9 And, table size grew. Rows started getting upserted more

    frequently .. Data Lake at PayPay and problems started getting bigger How can we mitigate these problems?
  6. 11 Data Lake at PayPay Data Change Capture based Pipeline

    using Apache Hudi: ‘New Pipeline’ • Table size does not matter • No additional index required • No missing data • Data is no more stale • Capture all DDL/DML events on table
  7. 12 Data Lake at PayPay No missing binlogs, since they

    are published ATLEAST ONCE De-duplication of data is handled in stream processing by • Merging: choose the latest record • Compaction: compare the record commit time NO MISSING DATA
  8. 13 • support for upserting records in DFS • support

    for (pluggable) indexing records for fast updates and deletes • support two table types1 ◦ Copy on Write (COW): data stored in purely columnar format ◦ Merge On Read (MOR): data stored using combination of columnar and row format • In MOR table, updates are added in delta (avro) files which are later compacted with columnar files synchronously or asynchronously • Depending on one’s requirement, compaction can be tuned ◦ We compact every write, since Amazon Athena only serves Read Optimized View2 • And Amazon EMR supports3,4 Hudi Data Lake at PayPay We use MOR table type Amazon EMR We use Apache Hudi 0.7.0 with EMR 6.0.0 1. https://hudi.apache.org/docs/overview.html#table-types 2. https://docs.aws.amazon.com/athena/latest/ug/querying-hudi.html 3. https://aws.amazon.com/emr/features/hudi/ 4. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-6x.html (support is for Hudi 0.5.0) Support for Fast Writes using Apache Hudi
  9. 14 Data Lake at PayPay Number of rows per write

    record count between 100K~700K input rows in millions for some tables And time take to process these number of rows is ..
  10. 15 Data Lake at PayPay Write Speed About ~99% of

    tables writes are under 10 mins Avg write time
  11. 16 • When records are deleted, generated binlog events are

    applied to the data lake tables • Perform soft delete: do not delete the record physically ◦ update an internal column xxx_is_deleted to true • To query the data, one has to add the condition xxx_is_deleted = false ◦ batch job to physically delete records Data Lake at PayPay Can handle Data Deletion Amazon S3 Usage LOW
  12. 17 Data Lake at PayPay How is this Data Lake

    used? • Hourly KPI Dashboards ◦ Using views created on Amazon Athena • Reconciliation platform is built on top of this data which is used for verifying data correctness • Business strategy • Map based visualization ◦ https://map.yahoo.co.jp/congestion?lat=35.67717&lon=139.72141&zoom=13&maptype=basic
  13. 20 What’s Next? Securing our Data Lake with AWS Lake

    Formation Adding Recommendation Services with Amazon Personalize Data Archival Enable Time Travel