• Merchant Stores • Cashback Transactions • ... Data at PayPay Everyday humongous amounts of data is generated by PayPay Services Amazon ElastiCache Amazon DynamoDB TiDB Amazon Aurora Apache Kafka More than 600 data sources
using Apache Hudi: ‘New Pipeline’ • Table size does not matter • No additional index required • No missing data • Data is no more stale • Capture all DDL/DML events on table
are published ATLEAST ONCE De-duplication of data is handled in stream processing by • Merging: choose the latest record • Compaction: compare the record commit time NO MISSING DATA
for (pluggable) indexing records for fast updates and deletes • support two table types1 ◦ Copy on Write (COW): data stored in purely columnar format ◦ Merge On Read (MOR): data stored using combination of columnar and row format • In MOR table, updates are added in delta (avro) files which are later compacted with columnar files synchronously or asynchronously • Depending on one’s requirement, compaction can be tuned ◦ We compact every write, since Amazon Athena only serves Read Optimized View2 • And Amazon EMR supports3,4 Hudi Data Lake at PayPay We use MOR table type Amazon EMR We use Apache Hudi 0.7.0 with EMR 6.0.0 1. https://hudi.apache.org/docs/overview.html#table-types 2. https://docs.aws.amazon.com/athena/latest/ug/querying-hudi.html 3. https://aws.amazon.com/emr/features/hudi/ 4. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-6x.html (support is for Hudi 0.5.0) Support for Fast Writes using Apache Hudi
applied to the data lake tables • Perform soft delete: do not delete the record physically ◦ update an internal column xxx_is_deleted to true • To query the data, one has to add the condition xxx_is_deleted = false ◦ batch job to physically delete records Data Lake at PayPay Can handle Data Deletion Amazon S3 Usage LOW
used? • Hourly KPI Dashboards ◦ Using views created on Amazon Athena • Reconciliation platform is built on top of this data which is used for verifying data correctness • Business strategy • Map based visualization ◦ https://map.yahoo.co.jp/congestion?lat=35.67717&lon=139.72141&zoom=13&maptype=basic