Near Real-Time Data Lake at PayPay

Near Real-Time Data Lake at PayPay Uddhav Arote Software Engineer,
PayPay

2 Uddhav Arote • From India • MTech, Computer Science
• Work History ◦ Yahoo Japan (2017 - 2020) ◦ PayPay (2019 - Present)

3 • (Big) Data At PayPay • PayPay’s Data Lake
Agenda

(Big) Data At PayPay

5 • Payment Transactions • User Signups • Active Users
• Merchant Stores • Cashback Transactions • ... Data at PayPay Everyday humongous amounts of data is generated by PayPay Services Amazon ElastiCache Amazon DynamoDB TiDB Amazon Aurora Apache Kafka More than 600 data sources

Data Lake At PayPay

7 Data Lake at PayPay Range Query based Pipeline: ‘Old
Pipeline’ • Efficient for small tables • These tables have a column named updated_at ◦ it keeps tracks of when was the record updated last but

8 Datalake at PayPay ‘Old Pipeline’ had problems 1. Missing
Data 2. Could not query real-time data 3. Slow Write speed 4. Could not delete data from data lake 5. Incremental data is about 7TB ~ 10TB

9 And, table size grew. Rows started getting upserted more
frequently .. Data Lake at PayPay and problems started getting bigger How can we mitigate these problems?

ようこそ Near Real-Time Data Lake へ

11 Data Lake at PayPay Data Change Capture based Pipeline
using Apache Hudi: ‘New Pipeline’ • Table size does not matter • No additional index required • No missing data • Data is no more stale • Capture all DDL/DML events on table

12 Data Lake at PayPay No missing binlogs, since they
are published ATLEAST ONCE De-duplication of data is handled in stream processing by • Merging: choose the latest record • Compaction: compare the record commit time NO MISSING DATA

13 • support for upserting records in DFS • support
for (pluggable) indexing records for fast updates and deletes • support two table types1 ◦ Copy on Write (COW): data stored in purely columnar format ◦ Merge On Read (MOR): data stored using combination of columnar and row format • In MOR table, updates are added in delta (avro) ﬁles which are later compacted with columnar ﬁles synchronously or asynchronously • Depending on one’s requirement, compaction can be tuned ◦ We compact every write, since Amazon Athena only serves Read Optimized View2 • And Amazon EMR supports3,4 Hudi Data Lake at PayPay We use MOR table type Amazon EMR We use Apache Hudi 0.7.0 with EMR 6.0.0 1. https://hudi.apache.org/docs/overview.html#table-types 2. https://docs.aws.amazon.com/athena/latest/ug/querying-hudi.html 3. https://aws.amazon.com/emr/features/hudi/ 4. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-6x.html (support is for Hudi 0.5.0) Support for Fast Writes using Apache Hudi

14 Data Lake at PayPay Number of rows per write
record count between 100K~700K input rows in millions for some tables And time take to process these number of rows is ..

15 Data Lake at PayPay Write Speed About ~99% of
tables writes are under 10 mins Avg write time

16 • When records are deleted, generated binlog events are
applied to the data lake tables • Perform soft delete: do not delete the record physically ◦ update an internal column xxx_is_deleted to true • To query the data, one has to add the condition xxx_is_deleted = false ◦ batch job to physically delete records Data Lake at PayPay Can handle Data Deletion Amazon S3 Usage LOW

17 Data Lake at PayPay How is this Data Lake
used? • Hourly KPI Dashboards ◦ Using views created on Amazon Athena • Reconciliation platform is built on top of this data which is used for verifying data correctness • Business strategy • Map based visualization ◦ https://map.yahoo.co.jp/congestion?lat=35.67717&lon=139.72141&zoom=13&maptype=basic

Thank you

What’s Next?

20 What’s Next? Securing our Data Lake with AWS Lake
Formation Adding Recommendation Services with Amazon Personalize Data Archival Enable Time Travel

Near Real-Time Data Lake at PayPay

Near Real-Time Data Lake at PayPay

PayPay Corporation. PRO

More Decks by PayPay Corporation.

Other Decks in Technology

Featured

Transcript