Slide 1

Slide 1 text

Near Real-Time Data Lake at PayPay Uddhav Arote Software Engineer, PayPay

Slide 2

Slide 2 text

2 Uddhav Arote ● From India ● MTech, Computer Science ● Work History ○ Yahoo Japan (2017 - 2020) ○ PayPay (2019 - Present)

Slide 3

Slide 3 text

3 ● (Big) Data At PayPay ● PayPay’s Data Lake Agenda

Slide 4

Slide 4 text

(Big) Data At PayPay

Slide 5

Slide 5 text

5 ● Payment Transactions ● User Signups ● Active Users ● Merchant Stores ● Cashback Transactions ● ... Data at PayPay Everyday humongous amounts of data is generated by PayPay Services Amazon ElastiCache Amazon DynamoDB TiDB Amazon Aurora Apache Kafka More than 600 data sources

Slide 6

Slide 6 text

Data Lake At PayPay

Slide 7

Slide 7 text

7 Data Lake at PayPay Range Query based Pipeline: ‘Old Pipeline’ ● Efficient for small tables ● These tables have a column named updated_at ○ it keeps tracks of when was the record updated last but

Slide 8

Slide 8 text

8 Datalake at PayPay ‘Old Pipeline’ had problems 1. Missing Data 2. Could not query real-time data 3. Slow Write speed 4. Could not delete data from data lake 5. Incremental data is about 7TB ~ 10TB

Slide 9

Slide 9 text

9 And, table size grew. Rows started getting upserted more frequently .. Data Lake at PayPay and problems started getting bigger How can we mitigate these problems?

Slide 10

Slide 10 text

ようこそ Near Real-Time Data Lake へ

Slide 11

Slide 11 text

11 Data Lake at PayPay Data Change Capture based Pipeline using Apache Hudi: ‘New Pipeline’ ● Table size does not matter ● No additional index required ● No missing data ● Data is no more stale ● Capture all DDL/DML events on table

Slide 12

Slide 12 text

12 Data Lake at PayPay No missing binlogs, since they are published ATLEAST ONCE De-duplication of data is handled in stream processing by ● Merging: choose the latest record ● Compaction: compare the record commit time NO MISSING DATA

Slide 13

Slide 13 text

13 ● support for upserting records in DFS ● support for (pluggable) indexing records for fast updates and deletes ● support two table types1 ○ Copy on Write (COW): data stored in purely columnar format ○ Merge On Read (MOR): data stored using combination of columnar and row format ● In MOR table, updates are added in delta (avro) files which are later compacted with columnar files synchronously or asynchronously ● Depending on one’s requirement, compaction can be tuned ○ We compact every write, since Amazon Athena only serves Read Optimized View2 ● And Amazon EMR supports3,4 Hudi Data Lake at PayPay We use MOR table type Amazon EMR We use Apache Hudi 0.7.0 with EMR 6.0.0 1. https://hudi.apache.org/docs/overview.html#table-types 2. https://docs.aws.amazon.com/athena/latest/ug/querying-hudi.html 3. https://aws.amazon.com/emr/features/hudi/ 4. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-6x.html (support is for Hudi 0.5.0) Support for Fast Writes using Apache Hudi

Slide 14

Slide 14 text

14 Data Lake at PayPay Number of rows per write record count between 100K~700K input rows in millions for some tables And time take to process these number of rows is ..

Slide 15

Slide 15 text

15 Data Lake at PayPay Write Speed About ~99% of tables writes are under 10 mins Avg write time

Slide 16

Slide 16 text

16 ● When records are deleted, generated binlog events are applied to the data lake tables ● Perform soft delete: do not delete the record physically ○ update an internal column xxx_is_deleted to true ● To query the data, one has to add the condition xxx_is_deleted = false ○ batch job to physically delete records Data Lake at PayPay Can handle Data Deletion Amazon S3 Usage LOW

Slide 17

Slide 17 text

17 Data Lake at PayPay How is this Data Lake used? ● Hourly KPI Dashboards ○ Using views created on Amazon Athena ● Reconciliation platform is built on top of this data which is used for verifying data correctness ● Business strategy ● Map based visualization ○ https://map.yahoo.co.jp/congestion?lat=35.67717&lon=139.72141&zoom=13&maptype=basic

Slide 18

Slide 18 text

Thank you

Slide 19

Slide 19 text

What’s Next?

Slide 20

Slide 20 text

20 What’s Next? Securing our Data Lake with AWS Lake Formation Adding Recommendation Services with Amazon Personalize Data Archival Enable Time Travel