Slide 1

Slide 1 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Modern Transactional Data Lake using Apache Iceberg in Amazon S3 Sungmin Kim Sr. Solutions Architect AWS

Slide 2

Slide 2 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Agenda • Pitfalls of Data Lake using Append-Only Distributed File System • CDC-based UPSERT in Data Lake § Using Views to UPSERT § Using Open Table Formats – Apache Iceberg, Hudi, Delta Lake • Modern Transactional Data Lake Architecture

Slide 3

Slide 3 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. CRM IoT WEB Messages CDC* Event Streams * CDC: Change Data Capture Data Analytics System RDBMS Data Insights

Slide 4

Slide 4 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. ... Not Easy to Scale RDBMS RDBMS (Replica) RDBMS (Primary) Query Engine (1) Storage Query Engine (2) Query Engine (3) Storage interface Scale-Out Scale-Out Primary-Replica Cluster RDBMS (Primary) Scale-Up RDBMS (Replica) Scale-Out Replica Primary Distributed File System RDBMS

Slide 5

Slide 5 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. DFS* Stream Storage Data Lake Data Mart AI/ML Data Insights CRM IoT WEB Messages CDC Event Streams Data Lake * DFS: Distributed File System Data Ware house Stream Delivery

Slide 6

Slide 6 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. CRM IoT WEB Messages CDC Event Streams Data Lake on Amazon Kinesis Data Streams Amazon Kinesis Data Firehose Amazon Athena Amazon S3 Data Lake Amazon QuickSight

Slide 7

Slide 7 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. IMMUTABLE Objects Distributed CAN NOT Update/Delete In-Place Insert (Append)-Only interface (HTTPS, SDK APIs) Transactional (X) MUTABLE Records Files per tables Update/Delete In-Place Insert/Update/Delete table1 table2 table3 RDBMS Transactional (O) RDBMS vs. S3 (≈ Distributed Object Storage) File System File System File System Amazon S3

Slide 8

Slide 8 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. RDBMS CDC How to Update/Delete CDC in Amazon S3? Amazon Kinesis Data Streams Amazon Kinesis Data Firehose Amazon Athena Amazon S3 AWS DMS datalake/ year=2023/month=05/day=03/hour=01/ obj1.parquet obj2.parquet … year=2023/month=05/day=03/hour=02/ updated-obj1.parquet … Data Lake Operation Changed Data I, pk1, c1, c2, t1 U, pk1, c1, c2, t2 D, pk0, c1, c2, t3

Slide 9

Slide 9 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Using Views to UPSERT: Merge-On-Read RDBMS Updated/ Deleted Data Inserted Data View Table Operation Changed Data I, pk1, c1, c2, t1 U, pk1, c1, c2, t2 I, pk1, c1, c2, t1 U, pk1, c1, c2, t2 D, pk0, c1, c2, t3 I, pk1, c1, c2, t1 U, pk1, c1, c2, t2 I, pk0, c1, c2, t0 D, pk0, c1, c2, t3 I, pk0, c1, c2, t0

Slide 10

Slide 10 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Using Views to UPSERT: Merge-On-Read RDBMS Updated/Deleted Data Inserted Data View Table Amazon S3 Amazon Athena Amazon Redshift Logical View Materialized View CDC

Slide 11

Slide 11 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Logical Views vs. Materialized Views CREATE VIEW view_tbl AS SELECT * FROM org_tbl, delta_tbl SELECT * FROM view_tbl SELECT * FROM ( SELECT * FROM org_tbl, delta_tbl ) SELECT * FROM view_tbl Materialized View Logical View org_tbl Amazon S3 view_tbl + delta_tbl

Slide 12

Slide 12 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Materialized Views in Amazon Redshift

Slide 13

Slide 13 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Kinesis Data Streams Amazon Redshift / Redshift Serverless Permanent Tables Real-time Materialized View Streaming Table … … Amazon QuickSight Amazon MSK Amazon Redshift Streaming Ingestion M A T E R I A L I Z E D V I E W Auto Refresh Data Source

Slide 14

Slide 14 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Routine Merge & Compaction t1 t2 Inserted Data (t1) Amazon S3 Inserted Data (t2) + + a b c d e f Merge & Compaction time Data Size Updated/ Deleted Data (t1) Updated/ Deleted Data (t2)

Slide 15

Slide 15 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. year=2022/month=01/day=01/hour=00/ p1.parquet p2.parauet year=2022/month=02/day=01/hour=00/ ... year=2022/month=12/day=01/hour=00/ ... year=2023/month=01/day=02/hour=00/ p1.parquet p2.parauet year=2023/month=01/day=02/hour=01/ p1.parquet p2.parauet S3 Glacier Deep Archive S3 Standard Pitfalls of Logical Views Update/ Delete View Merge-On-Read

Slide 16

Slide 16 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Pitfalls of Logical Views • Complexity – Read SQL Query, Architecture • Operational Overhead • Cost = Merge & Compaction + Storage • Real-time analysis inefficiencies

Slide 17

Slide 17 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Real-time Materialized View org_tbl delta_tbl Auto Refresh Streaming Table Permanent Table Pitfalls of Materialized Views Amazon Redshift Data Volume Data Volume Data Volume t1 tN time t2 Data Size Unlimited Data Volume .....

Slide 18

Slide 18 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Real-time Materialized View org_tbl delta_tbl Auto Refresh Table data files commit log Merge-On-Read Streaming Table Permanent Table Amazon S3 How to build Materialized Views in Amazon S3? Amazon Redshift

Slide 19

Slide 19 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Table data files commit log Merge-On-Read Amazon S3 “Table Format” = Layout of Files in Table commit_log date=2023-01-01

Slide 20

Slide 20 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Bring Database-like features to Data Lake RDBMS Index Field1 (v1, t1) Files binlog Read Field1 (v2, t2) my_table/ date=2023-01-01/ file-1.parquet ...... file-2.parquet ...... commit_log/ 00000.json 00001.json ...... Amazon S3 Write t1 t2 time Table data files Merge-On-Read commit log Insert file-1.parquet Insert file-2.parquet Delete file-1.parquet

Slide 21

Slide 21 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. “Table Format” = Layout of Files in Table O P E N T A B L E F O R M A T S

Slide 22

Slide 22 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Apache Hudi © hudi.apache.org

Slide 23

Slide 23 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Apache Hudi © hudi.apache.org

Slide 24

Slide 24 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Apache Iceberg s0 Data Snapshots t0 t1 Partition File Location Schema Format Stats Write & Commit time Snapshots: State of table at some time s1

Slide 25

Slide 25 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Apache Iceberg M E T A D A T A F I L E S T O T R A C K D A T A schema, partitions, snapshots list of files and mappings to snapshots tracks data files and statistics © iceberg.apache.org

Slide 26

Slide 26 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Apache Iceberg M E T A D A T A F I L E S T O T R A C K D A T A my_table/ ├── metadata/ │ ├── 00000.metadata.json │ ├── 00001.metadata.json │ ├── 00002.metadata.json │ ....... │ ├── a39f-e190-b871-ac8e5b-m0.avro │ ├── a39f-e190-b871-ac8e5b-m1.avro │ ├── a39f-e190-b871-ac8e5b-m2.avro │ ....... │ ├── snap-1954-1-2e934.avro │ ├── snap-4381-1-255b.avro │ ├── snap-4866-1-8bf57.avro └── data/ ├── date=2023-01-01 │ └── file-1.parquet └── date=2023-01-02 └── file-2.parquet © iceberg.apache.org

Slide 27

Slide 27 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Delta Lake my_table/ ├── _delta_log │ ├── 00000.json │ ├── 00001.json │ ├── 00002.json │ ....... │ ├── 00010.json │ └── 00010.checkpoint.parquet ├── date=2023-01-01 │ └── file-1.parquet └── date=2023-01-02 └── file-2.parquet Transaction Log Single commits Checkpoint Files (Optional) Partition Directories Data Files Add 1.parquet Add 2.parquet Remove 1.parquet Remove 2.parquet Add 3.parquet

Slide 28

Slide 28 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Open Table Formats – Iceberg, Hudi, Delta Lake Apache Iceberg Hudi Delta Lake ACID Yes Yes Yes Partition Evolution Yes No No Schema Evolution Yes Partial Limited Time Travel Yes Yes Yes Merge Yes Yes Yes Compaction API based Manual Automated Data Format Parquet, Avro, ORC, CSV Parquet, ORC Parquet Current Pointer Metastore, File system with version File Timeline commit Transaction log Conflict Resolution Optimistic Optimistic Optimistic Programming Language Java & Python Scala, Java & Python Java & Python

Slide 29

Slide 29 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Modernizing Data Lakes

Slide 30

Slide 30 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Typical Data Pipeline & Data Lake AWS DMS Amazon Kinesis Data Streams Amazon Athena Amazon S3 Amazon RDS Payments • Sign-Up: Insert • Change Personal Information: Update • Unsubscribe: Delete • Payment History: Append Only Amazon Kinesis Data Firehose Data Source Data Pipeline Data Lake User Profile

Slide 31

Slide 31 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. CDC-based UPSERT in Data Lake AWS DMS Amazon Kinesis Data Streams Amazon Athena Amazon S3 Amazon RDS Amazon Kinesis Data Firehose S3 User Profile iceberg Payments parquet, orc, avro iceberg, hudi, delta lake Athena Hudi Iceberg Delta Lake Insert X O X Delete X O X Select O O O

Slide 32

Slide 32 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. CDC-based UPSERT in Data Lake AWS DMS Amazon Kinesis Data Streams Amazon Athena Amazon S3 Amazon RDS S3 User Profile iceberg Payments parquet, orc, avro iceberg, hudi, delta lake Athena Hudi Iceberg Delta Lake Insert X O X Delete X O X Select O O O AWS Glue Flink / Spark Amazon EMR Open Source Serverless Fully Managed

Slide 33

Slide 33 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. CDC-based UPSERT in Data Lake AWS DMS Amazon Kinesis Data Streams Amazon Athena Amazon S3 Amazon RDS AWS Glue Streaming Operation Changed Data I, pk1, c1, c2, t1 U, pk1, c1, c2, t2 D, pk0, c1, c2, t3 CDC { JSON }

Slide 34

Slide 34 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Transactional Data Lake AWS DMS Amazon Athena Amazon S3 Amazon RDS AWS DMS Amazon Kinesis Data Streams Amazon Athena Amazon S3 Amazon RDS Amazon Kinesis Data Firehose {JSON} {JSON} Amazon Kinesis Data Streams AWS Glue Streaming

Slide 35

Slide 35 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Demo

Slide 36

Slide 36 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Reference Architecture https://github.com/aws-samples/transactional-datalake-using-apache-iceberg-on-aws-glue

Slide 37

Slide 37 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 38

Slide 38 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Set up Spark & Glue Context Create a connection to Kinesis Data Streams Insert/Update/Delete CDC in Apache Iceberg Table 1 2 3 Glue Streaming Job – PySpark Script

Slide 39

Slide 39 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Glue Streaming Job – PySpark Script

Slide 40

Slide 40 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. 데모 시간을 5분 이내로 줄이기 Glue Streaming 코드 설명하기 Deduplicate CDCs Upsert into Iceberg table Delete from Iceberg table 1 2 3

Slide 41

Slide 41 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Summary

Slide 42

Slide 42 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. “Table Format” = Layout of Files in Table O P E N T A B L E F O R M A T S Amazon S3 Update/Delete In-Place table1 table2 table3 RDBMS Transactional Bring Database-like features to Data Lake

Slide 43

Slide 43 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Transactional Data Lake: Batch AWS DMS Amazon Kinesis Data Streams AWS Glue ETL Amazon Athena Amazon S3 Amazon RDS (Apache Iceberg, Hudi, Delta Lake) Amazon S3 Amazon Kinesis Data Firehose Raw Zone Curated Zone

Slide 44

Slide 44 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Transactional Data Lake: Batch + Real-time L A M B D A A R C H I T E C T U R E AWS DMS Amazon Kinesis Data Streams AWS Glue ETL Amazon Athena Amazon S3 Amazon RDS Amazon Redshift / Redshift Serverless Real-Time Materialized View Streaming Table Permanent Tables (Apache Iceberg, Hudi, Delta Lake) Amazon S3 Amazon Kinesis Data Firehose Raw Zone Curated Zone Batch Layer Speed Layer

Slide 45

Slide 45 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Transactional Data Lake in Real-time AWS DMS Amazon Kinesis Data Streams AWS Glue Streaming Amazon Athena Amazon S3 Amazon RDS (Apache Iceberg, Hudi, Delta Lake) Amazon Redshift / Redshift Serverless Real-Time Materialized View Streaming Table Permanent Tables

Slide 46

Slide 46 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. On-Premise Transactional Data Lake Generic database Corporate data center Long Time-to-build High Cost in TCO Deep Expertise Required Security HDFS Kafka Connect Connect Hive / Presto Flink / Spark Streaming

Slide 47

Slide 47 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Generic database AWS DMS Amazon Kinesis Data Streams AWS Glue Streaming Amazon Athena Amazon S3 Corporate data center AWS Cloud Streaming Migrations for Analytics on Generic database Corporate data center HDFS Hive / Presto Kafka Connect Connect (Apache Iceberg, Hudi, Delta Lake) (Apache Iceberg, Hudi, Delta Lake) Flink / Spark Streaming

Slide 48

Slide 48 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Design Process for Transactional Data Lake

Slide 49

Slide 49 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Resources • Transactional Data Lake using Apache Iceberg with AWS Glue Streaming and DMS § https://github.com/aws-samples/transactional-datalake-using-apache-iceberg-on-aws-glue • Building Serverless Business Intelligent System from Scratch § https://serverless-bi-system-from-scratch.workshop.aws/ • Data Pipeline using AWS DMS and Kinesis § https://catalog.us-east-1.prod.workshops.aws/workshops/4da54890-23fc-4b9a-80cd-3a0ca3279b3f/en- US • Amazon Redshift Streaming Ingestion Patterns § https://github.com/aws-samples/redshift-streaming-ingestion-patterns

Slide 50

Slide 50 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Thank you