Upgrade to Pro — share decks privately, control downloads, hide ads and more …

실시간 CDC 데이터 처리! Modern Transactional Data Lake 구축하기

실시간 CDC 데이터 처리! Modern Transactional Data Lake 구축하기

Agenda

1. Append-Only 분산 파일 시스템으로 구성한 데이터 레이크의 단점
2. CDC-based UPSERT를 지원하는 데이터 레이크 구성
(1) View 테이블 이용 방법
(2) Open Table Formats 이용 방법 - Apache Iceberg, Hudi, Delta Lake
3. Modern Transactional Data Lake Architecture

Sungmin Kim

May 08, 2023
Tweet

More Decks by Sungmin Kim

Other Decks in Technology

Transcript

  1. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. 실시간 CDC 데이터 처리! Modern Transactional Data Lake 구축하기 김성민 솔루션즈 아키텍트 AWS
  2. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Agenda • Append-Only 분산 파일 시스템으로 구성한 데이터 레이크의 단점 • CDC-based UPSERT를 지원하는 데이터 레이크 구성 § View 테이블 이용 방법 § Open Table Formats 이용 방법 – Apache Iceberg, Hudi, Delta Lake • Modern Transactional Data Lake Architecture
  3. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. CRM IoT WEB Messages CDC* Event Streams * CDC: Change Data Capture 데이터 분석 시스템 RDBMS Data Insights
  4. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. RDBMS의 Scalability 한계 RDBMS (Replica) RDBMS (Primary) Query Engine (1) Storage Query Engine (2) Query Engine (3) Storage interface Scale-Out Scale-Out Primary-Replica Cluster RDBMS (Primary) Scale-Up RDBMS (Replica) Scale-Out Replica Primary Distributed File System RDBMS
  5. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. DFS* Stream Storage Data Lake Data Mart AI/ML 데이터 분석 CRM IoT WEB Messages CDC Event Streams Data Lake 구축 * DFS: Distributed File System Data Ware house Stream Delivery
  6. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. CRM IoT WEB Messages CDC Event Streams Data Lake 구축 Amazon Kinesis Data Streams Amazon Kinesis Data Firehose Amazon Athena Amazon S3 Data Lake Amazon QuickSight
  7. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. IMMUTABLE Objects Distributed CAN NOT Update/Delete In-Place Insert (Append)-Only interface (HTTPS, SDK APIs) Transactional (X) MUTABLE Records Files per tables Update/Delete In-Place Insert/Update/Delete table1 table2 table3 RDBMS Transactional (O) RDBMS vs. S3 (≈ Distributed Object Storage) File System File System File System Amazon S3
  8. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. RDBMS CDC CDC 데이터의 Update/Delete 처리? Amazon Kinesis Data Streams Amazon Kinesis Data Firehose Amazon Athena Amazon S3 AWS DMS datalake/ year=2023/month=05/day=03/hour=01/ obj1.parquet obj2.parquet … year=2023/month=05/day=03/hour=02/ updated-obj1.parquet … Data Lake Operation Changed Data I, pk1, c1, c2, t1 U, pk1, c1, c2, t2 D, pk0, c1, c2, t3
  9. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. View 테이블 기반 UPSERT 처리: Merge-On-Read RDBMS Updated/ Deleted Data Inserted Data View Table Operation Changed Data I, pk1, c1, c2, t1 U, pk1, c1, c2, t2 I, pk1, c1, c2, t1 U, pk1, c1, c2, t2 D, pk0, c1, c2, t3 I, pk1, c1, c2, t1 U, pk1, c1, c2, t2 I, pk0, c1, c2, t0 D, pk0, c1, c2, t3 I, pk0, c1, c2, t0
  10. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. View 테이블 기반 UPSERT 처리: Merge-On-Read RDBMS Updated/Deleted Data Inserted Data View Table Amazon S3 Amazon Athena Amazon Redshift Logical View Materialized View CDC
  11. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Logical View vs. Materialized View CREATE VIEW view_tbl AS SELECT * FROM org_tbl, delta_tbl SELECT * FROM view_tbl SELECT * FROM ( SELECT * FROM org_tbl, delta_tbl ) SELECT * FROM view_tbl Materialized View Logical View org_tbl Amazon S3 view_tbl + delta_tbl
  12. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Amazon Redshift Materialized Views
  13. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Amazon Kinesis Data Streams Amazon Redshift / Redshift Serverless Permanent Tables Real-time Materialized View Streaming Table … … Amazon QuickSight Amazon MSK Amazon Redshift Streaming Ingestion M A T E R I A L I Z E D V I E W Auto Refresh Data Source
  14. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. 변경 데이터 통합 작업의 주기적인 실행 t1 t2 Inserted Data (t1) Amazon S3 Inserted Data (t2) + + a b c d e f Merge & Compaction time Data Size Updated/ Deleted Data (t1) Updated/ Deleted Data (t2)
  15. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. year=2022/month=01/day=01/hour=00/ p1.parquet p2.parauet year=2022/month=02/day=01/hour=00/ ... year=2022/month=12/day=01/hour=00/ ... year=2023/month=01/day=02/hour=00/ p1.parquet p2.parauet year=2023/month=01/day=02/hour=01/ p1.parquet p2.parauet S3 Glacier Deep Archive S3 Standard Logical View 테이블로 해결하기 어려운 상황 Update/ Delete View Merge-On-Read
  16. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Logical View 테이블의 단점 • 복잡도 증가 – Read 쿼리, 아키텍처 • 운영 부담 증가 • 비용 = Merge & Compaction 비용 + 데이터 보관 비용 • 실시간 데이터 조회 및 처리의 어려움
  17. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Real-time Materialized View org_tbl delta_tbl Auto Refresh Streaming Table Permanent Table Materialized View의 한계 Amazon Redshift Data Volume Data Volume Data Volume t1 tN time t2 Data Size Unlimited Data Volume .....
  18. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Real-time Materialized View org_tbl delta_tbl Auto Refresh Table data files commit log Merge-On-Read Streaming Table Permanent Table Amazon S3 Materialized View를 S3에서 구현할 수 없을까? Amazon Redshift
  19. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Table data files commit log Merge-On-Read Amazon S3 “Table Format” = Layout of Files in Table commit_log date=2023-01-01
  20. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Amazon S3를 RDBMS처럼 사용하기 RDBMS Index Field1 (v1, t1) Files binlog Read Field1 (v2, t2) my_table/ date=2023-01-01/ file-1.parquet ...... file-2.parquet ...... commit_log/ 00000.json 00001.json ...... Amazon S3 Write t1 t2 time Table data files Merge-On-Read commit log Insert file-1.parquet Insert file-2.parquet Delete file-1.parquet
  21. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. “Table Format” = Layout of Files in Table O P E N T A B L E F O R M A T S
  22. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Apache Hudi © hudi.apache.org
  23. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Apache Hudi © hudi.apache.org
  24. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Apache Iceberg s0 Data Snapshots t0 t1 Partition File Location Schema Format Stats Write & Commit time Snapshots: State of table at some time s1
  25. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Apache Iceberg M E T A D A T A F I L E S T O T R A C K D A T A schema, partitions, snapshots list of files and mappings to snapshots tracks data files and statistics © iceberg.apache.org
  26. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Apache Iceberg M E T A D A T A F I L E S T O T R A C K D A T A my_table/ ├── metadata/ │ ├── 00000.metadata.json │ ├── 00001.metadata.json │ ├── 00002.metadata.json │ ....... │ ├── a39f-e190-b871-ac8e5b-m0.avro │ ├── a39f-e190-b871-ac8e5b-m1.avro │ ├── a39f-e190-b871-ac8e5b-m2.avro │ ....... │ ├── snap-1954-1-2e934.avro │ ├── snap-4381-1-255b.avro │ ├── snap-4866-1-8bf57.avro └── data/ ├── date=2023-01-01 │ └── file-1.parquet └── date=2023-01-02 └── file-2.parquet © iceberg.apache.org
  27. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Delta Lake my_table/ ├── _delta_log │ ├── 00000.json │ ├── 00001.json │ ├── 00002.json │ ....... │ ├── 00010.json │ └── 00010.checkpoint.parquet ├── date=2023-01-01 │ └── file-1.parquet └── date=2023-01-02 └── file-2.parquet Transaction Log Single commits Checkpoint Files (Optional) Partition Directories Data Files Add 1.parquet Add 2.parquet Remove 1.parquet Remove 2.parquet Add 3.parquet
  28. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Open Table Formats – Iceberg, Hudi, Delta Lake Apache Iceberg Hudi Delta Lake ACID Yes Yes Yes Partition Evolution Yes No No Schema Evolution Yes Partial Limited Time Travel Yes Yes Yes Merge Yes Yes Yes Compaction API based Manual Automated Data Format Parquet, Avro, ORC, CSV Parquet, ORC Parquet Current Pointer Metastore, File system with version File Timeline commit Transaction log Conflict Resolution Optimistic Optimistic Optimistic Programming Language Java & Python Scala, Java & Python Java & Python
  29. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Modern Transactional Data Lake
  30. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Typical Data Pipeline & Data Lake AWS DMS Amazon Kinesis Data Streams Amazon Athena Amazon S3 Amazon RDS Payments • 가입: Insert • 변경: Update • 탈퇴: Delete • 이력 관리: Append Only Amazon Kinesis Data Firehose Data Source Data Pipeline Data Lake User Profile
  31. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. CDC-based UPSERT를 지원하는 Data Lake AWS DMS Amazon Kinesis Data Streams Amazon Athena Amazon S3 Amazon RDS Amazon Kinesis Data Firehose S3 User Profile iceberg Payments parquet, orc, avro iceberg, hudi, delta lake Athena Hudi Iceberg Delta Lake Insert X O X Delete X O X Select O O O
  32. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. CDC-based UPSERT를 지원하는 Data Lake AWS DMS Amazon Kinesis Data Streams Amazon Athena Amazon S3 Amazon RDS S3 User Profile iceberg Payments parquet, orc, avro iceberg, hudi, delta lake Athena Hudi Iceberg Delta Lake Insert X O X Delete X O X Select O O O AWS Glue Flink / Spark Amazon EMR Open Source Serverless Fully Managed
  33. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. CDC-based UPSERT 를 지원하는 Data Lake AWS DMS Amazon Kinesis Data Streams Amazon Athena Amazon S3 Amazon RDS AWS Glue Streaming Operation Changed Data I, pk1, c1, c2, t1 U, pk1, c1, c2, t2 D, pk0, c1, c2, t3 CDC { JSON }
  34. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Transactional Data Lake AWS DMS Amazon Kinesis Data Streams AWS Glue Streaming Amazon Athena Amazon S3 Amazon RDS AWS DMS Amazon Kinesis Data Streams Amazon Athena Amazon S3 Amazon RDS Amazon Kinesis Data Firehose {JSON} {JSON}
  35. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Demo
  36. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Reference Architecture https://github.com/aws-samples/transactional-datalake-using-apache-iceberg-on-aws-glue
  37. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Spark + Glue Context 설정 Kinesis Data Streams에서 데이터 읽기 Apache Iceberg 테이블에 Insert/Update/Delete 하기 1 2 3 Glue Streaming Job Code 분석
  38. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Glue Streaming Job Code 분석
  39. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. 데모 시간을 5분 이내로 줄이기 Glue Streaming 코드 설명하기 중복 제거 Upsert 처리 Delete 처리 1 2 3
  40. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Summary
  41. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. “Table Format” = Layout of Files in Table O P E N T A B L E F O R M A T S Amazon S3 Update/Delete In-Place table1 table2 table3 RDBMS Transactional Data Lake를 RDBMS처럼 사용하기
  42. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Transactional Data Lake: 배치 처리 AWS DMS Amazon Kinesis Data Streams AWS Glue ETL Amazon Athena Amazon S3 Amazon RDS (Apache Iceberg, Hudi, Delta Lake) Amazon S3 Amazon Kinesis Data Firehose Raw Zone Curated Zone
  43. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Transactional Data Lake: 배치 + 실시간 처리 L A M B D A A R C H I T E C T U R E AWS DMS Amazon Kinesis Data Streams AWS Glue ETL Amazon Athena Amazon S3 Amazon RDS Amazon Redshift / Redshift Serverless Real-Time Materialized View Streaming Table Permanent Tables (Apache Iceberg, Hudi, Delta Lake) Amazon S3 Amazon Kinesis Data Firehose Raw Zone Curated Zone Batch Layer Speed Layer
  44. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Transactional Data Lake: 실시간 처리 AWS DMS Amazon Kinesis Data Streams AWS Glue Streaming Amazon Athena Amazon S3 Amazon RDS (Apache Iceberg, Hudi, Delta Lake) Amazon Redshift / Redshift Serverless Real-Time Materialized View Streaming Table Permanent Tables
  45. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. On-Premise 에서 Transactional Data Lake 구축 Generic database Corporate data center Long Time-to-build High Cost in TCO Deep Expertise Required Security HDFS Kafka Connect Connect Hive / Presto Flink / Spark Streaming
  46. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Generic database AWS DMS Amazon Kinesis Data Streams AWS Glue Streaming Amazon Athena Amazon S3 Corporate data center AWS Cloud Streaming Migrations for Analytics on Generic database Corporate data center HDFS Hive / Presto Kafka Connect Connect (Apache Iceberg, Hudi, Delta Lake) (Apache Iceberg, Hudi, Delta Lake) Flink / Spark S
  47. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Data Lake 아키텍처 개선 과정
  48. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Resources • Transactional Data Lake using Apache Iceberg with AWS Glue Streaming and DMS § https://github.com/aws-samples/transactional-datalake-using-apache-iceberg-on-aws-glue • Building Serverless Business Intelligent System from Scratch § https://serverless-bi-system-from-scratch.workshop.aws/ • Data Pipeline using AWS DMS and Kinesis § https://catalog.us-east-1.prod.workshops.aws/workshops/4da54890-23fc-4b9a-80cd-3a0ca3279b3f/en- US • Amazon Redshift Streaming Ingestion Patterns § https://github.com/aws-samples/redshift-streaming-ingestion-patterns