Slide 1

Slide 1 text

RUNNING APACHE ICEBERG, APACHE HUDI, AND DELTA LAKE ON AWS © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Running Apache Iceberg, Apache Hudi, and Delta Lake on AWS Akira Ajisaka Senior Software Development Engineer AWS Japan

Slide 2

Slide 2 text

RUNNING APACHE ICEBERG, APACHE HUDI, AND DELTA LAKE ON AWS © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Akira Ajisaka 2 • Working for AWS Glue Spark team since 2022 • Responsible for architecting, building, and improving our features • Responsible for Apache Hudi, Apache Iceberg, Delta Lake native support • Apache Hadoop committer and PMC member, ASF member

Slide 3

Slide 3 text

RUNNING APACHE ICEBERG, APACHE HUDI, AND DELTA LAKE ON AWS © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Agenda 3 • Use Hudi, Delta, Iceberg on AWS Glue Spark • Open Table Format support in AWS Glue Crawler, AWS Glue Data Catalog, and AWS analytics services • OSS Contribution

Slide 4

Slide 4 text

RUNNING APACHE ICEBERG, APACHE HUDI, AND DELTA LAKE ON AWS © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Glue D I S C O V E R , P R E P A R E , A N D I N T E G R A T E A L L Y O U R D A T A A T A N Y S C A L E Tailored tools to support all data users Support all workloads in one place All-in-one data integration service Cost effective, serverless, and scalable 4

Slide 5

Slide 5 text

RUNNING APACHE ICEBERG, APACHE HUDI, AND DELTA LAKE ON AWS © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. 5 Now Hudi, Delta, and Iceberg are supported in Glue jobs natively The centralized repository for cataloging the available Glue connectors provided by multiple vendors • More flexibility in versions Add libraries to the job • Register your own libraries located in Amazon S3 as Glue connectors • Configure in Dependent JARs path • More flexibility in versions Native integration Marketplace connectors Custom connectors (BYOL) Extra library dependencies Use Hudi, Delta, Iceberg on AWS Glue Spark jobs

Slide 6

Slide 6 text

RUNNING APACHE ICEBERG, APACHE HUDI, AND DELTA LAKE ON AWS © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hudi, Delta, and Iceberg become easy to use You can use them in just one option! • No need to choose version • No need to find appropriate artifacts 6 Example:

Slide 7

Slide 7 text

RUNNING APACHE ICEBERG, APACHE HUDI, AND DELTA LAKE ON AWS © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Glue is optimized for AWS Services HIGH PERFORMANCE SECURE EASY TO USE GLUE CATALOG INTEGRATION LOW MAINTENANCE AWS GLUE AWS GLUE DATA CATALOG AWS GLUE CRAWLER Amazon Athena Amazon Aurora Amazon DocumentDB Amazon DynamoDB AWS Lake Formation Amazon Redshift Amazon RDS Amazon S3 Amazon AppFlow

Slide 8

Slide 8 text

RUNNING APACHE ICEBERG, APACHE HUDI, AND DELTA LAKE ON AWS © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Multiple ways to register table definitions to Glue Data Catalog • Glue Crawler • Spark job (EMR, Glue) • Athena DDL (Iceberg only) You can choose as you want 8

Slide 9

Slide 9 text

RUNNING APACHE ICEBERG, APACHE HUDI, AND DELTA LAKE ON AWS © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Use Glue Crawler to register table definition All the 3 formats are natively supported. • Delta Lake § Need to choose Native table or Symlink table • Hudi, Iceberg § You can define the maximum depth where Crawler can traverse to discover the table metadata 9

Slide 10

Slide 10 text

RUNNING APACHE ICEBERG, APACHE HUDI, AND DELTA LAKE ON AWS © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Glue for Apache Spark Natively created Delta table. Extends Parquet data files with a file-based transaction logs and metadata. Consistent snapshot exported from native Delta table using ACID transactions. Represented as symlink table based on parquet files. Native Delta table Manifest table Native Delta table and manifest table Amazon Athena Amazon Redshift Amazon EMR GENERATE Crawl Glue Crawler GENERATE symlink_format_manifest FOR TABLE delta.`` Amazon Athena Amazon EMR

Slide 11

Slide 11 text

RUNNING APACHE ICEBERG, APACHE HUDI, AND DELTA LAKE ON AWS © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Ad-hoc query from AWS Analytics 11 AWS Glue Crawler AWS Glue Data Catalog Amazon S3 Amazon Athena Amazon Redshift Users Delta table definition Delta table files 1. Write delta table files to S3 2. Crawl delta files and populate table definition 3. Ad-hoc query from AWS Analytics Services Video: https://youtu.be/o6Wd84-lxCI?si=M7Xh-1Hg8rOw3_e5&t=884

Slide 12

Slide 12 text

RUNNING APACHE ICEBERG, APACHE HUDI, AND DELTA LAKE ON AWS © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Contribution to Apache Hudi • HUDI-6805: Print detailed error message in clustering • HUDI-5866: Fix unnecessary log messages during bulk insert in Spark • And many more 12

Slide 13

Slide 13 text

RUNNING APACHE ICEBERG, APACHE HUDI, AND DELTA LAKE ON AWS © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Contribution to Apache Iceberg • AWS: Print logs whether Glue optimistic locking is used or not (#6358) • AWS, Docs: Add AWS Glue in Run Iceberg on AWS section (#6623) • AWS: Upgrade AWS Java SDK version to 2.20.131 (#8379) • AWS: Support S3 DSSE-KMS encryption (#8370) • AWS: Fix the missing StorageDescriptor parameter after renameTo (#3468) • AWS: Add supporting only s3 bucket name for S3URI (#6352) • And many more 13

Slide 14

Slide 14 text

RUNNING APACHE ICEBERG, APACHE HUDI, AND DELTA LAKE ON AWS © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Contribution to Delta Lake [Design Doc] Catalog implementation for AWS Glue Data Catalog https://github.com/delta-io/delta/issues/1679 • For better integration with Glue Data Catalog 14

Slide 15

Slide 15 text

RUNNING APACHE ICEBERG, APACHE HUDI, AND DELTA LAKE ON AWS © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Thank you! Akira Ajisaka X: @ajis_ka