Lock in $30 Savings on PRO—Offer Ends Soon! ⏳

Running Apache Iceberg, Apache Hudi, and Delta ...

Akira Ajisaka
October 18, 2023
300

Running Apache Iceberg, Apache Hudi, and Delta Lake on AWS

Akira Ajisaka

October 18, 2023
Tweet

Transcript

  1. RUNNING APACHE ICEBERG, APACHE HUDI, AND DELTA LAKE ON AWS

    © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Running Apache Iceberg, Apache Hudi, and Delta Lake on AWS Akira Ajisaka Senior Software Development Engineer AWS Japan
  2. RUNNING APACHE ICEBERG, APACHE HUDI, AND DELTA LAKE ON AWS

    © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Akira Ajisaka 2 • Working for AWS Glue Spark team since 2022 • Responsible for architecting, building, and improving our features • Responsible for Apache Hudi, Apache Iceberg, Delta Lake native support • Apache Hadoop committer and PMC member, ASF member
  3. RUNNING APACHE ICEBERG, APACHE HUDI, AND DELTA LAKE ON AWS

    © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Agenda 3 • Use Hudi, Delta, Iceberg on AWS Glue Spark • Open Table Format support in AWS Glue Crawler, AWS Glue Data Catalog, and AWS analytics services • OSS Contribution
  4. RUNNING APACHE ICEBERG, APACHE HUDI, AND DELTA LAKE ON AWS

    © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Glue D I S C O V E R , P R E P A R E , A N D I N T E G R A T E A L L Y O U R D A T A A T A N Y S C A L E Tailored tools to support all data users Support all workloads in one place All-in-one data integration service Cost effective, serverless, and scalable 4
  5. RUNNING APACHE ICEBERG, APACHE HUDI, AND DELTA LAKE ON AWS

    © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. 5 Now Hudi, Delta, and Iceberg are supported in Glue jobs natively The centralized repository for cataloging the available Glue connectors provided by multiple vendors • More flexibility in versions Add libraries to the job • Register your own libraries located in Amazon S3 as Glue connectors • Configure in Dependent JARs path • More flexibility in versions Native integration Marketplace connectors Custom connectors (BYOL) Extra library dependencies Use Hudi, Delta, Iceberg on AWS Glue Spark jobs
  6. RUNNING APACHE ICEBERG, APACHE HUDI, AND DELTA LAKE ON AWS

    © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hudi, Delta, and Iceberg become easy to use You can use them in just one option! • No need to choose version • No need to find appropriate artifacts 6 Example:
  7. RUNNING APACHE ICEBERG, APACHE HUDI, AND DELTA LAKE ON AWS

    © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Glue is optimized for AWS Services HIGH PERFORMANCE SECURE EASY TO USE GLUE CATALOG INTEGRATION LOW MAINTENANCE AWS GLUE AWS GLUE DATA CATALOG AWS GLUE CRAWLER Amazon Athena Amazon Aurora Amazon DocumentDB Amazon DynamoDB AWS Lake Formation Amazon Redshift Amazon RDS Amazon S3 Amazon AppFlow
  8. RUNNING APACHE ICEBERG, APACHE HUDI, AND DELTA LAKE ON AWS

    © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Multiple ways to register table definitions to Glue Data Catalog • Glue Crawler • Spark job (EMR, Glue) • Athena DDL (Iceberg only) You can choose as you want 8
  9. RUNNING APACHE ICEBERG, APACHE HUDI, AND DELTA LAKE ON AWS

    © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Use Glue Crawler to register table definition All the 3 formats are natively supported. • Delta Lake § Need to choose Native table or Symlink table • Hudi, Iceberg § You can define the maximum depth where Crawler can traverse to discover the table metadata 9
  10. RUNNING APACHE ICEBERG, APACHE HUDI, AND DELTA LAKE ON AWS

    © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Glue for Apache Spark Natively created Delta table. Extends Parquet data files with a file-based transaction logs and metadata. Consistent snapshot exported from native Delta table using ACID transactions. Represented as symlink table based on parquet files. Native Delta table Manifest table Native Delta table and manifest table Amazon Athena Amazon Redshift Amazon EMR GENERATE Crawl Glue Crawler GENERATE symlink_format_manifest FOR TABLE delta.`<path-to-delta-table>` Amazon Athena Amazon EMR
  11. RUNNING APACHE ICEBERG, APACHE HUDI, AND DELTA LAKE ON AWS

    © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Ad-hoc query from AWS Analytics 11 AWS Glue Crawler AWS Glue Data Catalog Amazon S3 Amazon Athena Amazon Redshift Users Delta table definition Delta table files 1. Write delta table files to S3 2. Crawl delta files and populate table definition 3. Ad-hoc query from AWS Analytics Services Video: https://youtu.be/o6Wd84-lxCI?si=M7Xh-1Hg8rOw3_e5&t=884
  12. RUNNING APACHE ICEBERG, APACHE HUDI, AND DELTA LAKE ON AWS

    © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Contribution to Apache Hudi • HUDI-6805: Print detailed error message in clustering • HUDI-5866: Fix unnecessary log messages during bulk insert in Spark • And many more 12
  13. RUNNING APACHE ICEBERG, APACHE HUDI, AND DELTA LAKE ON AWS

    © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Contribution to Apache Iceberg • AWS: Print logs whether Glue optimistic locking is used or not (#6358) • AWS, Docs: Add AWS Glue in Run Iceberg on AWS section (#6623) • AWS: Upgrade AWS Java SDK version to 2.20.131 (#8379) • AWS: Support S3 DSSE-KMS encryption (#8370) • AWS: Fix the missing StorageDescriptor parameter after renameTo (#3468) • AWS: Add supporting only s3 bucket name for S3URI (#6352) • And many more 13
  14. RUNNING APACHE ICEBERG, APACHE HUDI, AND DELTA LAKE ON AWS

    © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Contribution to Delta Lake [Design Doc] Catalog implementation for AWS Glue Data Catalog https://github.com/delta-io/delta/issues/1679 • For better integration with Glue Data Catalog 14
  15. RUNNING APACHE ICEBERG, APACHE HUDI, AND DELTA LAKE ON AWS

    © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Thank you! Akira Ajisaka X: @ajis_ka