Slide 1

Slide 1 text

Tomohiro Tanaka, 2024 Sep. 2 Chapter 6. Apache Spark Apache Iceberg: The De fi nitive Guide ྠಡձ

Slide 2

Slide 2 text

• झຯ: Iceberg • Iceberg ʹͨ·ʹίϯτϦϏϡʔτ͍ͯ͠·͢ • Serverless ETL and Analytics with AWS Glue (ڞஶ) • GitHub: tomtongue, LinkedIn: in/ttomtan Tomohiro Tanaka ຊൃද͸ݸਓͷݟղͰ͋ΓɺॴଐاۀΛ୅ද͢Δ΋ͷͰ͸͍͟͝·ͤΜɻ

Slide 3

Slide 3 text

ຊ೔͓࿩͢͠Δ಺༰ Use Iceberg with Spark Iceberg operations for Spark (Advanced) Dive Deep into Iceberg configurations for Spark Iceberg version 1.6.1 Λϕʔεͱ͍ͯ͠·͢

Slide 4

Slide 4 text

Use Iceberg with Spark

Slide 5

Slide 5 text

Apache Spark ͱ͸ • An uni fi ed analytics engine for large-scale data processing • Enables applications to easily distribute in-memory processing • Suits for iterative computations • Enables developers to easily implement high-level control fl ow of your distributed data processing application

Slide 6

Slide 6 text

Apache Spark ͱ͸? Example: ETL from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() df = spark.read.json('s3://src-bucket/path/') df.createOrReplaceTempView('tmp_tbl') df_2 = spark.sql(""" SELECT category, count(*) as cnt FROM tmp_tbl GROUP BY category """) df_2.write.parquet('s3://dst-bucket/path/') +--------+---+ |category|cnt| +--------+---+ | drink| 35| | book| 12| | health| 9| +--------+---+

Slide 7

Slide 7 text

SparkSQL & DataFrame APIs • Enables data processing with SQL syntax • Provides: • 1. External data source connectivity with friendly APIs, • 2. High performance coming from database techniques and • 3. Supporting new data sources such as semi-structured data • DataFrame API is a main abstraction of SparkSQL.

Slide 8

Slide 8 text

SparkSQL & DataFrame APIs Query optimization by Catalyst • Enables you to get optimizations by query optimizer (called Catalyst) • Catalyst provides rule-based, cost-based and runtime optimizations for your queries. Adaptive Query Execution Driver program Spark RDD Spark RDD Driver program SparkSQL Catalyst Query optimization w/o catalyst w catalyst

Slide 9

Slide 9 text

Iceberg Λ Spark Ͱར༻͢Δ Prerequisites for Iceberg 1.6.x • Spark 3.3+ • Java (8), 11, 17 or (21) • Java 8 ʹ͍ͭͯ͸ (͓ͦΒ͘) Iceberg 1.7.0 Ͱαϙʔτ͞Εͳ͍ (ઌ೔·Ͱ vote ͕ community ͰߦΘΕ ͍ͯͨ • (Community thread ͷ໊݅) [VOTE] Drop Java 8 support in Iceberg 1.7.0 • PR #10518 - Core: Drop support for Java 8 • ͳ͓ɺSpark 4.0 (in preview as of 2024.09.01) Ͱ Java 17/21 ʹͳΔͷͰɺ஫ҙ͕ඞཁ • JARs: • iceberg-spark-runtime-__.jar • Plus, (if needed) additional JARs like iceberg-aws-bundle-.jar (1.4.0+) • < 1.4.0 Ͱ͸ҎԼ 2 ͭͷ JAR ͕ඞཁ (for AWS) • url-connection-client-.jar • bundle-.jar طʹ README ʹ Java 8 ͷهࡌ͸ͳ͠

Slide 10

Slide 10 text

Iceberg Λ Spark Ͱར༻͢Δ Add Iceberg JARs $ spark-shell (spark-sql/spark-submit/pyspark) \ --jars /path/iceberg-spark-runtime.jar,/path/iceberg-aws-bundle.jar \ --master yarn --conf = --conf = from pyspark.sql import SparkSession spark = SparkSession.builder\ .master("yarn") .config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-:")\ .config("", "")\ .config("", "")\ .getOrCreate() $ spark-shell (spark-sql/spark-submit/pyspark) \ --packages org.apache.iceberg:iceberg-spark-runtime-_: \ --master yarn --conf = --conf =

Slide 11

Slide 11 text

Walkthrough Iceberg with Spark Scenario 1. Iceberg ͷઃఆΛ SparkSession ʹରͯ͠ߦ͏ 2. Iceberg ςʔϒϧΛ࡞੒͢Δ 3. ࡞੒ͨ͠ Iceberg ςʔϒϧʹσʔλΛॻ͖ࠐΉ 4. Iceberg ςʔϒϧ͔ΒσʔλΛಡΉ

Slide 12

Slide 12 text

Walkthrough Iceberg with Spark PySpark script from pyspark.sql import SparkSession spark = SparkSession.builder\ .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")\ .config("spark.sql.catalog.my_catalog.type", "glue")\ .config("spark.sql.catalog.my_catalog.warehouse", "s3://bucket/path")\ .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkExtensions")\ .getOrCreate() spark.sql(""" CREATE TABLE my_catalog.db.tbl (id int, name string) USING iceberg """) spark.sql("INSERT INTO my_catalog.db.tbl VALUES (1, 'Alice'), (2, 'Bob')") spark.sql("SELECT * FROM my_catalog.db.tbl").show(n=2, truncate=False)

Slide 13

Slide 13 text

Walkthrough Iceberg with Spark PySpark script from pyspark.sql import SparkSession spark = SparkSession.builder\ .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")\ .config("spark.sql.catalog.my_catalog.type", "glue")\ .config("spark.sql.catalog.my_catalog.warehouse", "s3://bucket/path")\ .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkExtensions")\ .getOrCreate() spark.sql(""" CREATE TABLE my_catalog.db.tbl (id int, name string) USING iceberg """) spark.sql("INSERT INTO my_catalog.db.tbl VALUES (1, 'Alice'), (2, 'Bob')") spark.sql("SELECT * FROM my_catalog.db.tbl").show(n=2, truncate=False) Iceberg ͷઃఆ Iceberg ςʔϒϧͷ࡞੒ σʔλͷॻ͖ग़͠ σʔλͷಡΈࠐΈ

Slide 14

Slide 14 text

Walkthrough Iceberg with Spark Iceberg con fi guration from pyspark.sql import SparkSession spark = SparkSession.builder\ .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")\ .config("spark.sql.catalog.my_catalog.type", "glue")\ .config("spark.sql.catalog.my_catalog.warehouse", "s3://bucket/path")\ .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkExtensions")\ .getOrCreate() spark.sql(""" CREATE TABLE my_catalog.db.tbl (id int, name string) USING iceberg """) spark.sql("INSERT INTO my_catalog.db.tbl VALUES (1, 'Alice'), (2, 'Bob')") spark.sql("SELECT * FROM my_catalog.db.tbl").show(n=2, truncate=False) Iceberg ͷઃఆ

Slide 15

Slide 15 text

Walkthrough Iceberg with Spark Example: Iceberg con fi guration to use Glue Data Catalog and S3 spark = SparkSession.builder\ .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")\ .config("spark.sql.catalog.my_catalog.type", "glue")\ .config("spark.sql.catalog.my_catalog.warehouse", "s3://bucket/path")\ .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkExtensions")\ .getOrCreate() spark = SparkSession.builder\ .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")\ .config("spark.sql.catalog.my_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")\ .config("spark.sql.catalog.my_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")\ .config("spark.sql.catalog.my_catalog.warehouse", "s3://bucket/path")\ .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkExtensions")\ .getOrCreate() For Iceberg 1.5.0+ Iceberg 1.5.0-

Slide 16

Slide 16 text

Walkthrough Iceberg with Spark Example: Iceberg con fi guration to use Hive Catalog and S3 spark = SparkSession.builder\ .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")\ .config("spark.sql.catalog.my_catalog.type", "hive")\ .config("spark.sql.catalog.my_catalog.uri", "thrift://:")\ .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkExtensions")\ .getOrCreate() spark = SparkSession.builder\ .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")\ .config("spark.sql.catalog.my_catalog.type", "hive")\ .config("spark.sql.catalog.my_catalog.uri", "thrift://:")\ .config("spark.sql.catalog.my_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")\ .config("spark.sql.catalog.my_catalog.warehouse", "s3://bucket/path")\ .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkExtensions")\ .getOrCreate() Hive Catalog (only, HDFS ͕σϑΥϧτͰར༻͞ΕΔ) Hive Catalog + S3

Slide 17

Slide 17 text

Walkthrough Iceberg with Spark Example: Iceberg con fi guration to use Snow fl ake Catalog spark = SparkSession.builder\ .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")\ .config("spark.sql.catalog.my_catalog.catalog-impl", "org.apache.iceberg.snowflake.SnowflakeCatalog")\ .config("spark.sql.catalog.my_catalog.uri", "jdbc:snowflake://.snowflakecomputing.com")\ .config("spark.sql.catalog.my_catalog.jdbc.role", "")\ .config("spark.sql.catalog.my_catalog.jdbc.user", "")\ .config("spark.sql.catalog.my_catalog.jdbc.password", "")\ .config("spark.sql.catalog.my_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")\ .config("spark.sql.catalog.my_catalog.warehouse", "s3://bucket/path")\ .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkExtensions")\ .getOrCreate() File IO class ͸ɺSnow fl ake Catalog ͷ৔߹ɺResolvingFileIO class ͕ࢦఆ͞Ε͓ͯΓɺLocation ͷ scheme (s3, gs, abfs etc.) ʹΑΓ൑அͤ͞Δ͜ͱ΋Ͱ͖Δɻ Fallback scheme ͸ HDFS*. *ҎԼͷιʔείʔυΛࢀরɻ https://github.com/apache/iceberg/blob/apache-iceberg-1.6.1/snow fl ake/src/main/java/org/apache/iceberg/snow fl ake/Snow fl akeCatalog.java https://github.com/apache/iceberg/blob/apache-iceberg-1.6.1/core/src/main/java/org/apache/iceberg/io/ResolvingFileIO.java

Slide 18

Slide 18 text

Walkthrough Iceberg with Spark PySpark script from pyspark.sql import SparkSession spark = SparkSession.builder\ .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")\ .config("spark.sql.catalog.my_catalog.type", "glue")\ .config("spark.sql.catalog.my_catalog.warehouse", "s3://bucket/path")\ .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkExtensions")\ .getOrCreate() spark.sql(""" CREATE TABLE my_catalog.db.tbl (id int, name string) USING iceberg """) spark.sql("INSERT INTO my_catalog.db.tbl VALUES (1, 'Alice'), (2, 'Bob')") spark.sql("SELECT * FROM my_catalog.db.tbl").show(n=2, truncate=False)

Slide 19

Slide 19 text

Walkthrough Iceberg with Spark Result output from pyspark.sql import SparkSession spark = SparkSession.builder\ .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")\ .config("spark.sql.catalog.my_catalog.type", "glue")\ .config("spark.sql.catalog.my_catalog.warehouse", "s3://bucket/path")\ .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkExtensions")\ .getOrCreate() spark.sql(""" CREATE TABLE my_catalog.db.tbl (id int, name string) USING iceberg """) spark.sql("INSERT INTO my_catalog.db.tbl VALUES (1, 'Alice'), (2, 'Bob')") spark.sql("SELECT * FROM my_catalog.db.tbl").show(n=2, truncate=False) +--+-------+ |id| name | +--+-------+ |1 | Alice | |2 | Bob | +--+-------+

Slide 20

Slide 20 text

Iceberg Operations for Spark

Slide 21

Slide 21 text

Iceberg operations 4 categories Operation SQLs DDLs CREATE TABLE (PARTITIONED BY), CTAS, DROP TABLE ALTER TABLE RENAME/ALTER COLUMN (Schema evolution) ALTER TABLE ADD/DROP/REPLACE PARTITION FIELD (Partition evolution) ALTER TABLE CREATE BRANCH (Branch and Tag) CREATE|ALTER|DROP VIEW (Views) etc. Reads SELECT, SELECT AS OF (Time travel) SELECT cols FROM history/snapshot/refs (Table inspections) etc. Writes INSERT INTO, UPDATE, DELETE FROM, MERGE INTO (upsert), INSERT OVERWRITE etc. Procedures rollback_to_snapshot/timestamp, rewrite_data_files/manifests, rewrite_position_deletes (compaction), expire_snapshot, remove_orphan_files fast_forward, publish_changes etc.

Slide 22

Slide 22 text

DDLs CREATE TABLE with partitioning and bucketing CREATE TABLE my_catalog.db.tbl (id int, name string) USING iceberg CREATE TABLE my_catalog.db.tbl (id int, name string, year int) USING iceberg LOCATION 's3://bucket/path' PARTITIONED BY (year) s3://bucket/path/db.db/tbl/ - data/ - year=2024/00000-6-887971a6-11e5-4860-a379-4980bda4c85c-0-00003.parquet - year=2023/00000-6-887971a6-11e5-4860-a379-4980bda4c85c-0-00002.parquet - metadata/ - 00001-00ce06e3-54d3-4f36-94c6-7e277b2aec3f.metadata.json - a184a338-62d9-4b67-ba06-5757972f64a6-m0.avro - snap-6993945275787927359-1-a184a338-62d9-4b67-ba06-5757972f64a6.avro

Slide 23

Slide 23 text

DDLs CREATE TABLE with partitioning and bucketing CREATE TABLE my_catalog.db.tbl (id int, name string) USING iceberg CREATE TABLE my_catalog.db.tbl (id int, name string, year int) USING iceberg LOCATION 's3://bucket/path' PARTITIONED BY (year) CREATE TABLE my_catalog.db.tbl (id int, name string, ts timestamp) USING iceberg LOCATION 's3 PARTITIONED BY (year(ts), month(ts), day(ts)) Partition transforms like year, month, day, bucket etc. Doc: https://iceberg.apache.org/spec/#partition-transforms Src: https://github.com/apache/iceberg/blob/apache-iceberg-1.6.1/api/src/main/java/org/apache/iceberg/transforms/Transforms.java

Slide 24

Slide 24 text

DDLs CREATE TABLE with partitioning and bucketing CREATE TABLE my_catalog.db.tbl (id int, name string, ts timestamp) USING iceberg LOCATION 's3 PARTITIONED BY (year(ts), month(ts), day(ts), bucket(4, id)) s3://bucket/path/db.db/tbl/ - data/ - year=2024/ - id_bucket=3/00000-6-887971a6-11e5-4860-a379-4980bda4c85c-0-00003.parquet - id_bucket=0/00000-6-887971a6-11e5-4860-a379-4980bda4c85c-0-00002.parquet - metadata/ - 00001-00ce06e3-54d3-4f36-94c6-7e277b2aec3f.metadata.json - a184a338-62d9-4b67-ba06-5757972f64a6-m0.avro - snap-6993945275787927359-1-a184a338-62d9-4b67-ba06-5757972f64a6.avro

Slide 25

Slide 25 text

DDLs ALTER TABLE SQL SQL Extension is required? ALTER TABLE my_catalog.db.tbl RENAME TO my_catalog.db.tbl_new_name No ALTER TABLE my_catalog.db.tbl SET TBLPROPERTIES ('KEY'='VALUE') ALTER TABLE my_catalog.db.tbl ADD COLUMN(S) (new_col string) ALTER TABLE my_catalog.db.tbl RENAME COLUMN id TO user_id ALTER TABLE my_catalog.db.tbl ALTER COLUMN id TYPE string ALTER TABLE my_catalog.db.tbl DROP COLUMN id ALTER TABLE my_catalog.db.tbl ADD|DROP PARTITION FIELD day(ts) ALTER TABLE my_catalog.db.tbl REPLACE PARTITION FIELD id WITH req_id YES ALTER TABLE my_catalog.db.tbl WRITE ORDERED BY id, year ALTER TABLE my_catalog.db.tbl WRITE DISTRIBUTED BY PARTITION year, month ALTER TABLE my_catalog.db.tbl SET|DRP{ IDENRIFIER FIELDS id, year ALTER TABLE my_catalog.db.tbl CREATE|REPLACE|DROP BRANCH 'branchname' ALTER TABLE my_catalogg CREATE|REPLACE|DROP TAG 'tagname'

Slide 26

Slide 26 text

DDLs Views Ref: https://iceberg.apache.org/docs/latest/spark-ddl/#iceberg-views-in-spark CREATE VIEW product_category_view AS SELECT category, count(*) as cnt FROM my_catalog.db.tbl DROP VIEW product_category_view ALTER VIEW product_category_view Spark 3.4+, and Nessie/JDBC catalogs etc. are supported

Slide 27

Slide 27 text

Reads Time travels SELECT * FROM my_catalog.db.tbl SELECT * FROM my_catalog.db.tbl VERSION AS OF Time travels by version, timestamp and branch/tag SELECT * FROM my_catalog.db.tbl TIMESTAMP AS OF '' SELECT * FROM my_catalog.db.tbl VERSION AS OF '' +---+-------+----+ |id |name |year| +---+-------+----+ |1 |Alice |2024| |2 |Bob |2023| |3 |Charlie|2022| |4 |Dave |2021| |5 |Elly |2022| +---+-------+----+ SELECT * FROM my_catalog.db.tbl VERSION AS OF 7394805868859573195 +---+-------+----+ |id |name |year| +---+-------+----+ |1 |Alice |2024| |2 |Bob |2023| |3 |Charlie|2022| +---+-------+----+ Ref: https://iceberg.apache.org/docs/latest/spark-queries/#time-travel

Slide 28

Slide 28 text

Reads Table inspections SELECT * FROM my_catalog.db.tbl. SELECT * FROM my_catalog.db.tbl.history +-----------------------+-------------------+-------------------+-------------------+ |made_current_at |snapshot_id |parent_id |is_current_ancestor| +-----------------------+-------------------+-------------------+-------------------+ |2024-09-02 07:45:31.031|7394805868859573195|NULL |true | |2024-09-02 08:23:56.313|7217724094206392902|7394805868859573195|true | +-----------------------+-------------------+-------------------+-------------------+ Ref: https://iceberg.apache.org/docs/latest/spark-queries/#inspecting-tables

Slide 29

Slide 29 text

Writes MERGE INTO (= Upsert) MERGE INTO my_catalog.db.tbl t USING (SELECT * FROM tmp) s ON t.id = s.id WHEN MATCHED THEN UPDATE SET t.year = s.year WHEN NOT MATCHED THEN INSERT * +---+-------+----+ |id |name |year| +---+-------+----+ |1 |Alice |2024| |2 |Bob |2023| |3 |Charlie|2022| |4 |Dave |2021| |5 |Elly |2022| +---+-------+----+ +---+-------+----+ |id |name |year| +---+-------+----+ |1 |Alice |1999| |2 |Bob |1998| |3 |Charlie|2022| |4 |Dave |2021| |5 |Elly |2022| |8 |Tommy |2024| +---+-------+----+ +---+-------+----+ |id |name |year| +---+-------+----+ |1 |Alice |1999| |2 |Bob |1998| |8 |Tommy |2024| +---+-------+----+ Target tmp

Slide 30

Slide 30 text

Procedures Procedure Category CALL my_catalog.system.rollback_to_snapshot CALL my_catalog.system.rollback_to.timestamp Snapshot management CALL my_catalog.system.set_current_snapshot CALL my_catalog.system.cherripick_snapshot CALL my_catalog.system.publish_changes CALL my_catalog.system.fast_forward Branch and Tag CALL my_catalog.system.expire_snapshots CALL my_catalog.system.remove_orphan_files Data lifecycle management CALL my_catalog.system.rewrite_data_files CALL my_catalog.system.rewrite_manifests CALL my_catalog.system.rewrite_position_delete_files Compaction CALL my_catalog.system.snapshot CALL my_catalog.system.migrate CALL my_catalog.system.add_files Table migration CALL my_catalog.system.register_table Table registration CALL my_catalog.system.create_changelog_view CDC for an Iceberg table CALL my_catalog.system.ancestors_of Metadata information

Slide 31

Slide 31 text

Dive Deep into Iceberg configurations for Spark

Slide 32

Slide 32 text

Recap Iceberg configurations for Spark spark = SparkSession.builder\ .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")\ .config("spark.sql.catalog.my_catalog.type", "glue")\ .config("spark.sql.catalog.my_catalog.warehouse", "s3://bucket/path")\ .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkExtensions")\ .getOrCreate() spark = SparkSession.builder\ .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")\ .config("spark.sql.catalog.my_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")\ .config("spark.sql.catalog.my_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")\ .config("spark.sql.catalog.my_catalog.warehouse", "s3://bucket/path")\ .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkExtensions")\ .getOrCreate() For Iceberg 1.5.0+ Iceberg 1.5.0-

Slide 33

Slide 33 text

Recap Iceberg configurations for Spark spark = SparkSession.builder\ .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")\ .config("spark.sql.catalog.my_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")\ .config("spark.sql.catalog.my_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")\ .config("spark.sql.catalog.my_catalog.warehouse", "s3://bucket/path")\ .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkExtensions")\ .getOrCreate()

Slide 34

Slide 34 text

Recap Iceberg configurations for Spark spark = SparkSession.builder\ .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")\ .config("spark.sql.catalog.my_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")\ .config("spark.sql.catalog.my_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")\ .config("spark.sql.catalog.my_catalog.warehouse", "s3://bucket/path")\ .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkExtensions")\ .getOrCreate() Catalog Storage Extensions

Slide 35

Slide 35 text

Initialization of SparkSession with Iceberg configurations spark = SparkSession.builder\ .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")\ .config("spark.sql.catalog.my_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")\ .config("spark.sql.catalog.my_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")\ .config("spark.sql.catalog.my_catalog.warehouse", "s3://bucket/path")\ .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkExtensions")\ .getOrCreate() Spark Catalog my_catalog = org.apache.iceberg.spark.SparkCatalog Catalog implementation class org.apache.iceberg.aws.glue.GlueCatalog Storage implementation class and location org.apache.iceberg.aws.s3.S3FileIO warehouse = s3://bucket/path SparkSQL extensinos org.apache.iceberg.spark.extensions.IcebergSparkExtensions

Slide 36

Slide 36 text

"Spark Catalog" ͱ͸ • Spark 3 Ͱ Catalog plugin API ͱ Multiple catalog support ͕௥Ճ͞Εͨ • SPARK-27066 - SPIP: Identi fi ers for multi-catalog support • SPARK-27067 - SPIP: Catalog API for table metadata • ຊػೳʹΑΓ: • Spark ͰϢʔβʔ͕ಠࣗʹ࣮૷ͨ͠ "catalog" Λར༻͢Δ͜ͱ͕Ͱ͖Δ • Catalog ͱ͍͏৽ͨͳ໊લۭؒΛઃ͚Δ͜ͱͰɺ1 SparkSession Ͱෳ਺ͷΧλϩά࣮૷Λར༻Ͱ͖Δ (e.g. SELECT * FROM catalog.db.tbl) • Default catalog name: spark_catalog (spark.sql.defaultCatalog= ͰมߋՄೳ) • Iceberg Spark Ͱ͸ɺຊ Catalog API Λར༻͠ɺ֤Χλϩά࣮૷ʹΞΫηε͍ͯ͠Δ

Slide 37

Slide 37 text

Spark Catalog ͱ ͦͷ࣮૷Ϋϥε ͸ͲͷΑ͏ʹղܾ͞ΕΔ͔ 2 ͭͷϨΠϠʔ͕ଘࡏ͢Δ 1. SparkCatalog ͕࠷ॳʹ "Spark" ϨΠϠʔͰղܾ͞ΕΔ 2. IcebergCatalogImpl (e.g. HiveCatalog, GlueCatalog etc.) ͕ "Iceberg" ϨΠϠʔͰϩʔυ͞ΕΔ spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog Spark layer Iceberg layer spark = SparkSession.builder\ .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")\ .config("spark.sql.catalog.my_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")\ ...

Slide 38

Slide 38 text

Spark Catalog ͱ ͦͷ࣮૷Ϋϥε ͸ͲͷΑ͏ʹղܾ͞ΕΔ͔ Catalyst query optimization process Spark layer Ͱ SparkCatalog ͕ͲͷΑ͏ʹղܾ͞ΕΔͷ͔ݟ͍ͯ͘લʹɺCatalyst plan ʹ͍ͭͯݟ͍͖ͯ·͢ɻ SQLParser SparkSQL Analyzer Optimizer Planner ৄࡉ͸ https://github.com/apache/spark/blob/v3.5.2/sql/core/src/main/ scala/org/apache/spark/sql/execution/QueryExecution.scala Λࢀর

Slide 39

Slide 39 text

Spark Catalog ͱ ͦͷ࣮૷Ϋϥε ͸ͲͷΑ͏ʹղܾ͞ΕΔ͔ Catalyst query optimization process Unresolved LogicalPlan Analyzed LogicalPlan SQLParser SparkSQL Analyzer Optimizer Planner Optimized LogicalPlan WithCachedData LogicalPlan PhysicalPlan (sparkPlan) ExecutablePhysicalPlan (executedPlan) RDD CodeGen Rule-based analysis by LogicalAnalyzer Rule-based and CBO by LogicalOptimizer Convert to PhysicalPlan by SparkPlanner Runtime Optimization. (If needed) Reoptimize the logical plan by Adaptive Query Execution ৄࡉ͸ https://github.com/apache/spark/blob/v3.5.2/sql/core/src/main/ scala/org/apache/spark/sql/execution/QueryExecution.scala Λࢀর

Slide 40

Slide 40 text

Spark Catalog ͱ ͦͷ࣮૷Ϋϥε ͸ͲͷΑ͏ʹղܾ͞ΕΔ͔ Catalyst query optimization process Unresolved LogicalPlan Analyzed LogicalPlan SQLParser SparkSQL Analyzer Optimizer Planner Optimized LogicalPlan WithCachedData LogicalPlan PhysicalPlan (sparkPlan) ExecutablePhysicalPlan (executedPlan) RDD CodeGen Rule-based analysis by LogicalAnalyzer Rule-based and CBO by LogicalOptimizer Convert to PhysicalPlan by SparkPlanner Runtime Optimization. (If needed) Reoptimize the logical plan by Adaptive Query Execution ৄࡉ͸ https://github.com/apache/spark/blob/v3.5.2/sql/core/src/main/ scala/org/apache/spark/sql/execution/QueryExecution.scala Λࢀর Catalog

Slide 41

Slide 41 text

Analyzer Spark Catalog ͱ ͦͷ࣮૷Ϋϥε ͸ͲͷΑ͏ʹղܾ͞ΕΔ͔ "Spark" layer ʹ͓͚Δղܾ Unresolved LogicalPlan Analyzed LogicalPlan Catalog Spark Parse SQL by SparkSqlParser Resolve in SparkAnalyzer (e.g.) ResolveCatalogs LookupCatalog CatalogManager.catalog Catalogs.load Initialize iceberg.spark.SparkCatalog (iceberg.spark.SparkCatalog.initialize) by CatalogPlugin DataFrame APIs with DataSourceV2 ͷ৔߹͸ɺSQL Parse ͷϑΣʔζ͕εΩοϓ͞ΕɺIceberg catalog ͕ Spark ʹ͓͚Δ CatalogV2Util class ʹΑΓϩʔυ͞ΕΔ

Slide 42

Slide 42 text

Spark Catalog ͱ ͦͷ࣮૷Ϋϥε ͸ͲͷΑ͏ʹղܾ͞ΕΔ͔ "Iceberg" layer ʹ͓͚Δղܾ Spark Parse SQL by SparkSqlParser Resolve in SparkAnalyzer (e.g.) ResolveCatalogs LookupCatalog CatalogManager.catalog Catalogs.load Initialize iceberg.spark.SparkCatalog (iceberg.spark.SparkCatalog.initialize) by CatalogPlugin Iceberg SparkCatalog.buildIcebergCatalog CatalogUtil.buildIcebergCatalog

Slide 43

Slide 43 text

Iceberg Catalog types CatalogUtil.buildIcebergCatalog ʹΑΓϩʔυ͞ΕΔ Iceberg Catalog Iceberg Catalog ͸ 2 λΠϓʹ෼ྨͰ͖Δɻ1. shorthand ͕ར༻Ͱ͖Δ (type ࢦఆՄ)Χλϩάɺ2. ΧελϜΧλϩά ʹ෼͚ΒΕΔɻ Category Catalog type Class name Default storage class Type ࢦఆ Մೳͳ Catalog hive (default) org.apache.iceberg.hive.HiveCatalog org.apache.iceberg.hadoop.HadoopFileIO hadoop org.apache.iceberg.hadoop.HadoopCatalog jdbc org.apache.iceberg.jdbc.JdbcCatalog rest org.apache.iceberg.rest.RESTCatalog org.apache.iceberg.io.ResolvingFileIO glue org.apache.iceberg.aws.glue.GlueCatalog org.apache.iceberg.aws.s3.S3FileIO nessie org.apache.iceberg.nessie.NessieCatalog org.apache.iceberg.hadoop.HadoopFileIO Custom Catalog impl class Λ catalog- impl ʹηοτ͢Δඞཁ͕͋Δ (Ұ෦ Iceberg package ʹؚ·Ε Δ) org.apache.iceberg.inmemory.InMemoryCatalog org.apache.iceberg.snowflake.SnowflakeCatalog org.apache.iceberg.dell.ecs.EcsCatalog etc. n/a

Slide 44

Slide 44 text

Iceberg Catalog types CatalogUtil.buildIcebergCatalog ʹΑΓϩʔυ͞ΕΔ Iceberg Catalog spark = SparkSession.builder\ .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")\ .config("spark.sql.catalog.my_catalog.type", "glue")\ .config("spark.sql.catalog.my_catalog.warehouse", "s3://bucket/path")\ .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkExtensions")\ .getOrCreate() spark = SparkSession.builder\ .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")\ .config("spark.sql.catalog.my_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")\ .config("spark.sql.catalog.my_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")\ .config("spark.sql.catalog.my_catalog.warehouse", "s3://bucket/path")\ .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkExtensions")\ .getOrCreate() ΑͬͯɺͲͪΒͷઃఆͰ΋ར༻Ͱ͖Δɻ

Slide 45

Slide 45 text

Ref: CatalogUtil.buildIcebergCatalog https://github.com/apache/iceberg/blob/apache-iceberg-1.6.1/ core/src/main/java/org/apache/iceberg/CatalogUtil.java#L273

Slide 46

Slide 46 text

Loading FileIO class CatalogUtil.buildIcebergCatalog ޙ Iceberg SparkCatalog.buildIcebergCatalog CatalogUtil.buildIcebergCatalog CatalogImpl Ληοτ͢Δ type=hive ͷ৔߹ɺ org.apache.iceberg.hive.HiveCatalog ·ͨ͸ɺ catalog-impl= org.apache.iceberg.snowflake.SnowflakeCatalog CatalogUtil.loadCatalog CatalogImpl ͸ DynamicConstructor (DynConstructors class) ʹΑΓಡΈࠐ·ΕΔ Ref: https://github.com/apache/iceberg/blob/apache-iceberg-1.6.1/core/src/main/java/org/apache/iceberg/CatalogUtil.java#L233 CatalogUtil.loadFileIO ಡΈࠐ·Εͨ CatalogImpl class Λϕʔεʹ Catalog ͕ॳظԽ ͞ΕΔɻ͜ͷॳظԽͷλΠϛϯάͰɺFileIO ͕ϩʔυ͞ΕΔ Ref: * https://github.com/apache/iceberg/blob/apache-iceberg-1.6.1/core/src/main/java/org/apache/iceberg/CatalogUtil.java#L256 * (HiveCatalog) https://github.com/apache/iceberg/blob/apache-iceberg-1.6.1/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java#L111

Slide 47

Slide 47 text

Summary: Catalog resolution Iceberg SparkCatalog.buildIcebergCatalog CatalogUtil.buildIcebergCatalog CatalogUtil.loadCatalog CatalogUtil.loadFileIO Spark Parse SQL by SparkSqlParser Resolve in SparkAnalyzer (e.g.) ResolveCatalogs LookupCatalog CatalogManager.catalog Catalogs.load Initialize iceberg.spark.SparkCatalog (iceberg.spark.SparkCatalog.initialize) by CatalogPlugin Ready to use Load FileIO impl Load Catalog impl SetCatalog Impl

Slide 48

Slide 48 text

"IcebergSparkSessionExtensions" ͱ͸ • Spark catalyst ʹ͓͚Δϧʔϧ͸ SparkSessionExtensions class ܦ༝Ͱ֦ு ͢Δ͜ͱ͕Ͱ͖Δ • Developer ͸ಠࣗϧʔϧΛ࣮૷ͯ͠ɺͦΕΛ Spark catalyst ʹ௥ՃͰ͖Δ • User ͸ɺ௥Ճ͞ΕͨϧʔϧΛ spark.sql.extensions ʹઃఆ͢Δ͜ͱͰ࢖༻ Ͱ͖Δ Ref: https://github.com/apache/iceberg/blob/apache-iceberg-1.6.1/spark/v3.5/spark-extensions/src/main/scala/org/apache/iceberg/spark/extensions/IcebergSparkSessionExtensions.scala spark = SparkSession.builder\ .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")\ .config("spark.sql.catalog.my_catalog.type", "glue")\ .config("spark.sql.catalog.my_catalog.warehouse", "s3://bucket/path")\ .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkExtensions")\ .getOrCreate()

Slide 49

Slide 49 text

Iceberg catalyst rule injection IcebergSparkSessionExtensions (Spark 3.5 & Iceberg 1.6.1) SQLParser SparkSQL Analyzer Optimizer Planner injectParser NonResolvedLogicalPlan => AnalyzedLogicalPlan AnalyzedLogicalPlan => OptimizedLogicalPlan QptimizedLogicalPlan => SpakPlan (+ AQE) injectResolutionRule injectCheckRule injectOptimizerRule injectPreCBORule injectPlannerStrategy ResolveProcedures ResolveViews ProcedureArgumentCoercison CheckViews ReplaceStaticInvoke ExtendedDataSourceV2Strategy

Slide 50

Slide 50 text

Many Iceberg catalyst rules are added to Spark Iceberg 1.4.0 release notes Tabular blog post on 2023 Oct. 5 https://tabular.io/blog/iceberg-1-4-0/

Slide 51

Slide 51 text

Many Iceberg catalyst rules are added to Spark Phrase Method Class for Spark 3.4 Class for Spark 3.5 Parse injectParser IcebergSparkSqlExtensionsParser IcebergSparkSqlExtensionsParser Analyzer injectResolutionRule ResolveProcedures ResolveViews ResolveMergeIntoTableReferences CheckMergeIntoTableConditions ProcedureArgumentCoercion AlignRowLevelCommandAssignments RewriteUpdateTable RewriteMergeIntoTable ResolveProcedures ResolveViews ProcedureArgumentCoercion injectCheckRule CheckViews MergeIntoIcebergTableResolutionCheck AlignedRowLevelIcebergCommandCheck CheckViews Optimizer injectOptimizerRule ExtendedSimplifyConditionalsInPredicate ExtendedReplaceNullWithFalseInPredicate ReplaceStaticInvoke ReplaceStaticInvoke injectPreCBORule RowLevelCommandScanRelationPushdown ExtendedV2Writes RowLevelCommandDynamicPruning ReplaceRewrittenRowLevelCommand n/a Planner injectPlannerStrategy ExtendedDataSourceV2Strategy ExtendedDataSourceV2Strategy (e.g.) https://github.com/apache/spark/blob/v3.5.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteUpdateTable.scala

Slide 52

Slide 52 text

"SparkSessionCatalog" (for Iceberg) ͱ͸? • Iceberg Catalog ࣮૷ʹՃ͑ɺSpark ͷϏϧτΠϯΧλϩά (spark_catalog) ΋ ؚΜͰ͍Δɻͭ·Γ Spark (Iceberg Ͱͳ͍) ςʔϒϧΦϖϨʔγϣϯ΋࣮ߦͰ ͖Δ • "Fallback" catalog ͱݴΘΕɺ࠷ॳʹ Iceberg Catalog ࣮૷Λ࢖͓͏ͱ͢Δ ͕ɺҧ͏৔߹͸ Spark Catalog Λར༻͢Δ • Spark Catalog ͱ Iceberg Catalog Λར༻͢Δ migrate procedure ࣮ߦ࣌ʹ ࢖༻͢Δ৔߹͕͋Δ

Slide 53

Slide 53 text

SparkSessionCatalog ͷઃఆ spark = SparkSession.builder\ .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")\ .config("spark.sql.catalog.my_catalog.type", "glue")\ .config("spark.sql.catalog.my_catalog.warehouse", "s3://bucket/path")\ .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkExtensions")\ .getOrCreate() spark.sql("SELECT * FROM my_catalog.db.tbl").show(truncate=False) spark.sql("SELECT * FROM my_catalog.db.tbl").show(truncate=False) spark = SparkSession.builder\ .config("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog")\ .config("spark.sql.catalog.spark_catalog.type", "glue")\ .config("spark.sql.catalog.spark_catalog.warehouse", "s3://bucket/path")\ .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkExtensions")\ .getOrCreate() spark.sql("SELECT * FROM db.tbl").show(truncate=False) SparkCatalog SparkSesisonCatalog

Slide 54

Slide 54 text

·ͱΊ • Iceberg Λ Spark Ͱར༻͢ΔͨΊʹ͸ɺύοέʔδͷ import ͱɺSparkCatalog ʹؔ ࿈͢ΔύϥϝʔλΛઃఆ͢Δ • Spark Ͱ࣮ߦͰ͖Δ Iceberg ΦϖϨʔγϣϯ͸ଟ͘ɺ༷ʑͳϝϯςφϯελεΫΛ Procedure ͱ࣮ͯ͠ߦͰ͖Δ • Iceberg Catalog ͸ Spark ͷ CatalogPlugin API ܦ༝Ͱઃఆ͞ΕɺॳظԽ͞Εͨޙɺ FileIO ͕ॳظԽ͞ΕΔɻ֤ Catalog ͸ɺσϑΥϧτ FileIO ΫϥεΛ࣋ͭ • IcebergSparkSessionExtensions ʹΑΓɺSpark ʹ࣮૷͕ଘࡏ͠ͳ͍ΫΤϦ (Procedure ͳͲ)Λ࣮ߦ͢Δ͜ͱ͕Ͱ͖Δ