for large-scale data processing • Enables applications to easily distribute in-memory processing • Suits for iterative computations • Enables developers to easily implement high-level control fl ow of your distributed data processing application
syntax • Provides: • 1. External data source connectivity with friendly APIs, • 2. High performance coming from database techniques and • 3. Supporting new data sources such as semi-structured data • DataFrame API is a main abstraction of SparkSQL.
you to get optimizations by query optimizer (called Catalyst) • Catalyst provides rule-based, cost-based and runtime optimizations for your queries. Adaptive Query Execution Driver program Spark RDD Spark RDD Driver program SparkSQL Catalyst Query optimization w/o catalyst w catalyst
BY), CTAS, DROP TABLE ALTER TABLE RENAME/ALTER COLUMN (Schema evolution) ALTER TABLE ADD/DROP/REPLACE PARTITION FIELD (Partition evolution) ALTER TABLE CREATE BRANCH (Branch and Tag) CREATE|ALTER|DROP VIEW (Views) etc. Reads SELECT, SELECT AS OF (Time travel) SELECT cols FROM history/snapshot/refs (Table inspections) etc. Writes INSERT INTO, UPDATE, DELETE FROM, MERGE INTO (upsert), INSERT OVERWRITE etc. Procedures rollback_to_snapshot/timestamp, rewrite_data_files/manifests, rewrite_position_deletes (compaction), expire_snapshot, remove_orphan_files fast_forward, publish_changes etc.
(id int, name string) USING iceberg CREATE TABLE my_catalog.db.tbl (id int, name string, year int) USING iceberg LOCATION 's3://bucket/path' PARTITIONED BY (year) s3://bucket/path/db.db/tbl/ - data/ - year=2024/00000-6-887971a6-11e5-4860-a379-4980bda4c85c-0-00003.parquet - year=2023/00000-6-887971a6-11e5-4860-a379-4980bda4c85c-0-00002.parquet - metadata/ - 00001-00ce06e3-54d3-4f36-94c6-7e277b2aec3f.metadata.json - a184a338-62d9-4b67-ba06-5757972f64a6-m0.avro - snap-6993945275787927359-1-a184a338-62d9-4b67-ba06-5757972f64a6.avro
(id int, name string) USING iceberg CREATE TABLE my_catalog.db.tbl (id int, name string, year int) USING iceberg LOCATION 's3://bucket/path' PARTITIONED BY (year) CREATE TABLE my_catalog.db.tbl (id int, name string, ts timestamp) USING iceberg LOCATION 's3 PARTITIONED BY (year(ts), month(ts), day(ts)) Partition transforms like year, month, day, bucket etc. Doc: https://iceberg.apache.org/spec/#partition-transforms Src: https://github.com/apache/iceberg/blob/apache-iceberg-1.6.1/api/src/main/java/org/apache/iceberg/transforms/Transforms.java
my_catalog.db.tbl RENAME TO my_catalog.db.tbl_new_name No ALTER TABLE my_catalog.db.tbl SET TBLPROPERTIES ('KEY'='VALUE') ALTER TABLE my_catalog.db.tbl ADD COLUMN(S) (new_col string) ALTER TABLE my_catalog.db.tbl RENAME COLUMN id TO user_id ALTER TABLE my_catalog.db.tbl ALTER COLUMN id TYPE string ALTER TABLE my_catalog.db.tbl DROP COLUMN id ALTER TABLE my_catalog.db.tbl ADD|DROP PARTITION FIELD day(ts) ALTER TABLE my_catalog.db.tbl REPLACE PARTITION FIELD id WITH req_id YES ALTER TABLE my_catalog.db.tbl WRITE ORDERED BY id, year ALTER TABLE my_catalog.db.tbl WRITE DISTRIBUTED BY PARTITION year, month ALTER TABLE my_catalog.db.tbl SET|DRP{ IDENRIFIER FIELDS id, year ALTER TABLE my_catalog.db.tbl CREATE|REPLACE|DROP BRANCH 'branchname' ALTER TABLE my_catalogg CREATE|REPLACE|DROP TAG 'tagname'
count(*) as cnt FROM my_catalog.db.tbl DROP VIEW product_category_view ALTER VIEW product_category_view Spark 3.4+, and Nessie/JDBC catalogs etc. are supported
my_catalog.db.tbl VERSION AS OF <SNAPSHOT_ID> Time travels by version, timestamp and branch/tag SELECT * FROM my_catalog.db.tbl TIMESTAMP AS OF '<TIMESTAMP>' SELECT * FROM my_catalog.db.tbl VERSION AS OF '<TAG_NAME | BRANCH_NAME>' +---+-------+----+ |id |name |year| +---+-------+----+ |1 |Alice |2024| |2 |Bob |2023| |3 |Charlie|2022| |4 |Dave |2021| |5 |Elly |2022| +---+-------+----+ SELECT * FROM my_catalog.db.tbl VERSION AS OF 7394805868859573195 +---+-------+----+ |id |name |year| +---+-------+----+ |1 |Alice |2024| |2 |Bob |2023| |3 |Charlie|2022| +---+-------+----+ Ref: https://iceberg.apache.org/docs/latest/spark-queries/#time-travel