DeltaCAT: A Scalable Data Catalog for Ray Datasets

A Scalable Data Catalog for Ray Datasets

Data Catalogs vs. Data Lakes •Data Lakes • Centralized repository
for structured and unstructured data. • Durably store and retrieve raw data. •Data Catalogs • Enriches data lake with additional metadata. • Discovery Mechanisms (namespaces, table names, partitions, versions, etc.) • Interpretation Instructions (schema, constraints, content type, encoding, etc.) • Key Insights (descriptions, audit logs, pre-computed aggregates, etc.)

Current State of Data Catalogs & Python • Integrated Compute
Engines • PyHive • HiveQL • PySpark • Spark SQL • DataFrame • Streaming • Pandas on Spark • PyFlink • DataStream • Table • Storage & Compute Based Largely on Preceding Java/SQL APIs and Architecture • Limited Feature Support • Not Pythonic • Inefficient • Cross-Language Integration Overhead

Graph from SlashData, licensed under CC 4.0

DeltaCAT Project Goals 1. Provide Intuitive, Pythonic Data Catalog APIs
2. Bring Fast, Scalable, ACID transactions to Ray 3. Give Ray users a Consistent, Portable Interface for Interacting with Data Catalogs 4. Provide Scalable, Efficient, and Reliable Implementations of Common Data Catalog Table Management Jobs

DeltaCAT Interfaces • Storage Interface • Low-Level • Target Audience:
Data Catalog Developers Integrating Their Catalog w/ Ray • Exposes • Catalog Metadata Models and Structural Details • Table Partition Details • Table Revision and CDC/Delta Log Details • https://github.com/ray-project/deltacat/blob/main/deltacat/storage/interface.py • Catalog Interface • High-Level • Target Audience: Data Catalog Users Integrating Their Data w/ Ray • Default Implementation Uses Storage Interface • Exposes • Table-Level Writes of Ray Datasets • Table, Partition, and Incremental/Time-Bound Reads into Ray Datasets • https://github.com/ray-project/deltacat/blob/main/deltacat/catalog/delegate.py

DeltaCAT Target Audience • Data Catalog Developers • Usually Implement
Storage Interface • Can reuse existing Ray catalog management job implementations. • Can reuse existing Catalog implementation. • Sometimes Implement Catalog Interface • Can’t reuse existing Ray catalog management job implementations. • Can’t reuse existing Catalog implementation. • Supports more flexible storage models. • Supports more flexible external catalog management job implementations. • Data Catalog Users • Usually Use Catalog Interface • Simpler, safer, standard API • Sometimes Use Storage Interface • Complex, less guardrails, but more powerful API

DeltaCAT Compute • Powered by Ray • Invoked Transparently by
DeltaCAT Catalog API • Leverages DeltaCAT Storage API • Current Implementations • Compaction • Merges Table Partition CDC/Delta Logs • Statistics • Column, Delta, Partition, and Table-level • Upcoming Implementations • Schema Evolution • Table Repair • Anomaly Detection

import deltacat as dc # Start by initializing DeltaCAT and
registering available Catalogs. # Ray will be initialized automatically via `ray.init(address="auto")`. # Only the `prod` data catalog is provided so it will become the default. # A default catalog name and `ray.init()` args can also be given. dc.init( catalogs={ "prod": dc.Catalog( impl=example.s3.catalog, uri="s3://sample-catalog" ) } )

# Equivalently, write the Dataset to `table-foo` in the default
namespace of # the default Catalog (`prod`). `TableWriteMode.CREATE` will be inferred if # the table doesn’t exist, and the content type will default to Parquet. dc.write_to_table(data=dataset, table="table-foo") import ray import pandas as pd # Create a Ray Distributed Dataset from a Pandas DataFrame. df = pd.DataFrame({"column1": [1, 2, 3], "column2": ["a", "b", "c"]}) dataset = ray.data.from_pandas(df) # Write the Dataset to a new Parquet table named `sample-table` in the # `example` namespace of the `prod` catalog. dc.write_to_table( data=dataset, catalog="prod", namespace="example", table="sample-table", mode=dc.TableWriteMode.CREATE, content_type=dc.ContentType.PARQUET)

# Read the table back into a Ray Distributed Dataset.
dataset = dc.read_table( catalog="prod", namespace="example", table="sample-table") # Convert to a Distributed Pandas DataFrame using Modin. distributed_df = dataset.to_modin() prod.example.sample-table column1 column2 1 a 2 b 3 c

# Create a Pandas DataFrame to append. df = pd.DataFrame({"column1":
[1, 2, 4], "column2": ["d", "e", "f"]}) # Append the DataFrame to the previously created table as a new Parquet file. dc.write_to_table( data=df, catalog="prod", namespace="example", table="sample-table", mode=dc.TableWriteMode.APPEND, content_type=dc.ContentType.PARQUET) prod.example.sample-table column1 column2 1 a 2 b 3 c 1 d 2 e 4 f

# Create a Pandas DataFrame to append. df = pd.DataFrame({"column1":
[1, 2, 4], "column2": ["d", "e", "f"]}) # Replace the previously created table with this DataFrame. dc.write_to_table( data=df, catalog="prod", namespace="example", table="sample-table", mode=dc.TableWriteMode.REPLACE, content_type=dc.ContentType.PARQUET) prod.example.sample-table column1 column2 1 d 2 e 4 f

# Create a Pandas DataFrame. df = pd.DataFrame({"column1": [1, 2,
3], "column2": ["a", "b", "c"]}) # Write the DataFrame to a new Parquet table using `column1` as the primary key. dc.write_to_table( data=df, catalog="prod", namespace="example", table="sample-pk-table", mode=dc.TableWriteMode.CREATE, content_type=dc.ContentType.PARQUET, primary_keys=["column1"]) prod.example.sample-pk-table column1 column2 1 a 2 b 3 c

# Create a Pandas DataFrame to upsert. df = pd.DataFrame({"column1":
[1, 2, 4], "column2": ["d", "e", "f"]}) # Merge/upsert the DataFrame into the previously created table. dc.write_to_table( data=df, catalog="prod", namespace="example", table="sample-pk-table", mode=dc.TableWriteMode.MERGE, content_type=dc.ContentType.PARQUET) prod.example.sample-pk-table column1 column2 1 d 2 e 3 c 4 f

Supported Write Modes by Table Type Append Replace Merge Upsert
Merge Delete Standard Table Y Y N N Primary Key Table N Y Y Y

Dataset Read Benchmarks deltacat.read_table() 100 Samples from Prod Tables <=
1TiB Storage Content Type Cluster Dataset Files (avg) Dataset Size (avg) Performance (avg) Efficiency (avg) S3 Parquet 8 r5n.8xlarge 256 vCPUs 200 Gbps 304 161 GB 67,220,538 rows 63 Gbps 7.84 GB/s 28 TB/hour $0.76/TB 12,656 rows/s-core S3 CSV 8 r5n.8xlarge 256 vCPUs 200 Gbps 283 143 GB 88,917,356 rows 41 Gbps 5.13 GB/s 18.5 TB/hour $1.16/TB 12,460 rows/s-core

Primary Key Merge Benchmarks 432 Merge Upsert Samples w/ Prod
Datasets >= 10TiB Storage Content Type Cluster Input Size Performance Efficiency S3 Parquet 250 r5n.8xlarge 8000 vCPUs 6250 Gbps 129 TB 2 GB/file 3176 Gbps 397 GB/s 1429 TB/hour $0.42/TB 29,233 rows/s-core S3 Parquet 110 r5n.8xlarge 3520 vCPUs 2750 Gbps 1.2 PB 2 GB/file 862 Gbps 108 GB/s 388 TB/hour $0.68/TB 18,115 rows/s-core S3 Parquet 11 r5n.8xlarge 352 vCPUs 275 Gbps 22 TB 2 GB/file 264 Gbps 33 GB/s 119 TB/hour $0.22/TB 61,477 rows/s-core Averages 130.5 r5n.8xlarge 4176 vCPUs 3262.5 Gbps 75 TB 2 GB/file 1664 Gbps 208 GB/s 749 TB/hour $0.33/TB 44,877 rows/s-core

Current State at Amazon •DeltaCAT Storage Implemented for Prod S3-Based
Catalog •Use Cases • New Initiatives • Near-Real-Time Table Statistics and Data Quality Analysis • Interactive Data Exploration and Analysis for ML Workflows • Compaction Migration • Shadowing Production Jobs on Spark EMR • Reduced Cost (91% in terms of $/byte) • Improved Scalability (12X larger input datasets) • Improved Throughput (13X in terms of bytes/s) • Job Completion Time SLA Guarantees

Next Steps • Documentation! • Reference DeltaCAT Storage Implementation •
Reference DeltaCAT Catalog Implementation • Native Ray Dataset API Integration • Introductory and Technical How-To Blog Posts • Apache Iceberg Integration (PyIceberg) • Transition Core Amazon Table Management Jobs to DeltaCAT • Exabyte-Scale Catalog • >70K jobs/day average • >1PiB-deltas/day (>5.5 trillion records/day) average • >17PiB-deltas/day (>80 trillion records/day) peak

Thanks! Questions? Comments? Want to Get Involved? Visit us on
GitHub https://github.com/ray-project/deltacat Feel free to reach out to me on the Ray Community Slack!

DeltaCAT: A Scalable Data Catalog for Ray Datasets

DeltaCAT: A Scalable Data Catalog for Ray Datasets

Anyscale

More Decks by Anyscale

Other Decks in Technology

Featured

Transcript

A Scalable Data Catalog for Ray Datasets

Data Catalogs vs. Data Lakes •Data Lakes • Centralized repository

Current State of Data Catalogs & Python • Integrated Compute

Graph from SlashData, licensed under CC 4.0

Graph from SlashData, licensed under CC 4.0

DeltaCAT Project Goals 1. Provide Intuitive, Pythonic Data Catalog APIs

DeltaCAT Interfaces • Storage Interface • Low-Level • Target Audience:

DeltaCAT Target Audience • Data Catalog Developers • Usually Implement

DeltaCAT Compute • Powered by Ray • Invoked Transparently by

import deltacat as dc # Start by initializing DeltaCAT and

# Equivalently, write the Dataset to `table-foo` in the default

# Read the table back into a Ray Distributed Dataset.

# Create a Pandas DataFrame to append. df = pd.DataFrame({"column1":

# Create a Pandas DataFrame to append. df = pd.DataFrame({"column1":

# Create a Pandas DataFrame. df = pd.DataFrame({"column1": [1, 2,

# Create a Pandas DataFrame to upsert. df = pd.DataFrame({"column1":

Supported Write Modes by Table Type Append Replace Merge Upsert

Dataset Read Benchmarks deltacat.read_table() 100 Samples from Prod Tables <=

Primary Key Merge Benchmarks 432 Merge Upsert Samples w/ Prod

Current State at Amazon •DeltaCAT Storage Implemented for Prod S3-Based

Next Steps • Documentation! • Reference DeltaCAT Storage Implementation •

Thanks! Questions? Comments? Want to Get Involved? Visit us on