Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DeltaCAT: A Scalable Data Catalog for Ray Datasets

Anyscale
January 21, 2022

DeltaCAT: A Scalable Data Catalog for Ray Datasets

Most of today’s open-source data catalogs and data lakes are written for Java, with Python support either unavailable or tacked on as an afterthought. This can lead to awkward programming models and cross-language integration overhead for Python developers. To better connect Python devs to their data, we introduced the DeltaCAT project to the Ray Project ecosystem.
We’ll discuss how DeltaCAT leverages Ray Datasets to manage petabyte-scale data catalog tables. We’ll also review the goals of the project, how Amazon is using it internally, its current state, and future roadmap.

Anyscale

January 21, 2022
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. Data Catalogs vs. Data Lakes •Data Lakes • Centralized repository

    for structured and unstructured data. • Durably store and retrieve raw data. •Data Catalogs • Enriches data lake with additional metadata. • Discovery Mechanisms (namespaces, table names, partitions, versions, etc.) • Interpretation Instructions (schema, constraints, content type, encoding, etc.) • Key Insights (descriptions, audit logs, pre-computed aggregates, etc.)
  2. Current State of Data Catalogs & Python • Integrated Compute

    Engines • PyHive • HiveQL • PySpark • Spark SQL • DataFrame • Streaming • Pandas on Spark • PyFlink • DataStream • Table • Storage & Compute Based Largely on Preceding Java/SQL APIs and Architecture • Limited Feature Support • Not Pythonic • Inefficient • Cross-Language Integration Overhead
  3. DeltaCAT Project Goals 1. Provide Intuitive, Pythonic Data Catalog APIs

    2. Bring Fast, Scalable, ACID transactions to Ray 3. Give Ray users a Consistent, Portable Interface for Interacting with Data Catalogs 4. Provide Scalable, Efficient, and Reliable Implementations of Common Data Catalog Table Management Jobs
  4. DeltaCAT Interfaces • Storage Interface • Low-Level • Target Audience:

    Data Catalog Developers Integrating Their Catalog w/ Ray • Exposes • Catalog Metadata Models and Structural Details • Table Partition Details • Table Revision and CDC/Delta Log Details • https://github.com/ray-project/deltacat/blob/main/deltacat/storage/interface.py • Catalog Interface • High-Level • Target Audience: Data Catalog Users Integrating Their Data w/ Ray • Default Implementation Uses Storage Interface • Exposes • Table-Level Writes of Ray Datasets • Table, Partition, and Incremental/Time-Bound Reads into Ray Datasets • https://github.com/ray-project/deltacat/blob/main/deltacat/catalog/delegate.py
  5. DeltaCAT Target Audience • Data Catalog Developers • Usually Implement

    Storage Interface • Can reuse existing Ray catalog management job implementations. • Can reuse existing Catalog implementation. • Sometimes Implement Catalog Interface • Can’t reuse existing Ray catalog management job implementations. • Can’t reuse existing Catalog implementation. • Supports more flexible storage models. • Supports more flexible external catalog management job implementations. • Data Catalog Users • Usually Use Catalog Interface • Simpler, safer, standard API • Sometimes Use Storage Interface • Complex, less guardrails, but more powerful API
  6. DeltaCAT Compute • Powered by Ray • Invoked Transparently by

    DeltaCAT Catalog API • Leverages DeltaCAT Storage API • Current Implementations • Compaction • Merges Table Partition CDC/Delta Logs • Statistics • Column, Delta, Partition, and Table-level • Upcoming Implementations • Schema Evolution • Table Repair • Anomaly Detection
  7. import deltacat as dc # Start by initializing DeltaCAT and

    registering available Catalogs. # Ray will be initialized automatically via `ray.init(address="auto")`. # Only the `prod` data catalog is provided so it will become the default. # A default catalog name and `ray.init()` args can also be given. dc.init( catalogs={ "prod": dc.Catalog( impl=example.s3.catalog, uri="s3://sample-catalog" ) } )
  8. # Equivalently, write the Dataset to `table-foo` in the default

    namespace of # the default Catalog (`prod`). `TableWriteMode.CREATE` will be inferred if # the table doesn’t exist, and the content type will default to Parquet. dc.write_to_table(data=dataset, table="table-foo") import ray import pandas as pd # Create a Ray Distributed Dataset from a Pandas DataFrame. df = pd.DataFrame({"column1": [1, 2, 3], "column2": ["a", "b", "c"]}) dataset = ray.data.from_pandas(df) # Write the Dataset to a new Parquet table named `sample-table` in the # `example` namespace of the `prod` catalog. dc.write_to_table( data=dataset, catalog="prod", namespace="example", table="sample-table", mode=dc.TableWriteMode.CREATE, content_type=dc.ContentType.PARQUET)
  9. # Read the table back into a Ray Distributed Dataset.

    dataset = dc.read_table( catalog="prod", namespace="example", table="sample-table") # Convert to a Distributed Pandas DataFrame using Modin. distributed_df = dataset.to_modin() prod.example.sample-table column1 column2 1 a 2 b 3 c
  10. # Create a Pandas DataFrame to append. df = pd.DataFrame({"column1":

    [1, 2, 4], "column2": ["d", "e", "f"]}) # Append the DataFrame to the previously created table as a new Parquet file. dc.write_to_table( data=df, catalog="prod", namespace="example", table="sample-table", mode=dc.TableWriteMode.APPEND, content_type=dc.ContentType.PARQUET) prod.example.sample-table column1 column2 1 a 2 b 3 c 1 d 2 e 4 f
  11. # Create a Pandas DataFrame to append. df = pd.DataFrame({"column1":

    [1, 2, 4], "column2": ["d", "e", "f"]}) # Replace the previously created table with this DataFrame. dc.write_to_table( data=df, catalog="prod", namespace="example", table="sample-table", mode=dc.TableWriteMode.REPLACE, content_type=dc.ContentType.PARQUET) prod.example.sample-table column1 column2 1 d 2 e 4 f
  12. # Create a Pandas DataFrame. df = pd.DataFrame({"column1": [1, 2,

    3], "column2": ["a", "b", "c"]}) # Write the DataFrame to a new Parquet table using `column1` as the primary key. dc.write_to_table( data=df, catalog="prod", namespace="example", table="sample-pk-table", mode=dc.TableWriteMode.CREATE, content_type=dc.ContentType.PARQUET, primary_keys=["column1"]) prod.example.sample-pk-table column1 column2 1 a 2 b 3 c
  13. # Create a Pandas DataFrame to upsert. df = pd.DataFrame({"column1":

    [1, 2, 4], "column2": ["d", "e", "f"]}) # Merge/upsert the DataFrame into the previously created table. dc.write_to_table( data=df, catalog="prod", namespace="example", table="sample-pk-table", mode=dc.TableWriteMode.MERGE, content_type=dc.ContentType.PARQUET) prod.example.sample-pk-table column1 column2 1 d 2 e 3 c 4 f
  14. Supported Write Modes by Table Type Append Replace Merge Upsert

    Merge Delete Standard Table Y Y N N Primary Key Table N Y Y Y
  15. Dataset Read Benchmarks deltacat.read_table() 100 Samples from Prod Tables <=

    1TiB Storage Content Type Cluster Dataset Files (avg) Dataset Size (avg) Performance (avg) Efficiency (avg) S3 Parquet 8 r5n.8xlarge 256 vCPUs 200 Gbps 304 161 GB 67,220,538 rows 63 Gbps 7.84 GB/s 28 TB/hour $0.76/TB 12,656 rows/s-core S3 CSV 8 r5n.8xlarge 256 vCPUs 200 Gbps 283 143 GB 88,917,356 rows 41 Gbps 5.13 GB/s 18.5 TB/hour $1.16/TB 12,460 rows/s-core
  16. Primary Key Merge Benchmarks 432 Merge Upsert Samples w/ Prod

    Datasets >= 10TiB Storage Content Type Cluster Input Size Performance Efficiency S3 Parquet 250 r5n.8xlarge 8000 vCPUs 6250 Gbps 129 TB 2 GB/file 3176 Gbps 397 GB/s 1429 TB/hour $0.42/TB 29,233 rows/s-core S3 Parquet 110 r5n.8xlarge 3520 vCPUs 2750 Gbps 1.2 PB 2 GB/file 862 Gbps 108 GB/s 388 TB/hour $0.68/TB 18,115 rows/s-core S3 Parquet 11 r5n.8xlarge 352 vCPUs 275 Gbps 22 TB 2 GB/file 264 Gbps 33 GB/s 119 TB/hour $0.22/TB 61,477 rows/s-core Averages 130.5 r5n.8xlarge 4176 vCPUs 3262.5 Gbps 75 TB 2 GB/file 1664 Gbps 208 GB/s 749 TB/hour $0.33/TB 44,877 rows/s-core
  17. Current State at Amazon •DeltaCAT Storage Implemented for Prod S3-Based

    Catalog •Use Cases • New Initiatives • Near-Real-Time Table Statistics and Data Quality Analysis • Interactive Data Exploration and Analysis for ML Workflows • Compaction Migration • Shadowing Production Jobs on Spark EMR • Reduced Cost (91% in terms of $/byte) • Improved Scalability (12X larger input datasets) • Improved Throughput (13X in terms of bytes/s) • Job Completion Time SLA Guarantees
  18. Next Steps • Documentation! • Reference DeltaCAT Storage Implementation •

    Reference DeltaCAT Catalog Implementation • Native Ray Dataset API Integration • Introductory and Technical How-To Blog Posts • Apache Iceberg Integration (PyIceberg) • Transition Core Amazon Table Management Jobs to DeltaCAT • Exabyte-Scale Catalog • >70K jobs/day average • >1PiB-deltas/day (>5.5 trillion records/day) average • >17PiB-deltas/day (>80 trillion records/day) peak
  19. Thanks! Questions? Comments? Want to Get Involved? Visit us on

    GitHub https://github.com/ray-project/deltacat Feel free to reach out to me on the Ray Community Slack!