Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Theseus: A composable, distributed, hardware ag...

Theseus: A composable, distributed, hardware agnostic processing engine

Theseus is a composable, scalable, distributed high performance data analytics engine. This talk is about how the GPU accelerator native engine is now hardware agnostic, by leveraging Velox as a CPU backend.

William Malpica
Distinguished Engineer & Co-Founder at Voltron Data

Amin Aramoon, PhD
Software Developer at Voltron Data

Ali LeClerc

April 09, 2024
Tweet

More Decks by Ali LeClerc

Other Decks in Technology

Transcript

  1. Three Mega Trends Compute We need data processing engines built

    to leverage system-level acceleration - from compute, to memory, networking, and storage. We call this accelerator-native. The time is now to run distributed ETL workloads on hardware-accelerated infrastructure. Composability Composability lets teams augment, adapt or replace components of their data systems to unlock new use cases. Sometimes you can augment or other times you can adapt for new paradigms. By embracing composability, you can be proactive and flexible. Community Open Standards allow faster innovation. We are building standards just like internet protocols but for data systems. We build, maintain, and contribute to core Open Standards and Open Source Projects such as Arrow, Ibis, RAPIDS, Substrait, Velox, and more.
  2. The Wall Spark vs. Theseus: TPC-H 10TB Benchmark Total Runtime

    (seconds) Cluster Cost per Hour ($) Spark EMR Theseus CPU performance is capped. No amount of money will jump over this wall. Note: Theseus: 1 Node 8 x A100 80 GB, Spark: 1 Node r5.8xlarge (AWS) 32 VCPU 32 GB
  3. Scaling Datasets ✓ Up to 10 DGX Servers ✓ Parquet

    Files ✓ Remote File System ✓ Lots of Spilling ✘ No Sorting ✘ No Indexing ✘ No Caching ✘ No Warm Up (Cold Queries) TPC-H (10TB, 30TB, 100TB) Note: Theseus: 1 Node 8 x A100 80 GB, Spark: 1 Node r5.8xlarge (AWS) 32 VCPU 32 GB 10 TB 30 TB 100 TB 30 TB 10 TB Spark EMR (10 TB) Theseus (10 TB) The Total Wall Theseus (30 TB) Theseus (100 TB) Spark EMR (30TB)
  4. Scale to problems too big for Spark Support efficient spilling

    out of GPU memory The only distributed GPU engine 72x faster, 71x cheaper Linearly scale to massive data problems Seamlessly move from Dev to Prod with the same code Built from the ground up for accelerators Evolve as enterprise needs change Framework Database Data Lake Walled Garden Monolith SaaS Data Warehouse File System Cloud Support service What it’s NOT Upgrade and diversify hardware by leveraging GPUs, x86, ARM, Infiniband, RoCE, NVLink, and more Harness underutilized generally available GPUs to improve economics Reduce data center footprint by 6x Confidently leverage GPUs, x86, ARM, and future hardware innovations for data preprocessing Process analytics and AI workflows within the same semiconductor Use multiple programming languages (Python, R, Java, Rust, C++) Operate on data where it is Theseus A composable, accelerated data processing engine built on GPUs.
  5. Voltron Data Theseus A Compute Mesh unifying hardware, languages, and

    applications CSV Avro JSON Parquet ORC Laptop Servers Cloud 1. Accelerator-Native: Distributed query engine built from the ground up to take advantage of full system hardware acceleration. 2. Petabyte Scale: Focusing on problems too big and time sensitive for Spark 3. Composable: Built on open source standards that enables interoperability from storage to application 4. Evolutionary: A composable engine that seamlessly adapts to new hardware and languages 4 1 2 3 4
  6. Arrow Everywhere Modular standards: Widely adopted at >80M monthly downloads

    • Internally all data is Arrow • All data is returned as Arrow Allows users to: • interoperate with Arrow Native libraries • compose efficient data pipelines faster • extend functionality for evolving needs For more information on our thoughts on Modularity, Interoperability, Composability, and Extensibility check out https://voltrondata.com/codex
  7. Design Freedom with Theseus ENGINE ENGINE ENGINE ENGINE Write in

    your language of choice DATAFRAMES Operate seamlessly on the same data inside more than 22 data engines ENGINE • One analytics interface, many data engines • No lock in or rewrites • Support multiple user skill-sets • No proprietary formats • Reads and writes all common big data formats • Fully integrates with existing data lakes Ibis Python API
  8. Why make Theseus run on CPU? • Provides Theseus with

    extra flexibility to run on clusters without GPUs, making it easier & cheaper for users test their workloads prior to scaling up • Helps us break down system assumptions in our architectural development, to assure the composability of our architecture, while maintaining performance • Paves the way for running CPU only UDFs • Opens the door to optimizations via hybrid compute, therefore running some operations on the CPU while running others on the GPU. Theseus: Velox Backend Hardware agnostic for greater flexibility
  9. Theseus: Query Graph SELECT n.n_name, count(c.c_custkey) FROM customer AS c

    INNER JOIN nation AS n ON c.c_nationkey = n.n_nationkey WHERE n.n_nationkey < 10 GROUP BY n.n_name A Query Becomes a Query Graph
  10. Theseus-Velox Backend: Development cuDF Executor (compute resources and GPU memory

    management) Velox Executor (compute resources and CPU memory management) Implemented each SQL operator using Velox
  11. Theseus - Velox Backend: Current Deficiencies Each SQL Operator is

    independent, therefore no internal pipelining Theseus Pipeline (without internal pipelines) Velox Pipeline
  12. Theseus - Velox Backend: Current Deficiencies Unnecessary copies and conversions

    are resulting in 50%+ perf loss Currently need to convert between RowVectors and ArrowTables. • Primitive Types -> Zero Copy • String columns -> Copy • Columns with dictionary encoding -> Flatten (Copy)
  13. Apache Arrow - Velox Alignment Mar ’24 • Announced partnership

    - see blog • Goal was to align and converge Apache Arrow with Velox Meta / VoDa Partnership • Drive alignment in Apache Arrow and Velox OSS communities for seamless interoperability at zero cost Prior Arrow Releases • 3 new format layouts developed: ◦ StringView, ListView & Run-End-Encoding • Adding new layouts to PyArrow & DuckDB is in-progress (some of this code has already merged) Arrow 15 Release • Velox & Apache Arrow unified together on composable data management systems design - see the blog • Adding Apache Arrow ListView and StringView support in Velox bridge is coming soon! Key Accomplishments & Plan Jun ’22 Dec ’23 Jan ’24
  14. Theseus - Infrastructure Recommendations What would an ideal data system

    look like? Minimum Hardware NVIDIA DGX V100 (32GB GPUs) Servers 64GB RAM per NVIDIA GPU 100 Gbps Ethernet Networking, 1 NIC shared b/w 2 GPUs Ideal Hardware NVIDIA DGX H100/A100 80GB Servers 160GB RAM per NVIDIA GPU 200+ Gbps Infiniband Networking Infiniband Network Attached Storage Better Hardware NVIDIA DGX A100 (40GB GPUs) Servers 80GB RAM per NVIDIA GPU 100 Gbps RoCE Networking, 1 NIC per GPU RoCE Network Attached Storage