Safeguarding Against OOM Kills: Strategies for Compute-Efficient Systems

Safeguarding Against OOM Kills: Strategies for Compute-Efficient Systems VeloxCon 2024
Vikram Joshi ComputeAI

ComputeAI Mission Make compute abundant and infinitely scalable • Processor
cores and memory are two sides of the same coin • Efficient memory management frees up cores needed for compute • Why: TCO AI/ML GPU Analytics CPU Targeted Applications

Open Architecture Presto SQL Spark SQL Drivers - Spark JDBC
Server AWS, Azure, GCP, On-Prem ComputeAI Microkernel Raw Data •Parquet Metadata •Apache Iceberg PostgreSQL Grants Row, Column, Table ACLs

Complex Compute Microkernel Spark SQL Presto SQL Complex/AI Generated SQL
Open-Source SQL Engines Open Data Lake: Parquet, Iceberg GPT4, Autonomous BI & ELT AI Powered BI Apps, Low/No-Code Apps Normal Workloads Architecture

Problem 1. Compute Inefficiency & Memory Failures • Infra over-provisioning
• Low CPU/GPU utilization • OOM-kills 2. I/O Waits • Network I/O waits for shuffles • Disk I/O waits for demand paging 3. Memory Stalls • Processor cores stalling on memory (not on lack of bandwidth)

Solution • Custom microkernel runtime for masking I/O waits and
reducing memory stalls • Data hypervisor for memory tiering & SLAs • AI/ML driven JIT spill-to-disk • TLS based thread context switching • Multithreading infra allows operator layer code to run parallelly like “tensor cores”

Hash Join In Low Memory

Demand Paging

Cursors and Pagers • What do we know • Available
memory, number of concurrent plans, current operator • Minimum pages per operator to make forward progress

Cursors and Pagers • How we use SLAs • Divide
memory for operators across plans • Plan has info on when to drop tables, and the ”distance” between ops for table reuse

Cursors and Pagers • How do we steal pages •
Cursors and pagers write dirty and evict clean pages • Pagers do JIT page-ins (best effort) • Steal within plan before staling outside of plan • AI/ML for recognizing memory needs of operators and gradual perf decay when low on memory

OOM Safeguard Takeaways OOM Kill prevention for ~0.1 % additional
memory Slow path does not interfere with fast path Generic applicability to multiple data models

Relational Compute Ref. Impl. Spark & Presto SQL Spark JDBC
Driver Catalyst Logical Plan ComputeAI Optimizer ComputeAI Query Execution ComputeAI Execution Runtime Velox

Contributing to Velox ComputeAI IP Velox Velox++ • No OOM
Kills • Unlimited memory overcommits • SLAs

Internal Benchmark Performance of ComputeAI engine ~2x faster than AWS
EMR Spark for TPC-H & TPC-DS 5-10x faster than AWS EMR Spark for real-world workloads No theoretical limit for memory overcommitment ~10x memory overcommitment without sacrificing performance 3-5x reduction in cloud infra

Resources Jitsu, Inc. Confidential https://www.linkedin.com/in/vikramjoshi/ • Why Was Compute.AI Founded
• Harnessing the Power of Iceberg & Delta • Compute.AI's Vertically Integrated vs Distributed Shared Memory Spark • Do We Need Another Version of Spark • An Approach to Database Fine-Grained Access Controls [email protected]

Safeguarding Against OOM Kills: Strategies for ...

Safeguarding Against OOM Kills: Strategies for Compute-Efficient Systems

Ali LeClerc

More Decks by Ali LeClerc

Other Decks in Technology

Featured

Transcript

Safeguarding Against OOM Kills: Strategies for Compute-Efficient Systems VeloxCon 2024

ComputeAI Mission Make compute abundant and infinitely scalable • Processor

Open Architecture Presto SQL Spark SQL Drivers - Spark JDBC

Complex Compute Microkernel Spark SQL Presto SQL Complex/AI Generated SQL

Problem 1. Compute Inefficiency & Memory Failures • Infra over-provisioning

Solution • Custom microkernel runtime for masking I/O waits and

Hash Join In Low Memory

Demand Paging

Cursors and Pagers • What do we know • Available

Cursors and Pagers • How we use SLAs • Divide

Cursors and Pagers • How do we steal pages •

OOM Safeguard Takeaways OOM Kill prevention for ~0.1 % additional

Relational Compute Ref. Impl. Spark & Presto SQL Spark JDBC

Contributing to Velox ComputeAI IP Velox Velox++ • No OOM

Internal Benchmark Performance of ComputeAI engine ~2x faster than AWS

Resources Jitsu, Inc. Confidential https://www.linkedin.com/in/vikramjoshi/ • Why Was Compute.AI Founded