Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Safeguarding Against OOM Kills: Strategies for ...

Safeguarding Against OOM Kills: Strategies for Compute-Efficient Systems

Compute resources becoming dear and the ever-increasing demand for efficiency necessitate a fundamental shift in how we approach computing. Our reliance on in-memory systems has led to an unsustainable practice: overprovisioning infrastructure to avoid memory-related crashes (OOM kills). To achieve truly scalable and cost-effective computing, we must unlock the full potential of underutilized CPUs and GPUs and build OOM-proof systems that are less memory-intensive, ultimately reducing our reliance on expensive infrastructure.

Vikram Joshi
Founder, President & CTO at ComputeAI

Ali LeClerc

April 05, 2024
Tweet

More Decks by Ali LeClerc

Other Decks in Technology

Transcript

  1. ComputeAI Mission Make compute abundant and infinitely scalable • Processor

    cores and memory are two sides of the same coin • Efficient memory management frees up cores needed for compute • Why: TCO AI/ML GPU Analytics CPU Targeted Applications
  2. Open Architecture Presto SQL Spark SQL Drivers - Spark JDBC

    Server AWS, Azure, GCP, On-Prem ComputeAI Microkernel Raw Data •Parquet Metadata •Apache Iceberg PostgreSQL Grants Row, Column, Table ACLs
  3. Complex Compute Microkernel Spark SQL Presto SQL Complex/AI Generated SQL

    Open-Source SQL Engines Open Data Lake: Parquet, Iceberg GPT4, Autonomous BI & ELT AI Powered BI Apps, Low/No-Code Apps Normal Workloads Architecture
  4. Problem 1. Compute Inefficiency & Memory Failures • Infra over-provisioning

    • Low CPU/GPU utilization • OOM-kills 2. I/O Waits • Network I/O waits for shuffles • Disk I/O waits for demand paging 3. Memory Stalls • Processor cores stalling on memory (not on lack of bandwidth)
  5. Solution • Custom microkernel runtime for masking I/O waits and

    reducing memory stalls • Data hypervisor for memory tiering & SLAs • AI/ML driven JIT spill-to-disk • TLS based thread context switching • Multithreading infra allows operator layer code to run parallelly like “tensor cores”
  6. Cursors and Pagers • What do we know • Available

    memory, number of concurrent plans, current operator • Minimum pages per operator to make forward progress
  7. Cursors and Pagers • How we use SLAs • Divide

    memory for operators across plans • Plan has info on when to drop tables, and the ”distance” between ops for table reuse
  8. Cursors and Pagers • How do we steal pages •

    Cursors and pagers write dirty and evict clean pages • Pagers do JIT page-ins (best effort) • Steal within plan before staling outside of plan • AI/ML for recognizing memory needs of operators and gradual perf decay when low on memory
  9. OOM Safeguard Takeaways OOM Kill prevention for ~0.1 % additional

    memory Slow path does not interfere with fast path Generic applicability to multiple data models
  10. Relational Compute Ref. Impl. Spark & Presto SQL Spark JDBC

    Driver Catalyst Logical Plan ComputeAI Optimizer ComputeAI Query Execution ComputeAI Execution Runtime Velox
  11. Contributing to Velox ComputeAI IP Velox Velox++ • No OOM

    Kills • Unlimited memory overcommits • SLAs
  12. Internal Benchmark Performance of ComputeAI engine ~2x faster than AWS

    EMR Spark for TPC-H & TPC-DS 5-10x faster than AWS EMR Spark for real-world workloads No theoretical limit for memory overcommitment ~10x memory overcommitment without sacrificing performance 3-5x reduction in cloud infra
  13. Resources Jitsu, Inc. Confidential https://www.linkedin.com/in/vikramjoshi/ • Why Was Compute.AI Founded

    • Harnessing the Power of Iceberg & Delta • Compute.AI's Vertically Integrated vs Distributed Shared Memory Spark • Do We Need Another Version of Spark • An Approach to Database Fine-Grained Access Controls [email protected]