Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Safeguarding Against OOM Kills: Strategies for ...

Safeguarding Against OOM Kills: Strategies for Compute-Efficient Systems

Compute resources becoming dear and the ever-increasing demand for efficiency necessitate a fundamental shift in how we approach computing. Our reliance on in-memory systems has led to an unsustainable practice: overprovisioning infrastructure to avoid memory-related crashes (OOM kills). To achieve truly scalable and cost-effective computing, we must unlock the full potential of underutilized CPUs and GPUs and build OOM-proof systems that are less memory-intensive, ultimately reducing our reliance on expensive infrastructure.

Vikram Joshi
Founder, President & CTO at ComputeAI

Avatar for Ali LeClerc

Ali LeClerc

April 05, 2024
Tweet

More Decks by Ali LeClerc

Other Decks in Technology

Transcript

  1. ComputeAI Mission Make compute abundant and infinitely scalable • Processor

    cores and memory are two sides of the same coin • Efficient memory management frees up cores needed for compute • Why: TCO AI/ML GPU Analytics CPU Targeted Applications
  2. Open Architecture Presto SQL Spark SQL Drivers - Spark JDBC

    Server AWS, Azure, GCP, On-Prem ComputeAI Microkernel Raw Data •Parquet Metadata •Apache Iceberg PostgreSQL Grants Row, Column, Table ACLs
  3. Complex Compute Microkernel Spark SQL Presto SQL Complex/AI Generated SQL

    Open-Source SQL Engines Open Data Lake: Parquet, Iceberg GPT4, Autonomous BI & ELT AI Powered BI Apps, Low/No-Code Apps Normal Workloads Architecture
  4. Problem 1. Compute Inefficiency & Memory Failures • Infra over-provisioning

    • Low CPU/GPU utilization • OOM-kills 2. I/O Waits • Network I/O waits for shuffles • Disk I/O waits for demand paging 3. Memory Stalls • Processor cores stalling on memory (not on lack of bandwidth)
  5. Solution • Custom microkernel runtime for masking I/O waits and

    reducing memory stalls • Data hypervisor for memory tiering & SLAs • AI/ML driven JIT spill-to-disk • TLS based thread context switching • Multithreading infra allows operator layer code to run parallelly like “tensor cores”
  6. Cursors and Pagers • What do we know • Available

    memory, number of concurrent plans, current operator • Minimum pages per operator to make forward progress
  7. Cursors and Pagers • How we use SLAs • Divide

    memory for operators across plans • Plan has info on when to drop tables, and the ”distance” between ops for table reuse
  8. Cursors and Pagers • How do we steal pages •

    Cursors and pagers write dirty and evict clean pages • Pagers do JIT page-ins (best effort) • Steal within plan before staling outside of plan • AI/ML for recognizing memory needs of operators and gradual perf decay when low on memory
  9. OOM Safeguard Takeaways OOM Kill prevention for ~0.1 % additional

    memory Slow path does not interfere with fast path Generic applicability to multiple data models
  10. Relational Compute Ref. Impl. Spark & Presto SQL Spark JDBC

    Driver Catalyst Logical Plan ComputeAI Optimizer ComputeAI Query Execution ComputeAI Execution Runtime Velox
  11. Contributing to Velox ComputeAI IP Velox Velox++ • No OOM

    Kills • Unlimited memory overcommits • SLAs
  12. Internal Benchmark Performance of ComputeAI engine ~2x faster than AWS

    EMR Spark for TPC-H & TPC-DS 5-10x faster than AWS EMR Spark for real-world workloads No theoretical limit for memory overcommitment ~10x memory overcommitment without sacrificing performance 3-5x reduction in cloud infra
  13. Resources Jitsu, Inc. Confidential https://www.linkedin.com/in/vikramjoshi/ • Why Was Compute.AI Founded

    • Harnessing the Power of Iceberg & Delta • Compute.AI's Vertically Integrated vs Distributed Shared Memory Spark • Do We Need Another Version of Spark • An Approach to Database Fine-Grained Access Controls [email protected]