Slide 1

Slide 1 text

Unleashing Maximum GPU Performance and Utilization for AI and HPC Michael Buchel, CTO, Arc Compute

Slide 2

Slide 2 text

• Discovery of GPU Inefficiencies • Innovative Solutions — ArcHPC Suite Genesis

Slide 3

Slide 3 text

Vision • Address the rising dependency on high-performance computing and accelerated hardware by reducing hardware and power requirements.

Slide 4

Slide 4 text

• Harness low-level optimization to maximize efficiency, achieve peak performance, and reduce environmental impact. Mission

Slide 5

Slide 5 text

Existing Solutions to the GPU Inefficiency Problem

Slide 6

Slide 6 text

• Harness low-level optimization to maximize efficiency, achieve peak performance, and reduce environmental impact. Pains and challenges Arc Compute SOLUTION 1 Ignore SUMMARY No method or solution to address the scarcity of hardware nor increase utilization PROS CONS • None • Not feasible for business continuity EXISTING SOLUTIONS

Slide 7

Slide 7 text

SOLUTION 2 Use of ineffective/incomplete software solutions SUMMARY Addressing utilization using de facto solutions such as job schedulers and fractional GPU software PROS CONS • Increase user density • Easy to use • Readily available • Scalable • Cannot address low-level utilization points such as memory access latencies where additional arithmetic operations could occur • Can lead to performance degradations • Cannot address fine-tuning of GPU environments for optimal task deployment for performance • Cannot set or prioritize performance for business objective alignment; missing user governance policy setting for performance EXISTING SOLUTIONS

Slide 8

Slide 8 text

SOLUTION 3 Purchase Additional Hardware SUMMARY Purchase additional hardware PROS CONS • Scalable • Easy to use • Low technical entry • Market resource scarcity • Expensive • Doesn’t address utilization • Cannot increase performance • Vendor prioritizes other entities and may limit your supply • Unreliable. • Limited deployment locations • Requirement of additional supporting resources • Dependent on vendor of hardware EXISTING SOLUTIONS

Slide 9

Slide 9 text

SOLUTION 4 Manual task matching SUMMARY Intertwining and pairing of task codes for task matching to increase the fundamental utilization of underlying hardware; able to increase utilization and performance of hardware by achieving memory access level parallelism PROS CONS • Address utilization at the core problem being opportunities for additional arithmetic operations during memory access latencies • Can increase the performance of the accelerated hardware if executed correctly • Full control over the code optimization cycle • Technical human capital resource scarcity for execution. • Long process • Not scalable • Limited to the human ability to execute correct task-matching operations and practices • Cannot address performance increase opportunities in the rewriting of schedulers and ISA commands for the underlying hardware • Code must be manually re-tuned for varying hardware architectures. • Product managers and technical leads of tasks are limited in code updates as tasks are matched with other product managers and technical leads • More bureaucratic red tape for execution • Must trust the paired task’s code security posture • Incapable of addressing operational business changes on the fly; cannot prioritize the execution of one task over the other without disrupting both tasks even if there are available resources • Inability to adjust in dynamic complex settings • Not feasible for large organizations at scale EXISTING SOLUTIONS

Slide 10

Slide 10 text

THE SUITE Mercury Nexus Oracle

Slide 11

Slide 11 text

• ArcHPC Nexus is a management solution for advanced GPU and other accelerated hardware • This software allows users to maximize user/task density and GPU performance. • It achieves this by increasing throughputs to compute resources while granularly tuning compute environments for task execution and providing recommendations for further improvements. Nexus

Slide 12

Slide 12 text

Environment Creation • ArcHPC Nexus creates the environment where the maximization of utilization and performance for GPUs can occur. ⚬ ArcHPC Nexus provides management protocols that remove limitations and performance degradation pitfalls found in other solutions that attempt to maximize utilization but can’t address performance at the same time. Confidentiality Notice: This presentation, including any attachments, is for the exclusive and confidential use of the intended recipient(s). If you are not the intended recipient, please note that any form of dissemination, distribution, or copying of this communication is strictly prohibited and may be unlawful. If you have received this presentation in error, please immediately notify the sender and delete all copies.

Slide 13

Slide 13 text

• ArcHPC Nexus comes with GUI and command line interface with integration into prominent job schedulers such as SLURM, used by Meta, exascale large institutions and universities, for scalable management of HPC environments along with tools for granular understanding of operational health of HPC environments and tasks running, with enterprise governance capabilities and control. HPC Environment Management Confidentiality Notice: This presentation, including any attachments, is for the exclusive and confidential use of the intended recipient(s). If you are not the intended recipient, please note that any form of dissemination, distribution, or copying of this communication is strictly prohibited and may be unlawful. If you have received this presentation in error, please immediately notify the sender and delete all copies.

Slide 14

Slide 14 text

• ArcHPC Nexus increases the throughput to accelerated hardware revealing the ability to increase utilization and performance. Increased Throughput for Increased User and Task Density Confidentiality Notice: This presentation, including any attachments, is for the exclusive and confidential use of the intended recipient(s). If you are not the intended recipient, please note that any form of dissemination, distribution, or copying of this communication is strictly prohibited and may be unlawful. If you have received this presentation in error, please immediately notify the sender and delete all copies.

Slide 15

Slide 15 text

• ArcHPC Nexus simultaneously manages various accelerator hardware architectures and generations in an HPC environment enabling users to mix and match various compute resources to remain agile. Confidentiality Notice: This presentation, including any attachments, is for the exclusive and confidential use of the intended recipient(s). If you are not the intended recipient, please note that any form of dissemination, distribution, or copying of this communication is strictly prohibited and may be unlawful. If you have received this presentation in error, please immediately notify the sender and delete all copies. Simultaneous Management of Multiple Accelerator Types

Slide 16

Slide 16 text

• Automates task matching and task deployment • Manages low-level operational execution of instructions in the HPC environment • Increases accelerated hardware performance through enterprise scalable control Oracle

Slide 17

Slide 17 text

• ArcHPC Oracle automates task matching and task deployment so it's scalable, streamlined and instantly applicable in dynamic environments. ⚬ Manual task matching is a gruelling, daunting cumbersome operation that is currently being performed by some large companies to maximize utilization of their underlying HPC investments the best a human can. ⚬ Manual matching - humans can't manage the operation as slight code changes require a rework of the entire operation. Confidentiality Notice: This presentation, including any attachments, is for the exclusive and confidential use of the intended recipient(s). If you are not the intended recipient, please note that any form of dissemination, distribution, or copying of this communication is strictly prohibited and may be unlawful. If you have received this presentation in error, please immediately notify the sender and delete all copies. Automated Task Matching and Task Deployment

Slide 18

Slide 18 text

• ArcHPC Oracle can make adjustments to kernels inflight for the execution of their instructions. This enables users to achieve the highest performance increases possible at scale during intervals and vectors that are not addressable due to human limitations. Confidentiality Notice: This presentation, including any attachments, is for the exclusive and confidential use of the intended recipient(s). If you are not the intended recipient, please note that any form of dissemination, distribution, or copying of this communication is strictly prohibited and may be unlawful. If you have received this presentation in error, please immediately notify the sender and delete all copies. Granular Instruction Execution Management

Slide 19

Slide 19 text

• ArcHPC Oracle has a flexible enterprise management tools and governance so large entities can maintain granular control of operations to align with business objectives. Confidentiality Notice: This presentation, including any attachments, is for the exclusive and confidential use of the intended recipient(s). If you are not the intended recipient, please note that any form of dissemination, distribution, or copying of this communication is strictly prohibited and may be unlawful. If you have received this presentation in error, please immediately notify the sender and delete all copies. Enterprise Scalability and Control

Slide 20

Slide 20 text

Mercury • Resolves task matching for maximum number of unique tasks running. • Selects hardware which will maximize the throughput for the average task running in the datacenter. • Provides datacenter owners information to help scale their datacenters to adhere to new growing workloads.

Slide 21

Slide 21 text

Case Studies • LAMMPS (Large-scale Atomic/Molecular Massively Parallel Simulator)

Slide 22

Slide 22 text

LAMMPS • Effective code created by labs such as Sandia National Lab. • Very hard to optimize due to high occupancy/pipeline saturation in the code. • Still benefits from Multi GPU setups. • Serious deadtime when running the lmp component in just moving the data to and from the gpu. • We ran a few tests to show how Nexus can speed this up.

Slide 23

Slide 23 text

LAMMPS Baseline • Thanks to George Mason University we have an initial baseline on what to expect on A100s (4484.309 tau/day) • We ran 5 experiments of the system just running on our machine in a virtualized state, no magic tricks to make our system better. • We used CUDA 11.2.1 (available at the time that Sandia made their benchmark). • To ensure there are no driver speedups playing a role in our systems. • We still got a 2% performance increase consistently.

Slide 24

Slide 24 text

LAMMPS Double GPU • Since we are limited by GPU memory size, we lowered the in.lj.txt value by a bit. Line 7 the 20 was changed to a 10. • With our virtual machines we did a mapping of sharing our GPU between 2 identical LAMMP processes. • The wide varity is due to CPUs haphazardly sending the info to the GPU. However with our system in place behind the scenes we still see massive improvements. Once we control the CPU component even more, we expect the varity to stablize around 12k Tau/day.

Slide 25

Slide 25 text

Competition and Differentiators • No direct competitors • Low - Indirect Competitors ⚬ Job Schedulers ⚬ Application Optimizers • Why Arc Compute is poised to win?

Slide 26

Slide 26 text

Roadma p Feb 1st, 2024 March 4th, 2024 May 25th, 2024 ArcHPC Nexus R1 - Version 1 Summary: Easy installation process to better user experience • Heterogenous vGPU support on a physical GPU • Initial support for NVIDIA Ampere and Hopper; previous architectures supported for key accounts • Technical Documentation • Installation ISO/Medium for scaling installations ArcHPC Nexus R2 - Version 2 Summary: Streamlined client management workflow for general release and centralized network management in HPC environments • ahpc-guest plugin documentation • Metric visualization • Network topology to libvirt • License server ArcHPC Nexus R3 - Version 3 Summary: General release of support for NVIDIA networking solutions • Cluster-based virtual networks • Database cluster database Confidentiality Notice: This presentation, including any attachments, is for the exclusive and confidential use of the intended recipient(s). If you are not the intended recipient, please note that any form of dissemination, distribution, or copying of this communication is strictly prohibited and may be unlawful. If you have received this presentation in error, please immediately notify the sender and delete all copies.

Slide 27

Slide 27 text

MILESTONES November 11th, 2024 December 15th, 2024 Summary: Release of ArcHPC Oracle; scalable automated task matching, task deployment and HPC environment tuning/calibrating to remove human operational inefficiency • Cross-datacenter ideal VM deployment • Papers on how to speed up GPU tasks and demystify NVIDIA propaganda • AST Comparisons of different compute tasks • Selection of node vGPU selectors. (Databinning) Summary: Allows users to use custom scheduling systems and ArcHPC Nexus becomes technically resilient to architecture redesigns and changes • Documentation for how to build custom scheduling mechanisms • ISA Translations between NVIDIA architectures • Nexus port to use NVIDIA free system Confidentiality Notice: This presentation, including any attachments, is for the exclusive and confidential use of the intended recipient(s). If you are not the intended recipient, please note that any form of dissemination, distribution, or copying of this communication is strictly prohibited and may be unlawful. If you have received this presentation in error, please immediately notify the sender and delete all copies.

Slide 28

Slide 28 text

GTM, Partner Ecosystem, OEM, Distribution/Reseller

Slide 29

Slide 29 text

GTM • Direct for strategic accounts • Focus on large AI/ML companies and super computers

Slide 30

Slide 30 text

Partner Ecosystem

Slide 31

Slide 31 text

OEM • In discussions with leading datacenter providers Distribution/reseller • Currently in development with a few major players in the datacenter market

Slide 32

Slide 32 text

Pricing Model • Per GPU ⚬ Volume Pricing ■ Range $4000 to $8800 per GPU per year • Cloud ⚬ Cost per hour USD $0.370964

Slide 33

Slide 33 text

Q&A

Slide 34

Slide 34 text

Thank you