Arc Compute - IT Press Tour #56 June 2024

Unleashing Maximum GPU Performance and Utilization for AI and HPC
Michael Buchel, CTO, Arc Compute

• Discovery of GPU Inefficiencies • Innovative Solutions — ArcHPC
Suite Genesis

Vision • Address the rising dependency on high-performance computing and
accelerated hardware by reducing hardware and power requirements.

• Harness low-level optimization to maximize efficiency, achieve peak performance,
and reduce environmental impact. Mission

Existing Solutions to the GPU Inefficiency Problem

• Harness low-level optimization to maximize efficiency, achieve peak performance,
and reduce environmental impact. Pains and challenges Arc Compute SOLUTION 1 Ignore SUMMARY No method or solution to address the scarcity of hardware nor increase utilization PROS CONS • None • Not feasible for business continuity EXISTING SOLUTIONS

SOLUTION 2 Use of ineffective/incomplete software solutions SUMMARY Addressing utilization
using de facto solutions such as job schedulers and fractional GPU software PROS CONS • Increase user density • Easy to use • Readily available • Scalable • Cannot address low-level utilization points such as memory access latencies where additional arithmetic operations could occur • Can lead to performance degradations • Cannot address ﬁne-tuning of GPU environments for optimal task deployment for performance • Cannot set or prioritize performance for business objective alignment; missing user governance policy setting for performance EXISTING SOLUTIONS

SOLUTION 3 Purchase Additional Hardware SUMMARY Purchase additional hardware PROS
CONS • Scalable • Easy to use • Low technical entry • Market resource scarcity • Expensive • Doesn’t address utilization • Cannot increase performance • Vendor prioritizes other entities and may limit your supply • Unreliable. • Limited deployment locations • Requirement of additional supporting resources • Dependent on vendor of hardware EXISTING SOLUTIONS

SOLUTION 4 Manual task matching SUMMARY Intertwining and pairing of
task codes for task matching to increase the fundamental utilization of underlying hardware; able to increase utilization and performance of hardware by achieving memory access level parallelism PROS CONS • Address utilization at the core problem being opportunities for additional arithmetic operations during memory access latencies • Can increase the performance of the accelerated hardware if executed correctly • Full control over the code optimization cycle • Technical human capital resource scarcity for execution. • Long process • Not scalable • Limited to the human ability to execute correct task-matching operations and practices • Cannot address performance increase opportunities in the rewriting of schedulers and ISA commands for the underlying hardware • Code must be manually re-tuned for varying hardware architectures. • Product managers and technical leads of tasks are limited in code updates as tasks are matched with other product managers and technical leads • More bureaucratic red tape for execution • Must trust the paired task’s code security posture • Incapable of addressing operational business changes on the ﬂy; cannot prioritize the execution of one task over the other without disrupting both tasks even if there are available resources • Inability to adjust in dynamic complex settings • Not feasible for large organizations at scale EXISTING SOLUTIONS

THE SUITE Mercury Nexus Oracle

• ArcHPC Nexus is a management solution for advanced GPU
and other accelerated hardware • This software allows users to maximize user/task density and GPU performance. • It achieves this by increasing throughputs to compute resources while granularly tuning compute environments for task execution and providing recommendations for further improvements. Nexus

Environment Creation • ArcHPC Nexus creates the environment where the
maximization of utilization and performance for GPUs can occur. ⚬ ArcHPC Nexus provides management protocols that remove limitations and performance degradation pitfalls found in other solutions that attempt to maximize utilization but can’t address performance at the same time. Confidentiality Notice: This presentation, including any attachments, is for the exclusive and confidential use of the intended recipient(s). If you are not the intended recipient, please note that any form of dissemination, distribution, or copying of this communication is strictly prohibited and may be unlawful. If you have received this presentation in error, please immediately notify the sender and delete all copies.

• ArcHPC Nexus comes with GUI and command line interface
with integration into prominent job schedulers such as SLURM, used by Meta, exascale large institutions and universities, for scalable management of HPC environments along with tools for granular understanding of operational health of HPC environments and tasks running, with enterprise governance capabilities and control. HPC Environment Management Confidentiality Notice: This presentation, including any attachments, is for the exclusive and confidential use of the intended recipient(s). If you are not the intended recipient, please note that any form of dissemination, distribution, or copying of this communication is strictly prohibited and may be unlawful. If you have received this presentation in error, please immediately notify the sender and delete all copies.

• ArcHPC Nexus increases the throughput to accelerated hardware revealing
the ability to increase utilization and performance. Increased Throughput for Increased User and Task Density Confidentiality Notice: This presentation, including any attachments, is for the exclusive and confidential use of the intended recipient(s). If you are not the intended recipient, please note that any form of dissemination, distribution, or copying of this communication is strictly prohibited and may be unlawful. If you have received this presentation in error, please immediately notify the sender and delete all copies.

• ArcHPC Nexus simultaneously manages various accelerator hardware architectures and
generations in an HPC environment enabling users to mix and match various compute resources to remain agile. Confidentiality Notice: This presentation, including any attachments, is for the exclusive and confidential use of the intended recipient(s). If you are not the intended recipient, please note that any form of dissemination, distribution, or copying of this communication is strictly prohibited and may be unlawful. If you have received this presentation in error, please immediately notify the sender and delete all copies. Simultaneous Management of Multiple Accelerator Types

• Automates task matching and task deployment • Manages low-level
operational execution of instructions in the HPC environment • Increases accelerated hardware performance through enterprise scalable control Oracle

• ArcHPC Oracle automates task matching and task deployment so
it's scalable, streamlined and instantly applicable in dynamic environments. ⚬ Manual task matching is a gruelling, daunting cumbersome operation that is currently being performed by some large companies to maximize utilization of their underlying HPC investments the best a human can. ⚬ Manual matching - humans can't manage the operation as slight code changes require a rework of the entire operation. Confidentiality Notice: This presentation, including any attachments, is for the exclusive and confidential use of the intended recipient(s). If you are not the intended recipient, please note that any form of dissemination, distribution, or copying of this communication is strictly prohibited and may be unlawful. If you have received this presentation in error, please immediately notify the sender and delete all copies. Automated Task Matching and Task Deployment

• ArcHPC Oracle can make adjustments to kernels inﬂight for
the execution of their instructions. This enables users to achieve the highest performance increases possible at scale during intervals and vectors that are not addressable due to human limitations. Confidentiality Notice: This presentation, including any attachments, is for the exclusive and confidential use of the intended recipient(s). If you are not the intended recipient, please note that any form of dissemination, distribution, or copying of this communication is strictly prohibited and may be unlawful. If you have received this presentation in error, please immediately notify the sender and delete all copies. Granular Instruction Execution Management

• ArcHPC Oracle has a ﬂexible enterprise management tools and
governance so large entities can maintain granular control of operations to align with business objectives. Confidentiality Notice: This presentation, including any attachments, is for the exclusive and confidential use of the intended recipient(s). If you are not the intended recipient, please note that any form of dissemination, distribution, or copying of this communication is strictly prohibited and may be unlawful. If you have received this presentation in error, please immediately notify the sender and delete all copies. Enterprise Scalability and Control

Mercury • Resolves task matching for maximum number of unique
tasks running. • Selects hardware which will maximize the throughput for the average task running in the datacenter. • Provides datacenter owners information to help scale their datacenters to adhere to new growing workloads.

Case Studies • LAMMPS (Large-scale Atomic/Molecular Massively Parallel Simulator)

LAMMPS • Eﬀective code created by labs such as Sandia
National Lab. • Very hard to optimize due to high occupancy/pipeline saturation in the code. • Still beneﬁts from Multi GPU setups. • Serious deadtime when running the lmp component in just moving the data to and from the gpu. • We ran a few tests to show how Nexus can speed this up.

LAMMPS Baseline • Thanks to George Mason University we have
an initial baseline on what to expect on A100s (4484.309 tau/day) • We ran 5 experiments of the system just running on our machine in a virtualized state, no magic tricks to make our system better. • We used CUDA 11.2.1 (available at the time that Sandia made their benchmark). • To ensure there are no driver speedups playing a role in our systems. • We still got a 2% performance increase consistently.

LAMMPS Double GPU • Since we are limited by GPU
memory size, we lowered the in.lj.txt value by a bit. Line 7 the 20 was changed to a 10. • With our virtual machines we did a mapping of sharing our GPU between 2 identical LAMMP processes. • The wide varity is due to CPUs haphazardly sending the info to the GPU. However with our system in place behind the scenes we still see massive improvements. Once we control the CPU component even more, we expect the varity to stablize around 12k Tau/day.

Competition and Differentiators • No direct competitors • Low -
Indirect Competitors ⚬ Job Schedulers ⚬ Application Optimizers • Why Arc Compute is poised to win?

Roadma p Feb 1st, 2024 March 4th, 2024 May 25th,
2024 ArcHPC Nexus R1 - Version 1 Summary: Easy installation process to better user experience • Heterogenous vGPU support on a physical GPU • Initial support for NVIDIA Ampere and Hopper; previous architectures supported for key accounts • Technical Documentation • Installation ISO/Medium for scaling installations ArcHPC Nexus R2 - Version 2 Summary: Streamlined client management workﬂow for general release and centralized network management in HPC environments • ahpc-guest plugin documentation • Metric visualization • Network topology to libvirt • License server ArcHPC Nexus R3 - Version 3 Summary: General release of support for NVIDIA networking solutions • Cluster-based virtual networks • Database cluster database Confidentiality Notice: This presentation, including any attachments, is for the exclusive and confidential use of the intended recipient(s). If you are not the intended recipient, please note that any form of dissemination, distribution, or copying of this communication is strictly prohibited and may be unlawful. If you have received this presentation in error, please immediately notify the sender and delete all copies.

MILESTONES November 11th, 2024 December 15th, 2024 Summary: Release of
ArcHPC Oracle; scalable automated task matching, task deployment and HPC environment tuning/calibrating to remove human operational inefficiency • Cross-datacenter ideal VM deployment • Papers on how to speed up GPU tasks and demystify NVIDIA propaganda • AST Comparisons of different compute tasks • Selection of node vGPU selectors. (Databinning) Summary: Allows users to use custom scheduling systems and ArcHPC Nexus becomes technically resilient to architecture redesigns and changes • Documentation for how to build custom scheduling mechanisms • ISA Translations between NVIDIA architectures • Nexus port to use NVIDIA free system Confidentiality Notice: This presentation, including any attachments, is for the exclusive and confidential use of the intended recipient(s). If you are not the intended recipient, please note that any form of dissemination, distribution, or copying of this communication is strictly prohibited and may be unlawful. If you have received this presentation in error, please immediately notify the sender and delete all copies.

GTM, Partner Ecosystem, OEM, Distribution/Reseller

GTM • Direct for strategic accounts • Focus on large
AI/ML companies and super computers

Partner Ecosystem

OEM • In discussions with leading datacenter providers Distribution/reseller •
Currently in development with a few major players in the datacenter market

Pricing Model • Per GPU ⚬ Volume Pricing ￭ Range
$4000 to $8800 per GPU per year • Cloud ⚬ Cost per hour USD $0.370964

Thank you

Arc Compute - IT Press Tour #56 June 2024

Arc Compute - IT Press Tour #56 June 2024

More Decks by The IT Press Tour

Other Decks in Technology

Featured

Transcript