What would you do with a million cores? High Performance Computing HPC on AWS 2019

© 2019, Amazon Web Services, Inc. or its affiliates. All
rights reserved. S U M M I T What would You do with a Million cores? High-Performance Computing on AWS Frank Munz, PhD Senior Technical Evangelist Amazon Web Services E C C 5 @frankmunz

rights reserved. S U M M I T Extreme Scale “Using AWS to easily shrink simulation time allows Western Digital R&D teams to explore new designs and innovations at a pace un-imaginable just a short time ago.” – Steve Phillpott, CIO, Western Digital HPC cluster of 1 million vCPUs

© 2017, Amazon Web Services, Inc. or its Affiliates. All
rights reserved. 1.1M vCPUs for ML

rights reserved. S U M M I T From worrying about to Capex Capacity Technology Focusing on innovation HPC on AWS is a Fundamental Rethink of what is Possible

rights reserved. S U M M I T We think the metric for success for any business should be time-to-results 2 2 2 4 2 1 1 3 7 7 4 9 5 7 6 6 7 7 4 8 4 Cores 8 2 1 9 5 4 5 3 1 2 3 6 1 9 4 8 1 2 8 7 7 6 Fixed data centre capacity limit Cores Finite capacity, usually with long queues to wait in Massive capacity when needed to speed up time to results, and agile environment when additional hardware and software experimentation is needed “For every $1 spent on HPC, businesses see $463 in incremental revenues and $44 in incremental profit.”

rights reserved. S U M M I T HPC Solution Components Automation and orchestration AWS Batch AWS ParallelCluster NICE EnginFrame Visualization NICE DCV Amazon AppStream 2.0 Compute Amazon EC2 instances (Compute and accelerated) Amazon EC2 Spot AWS Auto Scaling Networking Enhanced networking Placement groups Elastic Fabric Adapter Storage Amazon EBS Amazon EFS Amazon S3 Amazon FSx for Lustre

Broadest and deepest platform choice

rights reserved. S U M M I T EC2 Instance Types p3dn.24xlarge Instance Family: general purpose GPU Instance Instance Generation: 3 Additional Capabilities: - high speed, low latency NVMe SSD physically connected to server - 100 Gbit/s networking Instance Size: 96 VCPUs, 256 GB RAM, 8x NVIDIA V100

rights reserved. S U M M I T P3dn High bandwidth GPU compute instances Optimized for distributed ML training Launch with Deep Learning AMIs (that include most of the popular ML frameworks ) or with Amazon Sagemaker • 8 NVIDIA’s GPU Tesla with 32GB of memory per GPU • 96 custom Intel® Xeon® Scalable (Skylake) vCPUs • 100 Gbps of networking throughput • 1.8 TB of local NVMe-based SSD storage Featuring

rights reserved. S U M M I T Z1d High clock speed instances Optimized for applications that require sustained core performance Memory and compute-intensive applications. E.g. applications with license restrictions that require few, fast cores. Featuring • Up to 4 GHz sustained, all-turbo performance • Custom Intel Xeon Scalable processor • 8:1 GiB to vCPU ratio • Enhanced networking, up to 25 GB throughput

rights reserved. S U M M I T C5n High bandwidth compute instances Massively scalable performance Network bound workloads including distributed cluster and database workloads, HPC, real-time communications and video streaming Featuring • Significant improvements in max bandwidth, packet per seconds, and packets processing • Up to 100 Gbps of network bandwidth • Custom designed Nitro network cards • 5.33 GB/core with full memory bandwidth, 2:1 Memory:vCPU ratio • Elastic Fabric Adapter Compatible

rights reserved. EC2 Instance Types with AMD EPYC™ $ M5a M5 R5a R5 T3a T3 Lower Prices Application Compatible M5 (general purpose) R5 (memory workloads) T3 (burstable workloads) 10% less expensive Choice of Intel Xeon or AMD EPYC Processors Processor Choice

First instance powered by AWS Graviton Processor Up to 45%
cost savings AWS Graviton Processor with 64-bit ARM Neoverse cores and custom AWS silicon Amazon EC2 A1 Lower cost Run scale-out and ARM-based workloads in the cloud Maximize resource efficiency with AWS Nitro System Flexibility and choice for your workloads Amazon Linux, RedHat, Ubuntu

rights reserved. Broadest choice of processors and architectures

rights reserved. S U M M I T Elastic Fabric Adapter (EFA) C5n P3dn 15 micro-seconds network latencies EFA Elastic Fabric Adapter, best for large HPC workloads Scale tightly-coupled HPC applications on AWS i3en

rights reserved. S U M M I T HPC software stack in Amazon EC2 https://www.youtube.com/watch?time_continue=1&v=MjzbY74WNeI

Massively scalable performance Parallel file system Consistent sub-millisecond latencies Each
TB provides 200 MB/second, scales to hundreds of GB/s and millions of IOPS SSD-based AWS FSx Luster High performance file system

rights reserved. S U M M I T Seamless integration with Amazon S3 Data stored in Amazon S3 is loaded to Amazon FSx for processing Output of processing returned to Amazon S3 for retention When your workload finishes, simply delete your file system. Link your Amazon S3 data set to your Amazon FSx for Lustre file system, then….

rights reserved. S U M M I T AWS Parallel Cluster Easy Cluster Management Quickly build an HPC compute environment in AWS CLI to create HPC cluster • Successor of open source CfnCluster • Released via the Python Package Index (PyPI) • Built on AWS CloudFormation • Supported schedulers: sge (default), torque, and slurm • Integrated with AWS Batch

rights reserved. S U M M I T Flexible configuration and virtually unlimited scalability to grow and shrink your infrastructure as your HPC workloads dictate, not the other way around HPC on AWS

What would you do with a million cores? High Pe...

What would you do with a million cores? High Performance Computing HPC on AWS 2019

Frank Munz

More Decks by Frank Munz

Other Decks in Programming

Featured

Transcript

© 2019, Amazon Web Services, Inc. or its affiliates. All

© 2019, Amazon Web Services, Inc. or its affiliates. All

© 2017, Amazon Web Services, Inc. or its Affiliates. All

© 2019, Amazon Web Services, Inc. or its affiliates. All

© 2019, Amazon Web Services, Inc. or its affiliates. All

© 2019, Amazon Web Services, Inc. or its affiliates. All

Broadest and deepest platform choice

© 2019, Amazon Web Services, Inc. or its affiliates. All

© 2019, Amazon Web Services, Inc. or its affiliates. All

© 2019, Amazon Web Services, Inc. or its affiliates. All

© 2019, Amazon Web Services, Inc. or its affiliates. All

© 2018, Amazon Web Services, Inc. or its affiliates. All

First instance powered by AWS Graviton Processor Up to 45%

© 2018, Amazon Web Services, Inc. or its affiliates. All

S U M M I T © 2019, Amazon Web

© 2019, Amazon Web Services, Inc. or its affiliates. All

© 2019, Amazon Web Services, Inc. or its affiliates. All

S U M M I T © 2019, Amazon Web

Massively scalable performance Parallel file system Consistent sub-millisecond latencies Each

© 2019, Amazon Web Services, Inc. or its affiliates. All

S U M M I T © 2019, Amazon Web

© 2019, Amazon Web Services, Inc. or its affiliates. All

© 2017, Amazon Web Services, Inc. or its Affiliates. All

S U M M I T © 2019, Amazon Web

© 2019, Amazon Web Services, Inc. or its affiliates. All

Thank you! S U M M I T © 2019,