Slide 1

Slide 1 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T What would You do with a Million cores? High-Performance Computing on AWS Frank Munz, PhD Senior Technical Evangelist Amazon Web Services E C C 5 @frankmunz

Slide 2

Slide 2 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Extreme Scale “Using AWS to easily shrink simulation time allows Western Digital R&D teams to explore new designs and innovations at a pace un-imaginable just a short time ago.” – Steve Phillpott, CIO, Western Digital HPC cluster of 1 million vCPUs

Slide 3

Slide 3 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 1.1M vCPUs for ML

Slide 4

Slide 4 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T From worrying about to Capex Capacity Technology Focusing on innovation HPC on AWS is a Fundamental Rethink of what is Possible

Slide 5

Slide 5 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T We think the metric for success for any business should be time-to-results 2 2 2 4 2 1 1 3 7 7 4 9 5 7 6 6 7 7 4 8 4 Cores 8 2 1 9 5 4 5 3 1 2 3 6 1 9 4 8 1 2 8 7 7 6 Fixed data centre capacity limit Cores Finite capacity, usually with long queues to wait in Massive capacity when needed to speed up time to results, and agile environment when additional hardware and software experimentation is needed “For every $1 spent on HPC, businesses see $463 in incremental revenues and $44 in incremental profit.”

Slide 6

Slide 6 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T HPC Solution Components Automation and orchestration AWS Batch AWS ParallelCluster NICE EnginFrame Visualization NICE DCV Amazon AppStream 2.0 Compute Amazon EC2 instances (Compute and accelerated) Amazon EC2 Spot AWS Auto Scaling Networking Enhanced networking Placement groups Elastic Fabric Adapter Storage Amazon EBS Amazon EFS Amazon S3 Amazon FSx for Lustre

Slide 7

Slide 7 text

Broadest and deepest platform choice

Slide 8

Slide 8 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T EC2 Instance Types p3dn.24xlarge Instance Family: general purpose GPU Instance Instance Generation: 3 Additional Capabilities: - high speed, low latency NVMe SSD physically connected to server - 100 Gbit/s networking Instance Size: 96 VCPUs, 256 GB RAM, 8x NVIDIA V100

Slide 9

Slide 9 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T P3dn High bandwidth GPU compute instances Optimized for distributed ML training Launch with Deep Learning AMIs (that include most of the popular ML frameworks ) or with Amazon Sagemaker • 8 NVIDIA’s GPU Tesla with 32GB of memory per GPU • 96 custom Intel® Xeon® Scalable (Skylake) vCPUs • 100 Gbps of networking throughput • 1.8 TB of local NVMe-based SSD storage Featuring

Slide 10

Slide 10 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Z1d High clock speed instances Optimized for applications that require sustained core performance Memory and compute-intensive applications. E.g. applications with license restrictions that require few, fast cores. Featuring • Up to 4 GHz sustained, all-turbo performance • Custom Intel Xeon Scalable processor • 8:1 GiB to vCPU ratio • Enhanced networking, up to 25 GB throughput

Slide 11

Slide 11 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T C5n High bandwidth compute instances Massively scalable performance Network bound workloads including distributed cluster and database workloads, HPC, real-time communications and video streaming Featuring • Significant improvements in max bandwidth, packet per seconds, and packets processing • Up to 100 Gbps of network bandwidth • Custom designed Nitro network cards • 5.33 GB/core with full memory bandwidth, 2:1 Memory:vCPU ratio • Elastic Fabric Adapter Compatible

Slide 12

Slide 12 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. EC2 Instance Types with AMD EPYC™ $ M5a M5 R5a R5 T3a T3 Lower Prices Application Compatible M5 (general purpose) R5 (memory workloads) T3 (burstable workloads) 10% less expensive Choice of Intel Xeon or AMD EPYC Processors Processor Choice

Slide 13

Slide 13 text

First instance powered by AWS Graviton Processor Up to 45% cost savings AWS Graviton Processor with 64-bit ARM Neoverse cores and custom AWS silicon Amazon EC2 A1 Lower cost Run scale-out and ARM-based workloads in the cloud Maximize resource efficiency with AWS Nitro System Flexibility and choice for your workloads Amazon Linux, RedHat, Ubuntu

Slide 14

Slide 14 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Broadest choice of processors and architectures

Slide 15

Slide 15 text

S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 16

Slide 16 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Elastic Fabric Adapter (EFA) C5n P3dn 15 micro-seconds network latencies EFA Elastic Fabric Adapter, best for large HPC workloads Scale tightly-coupled HPC applications on AWS i3en

Slide 17

Slide 17 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T HPC software stack in Amazon EC2 https://www.youtube.com/watch?time_continue=1&v=MjzbY74WNeI

Slide 18

Slide 18 text

S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 19

Slide 19 text

Massively scalable performance Parallel file system Consistent sub-millisecond latencies Each TB provides 200 MB/second, scales to hundreds of GB/s and millions of IOPS SSD-based AWS FSx Luster High performance file system

Slide 20

Slide 20 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Seamless integration with Amazon S3 Data stored in Amazon S3 is loaded to Amazon FSx for processing Output of processing returned to Amazon S3 for retention When your workload finishes, simply delete your file system. Link your Amazon S3 data set to your Amazon FSx for Lustre file system, then….

Slide 21

Slide 21 text

S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 22

Slide 22 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T AWS Parallel Cluster Easy Cluster Management Quickly build an HPC compute environment in AWS CLI to create HPC cluster • Successor of open source CfnCluster • Released via the Python Package Index (PyPI) • Built on AWS CloudFormation • Supported schedulers: sge (default), torque, and slurm • Integrated with AWS Batch

Slide 23

Slide 23 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Slide 24

Slide 24 text

S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 25

Slide 25 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Flexible configuration and virtually unlimited scalability to grow and shrink your infrastructure as your HPC workloads dictate, not the other way around HPC on AWS

Slide 26

Slide 26 text

Thank you! S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. frankmunz @frankmunz https://medium.com/@frank.munz (Blog) https://speakerdeck.com/fmunz (Slides)