Upgrade to Pro — share decks privately, control downloads, hide ads and more …

What would you do with a million cores? High Performance Computing HPC on AWS 2019

Frank Munz
October 05, 2019

What would you do with a million cores? High Performance Computing HPC on AWS 2019

Stop press. There is a conference recording for this talk:
https://www.youtube.com/watch?v=-QCGam1DDqQ

This talk presents the technical innovations that make HPC in the cloud possible. I will cover everything from the best of class custom build Intel processors to low energy, low cost ARM processors that silently sneaked into server-class machines. I will speak about purpose build networking that bypasses the TCP/IP stack to guarantee lowest latency, low jitter networking beyond what is possible with classical NICs. And I will cover the open source software that makes it possible to spin up very large scale HPC clusters in the cloud.

Oh, and of course I will answer the question if 1.000.000 cores is pure marketecture or if it really happened. Of course it really happened.

Frank Munz

October 05, 2019
Tweet

More Decks by Frank Munz

Other Decks in Programming

Transcript

  1. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T What would You do with a Million cores? High-Performance Computing on AWS Frank Munz, PhD Senior Technical Evangelist Amazon Web Services E C C 5 @frankmunz
  2. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T Extreme Scale “Using AWS to easily shrink simulation time allows Western Digital R&D teams to explore new designs and innovations at a pace un-imaginable just a short time ago.” – Steve Phillpott, CIO, Western Digital HPC cluster of 1 million vCPUs
  3. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T From worrying about to Capex Capacity Technology Focusing on innovation HPC on AWS is a Fundamental Rethink of what is Possible
  4. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T We think the metric for success for any business should be time-to-results 2 2 2 4 2 1 1 3 7 7 4 9 5 7 6 6 7 7 4 8 4 Cores 8 2 1 9 5 4 5 3 1 2 3 6 1 9 4 8 1 2 8 7 7 6 Fixed data centre capacity limit Cores Finite capacity, usually with long queues to wait in Massive capacity when needed to speed up time to results, and agile environment when additional hardware and software experimentation is needed “For every $1 spent on HPC, businesses see $463 in incremental revenues and $44 in incremental profit.”
  5. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T HPC Solution Components Automation and orchestration AWS Batch AWS ParallelCluster NICE EnginFrame Visualization NICE DCV Amazon AppStream 2.0 Compute Amazon EC2 instances (Compute and accelerated) Amazon EC2 Spot AWS Auto Scaling Networking Enhanced networking Placement groups Elastic Fabric Adapter Storage Amazon EBS Amazon EFS Amazon S3 Amazon FSx for Lustre
  6. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T EC2 Instance Types p3dn.24xlarge Instance Family: general purpose GPU Instance Instance Generation: 3 Additional Capabilities: - high speed, low latency NVMe SSD physically connected to server - 100 Gbit/s networking Instance Size: 96 VCPUs, 256 GB RAM, 8x NVIDIA V100
  7. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T P3dn High bandwidth GPU compute instances Optimized for distributed ML training Launch with Deep Learning AMIs (that include most of the popular ML frameworks ) or with Amazon Sagemaker • 8 NVIDIA’s GPU Tesla with 32GB of memory per GPU • 96 custom Intel® Xeon® Scalable (Skylake) vCPUs • 100 Gbps of networking throughput • 1.8 TB of local NVMe-based SSD storage Featuring
  8. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T Z1d High clock speed instances Optimized for applications that require sustained core performance Memory and compute-intensive applications. E.g. applications with license restrictions that require few, fast cores. Featuring • Up to 4 GHz sustained, all-turbo performance • Custom Intel Xeon Scalable processor • 8:1 GiB to vCPU ratio • Enhanced networking, up to 25 GB throughput
  9. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T C5n High bandwidth compute instances Massively scalable performance Network bound workloads including distributed cluster and database workloads, HPC, real-time communications and video streaming Featuring • Significant improvements in max bandwidth, packet per seconds, and packets processing • Up to 100 Gbps of network bandwidth • Custom designed Nitro network cards • 5.33 GB/core with full memory bandwidth, 2:1 Memory:vCPU ratio • Elastic Fabric Adapter Compatible
  10. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. EC2 Instance Types with AMD EPYC™ $ M5a M5 R5a R5 T3a T3 Lower Prices Application Compatible M5 (general purpose) R5 (memory workloads) T3 (burstable workloads) 10% less expensive Choice of Intel Xeon or AMD EPYC Processors Processor Choice
  11. First instance powered by AWS Graviton Processor Up to 45%

    cost savings AWS Graviton Processor with 64-bit ARM Neoverse cores and custom AWS silicon Amazon EC2 A1 Lower cost Run scale-out and ARM-based workloads in the cloud Maximize resource efficiency with AWS Nitro System Flexibility and choice for your workloads Amazon Linux, RedHat, Ubuntu
  12. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Broadest choice of processors and architectures
  13. S U M M I T © 2019, Amazon Web

    Services, Inc. or its affiliates. All rights reserved.
  14. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T Elastic Fabric Adapter (EFA) C5n P3dn 15 micro-seconds network latencies EFA Elastic Fabric Adapter, best for large HPC workloads Scale tightly-coupled HPC applications on AWS i3en
  15. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T HPC software stack in Amazon EC2 https://www.youtube.com/watch?time_continue=1&v=MjzbY74WNeI
  16. S U M M I T © 2019, Amazon Web

    Services, Inc. or its affiliates. All rights reserved.
  17. Massively scalable performance Parallel file system Consistent sub-millisecond latencies Each

    TB provides 200 MB/second, scales to hundreds of GB/s and millions of IOPS SSD-based AWS FSx Luster High performance file system
  18. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T Seamless integration with Amazon S3 Data stored in Amazon S3 is loaded to Amazon FSx for processing Output of processing returned to Amazon S3 for retention When your workload finishes, simply delete your file system. Link your Amazon S3 data set to your Amazon FSx for Lustre file system, then….
  19. S U M M I T © 2019, Amazon Web

    Services, Inc. or its affiliates. All rights reserved.
  20. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T AWS Parallel Cluster Easy Cluster Management Quickly build an HPC compute environment in AWS CLI to create HPC cluster • Successor of open source CfnCluster • Released via the Python Package Index (PyPI) • Built on AWS CloudFormation • Supported schedulers: sge (default), torque, and slurm • Integrated with AWS Batch
  21. S U M M I T © 2019, Amazon Web

    Services, Inc. or its affiliates. All rights reserved.
  22. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T Flexible configuration and virtually unlimited scalability to grow and shrink your infrastructure as your HPC workloads dictate, not the other way around HPC on AWS
  23. Thank you! S U M M I T © 2019,

    Amazon Web Services, Inc. or its affiliates. All rights reserved. frankmunz @frankmunz https://medium.com/@frank.munz (Blog) https://speakerdeck.com/fmunz (Slides)