HPC-Updates@jawshpc#19

UPDATE THIS PRESENTATION HEADER IN SLIDE MASTER © 2024, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark. © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark. J A W S - H P C # 1 9 2 0 2 4 / 1 / 2 6 HPC関連アップデート SC23 à re:Invent à Now Hiroshi Kobayashi Solution Architect

© 2024, Amazon Web Services, Inc. or its affiliates. Super
Exciting Weeks 2 Nov 12 - 17 Nov 28 – Dec 2

© 2024, Amazon Web Services, Inc. or its affiliates. Supercomputing(SC)
23 イベント概要 • HPC分野におけるトップカンファレンス • 参加者数 14295名、出展数 438 • 2023/11/12 – 17@Denver, CO • Top500, Green500, Gordon Bell Prizeなどが発表 3

© 2024, Amazon Web Services, Inc. or its affiliates. AWS
re:Invent 2023 イベント概要 • AWSによるクラウドコンピューティングに関する世界最大規模の「学習型」カンファレンス § 2023年11月27日(月) ~ 12月1日(金) • 多数のコンテンツ § 5テーマの基調講演、17のイノベーショントーク、 2000+のブレイクアウトセッション • 多くのお客様がご参加 § 現地参加：50,000+ § 日本からのお客様：1,700+ 4

© 2024, Amazon Web Services, Inc. or its affiliates. Dive
deep into topics such as generative AI, cloud operations, compute, security, and more. With more than 2,000 sessions to choose from, catching up on the latest cloud technologies has never been easier. You’ll find something for every learning style and skill level—whether you’re listening to lectures in breakout sessions or attending hands-on workshops. Find the content that interests you the most by browsing our session catalog. For HPC-interested attendees we recommend CMP213, CMP214, and WPS207. Breakout content MAKE A PLAN USEFUL RESOURCES RESERVED SEATING Session catalog EVENT OVERVIEW C O M P U T E T R A C K Srinivas Tadepalli Global Head of HPC & Accelerated Computing GTM, AWS C O M P U T E T R A C K Ian Colle Director & GM, Advanced Computing and Simulation, AWS CMP213 Confidently run your HPC production workloads on AWS CMP214 Simulating the Future: HPC on AWS for Semiconductors and Healthcare Life Sciences WPS207 Bridging research and computing to tackle the world's grand challenges P U B L I C S E C T O R T R A C K Debra Goldfarb Director, Product Strategy, Adv Computing & Simulation, AW

© 2024, Amazon Web Services, Inc. or its affiliates. Looking
for more HPC & related sessions? Breakout CMP214 Simulating the Future: HPC on AWS for Semiconductors and Healthcare Life Sciences Mon, Nov. 27 | 8:30 AM - 9:30 AM (PDT) MGM Grand | Level 3 | Premier 318 CMP213 Confidently run your HPC production workloads on AWS Tue, Nov. 28 | 2:30 PM - 3:30 PM (PDT) Mandalay Bay | Level 3 | South | Jasmine F WPS207 Bridging research and computing to tackle the world's grand challenges Thu, Nov. 30 | 1:00 PM - 2:00 PM (PDT) MGM Grand | Level 3 | Chairmans 364 Chalk Talk CMP416 Performance at scale: Running hero HPC workloads on AWS Mon, Nov. 27 | 10:30 AM - 11:30 AM (PDT) Caesars Forum | Level 1 | Academy 411 CMP415-R Machine learning for engineering simulations Mon, Nov. 27 | 12:00 PM - 1:00 PM (PST) Mandalay Bay | Level 1 | North | South Pacific B Workshop CMP305 Best practices for high performance computing in the cloud Mon, Nov 27 | 11:00 AM - 1:00 PM | Caesars Forum | Level 1 | Alliance 307 Chalk Talk CMP417 Using HPC-optimized Amazon EC2 instances for optimal price performance Wed, Nov. 29 | 4:00 PM - 5:00 PM (PDT) Caesars Forum | Level 1 | Summit 220 CMP414 Architectural best practices for running hybrid HPC workloads on AWS Thu, Nov. 30 | 3:30 PM - 4:30 PM (PDT) Caesars Forum | Level 1 | Alliance 305 Innovation HYB207-INT Thu, Nov. 30 | 11:00 AM - 12:00 PM (PDT)Venetian | Level 5 | Palazzo Ballroom B E M E R G I N G T E C H Bill Vass Vice President of Engineering, AWS

© 2024, Amazon Web Services, Inc. or its affiliates. Other
HPC related sessions 7 • Compute innovation for any application, anywhere (CMP219) • Accelerate ML and HPC with high performance file storage (STG340) • Conquer cloud challenges with a competitive edge for less with AMD (CMP104) • Deep dive into the AWS Nitro System (CMP306) • Deep dive on Amazon FSx for NetApp ONTAP scale-out file systems (STG229)

© 2024, Amazon Web Services, Inc. or its affiliates. New
Instances! 8

© 2024, Amazon Web Services, Inc. or its affiliates. Graviton
の歴史 • 2018 年からの5年間で、第4世代まで進化してきた • Graviton シリーズで累計 200万枚以上のチップを生産したと発表 9 Graviton の歴史パフォーマンスを継続的に改善

Graviton4 を発表 Graviton シリーズの中で最も強力でエネルギー効率の高いプロセッサ幅広いクラウドワークロードに対応 Graviton3 の後継 • vCPU 数が Graviton3 の1.5倍（Arm Neoverse V2, 96vCPUs/socket） • 1コアあたり 2MB の L2 キャッシュ • DDR5-5600 12チャネル • コヒーレントマルチソケットに対応 Graviton3 と比較して • データベースを最大40%、ウェブアプリを最大30%、大規模 Java アプリを最大45% 高速化 • 単一システムで最大で 3倍のコア数、3倍のDRAMを利用可能に 10 プレビュー Press Release | AWS Blog | Adam's keynote | Deep dive

© 2024, Amazon Web Services, Inc. or its affiliates. Graviton4
を搭載した R8g インスタンス発表 R7g インスタンスの後継 • R8g は、Graviton4 を搭載した最初のインスタンス • DDR5-5600 搭載でメモリインテンシブなワークロードに対応 R7g インスタンスと比較して • Graviton3 ベースの R7g に比べ、3倍の vCPU、3倍のメモリまでスケールアップできる • R7g と比較して最大30%高いコンピュート性能で、 EC2 インスタンスの中で最も価格性能比が良いユースケース • データベース、インメモリキャッシュ、ビッグデータのリアルタイム分析 11 プレビュー Press Release | AWS Blog | Join the preview Graviton4 メモリ最適化

Trainium2 / Amazon EC2 Trn2 インスタンスを発表 AWS で基盤モデル学習する上で、最高レベルのコンピューティング性能を提供より少ないエネルギーで、より早く、より低コストで基盤モデルを学習 Trainium 第1世代の後継 • Trainium 第1世代と比較して、最大4倍のトレーニング性能、 3倍のメモリ容量、2倍のエネルギー効率 • 3000億パラメータの LLM のトレーニングを数ヶ月 → 数週間に短縮できる Trn2 インスタンス • 1インスタンスが 16 枚の Trainium2 チップで構成 • 最大 100,000 チップからなる EC2 UltraCluster に導入することで最大 65 exaflops の計算性能を発揮する 12 Press Release | YouTube アナウンス

© 2024, Amazon Web Services, Inc. or its affiliates. NVIDIA
とのパートナーシップ強化を発表 (1/2) Adam Selipsky の基調講演に、NVIDIA CEO Jensen Huang がゲスト登壇し、 AWS と NVIDIA のパートナーシップの強化を発表 • 過去1年間で、2百万の NVIDIA GPU を AWS にデプロイ。これは、3 ゼタフロップス = 3000 エクサスケールのスーパーコンピュータに相当する • 今後も、四半期に 1 ゼタフロップスの計算性能を AWS 上にデプロイ • 新しい EC2 インスタンスを発表。2024年に利用可能予定・P5e（NVIDIA H200 Tensor Core 搭載）・G6e（NVIDIA L40S Tensor Core 搭載）・G6 （NVIDIA L4 Tensor Core 搭載） 13 NVIDIA Blog | Adam's keynote アナウンス

とのパートナーシップ強化を発表 (2/2) AWS が GH200 Grace Hopper 搭載の NVIDIA DGX Cloud を初めてホスト • AWS は、NVIDIA GH200 Grace Hopper Superchip をクラウドに実装する、初のクラウドプロバイダーとなる • NVIDIA DGX Cloud が今後 AWS でホストされる。 GH200 を搭載した初の DGX Cloud となる Project Ceiba で NVIDIA の AI 開発を加速 • NVIDIA 自身の研究開発用 AI スーパーコンピュータを構築する “Project Ceiba” で、 GH200 NVL32 と Amazon EFA interconnect を搭載した大規模なシステムを AWS 上に構築 14 NVIDIA Blog | Adam's keynote アナウンス

GPU ベースの、3つの新インスタンスを発表 EC2 P5e インスタンス • NVIDIA H200 Tensor Core GPU 搭載 • EC2 P5 (H100) の後継で、1.7倍大きく1.4倍高速なメモリを搭載 • 3200 Gbps の EFA ネットワーク帯域幅 EC2 G6e インスタンス • NVIDIA L40S Tensor Core GPU 搭載 • 最大 1.45 ペタフロップスの FP8 性能、最大 209 テラフロップスのレイトレーシング性能 • LLM の学習、デジタルツインに適する EC2 G6 インスタンス • NVIDIA L4 Tensor Core GPU 搭載 • 自然言語処理・言語翻訳・動画/画像解析・音声認識などの機械学習モデルや、グラフィックス処理に適する 15 AWS Blog | NVIDIA Press Release | YouTube NVIDIA ⾼速コンピューティングアナウンス * 2024 年に提供予定

© 2024, Amazon Web Services, Inc. or its affiliates. Amazon
EC2 U7i インスタンスハイメモリインスタンス U-1 の後継 • 最大 32 TiB の DDR5 メモリ • カスタム第4世代 Intel Xeon Scalable Processor (Sapphire Rapids) • AWS の中で最多の、最大896 vCPU までサポートユースケース • SAP HANA, Oracle, SQL Server など、高メモリでミッションクリティカルなワークロード既存の U-1 インスタンスと比較して • 最大 125% のコンピューティング性能を実現 • 2.5倍以上の EBS 帯域幅、データロードにかかる時間を短縮 16 プレビュー What's New | AWS Blog | YouTube Intel ハイメモリ * オレゴン・ソウル・フランクフルトでプレビュー開始

EC2 DL2q インスタンス機械学習の推論に適した Qualcomm AI 100 Standard アクセラレータ搭載のインスタンス • 自然言語処理、画像処理、生成系AI の推論タスク向け • 特に、スマートフォン、自動車、ロボティクス、拡張現実ヘッドセットに展開する前に、AI ワークロードを検証するのに役立つ • Qualcomm の AI スタックが付属しており、Qualcomm 製エッジデバイス含め一貫した開発者体験を提供 17 What's New | AWS Blog Qualcomm ⾼速コンピューティング * オレゴン・フランクフルトで⼀般利⽤開始⼀般利⽤開始

© 2024, Amazon Web Services, Inc. or its affiliates. HPC
Related Updates! 18

EC2 Capacity Blocks for ML 機械学習ワークロードの実行に必要な GPU キャパシティを数日間だけ確保するための新しい仕組み現在はオハイオリージョンの EC2 P5 インスタンス (NVIDIA Tensor Core H100) のみで利用可予約するために必要な情報 • 開始日（8週間先まで） • 利用期間（1日~14日間） • インスタンス数（1~64インスタンス） 19 What's New | AWS Blog | YouTube ⼀般利⽤開始 * オハイオで⼀般利⽤開始

© 2024, Amazon Web Services, Inc. or its affiliates. Capacity
Blocks vs. ODCR + Savings Plans : キャパシティ確保方法の使い分け Capacity Blocks と ODCR を両方組み合わせて最適化する • 月に3-4週間利用するワークロードは ODCR + SPs • 月に3週間以下で利用するワークロードは Capacity Blocks 利用上の注意点 • Capacity Blocks には SPs や RI の割引が適用できない 20 YouTube | AWS Doc ⽉に2週間 Capacity Blocks を利⽤⽉に3週間 Capacity Blocks を利⽤⽉に4週間 Capacity Blocks を利⽤ 12% 節約 42% 節約 ODCR + 1年前払いなし Savings Plans

ParallelCluster アップデート 3.1.x Launched • Multi-user support • Cluster access without Internet 3.2 Release (Launched) • Support for multiple file systems • Memory aware scheduling • OpenZFS, ONTAP support • Dynamic Queue Updates • Instance Fast Failover 3.3 Release (2022-Nov) • Flexible instance types for capacity mgmt • Slurm job accounting • Dynamic Filesystem mounting • Native ODCR support • Slurm upgrade (22.05) • Placement Groups per Compute resource 3.4 Release (2022-Dec) • Support for multiple AZs • Mount encrypted EFS filesystems via EFS Utils • Permission boundary policy extension • ParallelCluster lambda restricted to VPC 3.5 Release (2023-Mar) • ParallelCluster UI • Enabled to call Pcluster func as Python libs • Add log of compute node out to CW 3.6 (2023-May) • Slurm Upgrade to 23.02.2 • Custom Slurm seting • RHEL8 support • Number of queue 10 -> 50 • GPU healthcheck • Multi custom bootstrap script 3.7 (2023-Aug) • Support for FSx FileCache • Memory-aware scheduling with multiple instance types • Job-level scaling (aka all-or-nothing) • Support Login nodes. • Ubuntu 22 support 3.7.2 (2023-Oct) • Slurm upgrade for security vulnerability 3.8.0 (2023-Dec) NEW!!! • Rocky Linux 8 support • EC2 Capacity Blocks for ML support https://github.com/aws/aws-parallelcluster/releases

EC2 Instance Topology API 既存の EC2 インスタンスのネットワークトポロジーを理解するための API • レイテンシとスループットが重視される HPC / ML ワークロードで、ジョブ配置の最適化の参考にすることができる • 各 EC2 インスタンスがどのネットワークノードに接続しているか（→ ホップ数がいくつか）を検出 • 出力は JSON 形式利用上の注意点 • Instance Topology API はあくまで既存インスタンスの位置を検出するためのもの • 物理的に近い位置に新たなインスタンスを起動するには Cluster placement group を使ってキャパシティ予約が必要 22 What’s New | AWS Doc (Instance Topology) | AWS Doc (cluster placement groups) ⼀般利⽤開始

S3からのトレーニングデータ転送が高速化 • Amazon S3からAmazon EC2 Trn1, P4d, P5インスタンスに対するデータ転送を高速化 § 機械学習のトレーニングジョブ実行に必要なデータのダウンロードが最大3倍高速に。モデルのチェックポイントデータのアップロードは最大5 倍高速に • 最新版のAWS CLI, Python SDKで自動的に有効に § AWS Common Runtime(CRT)によるリクエストの並列化、自動再試行、 DNSロードバランシングなどのベストプラクティスが活用される § AWS CLI, Python SDKを利用するアプリケーションであれば、CRTのメリットが有効になる § https://aws.amazon.com/jp/about-aws/whats-new/2023/11/updates- accelerate-amazon-s3-data-transfer-ml-training/ 25 ⼀般利⽤開始 Amazon Simple Storage Service (Amazon S3) Bucket with objects Trn1 Instance P4d Instance P5 Instance

FSx for NetApp ONTAPがスケールアウト構成に対応 • FSx for NetApp ONTAPがスケールアウト構成をサポートし、より高い性能要求に対応可能に § 36GB/sのスループット、1.2M IOPS § 最大6つの高可用性ペア(ファイルサーバ)に分散することで高性能を実現。単一ペアは6GB/s、200K IOPSを提供する • バージニア、オレゴン、オハイオ、アイルランド、シドニーのリージョンにて利用可能に 26 ⼀般利⽤開始 Amazon FSx for NetApp ONTAP File system

© 2024, Amazon Web Services, Inc. or its affiliates. Research
and Engineering Studio on AWS (RES) R&Dチームがクラウドの専⾨知識を必要とせずにワークロードを実⾏できる環境を管理・作成するためのオープンソースのウェブポータル Key Features 1. 専⽤のウェブポータルからデスクトップ環境作成 2. 共有データストアを利⽤し共同作業 3. 既存のID管理インフラ（AWSマネージドAD）との統合 4. プロジェクト毎でコストやアクセスを管理 * 12リージョンで⼀般利⽤開始。東京では未提供 GitHub | AWS Doc | AWS Blog | YouTube Admin End-users manage login & use ⼀般利⽤開始

© 2024, Amazon Web Services, Inc. or its affiliates. Research
and Engineering Studio on AWS (RES) クラウドベースのワークステーションへのアクセスを簡素化 • チームのニーズに対応するため、あらかじめ設定されたツールセットとソフトウェアを定義 • パフォーマンスとコスト管理のためにインスタンスタイプを適切なサイズに設定 • トレーニングやコラボレーションのためのセッション共有 • 業務時間に基づいてセッションをスケジュールし、⾃動シャットダウンを有効化⼀般利⽤開始

UPDATE THIS PRESENTATION HEADER IN SLIDE MASTER © 2024, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark. Thank you! © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark. Hiroshi Kobayashi [email protected]

© 2024, Amazon Web Services, Inc. or its affiliates. One
More Thing … 31 Intel Sapphire Rapids, AMD Genoa を搭載した最新第7世代インスタンスが次々とTokyoリージョンGAに! C7i M7i M7i- Flex R7i M7a R7a R7iz GA Jan/23 GA Dec/20 GA Jan/9 GA Jan/10 GA Jan/11

HPC-Updates@jawshpc#19

HPC-Updates@jawshpc#19

porcaro33

More Decks by porcaro33

Other Decks in Technology

Featured

Transcript

UPDATE THIS PRESENTATION HEADER IN SLIDE MASTER © 2024, Amazon

© 2024, Amazon Web Services, Inc. or its affiliates. Super

© 2024, Amazon Web Services, Inc. or its affiliates. Supercomputing(SC)

© 2024, Amazon Web Services, Inc. or its affiliates. AWS

© 2024, Amazon Web Services, Inc. or its affiliates. Dive

© 2024, Amazon Web Services, Inc. or its affiliates. Looking

© 2024, Amazon Web Services, Inc. or its affiliates. Other

© 2024, Amazon Web Services, Inc. or its affiliates. New

© 2024, Amazon Web Services, Inc. or its affiliates. Graviton

© 2024, Amazon Web Services, Inc. or its affiliates. AWS

© 2024, Amazon Web Services, Inc. or its affiliates. Graviton4

© 2024, Amazon Web Services, Inc. or its affiliates. AWS

© 2024, Amazon Web Services, Inc. or its affiliates. NVIDIA

© 2024, Amazon Web Services, Inc. or its affiliates. NVIDIA

© 2024, Amazon Web Services, Inc. or its affiliates. NVIDIA

© 2024, Amazon Web Services, Inc. or its affiliates. Amazon

© 2024, Amazon Web Services, Inc. or its affiliates. Amazon

© 2024, Amazon Web Services, Inc. or its affiliates. HPC

© 2024, Amazon Web Services, Inc. or its affiliates. Amazon

© 2024, Amazon Web Services, Inc. or its affiliates. Capacity

© 2024, Amazon Web Services, Inc. or its affiliates. AWS

© 2024, Amazon Web Services, Inc. or its affiliates. Amazon

© 2024, Amazon Web Services, Inc. or its affiliates. Amazon

© 2024, Amazon Web Services, Inc. or its affiliates. Amazon

© 2024, Amazon Web Services, Inc. or its affiliates. Research

© 2024, Amazon Web Services, Inc. or its affiliates. Research

UPDATE THIS PRESENTATION HEADER IN SLIDE MASTER © 2024, Amazon

© 2024, Amazon Web Services, Inc. or its affiliates. One