GPU_FPGA_Briefing+March+2017

© 2017, Amazon Web Services, Inc. or its Affiliates. All
rights reserved. Jamie Kinney, Principal Product Manager, HPC March, 2017 Accelerated Computing on AWS

EC2 Compute Instance Types M4 General purpose Compute optimized Storage
and IO optimized GPU and FPGA accelerated Memory optimized X1 2010 2013 2016 2016 Preview F1 P2 G2 CG1 M3 T2 I2 HS1 I3 D2 R4 R3 C5 C4 C3 CC2 Announced

GPU Acceleration with P2 Instances

P2 GPU Instances • Up to 16 GPUs (8 NVIDIA
K80 cards) in a single instance • Including peer-to-peer PCIe GPU interconnect • Supporting a wide variety of use cases including deep learning, HPC simulations, and batch rendering Instance Size GPUs GPU Peer to Peer vCPUs Memory (GiB) Network Bandwidth* p2.xlarge 1 - 4 61 1.25Gbps p2.8xlarge 8 Y 32 488 10Gbps p2.16xlarge 16 Y 64 732 20Gbps *In a placement group P2

Deep Learning on GPUs P2 GPU instances for high performance
DL training and inference MXNet training on EC2 P2 instances: We trained a popular image analysis algorithm, Inception v3, using MXNet and running on P2 instances MXNet had the fastest throughput of any library we evaluated (as measured by the number of images trained per second), and the throughput rose by almost the same rate as the number of GPUs used for training, with a scaling efficiency of 85%.

Deep Learning Frameworks

Medical Image Rendering on P2

New: Elastic GPU

EC2 + Elastic GPU = Graphics Flexibility t2 c4 m4
r4 : : Small GPU : : Large GPU Attach Elastic GPU to an instance at launch, similar to attaching an EBS volume

Elastic GPU Architecture Compute + Graphics Instructions Graphics Instructions Instructions
Image Image Instance Graphics Attachment

Attaching Elastic GPU – Console

Fully managed application streaming service that provides users instant access
to their desktop applications

Desktop Application Streaming Stream desktop applications securely to any web
browser Pay-as-you-go Scale globally Secure apps & data Run Desktop Apps in a Web Browser

• Use multiple apps at the same time • Clipboard,
file upload/download, printing • Audio and bandwidth controls • Multiple storage options • HTML5 browsers with no plug-ins Simple User Experience

Simple user experience

FPGA Acceleration with F1 Instances

GPU and FPGA for Accelerated Computing NVIDIA Tesla GPU Card
P2: GPU-accelerated computing § Enabling a high degree of parallelism – each GPU has thousands of cores § Consistent, well documented set of APIs (CUDA, OpenACC, OpenCL) § Supported by a wide variety of ISVs and open source frameworks Xilinx UltraScale+ FPGA F1: FPGA-accelerated computing § Massively parallel – each FPGA includes millions of parallel system logic cells § Flexible – no fixed instruction set, can implement wide or narrow datapaths § Programmable using available, cloud-based FPGA development tools

Accelerated Computing Concepts Parallelism increases throughout… CPU: High speed, low
efficiency GPU/FPGA: High throughput, high efficiency GPUs and FPGAs can provide massive parallelism and higher efficiency than CPUs for certain categories of applications

A GPU is effective at processing the same set of
operations in parallel – single instruction, multiple data (SIMD). A GPU has a well-defined instruction-set, and fixed word sizes – for example single, double, or half-precision integer and floating point values. An FPGA is effective at processing the same or different operations in parallel – multiple instructions, multiple data (MIMD). An FPGA does not have a predefined instruction-set, or a fixed data width. Control ALU ALU Cache DRAM ALU ALU CPU (one core) FPGA DRAM DRAM GPU Each FPGA in F1 has more than 2M of these cells Each GPU in P2 has 2880 of these cores DRAM Parallel Processing in GPUs and FPGAs Block RAM Block RAM DRAM DRAM

F1 FPGA Instances • Up to 8 Xilinx Virtex UltraScale
Plus VU9p FPGAs in a single instance with four high-speed DDR-4 per FPGA • Largest size includes high performance FPGA interconnects via PCIe Gen3 (FPGA Direct), and bidirectional ring (FPGA Link) • Designed for hardware-accelerated applications including financial computing, genomics, accelerated search, and image processing Instance Size FPGAs FPGA Link FPGA Direct vCPUs Memory (GiB) NVMe Instance Storage Network Bandwidth* f1.2xlarge 1 - 8 122 1 x 480 5 Gbps f1.16xlarge 8 Y Y 64 976 4 x 960 20 Gbps *In a placement group F1

AWS FPGA Shell FPGA I/O is provided using pre-configured, pre-tested,
and secure I/O components, allowing FPGA developers to focus on their differentiating value The FPGA Shell allows for faster coding of core acceleration functions by removing the need to develop I/O related FPGA hardware Block RAM Block RAM DDR-4 DDR-4 DDR-4 DDR-4 FPGA Link PCIe Abstracting FPGA I/O

Amazon Machine Image (AMI) Amazon FPGA Image (AFI) EC2 F1
Instance CPU Application on F1 DDR-4 Attached Memory DDR-4 Attached Memory DDR-4 Attached Memory DDR-4 Attached Memory DDR-4 Attached Memory DDR-4 Attached Memory DDR-4 Attached Memory DDR-4 Attached Memory FPGA Link PCIe DDR Controllers Launch Instance and Load AFI An F1 instance can have any number of AFIs An AFI can be loaded into the FPGA in less than 1 second FPGA Acceleration Using F1 F1 FPGA Direct

Development steps Launch the AWS-provided FPGA Developer AMI, which includes
all needed FPGA design and programming software, as well as the AWS FPGA Hardware Development Kit (HDK) Use Xilinx Vivado or SDAccel software and a hardware description language (Verilog, VHDL, or OpenCL) with the HDK to describe and simulate your custom FPGA logic After successful simulation, use Vivado or SCAccel to synthesize and place/route the FPGA logic to create an FPGA Design Check Point (DCP), encrypt, and generate an Amazon FPGA Image (AFI) Launch an F1 instance and load the AFI to the FPGA, using AFI management tools provided by AWS Developing Applications for F1 1 2 3 4

Highly Efficient • Algorithms Implemented in Hardware • Gate-Level Circuit
Design • No Instruction Set Overhead Massively Parallel • Massively Parallel Circuits • Multiple Compute Engines • Rapid FPGA Reconfigurability FPGA Speeds Analysis of Whole Human Genomes from Hours to Minutes Unprecedented Low Cost for Compute and Compressed Storage Genomics Processing

Modeling Counterparty Risk (CVA) and Regulatory Capital Requirements Financial Computing

F1 for Video Processing Next Generation Video Compression for Broadcast
Quality 4K content Successfully ported to F1 in just 3 weeks

F1 for Accelerated Analytics Heterogeneous Compute Acceleration for Faster Data
Discovery

Delivering FPGA Partner Solutions on AWS via AWS Marketplace Amazon
EC2 FPGA Deployment Amazon Machine Image (AMI) Amazon FPGA Image (AFI) Customers AWS Marketplace

Thank You [email protected]

GPU_FPGA_Briefing+March+2017

GPU_FPGA_Briefing+March+2017

porcaro33

More Decks by porcaro33

Other Decks in Technology

Featured

Transcript

© 2017, Amazon Web Services, Inc. or its Affiliates. All

EC2 Compute Instance Types M4 General purpose Compute optimized Storage

GPU Acceleration with P2 Instances

P2 GPU Instances • Up to 16 GPUs (8 NVIDIA

Deep Learning on GPUs P2 GPU instances for high performance

Deep Learning Frameworks

Medical Image Rendering on P2

New: Elastic GPU

EC2 + Elastic GPU = Graphics Flexibility t2 c4 m4

Elastic GPU Architecture Compute + Graphics Instructions Graphics Instructions Instructions

Attaching Elastic GPU – Console

Attaching Elastic GPU – Console

Fully managed application streaming service that provides users instant access

Desktop Application Streaming Stream desktop applications securely to any web

• Use multiple apps at the same time • Clipboard,

Simple user experience

Simple user experience

FPGA Acceleration with F1 Instances

GPU and FPGA for Accelerated Computing NVIDIA Tesla GPU Card

Accelerated Computing Concepts Parallelism increases throughout… CPU: High speed, low

A GPU is effective at processing the same set of

F1 FPGA Instances • Up to 8 Xilinx Virtex UltraScale

AWS FPGA Shell FPGA I/O is provided using pre-configured, pre-tested,

Amazon Machine Image (AMI) Amazon FPGA Image (AFI) EC2 F1

Development steps Launch the AWS-provided FPGA Developer AMI, which includes

Highly Efficient • Algorithms Implemented in Hardware • Gate-Level Circuit

Modeling Counterparty Risk (CVA) and Regulatory Capital Requirements Financial Computing

F1 for Video Processing Next Generation Video Compression for Broadcast

F1 for Accelerated Analytics Heterogeneous Compute Acceleration for Faster Data

Delivering FPGA Partner Solutions on AWS via AWS Marketplace Amazon

Thank You [email protected]