GPU_FPGA_Briefing+March+2017

Slide 1

Slide 1 text

Slide 2

Slide 2 text

EC2 Compute Instance Types M4 General purpose Compute optimized Storage and IO optimized GPU and FPGA accelerated Memory optimized X1 2010 2013 2016 2016 Preview F1 P2 G2 CG1 M3 T2 I2 HS1 I3 D2 R4 R3 C5 C4 C3 CC2 Announced

Slide 3

Slide 3 text

GPU Acceleration with P2 Instances

Slide 4

Slide 4 text

P2 GPU Instances • Up to 16 GPUs (8 NVIDIA K80 cards) in a single instance • Including peer-to-peer PCIe GPU interconnect • Supporting a wide variety of use cases including deep learning, HPC simulations, and batch rendering Instance Size GPUs GPU Peer to Peer vCPUs Memory (GiB) Network Bandwidth* p2.xlarge 1 - 4 61 1.25Gbps p2.8xlarge 8 Y 32 488 10Gbps p2.16xlarge 16 Y 64 732 20Gbps *In a placement group P2

Slide 5

Slide 5 text

Deep Learning on GPUs P2 GPU instances for high performance DL training and inference MXNet training on EC2 P2 instances: We trained a popular image analysis algorithm, Inception v3, using MXNet and running on P2 instances MXNet had the fastest throughput of any library we evaluated (as measured by the number of images trained per second), and the throughput rose by almost the same rate as the number of GPUs used for training, with a scaling efficiency of 85%.

Slide 6

Slide 6 text

Deep Learning Frameworks

Slide 7

Slide 7 text

Medical Image Rendering on P2

Slide 8

Slide 8 text

New: Elastic GPU

Slide 9

Slide 9 text

EC2 + Elastic GPU = Graphics Flexibility t2 c4 m4 r4 : : Small GPU : : Large GPU Attach Elastic GPU to an instance at launch, similar to attaching an EBS volume

Slide 10

Slide 10 text

Elastic GPU Architecture Compute + Graphics Instructions Graphics Instructions Instructions Image Image Instance Graphics Attachment

Slide 11

Slide 11 text

Attaching Elastic GPU – Console

Slide 12

Slide 12 text

Attaching Elastic GPU – Console

Slide 13

Slide 13 text

Fully managed application streaming service that provides users instant access to their desktop applications

Slide 14

Slide 14 text

Desktop Application Streaming Stream desktop applications securely to any web browser Pay-as-you-go Scale globally Secure apps & data Run Desktop Apps in a Web Browser

Slide 15

Slide 15 text

• Use multiple apps at the same time • Clipboard, file upload/download, printing • Audio and bandwidth controls • Multiple storage options • HTML5 browsers with no plug-ins Simple User Experience

Slide 16

Slide 16 text

Simple user experience

Slide 17

Slide 17 text

Simple user experience

Slide 18

Slide 18 text

FPGA Acceleration with F1 Instances

Slide 19

Slide 19 text

GPU and FPGA for Accelerated Computing NVIDIA Tesla GPU Card P2: GPU-accelerated computing § Enabling a high degree of parallelism – each GPU has thousands of cores § Consistent, well documented set of APIs (CUDA, OpenACC, OpenCL) § Supported by a wide variety of ISVs and open source frameworks Xilinx UltraScale+ FPGA F1: FPGA-accelerated computing § Massively parallel – each FPGA includes millions of parallel system logic cells § Flexible – no fixed instruction set, can implement wide or narrow datapaths § Programmable using available, cloud-based FPGA development tools

Slide 20

Slide 20 text

Accelerated Computing Concepts Parallelism increases throughout… CPU: High speed, low efficiency GPU/FPGA: High throughput, high efficiency GPUs and FPGAs can provide massive parallelism and higher efficiency than CPUs for certain categories of applications

Slide 21

Slide 21 text

A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). A GPU has a well-defined instruction-set, and fixed word sizes – for example single, double, or half-precision integer and floating point values. An FPGA is effective at processing the same or different operations in parallel – multiple instructions, multiple data (MIMD). An FPGA does not have a predefined instruction-set, or a fixed data width. Control ALU ALU Cache DRAM ALU ALU CPU (one core) FPGA DRAM DRAM GPU Each FPGA in F1 has more than 2M of these cells Each GPU in P2 has 2880 of these cores DRAM Parallel Processing in GPUs and FPGAs Block RAM Block RAM DRAM DRAM

Slide 22

Slide 22 text

F1 FPGA Instances • Up to 8 Xilinx Virtex UltraScale Plus VU9p FPGAs in a single instance with four high-speed DDR-4 per FPGA • Largest size includes high performance FPGA interconnects via PCIe Gen3 (FPGA Direct), and bidirectional ring (FPGA Link) • Designed for hardware-accelerated applications including financial computing, genomics, accelerated search, and image processing Instance Size FPGAs FPGA Link FPGA Direct vCPUs Memory (GiB) NVMe Instance Storage Network Bandwidth* f1.2xlarge 1 - 8 122 1 x 480 5 Gbps f1.16xlarge 8 Y Y 64 976 4 x 960 20 Gbps *In a placement group F1

Slide 23

Slide 23 text

AWS FPGA Shell FPGA I/O is provided using pre-configured, pre-tested, and secure I/O components, allowing FPGA developers to focus on their differentiating value The FPGA Shell allows for faster coding of core acceleration functions by removing the need to develop I/O related FPGA hardware Block RAM Block RAM DDR-4 DDR-4 DDR-4 DDR-4 FPGA Link PCIe Abstracting FPGA I/O

Slide 24

Slide 24 text

Amazon Machine Image (AMI) Amazon FPGA Image (AFI) EC2 F1 Instance CPU Application on F1 DDR-4 Attached Memory DDR-4 Attached Memory DDR-4 Attached Memory DDR-4 Attached Memory DDR-4 Attached Memory DDR-4 Attached Memory DDR-4 Attached Memory DDR-4 Attached Memory FPGA Link PCIe DDR Controllers Launch Instance and Load AFI An F1 instance can have any number of AFIs An AFI can be loaded into the FPGA in less than 1 second FPGA Acceleration Using F1 F1 FPGA Direct

Slide 25

Slide 25 text

Development steps Launch the AWS-provided FPGA Developer AMI, which includes all needed FPGA design and programming software, as well as the AWS FPGA Hardware Development Kit (HDK) Use Xilinx Vivado or SDAccel software and a hardware description language (Verilog, VHDL, or OpenCL) with the HDK to describe and simulate your custom FPGA logic After successful simulation, use Vivado or SCAccel to synthesize and place/route the FPGA logic to create an FPGA Design Check Point (DCP), encrypt, and generate an Amazon FPGA Image (AFI) Launch an F1 instance and load the AFI to the FPGA, using AFI management tools provided by AWS Developing Applications for F1 1 2 3 4

Slide 26

Slide 26 text

Highly Efficient • Algorithms Implemented in Hardware • Gate-Level Circuit Design • No Instruction Set Overhead Massively Parallel • Massively Parallel Circuits • Multiple Compute Engines • Rapid FPGA Reconfigurability FPGA Speeds Analysis of Whole Human Genomes from Hours to Minutes Unprecedented Low Cost for Compute and Compressed Storage Genomics Processing

Slide 27

Slide 27 text

Modeling Counterparty Risk (CVA) and Regulatory Capital Requirements Financial Computing

Slide 28

Slide 28 text

F1 for Video Processing Next Generation Video Compression for Broadcast Quality 4K content Successfully ported to F1 in just 3 weeks

Slide 29

Slide 29 text

F1 for Accelerated Analytics Heterogeneous Compute Acceleration for Faster Data Discovery

Slide 30

Slide 30 text

Delivering FPGA Partner Solutions on AWS via AWS Marketplace Amazon EC2 FPGA Deployment Amazon Machine Image (AMI) Amazon FPGA Image (AFI) Customers AWS Marketplace

Slide 31

Slide 31 text

Thank You [email protected]