Upgrade to Pro — share decks privately, control downloads, hide ads and more …

GPU_FPGA_Briefing+March+2017

 GPU_FPGA_Briefing+March+2017

GPU, FPGA and HPC instance update on March 2017 by Jamie Kinney at JAWS AI x HPC 2017/03/31

porcaro33

March 31, 2017
Tweet

More Decks by porcaro33

Other Decks in Technology

Transcript

  1. © 2017, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Jamie Kinney, Principal Product Manager, HPC March, 2017 Accelerated Computing on AWS
  2. EC2 Compute Instance Types M4 General purpose Compute optimized Storage

    and IO optimized GPU and FPGA accelerated Memory optimized X1 2010 2013 2016 2016 Preview F1 P2 G2 CG1 M3 T2 I2 HS1 I3 D2 R4 R3 C5 C4 C3 CC2 Announced
  3. P2 GPU Instances • Up to 16 GPUs (8 NVIDIA

    K80 cards) in a single instance • Including peer-to-peer PCIe GPU interconnect • Supporting a wide variety of use cases including deep learning, HPC simulations, and batch rendering Instance Size GPUs GPU Peer to Peer vCPUs Memory (GiB) Network Bandwidth* p2.xlarge 1 - 4 61 1.25Gbps p2.8xlarge 8 Y 32 488 10Gbps p2.16xlarge 16 Y 64 732 20Gbps *In a placement group P2
  4. Deep Learning on GPUs P2 GPU instances for high performance

    DL training and inference MXNet training on EC2 P2 instances: We trained a popular image analysis algorithm, Inception v3, using MXNet and running on P2 instances MXNet had the fastest throughput of any library we evaluated (as measured by the number of images trained per second), and the throughput rose by almost the same rate as the number of GPUs used for training, with a scaling efficiency of 85%.
  5. EC2 + Elastic GPU = Graphics Flexibility t2 c4 m4

    r4 : : Small GPU : : Large GPU Attach Elastic GPU to an instance at launch, similar to attaching an EBS volume
  6. Desktop Application Streaming Stream desktop applications securely to any web

    browser Pay-as-you-go Scale globally Secure apps & data Run Desktop Apps in a Web Browser
  7. • Use multiple apps at the same time • Clipboard,

    file upload/download, printing • Audio and bandwidth controls • Multiple storage options • HTML5 browsers with no plug-ins Simple User Experience
  8. GPU and FPGA for Accelerated Computing NVIDIA Tesla GPU Card

    P2: GPU-accelerated computing § Enabling a high degree of parallelism – each GPU has thousands of cores § Consistent, well documented set of APIs (CUDA, OpenACC, OpenCL) § Supported by a wide variety of ISVs and open source frameworks Xilinx UltraScale+ FPGA F1: FPGA-accelerated computing § Massively parallel – each FPGA includes millions of parallel system logic cells § Flexible – no fixed instruction set, can implement wide or narrow datapaths § Programmable using available, cloud-based FPGA development tools
  9. Accelerated Computing Concepts Parallelism increases throughout… CPU: High speed, low

    efficiency GPU/FPGA: High throughput, high efficiency GPUs and FPGAs can provide massive parallelism and higher efficiency than CPUs for certain categories of applications
  10. A GPU is effective at processing the same set of

    operations in parallel – single instruction, multiple data (SIMD). A GPU has a well-defined instruction-set, and fixed word sizes – for example single, double, or half-precision integer and floating point values. An FPGA is effective at processing the same or different operations in parallel – multiple instructions, multiple data (MIMD). An FPGA does not have a predefined instruction-set, or a fixed data width. Control ALU ALU Cache DRAM ALU ALU CPU (one core) FPGA DRAM DRAM GPU Each FPGA in F1 has more than 2M of these cells Each GPU in P2 has 2880 of these cores DRAM Parallel Processing in GPUs and FPGAs Block RAM Block RAM DRAM DRAM
  11. F1 FPGA Instances • Up to 8 Xilinx Virtex UltraScale

    Plus VU9p FPGAs in a single instance with four high-speed DDR-4 per FPGA • Largest size includes high performance FPGA interconnects via PCIe Gen3 (FPGA Direct), and bidirectional ring (FPGA Link) • Designed for hardware-accelerated applications including financial computing, genomics, accelerated search, and image processing Instance Size FPGAs FPGA Link FPGA Direct vCPUs Memory (GiB) NVMe Instance Storage Network Bandwidth* f1.2xlarge 1 - 8 122 1 x 480 5 Gbps f1.16xlarge 8 Y Y 64 976 4 x 960 20 Gbps *In a placement group F1
  12. AWS FPGA Shell FPGA I/O is provided using pre-configured, pre-tested,

    and secure I/O components, allowing FPGA developers to focus on their differentiating value The FPGA Shell allows for faster coding of core acceleration functions by removing the need to develop I/O related FPGA hardware Block RAM Block RAM DDR-4 DDR-4 DDR-4 DDR-4 FPGA Link PCIe Abstracting FPGA I/O
  13. Amazon Machine Image (AMI) Amazon FPGA Image (AFI) EC2 F1

    Instance CPU Application on F1 DDR-4 Attached Memory DDR-4 Attached Memory DDR-4 Attached Memory DDR-4 Attached Memory DDR-4 Attached Memory DDR-4 Attached Memory DDR-4 Attached Memory DDR-4 Attached Memory FPGA Link PCIe DDR Controllers Launch Instance and Load AFI An F1 instance can have any number of AFIs An AFI can be loaded into the FPGA in less than 1 second FPGA Acceleration Using F1 F1 FPGA Direct
  14. Development steps Launch the AWS-provided FPGA Developer AMI, which includes

    all needed FPGA design and programming software, as well as the AWS FPGA Hardware Development Kit (HDK) Use Xilinx Vivado or SDAccel software and a hardware description language (Verilog, VHDL, or OpenCL) with the HDK to describe and simulate your custom FPGA logic After successful simulation, use Vivado or SCAccel to synthesize and place/route the FPGA logic to create an FPGA Design Check Point (DCP), encrypt, and generate an Amazon FPGA Image (AFI) Launch an F1 instance and load the AFI to the FPGA, using AFI management tools provided by AWS Developing Applications for F1 1 2 3 4
  15. Highly Efficient • Algorithms Implemented in Hardware • Gate-Level Circuit

    Design • No Instruction Set Overhead Massively Parallel • Massively Parallel Circuits • Multiple Compute Engines • Rapid FPGA Reconfigurability FPGA Speeds Analysis of Whole Human Genomes from Hours to Minutes Unprecedented Low Cost for Compute and Compressed Storage Genomics Processing
  16. F1 for Video Processing Next Generation Video Compression for Broadcast

    Quality 4K content Successfully ported to F1 in just 3 weeks
  17. Delivering FPGA Partner Solutions on AWS via AWS Marketplace Amazon

    EC2 FPGA Deployment Amazon Machine Image (AMI) Amazon FPGA Image (AFI) Customers AWS Marketplace