K80 cards) in a single instance • Including peer-to-peer PCIe GPU interconnect • Supporting a wide variety of use cases including deep learning, HPC simulations, and batch rendering Instance Size GPUs GPU Peer to Peer vCPUs Memory (GiB) Network Bandwidth* p2.xlarge 1 - 4 61 1.25Gbps p2.8xlarge 8 Y 32 488 10Gbps p2.16xlarge 16 Y 64 732 20Gbps *In a placement group P2
DL training and inference MXNet training on EC2 P2 instances: We trained a popular image analysis algorithm, Inception v3, using MXNet and running on P2 instances MXNet had the fastest throughput of any library we evaluated (as measured by the number of images trained per second), and the throughput rose by almost the same rate as the number of GPUs used for training, with a scaling efficiency of 85%.
P2: GPU-accelerated computing § Enabling a high degree of parallelism – each GPU has thousands of cores § Consistent, well documented set of APIs (CUDA, OpenACC, OpenCL) § Supported by a wide variety of ISVs and open source frameworks Xilinx UltraScale+ FPGA F1: FPGA-accelerated computing § Massively parallel – each FPGA includes millions of parallel system logic cells § Flexible – no fixed instruction set, can implement wide or narrow datapaths § Programmable using available, cloud-based FPGA development tools
efficiency GPU/FPGA: High throughput, high efficiency GPUs and FPGAs can provide massive parallelism and higher efficiency than CPUs for certain categories of applications
operations in parallel – single instruction, multiple data (SIMD). A GPU has a well-defined instruction-set, and fixed word sizes – for example single, double, or half-precision integer and floating point values. An FPGA is effective at processing the same or different operations in parallel – multiple instructions, multiple data (MIMD). An FPGA does not have a predefined instruction-set, or a fixed data width. Control ALU ALU Cache DRAM ALU ALU CPU (one core) FPGA DRAM DRAM GPU Each FPGA in F1 has more than 2M of these cells Each GPU in P2 has 2880 of these cores DRAM Parallel Processing in GPUs and FPGAs Block RAM Block RAM DRAM DRAM
Plus VU9p FPGAs in a single instance with four high-speed DDR-4 per FPGA • Largest size includes high performance FPGA interconnects via PCIe Gen3 (FPGA Direct), and bidirectional ring (FPGA Link) • Designed for hardware-accelerated applications including financial computing, genomics, accelerated search, and image processing Instance Size FPGAs FPGA Link FPGA Direct vCPUs Memory (GiB) NVMe Instance Storage Network Bandwidth* f1.2xlarge 1 - 8 122 1 x 480 5 Gbps f1.16xlarge 8 Y Y 64 976 4 x 960 20 Gbps *In a placement group F1
and secure I/O components, allowing FPGA developers to focus on their differentiating value The FPGA Shell allows for faster coding of core acceleration functions by removing the need to develop I/O related FPGA hardware Block RAM Block RAM DDR-4 DDR-4 DDR-4 DDR-4 FPGA Link PCIe Abstracting FPGA I/O
Instance CPU Application on F1 DDR-4 Attached Memory DDR-4 Attached Memory DDR-4 Attached Memory DDR-4 Attached Memory DDR-4 Attached Memory DDR-4 Attached Memory DDR-4 Attached Memory DDR-4 Attached Memory FPGA Link PCIe DDR Controllers Launch Instance and Load AFI An F1 instance can have any number of AFIs An AFI can be loaded into the FPGA in less than 1 second FPGA Acceleration Using F1 F1 FPGA Direct
all needed FPGA design and programming software, as well as the AWS FPGA Hardware Development Kit (HDK) Use Xilinx Vivado or SDAccel software and a hardware description language (Verilog, VHDL, or OpenCL) with the HDK to describe and simulate your custom FPGA logic After successful simulation, use Vivado or SCAccel to synthesize and place/route the FPGA logic to create an FPGA Design Check Point (DCP), encrypt, and generate an Amazon FPGA Image (AFI) Launch an F1 instance and load the AFI to the FPGA, using AFI management tools provided by AWS Developing Applications for F1 1 2 3 4