Slide 1

Slide 1 text

RPC Considered Harmful Using the Write-Only Architecture (WOA) [1] to Reduce Communication Latencies in Modern DHPC Systems SaSe™ Business Solutions © Simon Spacey. 18/02/2013 <1> [1] S. Spacey, W. Luk, P.H.J. Kelly, D. Kuhn, Improving communication latency with the Write-Only Architecture, Journal of Parallel and Distributed Computing, 72 (12) (2012) 1617-1627. Simon Spacey B.Sc., M.Sc., D.E.A., M.B.A., D.CSC., D.I.U., Ph.D.

Slide 2

Slide 2 text

SaSe™ Business Solutions Modern Systems: Heterogeneous 2 [3] Showerman et al., QP (2009). DRAM DRAM GPU DRAM GPU DRAM GPU DRAM GPU DRAM Dual-core CPU 1 Dual-core CPU 0 PCIe bus DRAM SRAM FPGA PCIe bus PCI-X bus [2] Tsoi et al., Axel (2010). © Simon Spacey. 18/02/2013 <2> System I/O Multi−Bank Memory 1G Bytes Xilinx V5 LX330T customisable logic System Memory 4G Bytes DDR2 Video Memory 4G Bytes GDDR3 AMD Phenom X4 9650 Quad−Core nVidia Tesla C1060 240 streaming cores Infiniband Gigabit Ethernet Fast Serial Link PCIe Bus: 8~16 lanes; 4~8GBps

Slide 3

Slide 3 text

SaSe™ Business Solutions Example: Accelerations [1] © Simon Spacey. 18/02/2013 <3> Benchmark [4] Amdahl’s Limit Tightly-Coupled Loosely-Coupled Custom-Instruction Client-Server dijkstra 3.343x 1.000x 1.000x fft 1.450x 1.000x 1.000x ispell 4.548x 1.000x 1.000x jpeg 3.264x 1.001x 1.000x sha 2.200x 1.000x 1.000x susan-e 2.708x 1.031x 1.030x Optimal Assignment Accelerations over CPU only Execution for the six MiBench Tasks [4] running on the Tightly and Loosely-Coupled Heterogeneous Axel Architectures [2] with OpenPAT measurements using RPC based Communication Libraries. [1] S. Spacey et al., Improving communication latency with the WOA (2012). [4] M. Guthaus et al., MiBench (2001). [5] S. Spacey, 3S Quick Start Guide, Imperial Technical Manual (2009).

Slide 4

Slide 4 text

SaSe™ Business Solutions FPGA CPU img bit bit sized The Problem: RPC Central Hub © Simon Spacey. 18/02/2013 <4> ✘  RPC Based Libraries Are Only Efficient if Caller Uses Result [6-8] [6] O. Mencer et al., A Stream Compiler (2003). [7] J. Duato et al., rCUDA: reducing the number of GPU accels. (2010). [8] Sun Microsystems Inc., RPC: Remote Procedure Call spec. (1988). [9] J. Bacon et al., Distributed computing with RPC, (1987). // Decrypt SSL img = decrypt(packet); // Expand to Bitmap bit = expand(img); // Resize Bitmap sized = resize(bit); Central Hub

Slide 5

Slide 5 text

SaSe™ Business Solutions ¡ Distributed Network Services [10, 11]: ¡  User dependent (not data dependent) control flows ¡  Few hops with no loops ¡  Homogeneous ¡ But in High-Performance Computing need: ¡  Data dependent control flows ¡  Support for high-volume computational loops ¡  Across Heterogeneous components Related Work © Simon Spacey. 18/02/2013 <5> [10] Y. Song et al., RPC Chains (2009). [11] K. Sivaramakrishnan et al., Efficient session type guided interaction (2010).

Slide 6

Slide 6 text

SaSe™ Business Solutions FPGA CPU img Latency Saving bit bit sized FPGA CPU img sized bit Write-Only Architecture [1] © Simon Spacey. 18/02/2013 <6> ✔  Write Along Natural Task CFG ✔  Heterogeneous WOA Controllers ✔  Hardware Agnostic Activation Packets [1] S. Spacey et al., Improving communication latency with the WOA (2012). [12] D. Patterson, Latency Lags Bandwidth (2004). // Decrypt SSL img = decrypt(packet); // Expand to Bitmap bit = expand(img); // Resize Bitmap sized = resize(bit); Natural CFG Latency Saving bit bit sized

Slide 7

Slide 7 text

SaSe™ Business Solutions Experimental Configuration 7 © Simon Spacey. 18/02/2013 <7> Benchmark [4] Category [4] dijkstra Network 117 295 7 9864826 26750704 fft Telecom 130 438 18 3670310 11470084 ispell Office 1033 3242 9 5856315 35981124 jpeg Consumer 1776 7657 20 4406038 12414004 sha Security 68 816 6 6609432 82246556 susan-e Industrial 249 2491 72 4719519 28121940 Left: Tightly-coupled single node Axel [2] with RPC Custom Instruction communications over a PCIe Bus Right: Loosely-coupled two node Axel [2] with RPC Client-Server communications over UDP/IP System I/O Multi−Bank Memory 1G Bytes Xilinx V5 LX330T customisable logic System Memory 4G Bytes DDR2 Video Memory 4G Bytes GDDR3 AMD Phenom X4 9650 Quad−Core nVidia Tesla C1060 240 streaming cores Infiniband Gigabit Ethernet Fast Serial Link PCIe Bus: 8~16 lanes; 4~8GBps OpenPAT [5] Measurements for the six MiBench Tasks [4] Considered in the Paper.

Slide 8

Slide 8 text

SaSe™ Business Solutions ¡ General Formal DHPC Optimisation Model [1] Formal Optimal Assignmnets 8 © Simon Spacey. 18/02/2013 <8> (Instantiate Once [13]) [13] S. Spacey et al., Robust software partitioning with multiple instantiation, INFORMS Journal on Computing 24 (3) (2012) 500–515. (Respect Limits [1, 2]) (Support System Calls [1])

Slide 9

Slide 9 text

SaSe™ Business Solutions Results © Simon Spacey. 18/02/2013 <9> Benchmark [4] Amdahl’s Limit Tightly-Coupled Loosely-Coupled WOA CI PRC WOA CS RPC dijkstra 3.343x 3.129x 1.000x 1.000x 1.000x fft 1.450x 1.081x 1.000x 1.058x 1.000x ispell 4.548x 2.793x 1.000x 1.011x 1.000x jpeg 3.264x 3.222x 1.001x 2.702x 1.000x sha 2.200x 2.198x 1.000x 1.774x 1.000x susan-e 2.708x 2.504x 1.031x 2.170x 1.030x Optimal Assignment Accelerations over CPU only Execution for the six MiBench Tasks [4] running on the Tightly and Loosely-Coupled Heterogeneous Axel Architectures [2] using WOA and RPC based Alternatives.

Slide 10

Slide 10 text

SaSe™ Business Solutions ¡ The WOA Differs from RPC by: ¡  Writing Service Responses Instead of Requiring Result Reads ¡ Doing this: ¡  Delivered up to 3.22 times better Performance than RPC here ¡ Main Future Work: ¡  Implement WOA Libraries for Different Components Summary and Future Work [1] © Simon Spacey. 18/02/2013 <10> [1] S. Spacey, W. Luk, P.H.J. Kelly, D. Kuhn, Improving communication latency with the Write-Only Architecture, Journal of Parallel and Distributed Computing, 72 (12) (2012) 1617-1627.

Slide 11

Slide 11 text

SaSe™ Business Solutions References © Simon Spacey. 18/02/2013 <11> [1] S. Spacey, W. Luk, P.H.J. Kelly, D. Kuhn, Improving communication latency with the Write-Only Architecture, Journal of Parallel and Distributed Computing, 72 (12) (2012) 1617-1627. [2] K. Tsoi, W. Luk, Axel: a heterogeneous cluster with FPGAs and GPUs, in: Proc. of the 18th ACM/ SIGDA International Symposium on Field Pro- grammable Gate Arrays, 2010, pp. 115–124. [3] M. Showerman, J. Enos, A. Pant, V. Kindratenko, C. Steffen, R. Pennington, W. Hwu, QP: a heterogeneous multi-accelerator cluster, in: Proc. of the 10th LCI International Conference on High-Performance Clustered Computing, 2009. [4] M. Guthaus, J. Ringenberg, D. Ernst, T. Austin, T. Mudge, R. Brown, MiBench: a free, commercially representative embedded benchmark suite, in: Proc. of the 2001 IEEE International Workshop on Workload Characterization, 2001, pp. 3–14. [5] S. Spacey, 3S Quick Start Guide, SaSe Tech. Manual, 2009. [6] O. Mencer, D. Pearce, L. Howes, W. Luk, Design space exploration with A Stream Compiler, in: Proc. of the 2nd IEEE International Conference on Field-Programmable Technology, 2003, pp. 270–277. [7] J. Duato, A. Pea, F. Silla, R. Mayo, E. Quintana-Orti, rCUDA: reducing the number of GPU- based accelerators in high performance clusters, in: Proc. of the IEEE International Conference on High Performance Computing and Simulation, 2010, pp. 224–231. [8] Sun Microsystems Inc., RPC: Remote Procedure Call protocol specification, 1988. [9] J. Bacon, K.G. Hamilton, Distributed computing with RPC the Cambridge approach, Tech. Rep., Cambridge University, 1987. [10] Y. Song, M. Aguilera, R. Kotla, D. Malkhi, RPC Chains: efficient client- server communication in geodistributed systems, in: Proc. of the 6th USENIX Symposium on Networked Systems Design and Implementation, 2009, pp. 277–290.

Slide 12

Slide 12 text

SaSe™ Business Solutions References © Simon Spacey. 18/02/2013 <12> [11] K. Sivaramakrishnan, K. Nagaraj, L. Ziarek, P. Eugster, Efficient session type guided distributed interaction, in: Coordination Models and Languages, Vol. 6116 of Lecture Notes in Computer Science, Springer Berlin / Heidel-berg, 2010, pp. 152–167. [12] D. Patterson, Latency Lags Bandwidth, Communications of the ACM, 47 (10) (2004) 71-75. [13] S.Spacey, W.Wiesemann,D.Kuhn, W.Luk, Robust software partitioning with multiple instantiation, INFORMS Journal on Computing 24 (3) (2012) 500–515. http://dx.doi.org/10.1287/ijoc.1110.0467. [14] S. Spacey, W. Luk, D. Kuhn, P.H.J. Kelly, Parallel Partitioning for Distributed Systems using Sequential Assignment, Journal of Parallel and Distributed Computing, 73 (2) (2013) 207-219.

Slide 13

Slide 13 text

RPC Considered Harmful Dr Simon Spacey [email protected] http://cs.waikato.ac.nz/~sspacey