Upgrade to Pro — share decks privately, control downloads, hide ads and more …

RPC Considered Harmful

RPC Considered Harmful

In the 1980's architectures consisting of two computational components communicating through a loosely-coupled network were considered advanced. For these architectures the Remote Procedure Call (RPC) communication paradigm is efficient and many authors concerned with the complexities of computational partitioning on distributed architectures quickly adopted RPC as their standard program communication paradigm which led to RPC's firm embedding in the Client-Server, Custom Instruction and Shared Memory implementation libraries that we take for granted today.

Unfortunately though, RPC is not efficient for modern Distributed and High-Performance Computing (DHPC) architectures which invariably include more than two computational components or tightly-coupled busses. In this presentation I explain the critical problems that make RPC inefficient for modern DHPC systems, introduce a possible solution to these problems called the Write-Only Architecture (WOA) and provide results and formal bounds showing the performance improvements deliverable for both homogeneous and heterogeneous DHPC systems over current RPC based implementations.

042f472cd92332d20f866704d0801337?s=128

Multicore World 2013

February 19, 2013
Tweet

Transcript

  1. RPC Considered Harmful Using the Write-Only Architecture (WOA) [1] to

    Reduce Communication Latencies in Modern DHPC Systems SaSe™ Business Solutions © Simon Spacey. 18/02/2013 <1> [1] S. Spacey, W. Luk, P.H.J. Kelly, D. Kuhn, Improving communication latency with the Write-Only Architecture, Journal of Parallel and Distributed Computing, 72 (12) (2012) 1617-1627. Simon Spacey B.Sc., M.Sc., D.E.A., M.B.A., D.CSC., D.I.U., Ph.D.
  2. SaSe™ Business Solutions Modern Systems: Heterogeneous 2 [3] Showerman et

    al., QP (2009). DRAM DRAM GPU DRAM GPU DRAM GPU DRAM GPU DRAM Dual-core CPU 1 Dual-core CPU 0 PCIe bus DRAM SRAM FPGA PCIe bus PCI-X bus [2] Tsoi et al., Axel (2010). © Simon Spacey. 18/02/2013 <2> System I/O Multi−Bank Memory 1G Bytes Xilinx V5 LX330T customisable logic System Memory 4G Bytes DDR2 Video Memory 4G Bytes GDDR3 AMD Phenom X4 9650 Quad−Core nVidia Tesla C1060 240 streaming cores Infiniband Gigabit Ethernet Fast Serial Link PCIe Bus: 8~16 lanes; 4~8GBps
  3. SaSe™ Business Solutions Example: Accelerations [1] © Simon Spacey. 18/02/2013

    <3> Benchmark [4] Amdahl’s Limit Tightly-Coupled Loosely-Coupled Custom-Instruction Client-Server dijkstra 3.343x 1.000x 1.000x fft 1.450x 1.000x 1.000x ispell 4.548x 1.000x 1.000x jpeg 3.264x 1.001x 1.000x sha 2.200x 1.000x 1.000x susan-e 2.708x 1.031x 1.030x Optimal Assignment Accelerations over CPU only Execution for the six MiBench Tasks [4] running on the Tightly and Loosely-Coupled Heterogeneous Axel Architectures [2] with OpenPAT measurements using RPC based Communication Libraries. [1] S. Spacey et al., Improving communication latency with the WOA (2012). [4] M. Guthaus et al., MiBench (2001). [5] S. Spacey, 3S Quick Start Guide, Imperial Technical Manual (2009).
  4. SaSe™ Business Solutions FPGA CPU img bit bit sized The

    Problem: RPC Central Hub © Simon Spacey. 18/02/2013 <4> ✘  RPC Based Libraries Are Only Efficient if Caller Uses Result [6-8] [6] O. Mencer et al., A Stream Compiler (2003). [7] J. Duato et al., rCUDA: reducing the number of GPU accels. (2010). [8] Sun Microsystems Inc., RPC: Remote Procedure Call spec. (1988). [9] J. Bacon et al., Distributed computing with RPC, (1987). // Decrypt SSL img = decrypt(packet); // Expand to Bitmap bit = expand(img); // Resize Bitmap sized = resize(bit); Central Hub
  5. SaSe™ Business Solutions ¡ Distributed Network Services [10, 11]: ¡  User

    dependent (not data dependent) control flows ¡  Few hops with no loops ¡  Homogeneous ¡ But in High-Performance Computing need: ¡  Data dependent control flows ¡  Support for high-volume computational loops ¡  Across Heterogeneous components Related Work © Simon Spacey. 18/02/2013 <5> [10] Y. Song et al., RPC Chains (2009). [11] K. Sivaramakrishnan et al., Efficient session type guided interaction (2010).
  6. SaSe™ Business Solutions FPGA CPU img Latency Saving bit bit

    sized FPGA CPU img sized bit Write-Only Architecture [1] © Simon Spacey. 18/02/2013 <6> ✔  Write Along Natural Task CFG ✔  Heterogeneous WOA Controllers ✔  Hardware Agnostic Activation Packets [1] S. Spacey et al., Improving communication latency with the WOA (2012). [12] D. Patterson, Latency Lags Bandwidth (2004). // Decrypt SSL img = decrypt(packet); // Expand to Bitmap bit = expand(img); // Resize Bitmap sized = resize(bit); Natural CFG Latency Saving bit bit sized
  7. SaSe™ Business Solutions Experimental Configuration 7 © Simon Spacey. 18/02/2013

    <7> Benchmark [4] Category [4] dijkstra Network 117 295 7 9864826 26750704 fft Telecom 130 438 18 3670310 11470084 ispell Office 1033 3242 9 5856315 35981124 jpeg Consumer 1776 7657 20 4406038 12414004 sha Security 68 816 6 6609432 82246556 susan-e Industrial 249 2491 72 4719519 28121940 Left: Tightly-coupled single node Axel [2] with RPC Custom Instruction communications over a PCIe Bus Right: Loosely-coupled two node Axel [2] with RPC Client-Server communications over UDP/IP System I/O Multi−Bank Memory 1G Bytes Xilinx V5 LX330T customisable logic System Memory 4G Bytes DDR2 Video Memory 4G Bytes GDDR3 AMD Phenom X4 9650 Quad−Core nVidia Tesla C1060 240 streaming cores Infiniband Gigabit Ethernet Fast Serial Link PCIe Bus: 8~16 lanes; 4~8GBps OpenPAT [5] Measurements for the six MiBench Tasks [4] Considered in the Paper.
  8. SaSe™ Business Solutions ¡ General Formal DHPC Optimisation Model [1] Formal

    Optimal Assignmnets 8 © Simon Spacey. 18/02/2013 <8> (Instantiate Once [13]) [13] S. Spacey et al., Robust software partitioning with multiple instantiation, INFORMS Journal on Computing 24 (3) (2012) 500–515. (Respect Limits [1, 2]) (Support System Calls [1])
  9. SaSe™ Business Solutions Results © Simon Spacey. 18/02/2013 <9> Benchmark

    [4] Amdahl’s Limit Tightly-Coupled Loosely-Coupled WOA CI PRC WOA CS RPC dijkstra 3.343x 3.129x 1.000x 1.000x 1.000x fft 1.450x 1.081x 1.000x 1.058x 1.000x ispell 4.548x 2.793x 1.000x 1.011x 1.000x jpeg 3.264x 3.222x 1.001x 2.702x 1.000x sha 2.200x 2.198x 1.000x 1.774x 1.000x susan-e 2.708x 2.504x 1.031x 2.170x 1.030x Optimal Assignment Accelerations over CPU only Execution for the six MiBench Tasks [4] running on the Tightly and Loosely-Coupled Heterogeneous Axel Architectures [2] using WOA and RPC based Alternatives.
  10. SaSe™ Business Solutions ¡ The WOA Differs from RPC by: ¡ 

    Writing Service Responses Instead of Requiring Result Reads ¡ Doing this: ¡  Delivered up to 3.22 times better Performance than RPC here ¡ Main Future Work: ¡  Implement WOA Libraries for Different Components Summary and Future Work [1] © Simon Spacey. 18/02/2013 <10> [1] S. Spacey, W. Luk, P.H.J. Kelly, D. Kuhn, Improving communication latency with the Write-Only Architecture, Journal of Parallel and Distributed Computing, 72 (12) (2012) 1617-1627.
  11. SaSe™ Business Solutions References © Simon Spacey. 18/02/2013 <11> [1]

    S. Spacey, W. Luk, P.H.J. Kelly, D. Kuhn, Improving communication latency with the Write-Only Architecture, Journal of Parallel and Distributed Computing, 72 (12) (2012) 1617-1627. [2] K. Tsoi, W. Luk, Axel: a heterogeneous cluster with FPGAs and GPUs, in: Proc. of the 18th ACM/ SIGDA International Symposium on Field Pro- grammable Gate Arrays, 2010, pp. 115–124. [3] M. Showerman, J. Enos, A. Pant, V. Kindratenko, C. Steffen, R. Pennington, W. Hwu, QP: a heterogeneous multi-accelerator cluster, in: Proc. of the 10th LCI International Conference on High-Performance Clustered Computing, 2009. [4] M. Guthaus, J. Ringenberg, D. Ernst, T. Austin, T. Mudge, R. Brown, MiBench: a free, commercially representative embedded benchmark suite, in: Proc. of the 2001 IEEE International Workshop on Workload Characterization, 2001, pp. 3–14. [5] S. Spacey, 3S Quick Start Guide, SaSe Tech. Manual, 2009. [6] O. Mencer, D. Pearce, L. Howes, W. Luk, Design space exploration with A Stream Compiler, in: Proc. of the 2nd IEEE International Conference on Field-Programmable Technology, 2003, pp. 270–277. [7] J. Duato, A. Pea, F. Silla, R. Mayo, E. Quintana-Orti, rCUDA: reducing the number of GPU- based accelerators in high performance clusters, in: Proc. of the IEEE International Conference on High Performance Computing and Simulation, 2010, pp. 224–231. [8] Sun Microsystems Inc., RPC: Remote Procedure Call protocol specification, 1988. [9] J. Bacon, K.G. Hamilton, Distributed computing with RPC the Cambridge approach, Tech. Rep., Cambridge University, 1987. [10] Y. Song, M. Aguilera, R. Kotla, D. Malkhi, RPC Chains: efficient client- server communication in geodistributed systems, in: Proc. of the 6th USENIX Symposium on Networked Systems Design and Implementation, 2009, pp. 277–290.
  12. SaSe™ Business Solutions References © Simon Spacey. 18/02/2013 <12> [11]

    K. Sivaramakrishnan, K. Nagaraj, L. Ziarek, P. Eugster, Efficient session type guided distributed interaction, in: Coordination Models and Languages, Vol. 6116 of Lecture Notes in Computer Science, Springer Berlin / Heidel-berg, 2010, pp. 152–167. [12] D. Patterson, Latency Lags Bandwidth, Communications of the ACM, 47 (10) (2004) 71-75. [13] S.Spacey, W.Wiesemann,D.Kuhn, W.Luk, Robust software partitioning with multiple instantiation, INFORMS Journal on Computing 24 (3) (2012) 500–515. http://dx.doi.org/10.1287/ijoc.1110.0467. [14] S. Spacey, W. Luk, D. Kuhn, P.H.J. Kelly, Parallel Partitioning for Distributed Systems using Sequential Assignment, Journal of Parallel and Distributed Computing, 73 (2) (2013) 207-219.
  13. RPC Considered Harmful Dr Simon Spacey sspacey@waikato.ac.nz http://cs.waikato.ac.nz/~sspacey