Upgrade to Pro — share decks privately, control downloads, hide ads and more …

High-Level Synthesis of Memory Systems for Decoupled Data Orchestration

Masayuki Usui
November 08, 2023

High-Level Synthesis of Memory Systems for Decoupled Data Orchestration

This slides correspond to the following paper:
Usui, M., Takamaeda-Yamazaki, S. (2023). High-Level Synthesis of Memory Systems for Decoupled Data Orchestration. In: Palumbo, F., Keramidas, G., Voros, N., Diniz, P.C. (eds) Applied Reconfigurable Computing. Architectures, Tools, and Applications. ARC 2023. Lecture Notes in Computer Science, vol 14251. Springer, Cham. https://doi.org/10.1007/978-3-031-42921-7_1

Masayuki Usui

November 08, 2023
Tweet

More Decks by Masayuki Usui

Other Decks in Research

Transcript

  1. High-Level Synthesis of Memory Systems for Decoupled Data Orchestration Masayuki

    Usui, Shinya Takamaeda-Yamazaki The University of Tokyo
  2. Why our HLS method improves performance 2 Off-chip memory On-chip

    memory ALU or PEs FSM Off-chip memory On-chip memory ALU or PEs FSM FSM Control Control Control Hardware generated by naive HLS Hardware generated by proposed HLS method
  3. Why our HLS method improves performance 3 Off-chip memory On-chip

    memory ALU or PEs FSM Control Hardware generated by naive HLS Communication = the data movement between on- and off-chip memories Computation = the data movement between on-chip memory and compute units Computation and communication cannot be done in parallel because single FSM controls both comp. and comm.
  4. Why our HLS method improves performance 4 Off-chip memory On-chip

    memory ALU or PEs FSM FSM Control Control Hardware generated by proposed HLS method Communication = the data movement between on- and off-chip memories Computation = the data movement between on-chip memory and compute units Computation and communication can be done in parallel because separate FSMs control comp. and comm.
  5. How our HLS method works 5 Original source code Communication

    part Computation part Split Off-chip memory On-chip memory ALU or PEs FSM FSM Synthesize
  6. How our HLS method works 6 Original source code Communication

    part Computation part Split Off-chip memory On-chip memory ALU or PEs FSM FSM Synthesize Synchronization between these FSMs is necessary to gurantee correct operation! (explained later)
  7. Accelerators 8 PE PE PE PE PE PE PE PE

    PE On-Chip Memory Off-Chip Memory • Numerous processing elements (PEs) operate in parallel, achieving high computational performance. • Memory systems are also important to supply data to these PEs. • We focus on the memory systems.
  8. High-Level Synthesis High-level descriptions (Python, C++, etc.) High-level synthesis RTL

    descriptions (Verilog HDL or VHDL) 9 • High-level synthesis (HLS) converts high-level descriptions into register-transfer level (RTL) descriptions • However, naive HLS tools cannot generate optimal memory systems • We proprose HLS techiniques to generate high-performance memory systems
  9. Memory Systems in Accelerators 10 Tag Data 0x01 1024 DRAM

    Core Local Request Local Response Global Request Global Response Implicit Data Orchestration (often in CPU) Datapath 1024 DRAM Local Req/Resp Global Req/Resp Explicit Data Orchestration (often in Accelerator) Cache Buffer
  10. Memory Systems in Accelerators 11 Tag Data 0x01 1024 DRAM

    Core Local Request Local Response Global Request Global Response Implicit Data Orchestration (often in CPU) Cache Easy to program The programmar does not have to control the data movement between on- and off-chip memories. Large overhead Cache mechanisms often have large overhead.
  11. Memory Systems in Accelerators 12 Datapath 1024 DRAM Local Req/Resp

    Global Req/Resp Explicit Data Orchestration (often in Accelerator) Buffer Hard to program The programmar (or architect) has to explicit control the data movement between on- and off- chip memories. Domain knowledge In domain specific acclerators, explicit control over data movement allows leveraging domain knowledge to improve performance.
  12. Memory Systems in Accelerators 13 Datapath 1024 DRAM Local Req/Resp

    Global Req/Resp Explicit Data Orchestration (often in Accelerator) Buffer Hard to program The programmar (or architect) has to explicit control the data movement between on- and off- chip memories. Domain knowledge In domain specific acclerators, explicit control over data movement allows leveraging domain knowledge to improve performance. We focus on explict data orchestration
  13. Overlapping Comp. and Comm. 14 PE PE PE PE PE

    PE PE PE PE On-Chip Memory Off-Chip Memory PE PE PE PE PE PE PE PE PE On-Chip Memory Off-Chip Memory Computation and communication cannot be done in parallel. Computation and communication can be done in parallel The same module sequentially excutes comp. and comm. The different modules simultatenously compute and communicate This slide is meaningless because Speaker Deck cannot deal with annimation.
  14. Memory Systems in Accelerators 15 Explicit Coupled Data Orchestration Explicit

    Decoupled Data Orchestration (EDDO) Unified FSM (Comp. & Comm.) On-Chip Memory Off-Chip Memory On-Chip Memory Off-Chip Memory Datapath (Comp.) DMA Engine (Comm.) This work intends to convert from the left to the right automatically
  15. Design Challenges • EDDO complicates accelerator design because it requires

    the architect to decouple the hardware module for communication from the one for computation • Synchronization between decoupled modules is particularly cumbersome 16
  16. Proposed Method 17 We propose a novel method that automatically

    decouples data orchestration mechanisms in high- level synthesis to facilitate accelerator design
  17. Proposed Method (Overview) 18 a_addr = a_offset c_addr = c_offset

    for i in range(matrix_size): ram_a.dma_read(axi_a, 0, a_addr, matrix_size) b_addr = b_offset for j in range(matrix_size): ram_b.dma_read(axi_b, 0, b_addr, matrix_size) inner_product(ram_a, ram_b, ram_c, j, matrix_size) b_addr += matrix_size * (datawidth // 8) ram_c.dma_write(axi_c, 0, c_addr, matrix_size) a_addr += matrix_size * (datawidth // 8) c_addr += matrix_size * (datawidth // 8) a_addr = a_offset for i in range(matrix_size): ram_a.dma_read(axi_a, 0, a_addr, matrix_size) ram_a.push() a_addr += matrix_size * (datawidth // 8) ram_a.wait_not_full() for i in range(matrix_size): b_addr = b_offset for j in range(matrix_size): ram_b.dma_read(axi_b, 0, b_addr, matrix_size) ram_b.push() b_addr += matrix_size * (datawidth // 8) ram_b.wait_not_full() c_addr = c_offset for i in range(matrix_size): ram_c.wait_not_empty() ram_c.dma_write(axi_c, 0, c_addr, matrix_size) c_addr += matrix_size * (datawidth // 8) ram_c.pop() for i in range(matrix_size): ram_a.wait_not_empty() for j in range(matrix_size): ram_b.wait_not_empty() inner_product(ram_a, ram_b, ram_c, j, matrix_size) ram_b.pop() ram_c.push() ram_a.pop() ram_c.wait_not_full() FSM ALU Computation + Communication DRAM Our method Buffer A Buffer B Buffer C FSM ALU Buffer A Buffer B Buffer C DRAM FSM Computation FSM FSM Communication Data Movement for Buffer A Data Movement for Buffer B Data Movement for Buffer C Computation
  18. Proposed Method (Overview) 19 a_addr = a_offset c_addr = c_offset

    for i in range(matrix_size): ram_a.dma_read(axi_a, 0, a_addr, matrix_size) b_addr = b_offset for j in range(matrix_size): ram_b.dma_read(axi_b, 0, b_addr, matrix_size) inner_product(ram_a, ram_b, ram_c, j, matrix_size) b_addr += matrix_size * (datawidth // 8) ram_c.dma_write(axi_c, 0, c_addr, matrix_size) a_addr += matrix_size * (datawidth // 8) c_addr += matrix_size * (datawidth // 8) a_addr = a_offset for i in range(matrix_size): ram_a.dma_read(axi_a, 0, a_addr, matrix_size) ram_a.push() a_addr += matrix_size * (datawidth // 8) ram_a.wait_not_full() for i in range(matrix_size): b_addr = b_offset for j in range(matrix_size): ram_b.dma_read(axi_b, 0, b_addr, matrix_size) ram_b.push() b_addr += matrix_size * (datawidth // 8) ram_b.wait_not_full() c_addr = c_offset for i in range(matrix_size): ram_c.wait_not_empty() ram_c.dma_write(axi_c, 0, c_addr, matrix_size) c_addr += matrix_size * (datawidth // 8) ram_c.pop() for i in range(matrix_size): ram_a.wait_not_empty() for j in range(matrix_size): ram_b.wait_not_empty() inner_product(ram_a, ram_b, ram_c, j, matrix_size) ram_b.pop() ram_c.push() ram_a.pop() ram_c.wait_not_full() FSM ALU Computation + Communication DRAM Our method Buffer A Buffer B Buffer C FSM ALU Buffer A Buffer B Buffer C DRAM FSM Computation FSM FSM Communication Data Movement for Buffer A Data Movement for Buffer B Data Movement for Buffer C Computation
  19. Proposed Method (Overview) 20 a_addr = a_offset c_addr = c_offset

    for i in range(matrix_size): ram_a.dma_read(axi_a, 0, a_addr, matrix_size) b_addr = b_offset for j in range(matrix_size): ram_b.dma_read(axi_b, 0, b_addr, matrix_size) inner_product(ram_a, ram_b, ram_c, j, matrix_size) b_addr += matrix_size * (datawidth // 8) ram_c.dma_write(axi_c, 0, c_addr, matrix_size) a_addr += matrix_size * (datawidth // 8) c_addr += matrix_size * (datawidth // 8) a_addr = a_offset for i in range(matrix_size): ram_a.dma_read(axi_a, 0, a_addr, matrix_size) ram_a.push() a_addr += matrix_size * (datawidth // 8) ram_a.wait_not_full() for i in range(matrix_size): b_addr = b_offset for j in range(matrix_size): ram_b.dma_read(axi_b, 0, b_addr, matrix_size) ram_b.push() b_addr += matrix_size * (datawidth // 8) ram_b.wait_not_full() c_addr = c_offset for i in range(matrix_size): ram_c.wait_not_empty() ram_c.dma_write(axi_c, 0, c_addr, matrix_size) c_addr += matrix_size * (datawidth // 8) ram_c.pop() for i in range(matrix_size): ram_a.wait_not_empty() for j in range(matrix_size): ram_b.wait_not_empty() inner_product(ram_a, ram_b, ram_c, j, matrix_size) ram_b.pop() ram_c.push() ram_a.pop() ram_c.wait_not_full() FSM ALU Computation + Communication DRAM Our method Buffer A Buffer B Buffer C FSM ALU Buffer A Buffer B Buffer C DRAM FSM Computation FSM FSM Communication Data Movement for Buffer A Data Movement for Buffer B Data Movement for Buffer C Computation Input source code sequentially performs computation and communication Output source code simultaneously performs computation and communication
  20. Proposed Method (Overview) 21 a_addr = a_offset c_addr = c_offset

    for i in range(matrix_size): ram_a.dma_read(axi_a, 0, a_addr, matrix_size) b_addr = b_offset for j in range(matrix_size): ram_b.dma_read(axi_b, 0, b_addr, matrix_size) inner_product(ram_a, ram_b, ram_c, j, matrix_size) b_addr += matrix_size * (datawidth // 8) ram_c.dma_write(axi_c, 0, c_addr, matrix_size) a_addr += matrix_size * (datawidth // 8) c_addr += matrix_size * (datawidth // 8) a_addr = a_offset for i in range(matrix_size): ram_a.dma_read(axi_a, 0, a_addr, matrix_size) ram_a.push() a_addr += matrix_size * (datawidth // 8) ram_a.wait_not_full() for i in range(matrix_size): b_addr = b_offset for j in range(matrix_size): ram_b.dma_read(axi_b, 0, b_addr, matrix_size) ram_b.push() b_addr += matrix_size * (datawidth // 8) ram_b.wait_not_full() c_addr = c_offset for i in range(matrix_size): ram_c.wait_not_empty() ram_c.dma_write(axi_c, 0, c_addr, matrix_size) c_addr += matrix_size * (datawidth // 8) ram_c.pop() for i in range(matrix_size): ram_a.wait_not_empty() for j in range(matrix_size): ram_b.wait_not_empty() inner_product(ram_a, ram_b, ram_c, j, matrix_size) ram_b.pop() ram_c.push() ram_a.pop() ram_c.wait_not_full() FSM ALU Computation + Communication DRAM Our method Buffer A Buffer B Buffer C FSM ALU Buffer A Buffer B Buffer C DRAM FSM Computation FSM FSM Communication Data Movement for Buffer A Data Movement for Buffer B Data Movement for Buffer C Computation
  21. Example Code: Matrix Multiplication 22 a_addr = a_offset c_addr =

    c_offset for i in range(matrix_size): ram_a.dma_read(axi_a, 0, a_addr, matrix_size) b_addr = b_offset for j in range(matrix_size): ram_b.dma_read(axi_b, 0, b_addr, matrix_size) inner_product(ram_a, ram_b, ram_c, j, matrix_size) b_addr += matrix_size * (datawidth // 8) ram_c.dma_write(axi_c, 0, c_addr, matrix_size) a_addr += matrix_size * (datawidth // 8) c_addr += matrix_size * (datawidth // 8) Matrix multiplication: 𝐴𝐵 = 𝐶 × ʹ A B C RAM A Buffer for matrix A RAM B Buffer for matrix B RAM C Buffer for matrix C
  22. Example Code: Matrix Multiplication 23 a_addr = a_offset c_addr =

    c_offset for i in range(matrix_size): ram_a.dma_read(axi_a, 0, a_addr, matrix_size) b_addr = b_offset for j in range(matrix_size): ram_b.dma_read(axi_b, 0, b_addr, matrix_size) inner_product(ram_a, ram_b, ram_c, j, matrix_size) b_addr += matrix_size * (datawidth // 8) ram_c.dma_write(axi_c, 0, c_addr, matrix_size) a_addr += matrix_size * (datawidth // 8) c_addr += matrix_size * (datawidth // 8) Communication: DMA transfers regarding each RAM Computation: calculate the inner product of two vectors
  23. Decomposing the Example Code into 4 Parts 24 a_addr =

    a_offset c_addr = c_offset for i in range(matrix_size): ram_a.dma_read(axi_a, 0, a_addr, matrix_size) b_addr = b_offset for j in range(matrix_size): ram_b.dma_read(axi_b, 0, b_addr, matrix_size) inner_product(ram_a, ram_b, ram_c, j, matrix_size) b_addr += matrix_size * (datawidth // 8) ram_c.dma_write(axi_c, 0, c_addr, matrix_size) a_addr += matrix_size * (datawidth // 8) c_addr += matrix_size * (datawidth // 8) a_addr = a_offset for i in range(matrix_size): ram_a.dma_read(axi_a, 0, a_addr, matrix_size) a_addr += matrix_size * (datawidth // 8) c_addr = c_offset for i in range(matrix_size): ram_c.dma_write(axi_c, 0, c_addr, matrix_size) c_addr += matrix_size * (datawidth // 8) for i in range(matrix_size): b_addr = b_offset for j in range(matrix_size): ram_b.dma_read(axi_b, 0, b_addr, matrix_size) b_addr += matrix_size * (datawidth // 8) for i in range(matrix_size): for j in range(matrix_size): inner_product(ram_a, ram_b, ram_c, j, matrix_size) Extract code segment necessary for DMA transfers of RAM A Code for comm. of RAM A Code for comm. of RAM B Code for comm. of RAM C Code for computation
  24. Decomposing the Example Code into 4 Parts 25 a_addr =

    a_offset c_addr = c_offset for i in range(matrix_size): ram_a.dma_read(axi_a, 0, a_addr, matrix_size) b_addr = b_offset for j in range(matrix_size): ram_b.dma_read(axi_b, 0, b_addr, matrix_size) inner_product(ram_a, ram_b, ram_c, j, matrix_size) b_addr += matrix_size * (datawidth // 8) ram_c.dma_write(axi_c, 0, c_addr, matrix_size) a_addr += matrix_size * (datawidth // 8) c_addr += matrix_size * (datawidth // 8) a_addr = a_offset for i in range(matrix_size): ram_a.dma_read(axi_a, 0, a_addr, matrix_size) a_addr += matrix_size * (datawidth // 8) c_addr = c_offset for i in range(matrix_size): ram_c.dma_write(axi_c, 0, c_addr, matrix_size) c_addr += matrix_size * (datawidth // 8) for i in range(matrix_size): b_addr = b_offset for j in range(matrix_size): ram_b.dma_read(axi_b, 0, b_addr, matrix_size) b_addr += matrix_size * (datawidth // 8) for i in range(matrix_size): for j in range(matrix_size): inner_product(ram_a, ram_b, ram_c, j, matrix_size) Extract code segment necessary for DMA transfers of RAM B Code for comm. of RAM A Code for comm. of RAM B Code for comm. of RAM C Code for computation
  25. Decomposing the Example Code into 4 Parts 26 a_addr =

    a_offset c_addr = c_offset for i in range(matrix_size): ram_a.dma_read(axi_a, 0, a_addr, matrix_size) b_addr = b_offset for j in range(matrix_size): ram_b.dma_read(axi_b, 0, b_addr, matrix_size) inner_product(ram_a, ram_b, ram_c, j, matrix_size) b_addr += matrix_size * (datawidth // 8) ram_c.dma_write(axi_c, 0, c_addr, matrix_size) a_addr += matrix_size * (datawidth // 8) c_addr += matrix_size * (datawidth // 8) a_addr = a_offset for i in range(matrix_size): ram_a.dma_read(axi_a, 0, a_addr, matrix_size) a_addr += matrix_size * (datawidth // 8) c_addr = c_offset for i in range(matrix_size): ram_c.dma_write(axi_c, 0, c_addr, matrix_size) c_addr += matrix_size * (datawidth // 8) for i in range(matrix_size): b_addr = b_offset for j in range(matrix_size): ram_b.dma_read(axi_b, 0, b_addr, matrix_size) b_addr += matrix_size * (datawidth // 8) for i in range(matrix_size): for j in range(matrix_size): inner_product(ram_a, ram_b, ram_c, j, matrix_size) Extract code segment necessary for DMA transfers of RAM C Code for comm. of RAM A Code for comm. of RAM B Code for comm. of RAM C Code for computation
  26. Decomposing the Example Code into 4 Parts 27 a_addr =

    a_offset c_addr = c_offset for i in range(matrix_size): ram_a.dma_read(axi_a, 0, a_addr, matrix_size) b_addr = b_offset for j in range(matrix_size): ram_b.dma_read(axi_b, 0, b_addr, matrix_size) inner_product(ram_a, ram_b, ram_c, j, matrix_size) b_addr += matrix_size * (datawidth // 8) ram_c.dma_write(axi_c, 0, c_addr, matrix_size) a_addr += matrix_size * (datawidth // 8) c_addr += matrix_size * (datawidth // 8) a_addr = a_offset for i in range(matrix_size): ram_a.dma_read(axi_a, 0, a_addr, matrix_size) a_addr += matrix_size * (datawidth // 8) c_addr = c_offset for i in range(matrix_size): ram_c.dma_write(axi_c, 0, c_addr, matrix_size) c_addr += matrix_size * (datawidth // 8) for i in range(matrix_size): b_addr = b_offset for j in range(matrix_size): ram_b.dma_read(axi_b, 0, b_addr, matrix_size) b_addr += matrix_size * (datawidth // 8) for i in range(matrix_size): for j in range(matrix_size): inner_product(ram_a, ram_b, ram_c, j, matrix_size) Extract the rest of code as code segment of computation Code for comm. of RAM A Code for comm. of RAM B Code for comm. of RAM C Code for computation
  27. Illustrating the Extraction Process for RAM A 28 a_addr =

    a_offset c_addr = c_offset for i in range(matrix_size): ram_a.dma_read(axi_a, 0, a_addr, matrix_size) b_addr = b_offset for j in range(matrix_size): ram_b.dma_read(axi_b, 0, b_addr, matrix_size) inner_product(ram_a, ram_b, ram_c, j, matrix_size) b_addr += matrix_size * (datawidth // 8) ram_c.dma_write(axi_c, 0, c_addr, matrix_size) a_addr += matrix_size * (datawidth // 8) c_addr += matrix_size * (datawidth // 8) Clarification: • We extract code segment necessary for DMA transfers of RAM A • This code segment is the minimum code that reproduces the behavior of the original DMA
  28. Illustrating the Extraction Process for RAM A 29 a_addr =

    a_offset c_addr = c_offset for i in range(matrix_size): ram_a.dma_read(axi_a, 0, a_addr, matrix_size) b_addr = b_offset for j in range(matrix_size): ram_b.dma_read(axi_b, 0, b_addr, matrix_size) inner_product(ram_a, ram_b, ram_c, j, matrix_size) b_addr += matrix_size * (datawidth // 8) ram_c.dma_write(axi_c, 0, c_addr, matrix_size) a_addr += matrix_size * (datawidth // 8) c_addr += matrix_size * (datawidth // 8) First of all, we focus on the function call for RAM A DMA transfers.
  29. Illustrating the Extraction Process for RAM A 30 a_addr =

    a_offset c_addr = c_offset for i in range(matrix_size): ram_a.dma_read(axi_a, 0, a_addr, matrix_size) b_addr = b_offset for j in range(matrix_size): ram_b.dma_read(axi_b, 0, b_addr, matrix_size) inner_product(ram_a, ram_b, ram_c, j, matrix_size) b_addr += matrix_size * (datawidth // 8) ram_c.dma_write(axi_c, 0, c_addr, matrix_size) a_addr += matrix_size * (datawidth // 8) c_addr += matrix_size * (datawidth // 8) Because the value of a_addr is necessary to call the API of DMA regarding RAM A, we extract statements that modify a_addr.
  30. Illustrating the Extraction Process for RAM A 31 a_addr =

    a_offset for i in range(matrix_size): ram_a.dma_read(axi_a, 0, a_addr, matrix_size) a_addr += matrix_size * (datawidth // 8) Eventually, the above code is obtained.
  31. Synthesizing Decomposed parts 32 a_addr = a_offset c_addr = c_offset

    for i in range(matrix_size): ram_a.dma_read(axi_a, 0, a_addr, matrix_size) b_addr = b_offset for j in range(matrix_size): ram_b.dma_read(axi_b, 0, b_addr, matrix_size) inner_product(ram_a, ram_b, ram_c, j, matrix_size) b_addr += matrix_size * (datawidth // 8) ram_c.dma_write(axi_c, 0, c_addr, matrix_size) a_addr += matrix_size * (datawidth // 8) c_addr += matrix_size * (datawidth // 8) a_addr = a_offset for i in range(matrix_size): ram_a.dma_read(axi_a, 0, a_addr, matrix_size) a_addr += matrix_size * (datawidth // 8) for i in range(matrix_size): b_addr = b_offset for j in range(matrix_size): ram_b.dma_read(axi_b, 0, b_addr, matrix_size) b_addr += matrix_size * (datawidth // 8) c_addr = c_offset for i in range(matrix_size): ram_c.dma_write(axi_c, 0, c_addr, matrix_size) c_addr += matrix_size * (datawidth // 8) for i in range(matrix_size): for j in range(matrix_size): inner_product(ram_a, ram_b, ram_c, j, matrix_size) FSM ALU Computation + Communication DRAM Our method Buffer A Buffer B Buffer C FSM ALU Buffer A Buffer B Buffer C DRAM FSM Computation FSM FSM Communication Data Movement for Buffer A Data Movement for Buffer B Data Movement for Buffer C Computation
  32. Synthesizing Decomposed parts 33 a_addr = a_offset c_addr = c_offset

    for i in range(matrix_size): ram_a.dma_read(axi_a, 0, a_addr, matrix_size) b_addr = b_offset for j in range(matrix_size): ram_b.dma_read(axi_b, 0, b_addr, matrix_size) inner_product(ram_a, ram_b, ram_c, j, matrix_size) b_addr += matrix_size * (datawidth // 8) ram_c.dma_write(axi_c, 0, c_addr, matrix_size) a_addr += matrix_size * (datawidth // 8) c_addr += matrix_size * (datawidth // 8) a_addr = a_offset for i in range(matrix_size): ram_a.dma_read(axi_a, 0, a_addr, matrix_size) a_addr += matrix_size * (datawidth // 8) for i in range(matrix_size): b_addr = b_offset for j in range(matrix_size): ram_b.dma_read(axi_b, 0, b_addr, matrix_size) b_addr += matrix_size * (datawidth // 8) c_addr = c_offset for i in range(matrix_size): ram_c.dma_write(axi_c, 0, c_addr, matrix_size) c_addr += matrix_size * (datawidth // 8) for i in range(matrix_size): for j in range(matrix_size): inner_product(ram_a, ram_b, ram_c, j, matrix_size) FSM ALU Computation + Communication DRAM Our method Buffer A Buffer B Buffer C FSM ALU Buffer A Buffer B Buffer C DRAM FSM Computation FSM FSM Communication Data Movement for Buffer A Data Movement for Buffer B Data Movement for Buffer C Computation Synchronization is necessary!
  33. Data Structure and APIs for Synchronization 35 Producer Consumer Producer

    Consumer write(): write data wait_not_empty(): wait for a buffer to be filled push(): switch buffers write(): write data read(): read data
  34. Data Structure and APIs for Synchronization 36 Producer Consumer Producer

    Consumer write(): write data wait_not_empty(): wait for a buffer to be filled push(): switch buffers write(): write data read(): read data Producer Consumer read(): read data wait_not_full(): wait for a buffer to be unoccupied push(): switch buffers
  35. Data Structure and APIs for Synchronization 37 Producer Consumer read():

    read data wait_not_full(): wait for a buffer to be unoccupied
  36. Data Structure and APIs for Synchronization 38 Producer Consumer read():

    read data write(): write data pop(): switch buffers Producer Consumer read(): read data wait_not_full(): wait for a buffer to be unoccupied
  37. Data Structure and APIs for Synchronization 39 Producer Consumer read():

    read data write(): write data pop(): switch buffers Although double buffering is often used, we propose appropriate structure and operations to implement double buffering in the automatic decoupling of data orchestration mechanisms Producer Consumer read(): read data wait_not_full(): wait for a buffer to be unoccupied
  38. Adding APIs for RAM A 40 a_addr = a_offset for

    i in range(matrix_size): ram_a.dma_read(axi_a, 0, a_addr, matrix_size) ram_a.push() a_addr += matrix_size * (datawidth // 8) ram_a.wait_not_full() Added for synchronization
  39. Final Results of Example Code 41 a_addr = a_offset c_addr

    = c_offset for i in range(matrix_size): ram_a.dma_read(axi_a, 0, a_addr, matrix_size) b_addr = b_offset for j in range(matrix_size): ram_b.dma_read(axi_b, 0, b_addr, matrix_size) inner_product(ram_a, ram_b, ram_c, j, matrix_size) b_addr += matrix_size * (datawidth // 8) ram_c.dma_write(axi_c, 0, c_addr, matrix_size) a_addr += matrix_size * (datawidth // 8) c_addr += matrix_size * (datawidth // 8) a_addr = a_offset for i in range(matrix_size): ram_a.dma_read(axi_a, 0, a_addr, matrix_size) ram_a.push() a_addr += matrix_size * (datawidth // 8) ram_a.wait_not_full() for i in range(matrix_size): b_addr = b_offset for j in range(matrix_size): ram_b.dma_read(axi_b, 0, b_addr, matrix_size) ram_b.push() b_addr += matrix_size * (datawidth // 8) ram_b.wait_not_full() c_addr = c_offset for i in range(matrix_size): ram_c.wait_not_empty() ram_c.dma_write(axi_c, 0, c_addr, matrix_size) c_addr += matrix_size * (datawidth // 8) ram_c.pop() for i in range(matrix_size): ram_a.wait_not_empty() for j in range(matrix_size): ram_b.wait_not_empty() inner_product(ram_a, ram_b, ram_c, j, matrix_size) ram_b.pop() ram_c.push() ram_a.pop() ram_c.wait_not_full() FSM ALU Computation + Communication DRAM Our method Buffer A Buffer B Buffer C FSM ALU Buffer A Buffer B Buffer C DRAM FSM Computation FSM FSM Communication Data Movement for Buffer A Data Movement for Buffer B Data Movement for Buffer C Computation
  40. Evaluation Methodology • We implemented the proposed method on top

    of the HLS tool Veriloggen (https://github.com/PyHDI/veriloggen) • FPGA board: Ultra96-V2 • Workload: • General matrix multiplication (GeMM) • Sparse-matrix dense-matrix multiplication (SpMM) • Frequency: 300MHz 43 cited from Avnet
  41. Evaluation Results (Execution Time) • We measured execution time on

    an FPGA • The proposed method reduced execution time by almost half 44
  42. Evaluation Results (Execution Time) • The execution time is the

    bigger of latencies for computation and communication. 45
  43. Evaluation Results (Resource Utilization) • We measured memory resource utilization

    (the ratio of used BRAMs) using the EDA tool • The proposed method doubled utilization because it is based on double buffering 46
  44. Conclusion Summary • Decoupling data orchestration mechanisms enables overlapping computation

    and communication, improving performance • The proposed method automates the decoupling to facilitate design Future work • The proposed method will be evaluated using more complicated accelerators, including neural network accelerators 47
  45. Difference between explanation and implementation • In actual implementation, API

    calls are inserted before decoupling • Because information about interaction between modules is lost after decoupling and API calls cannot be automatically inserted 49 Decoupling or decomposition API call insertion API call insertion Decoupling or decomposition Explanation Implementation
  46. Actual implementation 50 a_addr = a_offset c_addr = c_offset for

    i in range(matrix_size): ram_a.dma_read(axi_a, 0, a_addr, matrix_size) b_addr = b_offset for j in range(matrix_size): ram_b.dma_read(axi_b, 0, b_addr, matrix_size) inner_product(ram_a, ram_b, ram_c, j, matrix_size) b_addr += matrix_size * (datawidth // 8) ram_c.dma_write(axi_c, 0, c_addr, matrix_size) a_addr += matrix_size * (datawidth // 8) c_addr += matrix_size * (datawidth // 8) a_addr = a_offset c_addr = c_offset for i in range(matrix_size): ram_a.dma_read(axi_a, 0, a_addr, matrix_size) b_addr = b_offset ram_a.push() ram_a.wait_not_empty() for j in range(matrix_size): ram_b.dma_read(axi_b, 0, b_addr, matrix_size) ram_b.push() ram_b.wait_not_empty() inner_product(ram_a, ram_b, ram_c, j, matrix_size) b_addr += matrix_size * (datawidth // 8) ram_b.pop() ram_b.wait_not_full() ram_c.push() ram_c.wait_not_empty() ram_c.dma_write(axi_c, 0, c_addr, matrix_size) a_addr += matrix_size * (datawidth // 8) c_addr += matrix_size * (datawidth // 8) ram_a.pop() ram_a.wait_not_full() ram_c.pop() ram_c.wait_not_full() a_addr = a_offset for i in range(matrix_size): ram_a.dma_read(axi_a, 0, a_addr, matrix_size) ram_a.push() a_addr += matrix_size * (datawidth // 8) ram_a.wait_not_full() for i in range(matrix_size): b_addr = b_offset for j in range(matrix_size): ram_b.dma_read(axi_b, 0, b_addr, matrix_size) ram_b.push() b_addr += matrix_size * (datawidth // 8) ram_b.wait_not_full() c_addr = c_offset for i in range(matrix_size): ram_c.wait_not_empty() ram_c.dma_write(axi_c, 0, c_addr, matrix_size) c_addr += matrix_size * (datawidth // 8) ram_c.pop() Insert API calls Decompose code into comp. and comm. parts for i in range(matrix_size): ram_a.wait_not_empty() for j in range(matrix_size): ram_b.wait_not_empty() inner_product( ram_a, ram_b, ram_c, j, matrix_size) ram_b.pop() ram_c.push() ram_a.pop() ram_c.wait_not_full()
  47. Automatic API Call Insertion • Split source code into producer

    and consumer parts. • Insert appropriate API calls in the boundary of these parts • producer to consumer: insert push() and wait_not_empty() • consumer to producer: insert pop() and wait_not_full() 51 1 acc = 0 2 for i in range(block_num): 3 ram.dma_read( 4 axi, 0, i * block_size, 5 block_size) 6 for j in range(block_size): 7 acc += ram.read(j) 1 acc = 0 2 for i in range(block_num): 3 ram.dma_read( 4 axi, 0, i * block_size, 5 block_size) 6 ram.push() 7 ram.wait_not_empty() 8 for j in range(block_size): 9 acc += ram.read(j) 10 ram.pop() 11 ram.wait_not_full() Producer Part Consumer Part Insert push() and wait_not_empty() Insert pop() and wait_not_full()
  48. Implementation of the Extraction Process 52 row_begin = ram_a_row.read(0) a_col_addr

    = a_col_offset a_val_addr = a_val_offset for i in range(1, a_height + 1): row_end = ram_a_row.read(i) row_range = row_end - row_begin row_begin = row_end if row_range > 0: ram_a_col.dma_read( axi_a_col, 0, a_col_addr, row_range) ram_a_val.dma_read( axi_a_val, 0, a_val_addr, row_range) a_col_addr += row_range << log_word a_val_addr += row_range << log_word a_col_addr a_val_addr row_range row_begin row_end
  49. Implementation of the Extraction Process 53 a_col_addr a_val_addr row_range row_begin

    row_end row_begin = ram_a_row.read(0) a_col_addr = a_col_offset a_val_addr = a_val_offset for i in range(1, a_height + 1): row_end = ram_a_row.read(i) row_range = row_end - row_begin row_begin = row_end if row_range > 0: ram_a_col.dma_read( axi_a_col, 0, a_col_addr, row_range) ram_a_val.dma_read( axi_a_val, 0, a_val_addr, row_range) a_col_addr += row_range << log_word a_val_addr += row_range << log_word
  50. Fork/Join for Partial Decoupling 55 Communication Fork Join Main FSM

    Forked FSM Idle Activate Activate Wait until idle Main thread Common (comm. + comp.) Common (comm. + comp.) Computation • In some applications, including SpMM used in the experiment, data orchestration cannot be completely decoupled, mainly due to dependencies • To tackle these cases, we partially decouple data orchestration based on fork/join
  51. Extra Example Code 56 def main(): acc = 0 for

    i in range(size): a_addr = a_offset + i*size*word_size ram_a.dma_read(axi_a, 0, a_addr, size) for j in range(size): b_addr = (b_offset + ram_a.read(j)*size*word_size) ram_b.dma_read(axi_b, 0, b_addr, size) for k in range(size): acc += ram_a.read(k)*ram_b.read(k) return acc FunctionDef Assign For Assign Expr For Assign Expr For AugAssign Return When we travarse an abstract syntax tree (AST), if we encounter a node of specific types such as For and While, we try to fork before and join after the node.
  52. Extra Example Code 57 def main(): acc = 0 for

    i in range(size): a_addr = a_offset + i*size*word_size ram_a.dma_read(axi_a, 0, a_addr, size) for j in range(size): b_addr = (b_offset + ram_a.read(j)*size*word_size) ram_b.dma_read(axi_b, 0, b_addr, size) for k in range(size): acc += ram_a.read(k)*ram_b.read(k) return acc FunctionDef Assign For Assign Expr For Assign Expr For AugAssign Return Check decouplability When we travarse an abstract syntax tree (AST), if we encounter a node of specific types such as For and While, we try to fork before and join after the node.
  53. Extra Example Code 58 def main(): acc = 0 for

    i in range(size): a_addr = a_offset + i*size*word_size ram_a.dma_read(axi_a, 0, a_addr, size) for j in range(size): b_addr = (b_offset + ram_a.read(j)*size*word_size) ram_b.dma_read(axi_b, 0, b_addr, size) for k in range(size): acc += ram_a.read(k)*ram_b.read(k) return acc FunctionDef Assign For Assign Expr For Assign Expr For AugAssign Return Check decouplability RAM A is decouplable! When we travarse an abstract syntax tree (AST), if we encounter a node of specific types such as For and While, we try to fork before and join after the node.
  54. Extra Example Code 59 def main(): acc = 0 for

    i in range(size): a_addr = a_offset + i*size*word_size ram_a.dma_read(axi_a, 0, a_addr, size) for j in range(size): b_addr = (b_offset + ram_a.read(j)*size*word_size) ram_b.dma_read(axi_b, 0, b_addr, size) for k in range(size): acc += ram_a.read(k)*ram_b.read(k) return acc FunctionDef Assign For Assign Expr For Assign Expr For AugAssign Return Check decouplability RAM A is decouplable! RAM A forks When we travarse an abstract syntax tree (AST), if we encounter a node of specific types such as For and While, we try to fork before and join after the node.
  55. Extra Example Code 60 def main(): acc = 0 for

    i in range(size): a_addr = a_offset + i*size*word_size ram_a.dma_read(axi_a, 0, a_addr, size) for j in range(size): b_addr = (b_offset + ram_a.read(j)*size*word_size) ram_b.dma_read(axi_b, 0, b_addr, size) for k in range(size): acc += ram_a.read(k)*ram_b.read(k) return acc FunctionDef Assign For Assign Expr For Assign Expr For AugAssign Return Check decouplability RAM A is decouplable! RAM A forks RAM A joins When we travarse an abstract syntax tree (AST), if we encounter a node of specific types such as For and While, we try to fork before and join after the node.
  56. Extra Example Code 61 def main(): acc = 0 for

    i in range(size): a_addr = a_offset + i*size*word_size ram_a.dma_read(axi_a, 0, a_addr, size) for j in range(size): b_addr = (b_offset + ram_a.read(j)*size*word_size) ram_b.dma_read(axi_b, 0, b_addr, size) for k in range(size): acc += ram_a.read(k)*ram_b.read(k) return acc FunctionDef Assign For Assign Expr For Assign Expr For AugAssign Return Check decouplability RAM A is decouplable! Check decouplability RAM A forks RAM A joins When we travarse an abstract syntax tree (AST), if we encounter a node of specific types such as For and While, we try to fork before and join after the node.
  57. Extra Example Code 62 def main(): acc = 0 for

    i in range(size): a_addr = a_offset + i*size*word_size ram_a.dma_read(axi_a, 0, a_addr, size) for j in range(size): b_addr = (b_offset + ram_a.read(j)*size*word_size) ram_b.dma_read(axi_b, 0, b_addr, size) for k in range(size): acc += ram_a.read(k)*ram_b.read(k) return acc FunctionDef Assign For Assign Expr For Assign Expr For AugAssign Return Check decouplability RAM A is decouplable! RAM B is decouplable! Check decouplability RAM A forks RAM A joins When we travarse an abstract syntax tree (AST), if we encounter a node of specific types such as For and While, we try to fork before and join after the node.
  58. Extra Example Code 63 def main(): acc = 0 for

    i in range(size): a_addr = a_offset + i*size*word_size ram_a.dma_read(axi_a, 0, a_addr, size) for j in range(size): b_addr = (b_offset + ram_a.read(j)*size*word_size) ram_b.dma_read(axi_b, 0, b_addr, size) for k in range(size): acc += ram_a.read(k)*ram_b.read(k) return acc FunctionDef Assign For Assign Expr For Assign Expr For AugAssign Return Check decouplability RAM A is decouplable! RAM B is decouplable! Check decouplability RAM A forks RAM A joins RAM B forks When we travarse an abstract syntax tree (AST), if we encounter a node of specific types such as For and While, we try to fork before and join after the node.
  59. Extra Example Code 64 def main(): acc = 0 for

    i in range(size): a_addr = a_offset + i*size*word_size ram_a.dma_read(axi_a, 0, a_addr, size) for j in range(size): b_addr = (b_offset + ram_a.read(j)*size*word_size) ram_b.dma_read(axi_b, 0, b_addr, size) for k in range(size): acc += ram_a.read(k)*ram_b.read(k) return acc FunctionDef Assign For Assign Expr For Assign Expr For AugAssign Return Check decouplability RAM A is decouplable! RAM B is decouplable! Check decouplability RAM A forks RAM A joins RAM B forks RAM B joins When we travarse an abstract syntax tree (AST), if we encounter a node of specific types such as For and While, we try to fork before and join after the node.
  60. Extra Example Code 65 def main(): acc = 0 for

    i in range(size): a_addr = a_offset + i*size*word_size ram_a.dma_read(axi_a, 0, a_addr, size) for j in range(size): b_addr = (b_offset + ram_a.read(j)*size*word_size) ram_b.dma_read(axi_b, 0, b_addr, size) for k in range(size): acc += ram_a.read(k)*ram_b.read(k) return acc def main(): acc = 0 for i in range(size): a_addr = a_offset + i*size*word_size ram_a.dma_read(axi_a, 0, a_addr, size) ram_a.push() ram_a.wait_not_empty() for j in range(size): b_addr = (b_offset + ram_a.read(j)*size*word_size) ram_b.dma_read(axi_b, 0, b_addr, size) ram_b.push() ram_b.wait_not_empty() for k in range(size): acc += ram_a.read(k)*ram_b.read(k) ram_b.pop() ram_b.wait_not_full() ram_a.pop() ram_a.wait_not_full() return acc API calls are inserted
  61. Extra Example Code 66 def main(): acc = 0 thd_a

    = Thread(target=comm_a) thd_a.start() for i in range(size): ram_a.wait_not_empty() thd_b = Thread(target=comm_b) thd_b.start() for j in range(size): ram_b.wait_not_empty() for k in range(size): acc += ram_a.read(k)*ram_b.read(k) ram_b.pop() thd_b.join() ram_a.pop() thd_a.join() return acc def comm_a(): for i in range(size): a_addr = a_offset + i*size*word_size ram_a.dma_read(axi_a, 0, a_addr, size) ram_a.push() ram_a.wait_not_full() def comm_b(): for j in range(size): b_addr = (b_offset + ram_a.read(j)*size*word_size) ram_b.dma_read(axi_b, 0, b_addr, size) ram_b.push() ram_b.wait_not_full()
  62. Extra Example Code 67 def main(): acc = 0 thd_a

    = Thread(target=comm_a) thd_a.start() for i in range(size): ram_a.wait_not_empty() thd_b = Thread(target=comm_b) thd_b.start() for j in range(size): ram_b.wait_not_empty() for k in range(size): acc += ram_a.read(k)*ram_b.read(k) ram_b.pop() thd_b.join() ram_a.pop() thd_a.join() return acc def comm_a(): for i in range(size): a_addr = a_offset + i*size*word_size ram_a.dma_read(axi_a, 0, a_addr, size) ram_a.push() ram_a.wait_not_full() def comm_b(): for j in range(size): b_addr = (b_offset + ram_a.read(j)*size*word_size) ram_b.dma_read(axi_b, 0, b_addr, size) ram_b.push() ram_b.wait_not_full() RAM A forks
  63. Extra Example Code 68 def main(): acc = 0 thd_a

    = Thread(target=comm_a) thd_a.start() for i in range(size): ram_a.wait_not_empty() thd_b = Thread(target=comm_b) thd_b.start() for j in range(size): ram_b.wait_not_empty() for k in range(size): acc += ram_a.read(k)*ram_b.read(k) ram_b.pop() thd_b.join() ram_a.pop() thd_a.join() return acc def comm_a(): for i in range(size): a_addr = a_offset + i*size*word_size ram_a.dma_read(axi_a, 0, a_addr, size) ram_a.push() ram_a.wait_not_full() def comm_b(): for j in range(size): b_addr = (b_offset + ram_a.read(j)*size*word_size) ram_b.dma_read(axi_b, 0, b_addr, size) ram_b.push() ram_b.wait_not_full() RAM A forks RAM B forks
  64. Extra Example Code 69 def main(): acc = 0 thd_a

    = Thread(target=comm_a) thd_a.start() for i in range(size): ram_a.wait_not_empty() thd_b = Thread(target=comm_b) thd_b.start() for j in range(size): ram_b.wait_not_empty() for k in range(size): acc += ram_a.read(k)*ram_b.read(k) ram_b.pop() thd_b.join() ram_a.pop() thd_a.join() return acc def comm_a(): for i in range(size): a_addr = a_offset + i*size*word_size ram_a.dma_read(axi_a, 0, a_addr, size) ram_a.push() ram_a.wait_not_full() def comm_b(): for j in range(size): b_addr = (b_offset + ram_a.read(j)*size*word_size) ram_b.dma_read(axi_b, 0, b_addr, size) ram_b.push() ram_b.wait_not_full() RAM A forks RAM B forks RAM B joins
  65. Extra Example Code 70 def main(): acc = 0 thd_a

    = Thread(target=comm_a) thd_a.start() for i in range(size): ram_a.wait_not_empty() thd_b = Thread(target=comm_b) thd_b.start() for j in range(size): ram_b.wait_not_empty() for k in range(size): acc += ram_a.read(k)*ram_b.read(k) ram_b.pop() thd_b.join() ram_a.pop() thd_a.join() return acc def comm_a(): for i in range(size): a_addr = a_offset + i*size*word_size ram_a.dma_read(axi_a, 0, a_addr, size) ram_a.push() ram_a.wait_not_full() def comm_b(): for j in range(size): b_addr = (b_offset + ram_a.read(j)*size*word_size) ram_b.dma_read(axi_b, 0, b_addr, size) ram_b.push() ram_b.wait_not_full() RAM A forks RAM A joins RAM B forks RAM B joins
  66. Extra Example Code 71 def main(): acc = 0 thd_a

    = Thread(target=comm_a) thd_a.start() for i in range(size): ram_a.wait_not_empty() thd_b = Thread(target=comm_b) thd_b.start() for j in range(size): ram_b.wait_not_empty() for k in range(size): acc += ram_a.read(k)*ram_b.read(k) ram_b.pop() thd_b.join() ram_a.pop() thd_a.join() return acc def comm_a(): for i in range(size): a_addr = a_offset + i*size*word_size ram_a.dma_read(axi_a, 0, a_addr, size) ram_a.push() ram_a.wait_not_full() def comm_b(): for j in range(size): b_addr = (b_offset + ram_a.read(j)*size*word_size) ram_b.dma_read(axi_b, 0, b_addr, size) ram_b.push() ram_b.wait_not_full() FSM FSM FSM Idle Activate Idle Activate
  67. Data Structure for Synchronization 73 head tail class DS4Sync: def

    __init__(self, size, length): self.rams = [RAM(size) for i in range(length)] self.length = length self.head = 0 self.tail = 0 self.occupancy = 0 def wait_not_full(self): while self.occupancy == self.length: pass def wait_not_empty(self): while self.occupancy == 0: pass def push(self): self.tail = (self.tail + 1) % self.length self.occupancy += 1 def pop(self): self.head = (self.head + 1) % self.length self.occupancy -= 1 def read_producer(self, addr): return self.rams[self.tail][addr] def write_producer(self, addr, data): self.rams[self.tail][addr] = data def read_consumer(self, addr): return self.rams[self.head][addr] def write_consumer(self, addr, data): self.rams[self.head][addr] = data