Automatically Generating Robotics Accelerators

Yuhao Zhu Computer Science & Brain and Cognitive Sciences University
of Rochester [email protected] Automatically Generating Robotics Accelerators

Localization Pipeline [HPCA 2021] 2 Temporal Matching Spatial Matching Feature
Extraction Filtering Key Point Correspondences Fusion GPS Samples Tracking (in Map) Mapping Map Camera Samples IMU Samples V S R L R t L R Frontend (Visual Feature Matching) Backend (Localization Optimization) 6 DoF Pose Trajectory (Rotation + Translation) VIO Registration SLAM Input Frontend Blocks Backend Blocks V S V S R Spatial Correspondences Temporal Correspondences L R L R L R L R L R Persist Map (Optional)

Extraction Filtering Key Point Correspondences Fusion GPS Samples Tracking (in Map) Mapping Map Camera Samples IMU Samples V S R L R t L R Frontend (Visual Feature Matching) Backend (Localization Optimization) 6 DoF Pose Trajectory (Rotation + Translation) VIO Registration SLAM Input Frontend Blocks Backend Blocks V S V S R Spatial Correspondences Temporal Correspondences L R L R L R L R L R Persist Map (Optional) No Map With Map No GPS With GPS SLAM (Indoor unknown environment) Registration (Indoor known environment) VIO (Outdoor unknown environment) (Outdoor known environment)

Archytas [MICRO 2021] 3 On-vehicle System (§6) Building M-DFG from
Algorithm Description (§3) Compute Optimizations Data Layout Optimizations Linear system solver Marginalization …… M-DFG FPGA Sensors Runtime (Host) SLAM Algorithm Jacobian matrix Hardware Synthesizer (§5) Pre-designed and Optimized Hardware Templates (§4) Latency requirement, FPGA resource constraints

A SLAM Formulation (Maximum a Posteriori) 4 i esents the
3D coordinates of j-th observed point. crux of BA is to solve a nonlinear least squares (N zation problem to estimate p [20]: min p { N X i=1 |oi Pi(p)| 2 Ci +|rp Hpp| 2 }

3D coordinates of j-th observed point. crux of BA is to solve a nonlinear least squares (N zation problem to estimate p [20]: min p { N X i=1 |oi Pi(p)| 2 Ci +|rp Hpp| 2 } Scene points and machine pose (to be estimated)

3D coordinates of j-th observed point. crux of BA is to solve a nonlinear least squares (N zation problem to estimate p [20]: min p { N X i=1 |oi Pi(p)| 2 Ci +|rp Hpp| 2 } Scene points and machine pose (to be estimated) Scene to measurement transformation

3D coordinates of j-th observed point. crux of BA is to solve a nonlinear least squares (N zation problem to estimate p [20]: min p { N X i=1 |oi Pi(p)| 2 Ci +|rp Hpp| 2 } Scene points and machine pose (to be estimated) Observations (sensor measurements) Scene to measurement transformation

3D coordinates of j-th observed point. crux of BA is to solve a nonlinear least squares (N zation problem to estimate p [20]: min p { N X i=1 |oi Pi(p)| 2 Ci +|rp Hpp| 2 } Scene points and machine pose (to be estimated) Observations (sensor measurements) Scene to measurement transformation Regularization based on priors

High-Level Algorithm 5 Marginalization Nonlinear Least Squares Solver New Hp
and rp to be used in next sliding window. Block H := [ [M, Λ]T, [ΛT, A]T ] Block b := [bm, br]T Compute new Hp = A - ΛM-1ΛT Compute new rp = br - ΛM-1 bm Calculate Jacobian Matrix for each sensor i using p+: Ji = dP i (p) / dp | p+ J = [J1 T, J2 T,…, JN T]T Calculate Jacobian Matrix for each sensor i using p: Ji = dPi(p) / dp | p Prepare A and b (Matrix arithmetics) Solve: A p = b Update (p += p) Exit condition met? Calculate information matrix H = JTJ ⊕ Hp and information vector b = JTe ⊕ rp Initial estimation p+; Optimization residual: e Initial estimation p Prior Hp, rp Optimal estimation p Prior Hp, rp

Algorithm Primitives 6 scription , which consists of u. 2
for the current which generates the of the next window. ly. enberg-Marquardt sed in commercial and 3D reconstruc- e class of gradient update the result p result p+ is calcu- e steps: matrix of all the unction) using p; priors Hp and rp, A p = b, where A nce of matrix mul- operations. (DFG). In particular, Archytas represents the localization algorithm using a M-DFG, which is a coarse-grained DFG, where each node, instead of being one single operation, is a relatively complex function (e.g., dense matrix multiplication) that executes on a well-optimized hardware block (Sec. 4). Table 1: Primitive M-DFG nodes. Node Type Description DMatInv Diagonal matrix inversion MatMul Matrix multiplication DMatMul Diagonal matrix multiplication MatSub Matrix subtraction (addition) MatTp Matrix transpose CD Cholesky decomposition FBSub Forward and backward substitution to solve linear system of equations with triangular matrices VJac Calculate visual Jacobian matrix IJac Calculate IMU Jacobian matrix Archytas supports a set of primitive M-DFG nodes listed in Tbl. 1. We choose these primitive nodes because they are low-level enough to build complex algorithms but high-

From High-Level Algorithm to M-DAG 7 b[7x1] Mat Mul DMat
Inv A[7x7] A[3x7] Mat Tp DMat Mul Mat Mul A[3x3] Mat Sub b[3x1] Mat Sub CD FBS ub D-type Schur W px +V py = by (3) px and bx are p-dimensional column vectors, and py are q-dimensional column vectors. dea of Schur elimination is to multiply the first equa- h WU 1 and subtract the first equation from the sec- ation, which gives a new linear system: 8 > > < > > : U px +X py = bx (V WU 1X) py = by WU 1bx (4a) (4b) cally, Equ. 4b is a q⇥q system and requires solving y, solving which would allow us to solve Equ. 4a, a stem that involves only px. Thus, solving Equ. 4 is ationally much simpler than solving Equ. 3. elimination, however, comes with its overhead. Com- Equ. 3 and Equ. 4, Schur elimination requires comput- 1X (also known as the Schur complement of matrix 1 transforming the prior calculation ⇤M 1bm), which have the form o can not be simply transformed to since M is not a diagonal matrix which we call M-type Schur, hin Without losing generality, let M matrix form: h M11 M12 M21 M22 i . Then: M 1 = " M 1 11 +M 1 11 M12S0 1M2 S0 1M21M 1 11 where S0 is M22 M21M 1 11 M12. Archytas, again, builds a cost blocking strategy. We find that the such that M11 is a diagonal matri D-type Schur and inverting M11, a The M-DFG is omitted here due t A desirable side e↵ect of S0 bei

Hardware Template 8 Legend Visual Jacobian Unit IMU Jacobian Unit
Logics to Prepare A, b D-Type Schur Complement Cholesky Decomposition M-Type Schur Complement Back Substitution Linear System Parameter Buffer Logics to Form Information Matrix (H) and Vector (b) Marginalization Parameter Buffer Output Buffer Input Buffer Prior Hp, rp Hp, rp p p NLS data flow Marginalization data flow Customizable blocks

Synthesizer: From HW Template to Concrete Design 9 block and
the number of MAC units in the D-type Schur and the M-type Schur blocks, denoted nd and nm, respectively. Problem Formulation The task of hardware generation is expressed in the form of a constrained optimization: min nd,nm,s Power(nd,nm,s) s.t. Lat(nd,nm,s)  L⇤, Res(nd,nm,s)  R⇤, (11) where Power(·), Lat(·), and Res(·) denote the total power, latency, and resource utilization, respectively; they are func- tions of nd, nm, and s. L⇤ is the latency constraint speciﬁed by the designer, and R⇤ is the resource constraint imposed by a particular FPGA system. It is worth noting that other optimization formulations are possible. For instance, the following formulation could be used for scenarios where performance, rather than power, is the main design objective: Mixed-Integer Convex Programming

Synthesizer: From HW Template to Concrete Design 9 e s
g b- 1 er e 9) e, It is worth noting that other optimization formulations are possible. For instance, the following formulation could be used for scenarios where performance, rather than power, is the main design objective: min nd,nm,s Lat(nd,nm,s) s.t. Res(nd,nm,s)  R⇤, (12) Latency Model Archytas derives the latency model by calculating the critical path latency of the M-DFG given the analytical latency models of each of the primitive nodes: Lat(nd,nm,s) = Iter ⇥ LNLS (nd,s)+ LMarg(nd,nm,s) (13) where LNLS denotes the latency of an iteration of the (itera- tive) NLS solver, Iter denotes the total number of iterations in the NLS solver — a parameter set by the application, and LMarg denotes the marginalization latency. The critical-path latency of an NLS iteration (the blocks along the solid arrows in Fig. 4) is expressed as follows: block and the number of MAC units in the D-type Schur and the M-type Schur blocks, denoted nd and nm, respectively. Problem Formulation The task of hardware generation is expressed in the form of a constrained optimization: min nd,nm,s Power(nd,nm,s) s.t. Lat(nd,nm,s)  L⇤, Res(nd,nm,s)  R⇤, (11) where Power(·), Lat(·), and Res(·) denote the total power, latency, and resource utilization, respectively; they are func- tions of nd, nm, and s. L⇤ is the latency constraint speciﬁed by the designer, and R⇤ is the resource constraint imposed by a particular FPGA system. It is worth noting that other optimization formulations are possible. For instance, the following formulation could be used for scenarios where performance, rather than power, is the main design objective: Mixed-Integer Convex Programming

Synthesizer: From HW Template to Concrete Design 9 e s
g b- 1 er e 9) e, It is worth noting that other optimization formulations are possible. For instance, the following formulation could be used for scenarios where performance, rather than power, is the main design objective: min nd,nm,s Lat(nd,nm,s) s.t. Res(nd,nm,s)  R⇤, (12) Latency Model Archytas derives the latency model by calculating the critical path latency of the M-DFG given the analytical latency models of each of the primitive nodes: Lat(nd,nm,s) = Iter ⇥ LNLS (nd,s)+ LMarg(nd,nm,s) (13) where LNLS denotes the latency of an iteration of the (itera- tive) NLS solver, Iter denotes the total number of iterations in the NLS solver — a parameter set by the application, and LMarg denotes the marginalization latency. The critical-path latency of an NLS iteration (the blocks along the solid arrows in Fig. 4) is expressed as follows: block and the number of MAC units in the D-type Schur and the M-type Schur blocks, denoted nd and nm, respectively. Problem Formulation The task of hardware generation is expressed in the form of a constrained optimization: min nd,nm,s Power(nd,nm,s) s.t. Lat(nd,nm,s)  L⇤, Res(nd,nm,s)  R⇤, (11) where Power(·), Lat(·), and Res(·) denote the total power, latency, and resource utilization, respectively; they are func- tions of nd, nm, and s. L⇤ is the latency constraint speciﬁed by the designer, and R⇤ is the resource constraint imposed by a particular FPGA system. It is worth noting that other optimization formulations are possible. For instance, the following formulation could be used for scenarios where performance, rather than power, is the main design objective: ⇥ 1 ber ype (9) are, 10) out d b Latency Model Archytas derives the latency model by calculating the critical path latency of the M-DFG given the analytical latency models of each of the primitive nodes: Lat(nd,nm,s) = Iter ⇥ LNLS (nd,s)+ LMarg(nd,nm,s) (13) where LNLS denotes the latency of an iteration of the (itera- tive) NLS solver, Iter denotes the total number of iterations in the NLS solver — a parameter set by the application, and LMarg denotes the marginalization latency. The critical-path latency of an NLS iteration (the blocks along the solid arrows in Fig. 4) is expressed as follows: LNLS (nd,s) = a X i=1 max{LJac,LDS chur(nd)}+ LCholesky(s)+ LS ub (14) Mixed-Integer Convex Programming

The Need for Dynamic Recon fi guration 10 2.0 1.5
1.0 0.5 0.0 Relative Error 900 800 700 600 500 400 Sliding Window ID 250 200 150 100 50 0 # of Feature Points

The Need for Dynamic Recon fi guration 10 2.0 1.5
1.0 0.5 0.0 Relative Error 900 800 700 600 500 400 Sliding Window ID 250 200 150 100 50 0 # of Feature Points 15 12 9 6 RMSE 6 5 4 3 2 1 Avg. # of NLS Iterations

Dynamic Recon fi guration 11 45 30 Res 20 15
10 5 0 nd 40 20 FF Time (a) Impact of nd. 45 30 Res 0 Figure 13: The influences of nd,nm,s on the hardwar problem to generate a new hardware configuration: min nd,nm,s Power(nd,nm,s) s.t. Lat(nd,nm,s)  L⇤,nd < nd⇤,nm < nm⇤,s < s⇤, (18) where nd⇤, nm⇤, and s⇤ denote the initial resource allocations generated by the static synthesizer. Examining the optimization formulation, we can see that ce is that U while M is Equ. 5. a straight- stance, the tInv (U 1), tSub (V of di↵erent parallelism ture points. t amenable matrix (M) . r blocks are ch dictates calculating block and the number of MAC units in the D-type Schur and the M-type Schur blocks, denoted nd and nm, respectively. Problem Formulation The task of hardware generation is expressed in the form of a constrained optimization: min nd,nm,s Power(nd,nm,s) s.t. Lat(nd,nm,s)  L⇤, Res(nd,nm,s)  R⇤, (11) where Power(·), Lat(·), and Res(·) denote the total power, latency, and resource utilization, respectively; they are func- tions of nd, nm, and s. L⇤ is the latency constraint specified by the designer, and R⇤ is the resource constraint imposed by a particular FPGA system. It is worth noting that other optimization formulations are possible. For instance, the following formulation could be used for scenarios where performance, rather than power, is the main design objective: Initial formulation New formulation

Dynamic Recon fi guration 11 45 30 Res 20 15
10 5 0 nd 40 20 FF Time (a) Impact of nd. 45 30 Res 0 Figure 13: The influences of nd,nm,s on the hardwar problem to generate a new hardware configuration: min nd,nm,s Power(nd,nm,s) s.t. Lat(nd,nm,s)  L⇤,nd < nd⇤,nm < nm⇤,s < s⇤, (18) where nd⇤, nm⇤, and s⇤ denote the initial resource allocations generated by the static synthesizer. Examining the optimization formulation, we can see that ce is that U while M is Equ. 5. a straight- stance, the tInv (U 1), tSub (V of di↵erent parallelism ture points. t amenable matrix (M) . r blocks are ch dictates calculating block and the number of MAC units in the D-type Schur and the M-type Schur blocks, denoted nd and nm, respectively. Problem Formulation The task of hardware generation is expressed in the form of a constrained optimization: min nd,nm,s Power(nd,nm,s) s.t. Lat(nd,nm,s)  L⇤, Res(nd,nm,s)  R⇤, (11) where Power(·), Lat(·), and Res(·) denote the total power, latency, and resource utilization, respectively; they are func- tions of nd, nm, and s. L⇤ is the latency constraint specified by the designer, and R⇤ is the resource constraint imposed by a particular FPGA system. It is worth noting that other optimization formulations are possible. For instance, the following formulation could be used for scenarios where performance, rather than power, is the main design objective: Initial formulation New formulation Simple throttling without real-time recompilation of the bitstream!

Headline Results 12 5.0 4.5 4.0 3.5 3.0 2.5 Power
(W) 100 80 60 40 20 Time (ms) 6 8 10 2 4 6 8 100 Energy Reduction (X) 1 2 4 6 8 10 2 Speedup (X) vs. Intel vs. Arm

Extraction Filtering Key Point Correspondences Fusion GPS Samples Tracking (in Map) Mapping Map Camera Samples IMU Samples V S R L R t L R Frontend (Visual Feature Matching) Backend (Localization Optimization) 6 DoF Pose Trajectory (Rotation + Translation) VIO Registration SLAM Input Frontend Blocks Backend Blocks V S V S R Spatial Correspondences Temporal Correspondences L R L R L R L R L R Persist Map (Optional)

Communication b/t Image Processing Accelerators 14 Temporal Matching Spatial Matching
Feature Extraction Key Point Correspondences Camera Samples L R t Frontend (Visual Feature Matching) Back L R L R L R L R L R

ImaGen [ISCA 2023] 15 Compiler Framework Front End Optimizer RTL
Code Gen Line Coalescing Constraint Formulation ILP Solver DAG On-chip Memory Specification Algorithm Description Rewritten DAG Constraints Pipeline Schedule Line Buffer Config. RTL Design Space Exploration . . . /%. /%. . /%.

Line Buffers 16 . . . . . /%. /%.
/LQHEXIIHU E E E [6KLIW5HJLVWHU$UUD\ F F F F F F F F ,EXII 2EXII

Line Buffers 17 . . E E E F F
F F F F F F

. . E E E F F F F F
F F F Line Buffers 18

. . E E E F F F F F
F F F Line Buffers 19

Multiple Consumers 20 . . . E E E F
F F F F F F F

Supporting Multiple Consumers using FIFOs 21 . . .

. . . E E E F F F F
F F F F Delaying Certain Consumers 22

Optimization Formulation 23 Optimization Formulation mally, the job of our
hardware generator can be described in a strained optimization formulation: min q !⌫(q) = # 1 ’ 8=0 !⌫8 (q), F⌘4A4 q = {(8 },8 2 [0, 1, · · · , # 1] (1a) B.C. 8(?,2) (2 (? (( 2 1) ⇥, + 1, (1b) 8;8C ⌫;,C (q)  %. (1c) imization Objective. Equ. 1a states the optimization objective. he schedule, denotes the collection of the start cycles of all the es {(8 } (8 is an integer between 0 and # 1, where # is the ber of pipeline stages). !⌫(q) denotes the total line buer size, line buer at any give of ports (%) of the SRA mathematically expres lines that a stage acce Consider a pipeline rst line accessed by s Thus, the Access Se accesses, at cycle C is5 A8,C = !8 To satisfy the hard intersection of more t 4In theory % should be repr dierent number of ports. Fo

hardware generator can be described in a strained optimization formulation: min q !⌫(q) = # 1 ’ 8=0 !⌫8 (q), F⌘4A4 q = {(8 },8 2 [0, 1, · · · , # 1] (1a) B.C. 8(?,2) (2 (? (( 2 1) ⇥, + 1, (1b) 8;8C ⌫;,C (q)  %. (1c) imization Objective. Equ. 1a states the optimization objective. he schedule, denotes the collection of the start cycles of all the es {(8 } (8 is an integer between 0 and # 1, where # is the ber of pipeline stages). !⌫(q) denotes the total line buer size, line buer at any give of ports (%) of the SRA mathematically expres lines that a stage acce Consider a pipeline rst line accessed by s Thus, the Access Se accesses, at cycle C is5 A8,C = !8 To satisfy the hard intersection of more t 4In theory % should be repr dierent number of ports. Fo Minimize total line buffer sizes

hardware generator can be described in a strained optimization formulation: min q !⌫(q) = # 1 ’ 8=0 !⌫8 (q), F⌘4A4 q = {(8 },8 2 [0, 1, · · · , # 1] (1a) B.C. 8(?,2) (2 (? (( 2 1) ⇥, + 1, (1b) 8;8C ⌫;,C (q)  %. (1c) imization Objective. Equ. 1a states the optimization objective. he schedule, denotes the collection of the start cycles of all the es {(8 } (8 is an integer between 0 and # 1, where # is the ber of pipeline stages). !⌫(q) denotes the total line buer size, line buer at any give of ports (%) of the SRA mathematically expres lines that a stage acce Consider a pipeline rst line accessed by s Thus, the Access Se accesses, at cycle C is5 A8,C = !8 To satisfy the hard intersection of more t 4In theory % should be repr dierent number of ports. Fo No intermediate off- chip accesses Minimize total line buffer sizes

hardware generator can be described in a strained optimization formulation: min q !⌫(q) = # 1 ’ 8=0 !⌫8 (q), F⌘4A4 q = {(8 },8 2 [0, 1, · · · , # 1] (1a) B.C. 8(?,2) (2 (? (( 2 1) ⇥, + 1, (1b) 8;8C ⌫;,C (q)  %. (1c) imization Objective. Equ. 1a states the optimization objective. he schedule, denotes the collection of the start cycles of all the es {(8 } (8 is an integer between 0 and # 1, where # is the ber of pipeline stages). !⌫(q) denotes the total line buer size, line buer at any give of ports (%) of the SRA mathematically expres lines that a stage acce Consider a pipeline rst line accessed by s Thus, the Access Se accesses, at cycle C is5 A8,C = !8 To satisfy the hard intersection of more t 4In theory % should be repr dierent number of ports. Fo No intermediate off- chip accesses # of accesses to each buffer <= # of ports Minimize total line buffer sizes

Line Coalescing to Increase Memory Utilization 24 .
. . . . E E . . E E E

Line Coalescing to Increase Memory Utilization 24 .
. . . . E E . . E E E Compiler Framework Front End Optimizer RTL Code Gen Line Coalescing Constraint Formulation ILP Solver DAG On-chip Memory Specification Algorithm Description Rewritten DAG Constraints Pipeline Schedule Line Buffer Config. RTL Design Space Exploration

Headline Results 25

Headline Results 26

Other Things We Work On 27 Computational Image Sensors Human
Vision-Driven AR/VR Computational Art & Art History

Computer Science & Brain and Cognitive Sciences University of Rochester
Ethan Chen Yu Feng Nisarg Ujjainkar Yiming Gan Abhishek Tyagi Weikai Lin https://horizon-lab.org/ Louise He Yawo Siatitse

Automatically Generating Robotics Accelerators

Automatically Generating Robotics Accelerators

More Decks by Yuhao Zhu

Featured

Transcript