Slide 1

Slide 1 text

Yuhao Zhu Computer Science & Brain and Cognitive Sciences University of Rochester [email protected] Automatically Generating Robotics Accelerators

Slide 2

Slide 2 text

Localization Pipeline [HPCA 2021] 2 Temporal Matching Spatial Matching Feature Extraction Filtering Key Point Correspondences Fusion GPS Samples Tracking (in Map) Mapping Map Camera Samples IMU Samples V S R L R t L R Frontend (Visual Feature Matching) Backend (Localization Optimization) 6 DoF Pose Trajectory (Rotation + Translation) VIO Registration SLAM Input Frontend Blocks Backend Blocks V S V S R Spatial Correspondences Temporal Correspondences L R L R L R L R L R Persist Map (Optional)

Slide 3

Slide 3 text

Localization Pipeline [HPCA 2021] 2 Temporal Matching Spatial Matching Feature Extraction Filtering Key Point Correspondences Fusion GPS Samples Tracking (in Map) Mapping Map Camera Samples IMU Samples V S R L R t L R Frontend (Visual Feature Matching) Backend (Localization Optimization) 6 DoF Pose Trajectory (Rotation + Translation) VIO Registration SLAM Input Frontend Blocks Backend Blocks V S V S R Spatial Correspondences Temporal Correspondences L R L R L R L R L R Persist Map (Optional) No Map With Map No GPS With GPS SLAM (Indoor unknown environment) Registration (Indoor known environment) VIO (Outdoor unknown environment) (Outdoor known environment)

Slide 4

Slide 4 text

Localization Pipeline [HPCA 2021] 2 Temporal Matching Spatial Matching Feature Extraction Filtering Key Point Correspondences Fusion GPS Samples Tracking (in Map) Mapping Map Camera Samples IMU Samples V S R L R t L R Frontend (Visual Feature Matching) Backend (Localization Optimization) 6 DoF Pose Trajectory (Rotation + Translation) VIO Registration SLAM Input Frontend Blocks Backend Blocks V S V S R Spatial Correspondences Temporal Correspondences L R L R L R L R L R Persist Map (Optional) No Map With Map No GPS With GPS SLAM (Indoor unknown environment) Registration (Indoor known environment) VIO (Outdoor unknown environment) (Outdoor known environment)

Slide 5

Slide 5 text

Archytas [MICRO 2021] 3 On-vehicle System (§6) Building M-DFG from Algorithm Description (§3) Compute Optimizations Data Layout Optimizations Linear system solver Marginalization …… M-DFG FPGA Sensors Runtime (Host) SLAM Algorithm Jacobian matrix Hardware Synthesizer (§5) Pre-designed and Optimized Hardware Templates (§4) Latency requirement, FPGA resource constraints

Slide 6

Slide 6 text

A SLAM Formulation (Maximum a Posteriori) 4 i esents the 3D coordinates of j-th observed point. crux of BA is to solve a nonlinear least squares (N zation problem to estimate p [20]: min p { N X i=1 |oi Pi(p)| 2 Ci +|rp Hpp| 2 }

Slide 7

Slide 7 text

A SLAM Formulation (Maximum a Posteriori) 4 i esents the 3D coordinates of j-th observed point. crux of BA is to solve a nonlinear least squares (N zation problem to estimate p [20]: min p { N X i=1 |oi Pi(p)| 2 Ci +|rp Hpp| 2 } Scene points and machine pose (to be estimated)

Slide 8

Slide 8 text

A SLAM Formulation (Maximum a Posteriori) 4 i esents the 3D coordinates of j-th observed point. crux of BA is to solve a nonlinear least squares (N zation problem to estimate p [20]: min p { N X i=1 |oi Pi(p)| 2 Ci +|rp Hpp| 2 } Scene points and machine pose (to be estimated) Scene to measurement transformation

Slide 9

Slide 9 text

A SLAM Formulation (Maximum a Posteriori) 4 i esents the 3D coordinates of j-th observed point. crux of BA is to solve a nonlinear least squares (N zation problem to estimate p [20]: min p { N X i=1 |oi Pi(p)| 2 Ci +|rp Hpp| 2 } Scene points and machine pose (to be estimated) Observations (sensor measurements) Scene to measurement transformation

Slide 10

Slide 10 text

A SLAM Formulation (Maximum a Posteriori) 4 i esents the 3D coordinates of j-th observed point. crux of BA is to solve a nonlinear least squares (N zation problem to estimate p [20]: min p { N X i=1 |oi Pi(p)| 2 Ci +|rp Hpp| 2 } Scene points and machine pose (to be estimated) Observations (sensor measurements) Scene to measurement transformation Regularization based on priors

Slide 11

Slide 11 text

High-Level Algorithm 5 Marginalization Nonlinear Least Squares Solver New Hp and rp to be used in next sliding window. Block H := [ [M, Λ]T, [ΛT, A]T ] Block b := [bm, br]T Compute new Hp = A - ΛM-1ΛT Compute new rp = br - ΛM-1 bm Calculate Jacobian Matrix for each sensor i using p+: Ji = dP i (p) / dp | p+ J = [J1 T, J2 T,…, JN T]T Calculate Jacobian Matrix for each sensor i using p: Ji = dPi(p) / dp | p Prepare A and b (Matrix arithmetics) Solve: A p = b Update (p += p) Exit condition met? Calculate information matrix H = JTJ ⊕ Hp and information vector b = JTe ⊕ rp Initial estimation p+; Optimization residual: e Initial estimation p Prior Hp, rp Optimal estimation p Prior Hp, rp

Slide 12

Slide 12 text

Algorithm Primitives 6 scription , which consists of u. 2 for the current which generates the of the next window. ly. enberg-Marquardt sed in commercial and 3D reconstruc- e class of gradient update the result p result p+ is calcu- e steps: matrix of all the unction) using p; priors Hp and rp, A p = b, where A nce of matrix mul- operations. (DFG). In particular, Archytas represents the localization algorithm using a M-DFG, which is a coarse-grained DFG, where each node, instead of being one single operation, is a relatively complex function (e.g., dense matrix multiplication) that executes on a well-optimized hardware block (Sec. 4). Table 1: Primitive M-DFG nodes. Node Type Description DMatInv Diagonal matrix inversion MatMul Matrix multiplication DMatMul Diagonal matrix multiplication MatSub Matrix subtraction (addition) MatTp Matrix transpose CD Cholesky decomposition FBSub Forward and backward substitution to solve linear system of equations with triangular matrices VJac Calculate visual Jacobian matrix IJac Calculate IMU Jacobian matrix Archytas supports a set of primitive M-DFG nodes listed in Tbl. 1. We choose these primitive nodes because they are low-level enough to build complex algorithms but high-

Slide 13

Slide 13 text

From High-Level Algorithm to M-DAG 7 b[7x1] Mat Mul DMat Inv A[7x7] A[3x7] Mat Tp DMat Mul Mat Mul A[3x3] Mat Sub b[3x1] Mat Sub CD FBS ub D-type Schur W px +V py = by (3) px and bx are p-dimensional column vectors, and py are q-dimensional column vectors. dea of Schur elimination is to multiply the first equa- h WU 1 and subtract the first equation from the sec- ation, which gives a new linear system: 8 > > < > > : U px +X py = bx (V WU 1X) py = by WU 1bx (4a) (4b) cally, Equ. 4b is a q⇥q system and requires solving y, solving which would allow us to solve Equ. 4a, a stem that involves only px. Thus, solving Equ. 4 is ationally much simpler than solving Equ. 3. elimination, however, comes with its overhead. Com- Equ. 3 and Equ. 4, Schur elimination requires comput- 1X (also known as the Schur complement of matrix 1 transforming the prior calculation ⇤M 1bm), which have the form o can not be simply transformed to since M is not a diagonal matrix which we call M-type Schur, hin Without losing generality, let M matrix form: h M11 M12 M21 M22 i . Then: M 1 = " M 1 11 +M 1 11 M12S0 1M2 S0 1M21M 1 11 where S0 is M22 M21M 1 11 M12. Archytas, again, builds a cost blocking strategy. We find that the such that M11 is a diagonal matri D-type Schur and inverting M11, a The M-DFG is omitted here due t A desirable side e↵ect of S0 bei

Slide 14

Slide 14 text

Hardware Template 8 Legend Visual Jacobian Unit IMU Jacobian Unit Logics to Prepare A, b D-Type Schur Complement Cholesky Decomposition M-Type Schur Complement Back Substitution Linear System Parameter Buffer Logics to Form Information Matrix (H) and Vector (b) Marginalization Parameter Buffer Output Buffer Input Buffer Prior Hp, rp Hp, rp p p NLS data flow Marginalization data flow Customizable blocks

Slide 15

Slide 15 text

Synthesizer: From HW Template to Concrete Design 9 block and the number of MAC units in the D-type Schur and the M-type Schur blocks, denoted nd and nm, respectively. Problem Formulation The task of hardware generation is expressed in the form of a constrained optimization: min nd,nm,s Power(nd,nm,s) s.t. Lat(nd,nm,s)  L⇤, Res(nd,nm,s)  R⇤, (11) where Power(·), Lat(·), and Res(·) denote the total power, latency, and resource utilization, respectively; they are func- tions of nd, nm, and s. L⇤ is the latency constraint specified by the designer, and R⇤ is the resource constraint imposed by a particular FPGA system. It is worth noting that other optimization formulations are possible. For instance, the following formulation could be used for scenarios where performance, rather than power, is the main design objective: Mixed-Integer Convex Programming

Slide 16

Slide 16 text

Synthesizer: From HW Template to Concrete Design 9 e s g b- 1 er e 9) e, It is worth noting that other optimization formulations are possible. For instance, the following formulation could be used for scenarios where performance, rather than power, is the main design objective: min nd,nm,s Lat(nd,nm,s) s.t. Res(nd,nm,s)  R⇤, (12) Latency Model Archytas derives the latency model by calculating the critical path latency of the M-DFG given the analytical latency models of each of the primitive nodes: Lat(nd,nm,s) = Iter ⇥ LNLS (nd,s)+ LMarg(nd,nm,s) (13) where LNLS denotes the latency of an iteration of the (itera- tive) NLS solver, Iter denotes the total number of iterations in the NLS solver — a parameter set by the application, and LMarg denotes the marginalization latency. The critical-path latency of an NLS iteration (the blocks along the solid arrows in Fig. 4) is expressed as follows: block and the number of MAC units in the D-type Schur and the M-type Schur blocks, denoted nd and nm, respectively. Problem Formulation The task of hardware generation is expressed in the form of a constrained optimization: min nd,nm,s Power(nd,nm,s) s.t. Lat(nd,nm,s)  L⇤, Res(nd,nm,s)  R⇤, (11) where Power(·), Lat(·), and Res(·) denote the total power, latency, and resource utilization, respectively; they are func- tions of nd, nm, and s. L⇤ is the latency constraint specified by the designer, and R⇤ is the resource constraint imposed by a particular FPGA system. It is worth noting that other optimization formulations are possible. For instance, the following formulation could be used for scenarios where performance, rather than power, is the main design objective: Mixed-Integer Convex Programming

Slide 17

Slide 17 text

Synthesizer: From HW Template to Concrete Design 9 e s g b- 1 er e 9) e, It is worth noting that other optimization formulations are possible. For instance, the following formulation could be used for scenarios where performance, rather than power, is the main design objective: min nd,nm,s Lat(nd,nm,s) s.t. Res(nd,nm,s)  R⇤, (12) Latency Model Archytas derives the latency model by calculating the critical path latency of the M-DFG given the analytical latency models of each of the primitive nodes: Lat(nd,nm,s) = Iter ⇥ LNLS (nd,s)+ LMarg(nd,nm,s) (13) where LNLS denotes the latency of an iteration of the (itera- tive) NLS solver, Iter denotes the total number of iterations in the NLS solver — a parameter set by the application, and LMarg denotes the marginalization latency. The critical-path latency of an NLS iteration (the blocks along the solid arrows in Fig. 4) is expressed as follows: block and the number of MAC units in the D-type Schur and the M-type Schur blocks, denoted nd and nm, respectively. Problem Formulation The task of hardware generation is expressed in the form of a constrained optimization: min nd,nm,s Power(nd,nm,s) s.t. Lat(nd,nm,s)  L⇤, Res(nd,nm,s)  R⇤, (11) where Power(·), Lat(·), and Res(·) denote the total power, latency, and resource utilization, respectively; they are func- tions of nd, nm, and s. L⇤ is the latency constraint specified by the designer, and R⇤ is the resource constraint imposed by a particular FPGA system. It is worth noting that other optimization formulations are possible. For instance, the following formulation could be used for scenarios where performance, rather than power, is the main design objective: ⇥ 1 ber ype (9) are, 10) out d b Latency Model Archytas derives the latency model by calculating the critical path latency of the M-DFG given the analytical latency models of each of the primitive nodes: Lat(nd,nm,s) = Iter ⇥ LNLS (nd,s)+ LMarg(nd,nm,s) (13) where LNLS denotes the latency of an iteration of the (itera- tive) NLS solver, Iter denotes the total number of iterations in the NLS solver — a parameter set by the application, and LMarg denotes the marginalization latency. The critical-path latency of an NLS iteration (the blocks along the solid arrows in Fig. 4) is expressed as follows: LNLS (nd,s) = a X i=1 max{LJac,LDS chur(nd)}+ LCholesky(s)+ LS ub (14) Mixed-Integer Convex Programming

Slide 18

Slide 18 text

The Need for Dynamic Recon fi guration 10 2.0 1.5 1.0 0.5 0.0 Relative Error 900 800 700 600 500 400 Sliding Window ID 250 200 150 100 50 0 # of Feature Points

Slide 19

Slide 19 text

The Need for Dynamic Recon fi guration 10 2.0 1.5 1.0 0.5 0.0 Relative Error 900 800 700 600 500 400 Sliding Window ID 250 200 150 100 50 0 # of Feature Points 15 12 9 6 RMSE 6 5 4 3 2 1 Avg. # of NLS Iterations

Slide 20

Slide 20 text

Dynamic Recon fi guration 11 45 30 Res 20 15 10 5 0 nd 40 20 FF Time (a) Impact of nd. 45 30 Res 0 Figure 13: The influences of nd,nm,s on the hardwar problem to generate a new hardware configuration: min nd,nm,s Power(nd,nm,s) s.t. Lat(nd,nm,s)  L⇤,nd < nd⇤,nm < nm⇤,s < s⇤, (18) where nd⇤, nm⇤, and s⇤ denote the initial resource allocations generated by the static synthesizer. Examining the optimization formulation, we can see that ce is that U while M is Equ. 5. a straight- stance, the tInv (U 1), tSub (V of di↵erent parallelism ture points. t amenable matrix (M) . r blocks are ch dictates calculating block and the number of MAC units in the D-type Schur and the M-type Schur blocks, denoted nd and nm, respectively. Problem Formulation The task of hardware generation is expressed in the form of a constrained optimization: min nd,nm,s Power(nd,nm,s) s.t. Lat(nd,nm,s)  L⇤, Res(nd,nm,s)  R⇤, (11) where Power(·), Lat(·), and Res(·) denote the total power, latency, and resource utilization, respectively; they are func- tions of nd, nm, and s. L⇤ is the latency constraint specified by the designer, and R⇤ is the resource constraint imposed by a particular FPGA system. It is worth noting that other optimization formulations are possible. For instance, the following formulation could be used for scenarios where performance, rather than power, is the main design objective: Initial formulation New formulation

Slide 21

Slide 21 text

Dynamic Recon fi guration 11 45 30 Res 20 15 10 5 0 nd 40 20 FF Time (a) Impact of nd. 45 30 Res 0 Figure 13: The influences of nd,nm,s on the hardwar problem to generate a new hardware configuration: min nd,nm,s Power(nd,nm,s) s.t. Lat(nd,nm,s)  L⇤,nd < nd⇤,nm < nm⇤,s < s⇤, (18) where nd⇤, nm⇤, and s⇤ denote the initial resource allocations generated by the static synthesizer. Examining the optimization formulation, we can see that ce is that U while M is Equ. 5. a straight- stance, the tInv (U 1), tSub (V of di↵erent parallelism ture points. t amenable matrix (M) . r blocks are ch dictates calculating block and the number of MAC units in the D-type Schur and the M-type Schur blocks, denoted nd and nm, respectively. Problem Formulation The task of hardware generation is expressed in the form of a constrained optimization: min nd,nm,s Power(nd,nm,s) s.t. Lat(nd,nm,s)  L⇤, Res(nd,nm,s)  R⇤, (11) where Power(·), Lat(·), and Res(·) denote the total power, latency, and resource utilization, respectively; they are func- tions of nd, nm, and s. L⇤ is the latency constraint specified by the designer, and R⇤ is the resource constraint imposed by a particular FPGA system. It is worth noting that other optimization formulations are possible. For instance, the following formulation could be used for scenarios where performance, rather than power, is the main design objective: Initial formulation New formulation

Slide 22

Slide 22 text

Dynamic Recon fi guration 11 45 30 Res 20 15 10 5 0 nd 40 20 FF Time (a) Impact of nd. 45 30 Res 0 Figure 13: The influences of nd,nm,s on the hardwar problem to generate a new hardware configuration: min nd,nm,s Power(nd,nm,s) s.t. Lat(nd,nm,s)  L⇤,nd < nd⇤,nm < nm⇤,s < s⇤, (18) where nd⇤, nm⇤, and s⇤ denote the initial resource allocations generated by the static synthesizer. Examining the optimization formulation, we can see that ce is that U while M is Equ. 5. a straight- stance, the tInv (U 1), tSub (V of di↵erent parallelism ture points. t amenable matrix (M) . r blocks are ch dictates calculating block and the number of MAC units in the D-type Schur and the M-type Schur blocks, denoted nd and nm, respectively. Problem Formulation The task of hardware generation is expressed in the form of a constrained optimization: min nd,nm,s Power(nd,nm,s) s.t. Lat(nd,nm,s)  L⇤, Res(nd,nm,s)  R⇤, (11) where Power(·), Lat(·), and Res(·) denote the total power, latency, and resource utilization, respectively; they are func- tions of nd, nm, and s. L⇤ is the latency constraint specified by the designer, and R⇤ is the resource constraint imposed by a particular FPGA system. It is worth noting that other optimization formulations are possible. For instance, the following formulation could be used for scenarios where performance, rather than power, is the main design objective: Initial formulation New formulation Simple throttling without real-time recompilation of the bitstream!

Slide 23

Slide 23 text

Headline Results 12 5.0 4.5 4.0 3.5 3.0 2.5 Power (W) 100 80 60 40 20 Time (ms) 6 8 10 2 4 6 8 100 Energy Reduction (X) 1 2 4 6 8 10 2 Speedup (X) vs. Intel vs. Arm

Slide 24

Slide 24 text

Localization Pipeline [HPCA 2021] 13 Temporal Matching Spatial Matching Feature Extraction Filtering Key Point Correspondences Fusion GPS Samples Tracking (in Map) Mapping Map Camera Samples IMU Samples V S R L R t L R Frontend (Visual Feature Matching) Backend (Localization Optimization) 6 DoF Pose Trajectory (Rotation + Translation) VIO Registration SLAM Input Frontend Blocks Backend Blocks V S V S R Spatial Correspondences Temporal Correspondences L R L R L R L R L R Persist Map (Optional)

Slide 25

Slide 25 text

Communication b/t Image Processing Accelerators 14 Temporal Matching Spatial Matching Feature Extraction Key Point Correspondences Camera Samples L R t Frontend (Visual Feature Matching) Back L R L R L R L R L R

Slide 26

Slide 26 text

ImaGen [ISCA 2023] 15 Compiler Framework Front End Optimizer RTL Code Gen Line Coalescing Constraint Formulation ILP Solver DAG On-chip Memory Specification Algorithm Description Rewritten DAG Constraints Pipeline Schedule Line Buffer Config. RTL Design Space Exploration . . . /%. /%. . /%.

Slide 27

Slide 27 text

Line Buffers 16 . . . . . /%. /%. /LQHEXIIHU E E E [6KLIW5HJLVWHU$UUD\ F F F F F F F F ,EXII 2EXII

Slide 28

Slide 28 text

Line Buffers 17 . . E E E F F F F F F F F

Slide 29

Slide 29 text

. . E E E F F F F F F F F Line Buffers 18

Slide 30

Slide 30 text

. . E E E F F F F F F F F Line Buffers 19

Slide 31

Slide 31 text

Multiple Consumers 20 . . . E E E F F F F F F F F

Slide 32

Slide 32 text

Supporting Multiple Consumers using FIFOs 21 . . .

Slide 33

Slide 33 text

. . . E E E F F F F F F F F Delaying Certain Consumers 22

Slide 34

Slide 34 text

. . . E E E F F F F F F F F Delaying Certain Consumers 22

Slide 35

Slide 35 text

Optimization Formulation 23 Optimization Formulation mally, the job of our hardware generator can be described in a strained optimization formulation: min q !⌫(q) = # 1 ’ 8=0 !⌫8 (q), F⌘4A4 q = {(8 },8 2 [0, 1, · · · , # 1] (1a) B.C. 8(?,2) (2 (? (( 2 1) ⇥, + 1, (1b) 8;8C ⌫;,C (q)  %. (1c) imization Objective. Equ. 1a states the optimization objective. he schedule, denotes the collection of the start cycles of all the es {(8 } (8 is an integer between 0 and # 1, where # is the ber of pipeline stages). !⌫(q) denotes the total line bu￿er size, line bu￿er at any give of ports (%) of the SRA mathematically expres lines that a stage acce Consider a pipeline ￿rst line accessed by s Thus, the Access Se accesses, at cycle C is5 A8,C = !8 To satisfy the hard intersection of more t 4In theory % should be repr di￿erent number of ports. Fo

Slide 36

Slide 36 text

Optimization Formulation 23 Optimization Formulation mally, the job of our hardware generator can be described in a strained optimization formulation: min q !⌫(q) = # 1 ’ 8=0 !⌫8 (q), F⌘4A4 q = {(8 },8 2 [0, 1, · · · , # 1] (1a) B.C. 8(?,2) (2 (? (( 2 1) ⇥, + 1, (1b) 8;8C ⌫;,C (q)  %. (1c) imization Objective. Equ. 1a states the optimization objective. he schedule, denotes the collection of the start cycles of all the es {(8 } (8 is an integer between 0 and # 1, where # is the ber of pipeline stages). !⌫(q) denotes the total line bu￿er size, line bu￿er at any give of ports (%) of the SRA mathematically expres lines that a stage acce Consider a pipeline ￿rst line accessed by s Thus, the Access Se accesses, at cycle C is5 A8,C = !8 To satisfy the hard intersection of more t 4In theory % should be repr di￿erent number of ports. Fo Minimize total line buffer sizes

Slide 37

Slide 37 text

Optimization Formulation 23 Optimization Formulation mally, the job of our hardware generator can be described in a strained optimization formulation: min q !⌫(q) = # 1 ’ 8=0 !⌫8 (q), F⌘4A4 q = {(8 },8 2 [0, 1, · · · , # 1] (1a) B.C. 8(?,2) (2 (? (( 2 1) ⇥, + 1, (1b) 8;8C ⌫;,C (q)  %. (1c) imization Objective. Equ. 1a states the optimization objective. he schedule, denotes the collection of the start cycles of all the es {(8 } (8 is an integer between 0 and # 1, where # is the ber of pipeline stages). !⌫(q) denotes the total line bu￿er size, line bu￿er at any give of ports (%) of the SRA mathematically expres lines that a stage acce Consider a pipeline ￿rst line accessed by s Thus, the Access Se accesses, at cycle C is5 A8,C = !8 To satisfy the hard intersection of more t 4In theory % should be repr di￿erent number of ports. Fo No intermediate off- chip accesses Minimize total line buffer sizes

Slide 38

Slide 38 text

Optimization Formulation 23 Optimization Formulation mally, the job of our hardware generator can be described in a strained optimization formulation: min q !⌫(q) = # 1 ’ 8=0 !⌫8 (q), F⌘4A4 q = {(8 },8 2 [0, 1, · · · , # 1] (1a) B.C. 8(?,2) (2 (? (( 2 1) ⇥, + 1, (1b) 8;8C ⌫;,C (q)  %. (1c) imization Objective. Equ. 1a states the optimization objective. he schedule, denotes the collection of the start cycles of all the es {(8 } (8 is an integer between 0 and # 1, where # is the ber of pipeline stages). !⌫(q) denotes the total line bu￿er size, line bu￿er at any give of ports (%) of the SRA mathematically expres lines that a stage acce Consider a pipeline ￿rst line accessed by s Thus, the Access Se accesses, at cycle C is5 A8,C = !8 To satisfy the hard intersection of more t 4In theory % should be repr di￿erent number of ports. Fo No intermediate off- chip accesses # of accesses to each buffer <= # of ports Minimize total line buffer sizes

Slide 39

Slide 39 text

Line Coalescing to Increase Memory Utilization 24 .    . . . . E E . .    E E E

Slide 40

Slide 40 text

Line Coalescing to Increase Memory Utilization 24 .    . . . . E E . .    E E E Compiler Framework Front End Optimizer RTL Code Gen Line Coalescing Constraint Formulation ILP Solver DAG On-chip Memory Specification Algorithm Description Rewritten DAG Constraints Pipeline Schedule Line Buffer Config. RTL Design Space Exploration

Slide 41

Slide 41 text

Headline Results 25

Slide 42

Slide 42 text

Headline Results 26

Slide 43

Slide 43 text

Other Things We Work On 27 Computational Image Sensors Human Vision-Driven AR/VR Computational Art & Art History

Slide 44

Slide 44 text

Computer Science & Brain and Cognitive Sciences University of Rochester Ethan Chen Yu Feng Nisarg Ujjainkar Yiming Gan Abhishek Tyagi Weikai Lin https://horizon-lab.org/ Louise He Yawo Siatitse