Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Automatically Generating Robotics Accelerators

Yuhao Zhu
July 10, 2024
20

Automatically Generating Robotics Accelerators

2024 ISCA AI4FACD Workshop Presentation

Yuhao Zhu

July 10, 2024
Tweet

Transcript

  1. Yuhao Zhu Computer Science & Brain and Cognitive Sciences University

    of Rochester [email protected] Automatically Generating Robotics Accelerators
  2. Localization Pipeline [HPCA 2021] 2 Temporal Matching Spatial Matching Feature

    Extraction Filtering Key Point Correspondences Fusion GPS Samples Tracking (in Map) Mapping Map Camera Samples IMU Samples V S R L R t L R Frontend (Visual Feature Matching) Backend (Localization Optimization) 6 DoF Pose Trajectory (Rotation + Translation) VIO Registration SLAM Input Frontend Blocks Backend Blocks V S V S R Spatial Correspondences Temporal Correspondences L R L R L R L R L R Persist Map (Optional)
  3. Localization Pipeline [HPCA 2021] 2 Temporal Matching Spatial Matching Feature

    Extraction Filtering Key Point Correspondences Fusion GPS Samples Tracking (in Map) Mapping Map Camera Samples IMU Samples V S R L R t L R Frontend (Visual Feature Matching) Backend (Localization Optimization) 6 DoF Pose Trajectory (Rotation + Translation) VIO Registration SLAM Input Frontend Blocks Backend Blocks V S V S R Spatial Correspondences Temporal Correspondences L R L R L R L R L R Persist Map (Optional) No Map With Map No GPS With GPS SLAM (Indoor unknown environment) Registration (Indoor known environment) VIO (Outdoor unknown environment) (Outdoor known environment)
  4. Localization Pipeline [HPCA 2021] 2 Temporal Matching Spatial Matching Feature

    Extraction Filtering Key Point Correspondences Fusion GPS Samples Tracking (in Map) Mapping Map Camera Samples IMU Samples V S R L R t L R Frontend (Visual Feature Matching) Backend (Localization Optimization) 6 DoF Pose Trajectory (Rotation + Translation) VIO Registration SLAM Input Frontend Blocks Backend Blocks V S V S R Spatial Correspondences Temporal Correspondences L R L R L R L R L R Persist Map (Optional) No Map With Map No GPS With GPS SLAM (Indoor unknown environment) Registration (Indoor known environment) VIO (Outdoor unknown environment) (Outdoor known environment)
  5. Archytas [MICRO 2021] 3 On-vehicle System (§6) Building M-DFG from

    Algorithm Description (§3) Compute Optimizations Data Layout Optimizations Linear system solver Marginalization …… M-DFG FPGA Sensors Runtime (Host) SLAM Algorithm Jacobian matrix Hardware Synthesizer (§5) Pre-designed and Optimized Hardware Templates (§4) Latency requirement, FPGA resource constraints
  6. A SLAM Formulation (Maximum a Posteriori) 4 i esents the

    3D coordinates of j-th observed point. crux of BA is to solve a nonlinear least squares (N zation problem to estimate p [20]: min p { N X i=1 |oi Pi(p)| 2 Ci +|rp Hpp| 2 }
  7. A SLAM Formulation (Maximum a Posteriori) 4 i esents the

    3D coordinates of j-th observed point. crux of BA is to solve a nonlinear least squares (N zation problem to estimate p [20]: min p { N X i=1 |oi Pi(p)| 2 Ci +|rp Hpp| 2 } Scene points and machine pose (to be estimated)
  8. A SLAM Formulation (Maximum a Posteriori) 4 i esents the

    3D coordinates of j-th observed point. crux of BA is to solve a nonlinear least squares (N zation problem to estimate p [20]: min p { N X i=1 |oi Pi(p)| 2 Ci +|rp Hpp| 2 } Scene points and machine pose (to be estimated) Scene to measurement transformation
  9. A SLAM Formulation (Maximum a Posteriori) 4 i esents the

    3D coordinates of j-th observed point. crux of BA is to solve a nonlinear least squares (N zation problem to estimate p [20]: min p { N X i=1 |oi Pi(p)| 2 Ci +|rp Hpp| 2 } Scene points and machine pose (to be estimated) Observations (sensor measurements) Scene to measurement transformation
  10. A SLAM Formulation (Maximum a Posteriori) 4 i esents the

    3D coordinates of j-th observed point. crux of BA is to solve a nonlinear least squares (N zation problem to estimate p [20]: min p { N X i=1 |oi Pi(p)| 2 Ci +|rp Hpp| 2 } Scene points and machine pose (to be estimated) Observations (sensor measurements) Scene to measurement transformation Regularization based on priors
  11. High-Level Algorithm 5 Marginalization Nonlinear Least Squares Solver New Hp

    and rp to be used in next sliding window. Block H := [ [M, Λ]T, [ΛT, A]T ] Block b := [bm, br]T Compute new Hp = A - ΛM-1ΛT Compute new rp = br - ΛM-1 bm Calculate Jacobian Matrix for each sensor i using p+: Ji = dP i (p) / dp | p+ J = [J1 T, J2 T,…, JN T]T Calculate Jacobian Matrix for each sensor i using p: Ji = dPi(p) / dp | p Prepare A and b (Matrix arithmetics) Solve: A p = b Update (p += p) Exit condition met? Calculate information matrix H = JTJ ⊕ Hp and information vector b = JTe ⊕ rp Initial estimation p+; Optimization residual: e Initial estimation p Prior Hp, rp Optimal estimation p Prior Hp, rp
  12. Algorithm Primitives 6 scription , which consists of u. 2

    for the current which generates the of the next window. ly. enberg-Marquardt sed in commercial and 3D reconstruc- e class of gradient update the result p result p+ is calcu- e steps: matrix of all the unction) using p; priors Hp and rp, A p = b, where A nce of matrix mul- operations. (DFG). In particular, Archytas represents the localization algorithm using a M-DFG, which is a coarse-grained DFG, where each node, instead of being one single operation, is a relatively complex function (e.g., dense matrix multiplication) that executes on a well-optimized hardware block (Sec. 4). Table 1: Primitive M-DFG nodes. Node Type Description DMatInv Diagonal matrix inversion MatMul Matrix multiplication DMatMul Diagonal matrix multiplication MatSub Matrix subtraction (addition) MatTp Matrix transpose CD Cholesky decomposition FBSub Forward and backward substitution to solve linear system of equations with triangular matrices VJac Calculate visual Jacobian matrix IJac Calculate IMU Jacobian matrix Archytas supports a set of primitive M-DFG nodes listed in Tbl. 1. We choose these primitive nodes because they are low-level enough to build complex algorithms but high-
  13. From High-Level Algorithm to M-DAG 7 b[7x1] Mat Mul DMat

    Inv A[7x7] A[3x7] Mat Tp DMat Mul Mat Mul A[3x3] Mat Sub b[3x1] Mat Sub CD FBS ub D-type Schur W px +V py = by (3) px and bx are p-dimensional column vectors, and py are q-dimensional column vectors. dea of Schur elimination is to multiply the first equa- h WU 1 and subtract the first equation from the sec- ation, which gives a new linear system: 8 > > < > > : U px +X py = bx (V WU 1X) py = by WU 1bx (4a) (4b) cally, Equ. 4b is a q⇥q system and requires solving y, solving which would allow us to solve Equ. 4a, a stem that involves only px. Thus, solving Equ. 4 is ationally much simpler than solving Equ. 3. elimination, however, comes with its overhead. Com- Equ. 3 and Equ. 4, Schur elimination requires comput- 1X (also known as the Schur complement of matrix 1 transforming the prior calculation ⇤M 1bm), which have the form o can not be simply transformed to since M is not a diagonal matrix which we call M-type Schur, hin Without losing generality, let M matrix form: h M11 M12 M21 M22 i . Then: M 1 = " M 1 11 +M 1 11 M12S0 1M2 S0 1M21M 1 11 where S0 is M22 M21M 1 11 M12. Archytas, again, builds a cost blocking strategy. We find that the such that M11 is a diagonal matri D-type Schur and inverting M11, a The M-DFG is omitted here due t A desirable side e↵ect of S0 bei
  14. Hardware Template 8 Legend Visual Jacobian Unit IMU Jacobian Unit

    Logics to Prepare A, b D-Type Schur Complement Cholesky Decomposition M-Type Schur Complement Back Substitution Linear System Parameter Buffer Logics to Form Information Matrix (H) and Vector (b) Marginalization Parameter Buffer Output Buffer Input Buffer Prior Hp, rp Hp, rp p p NLS data flow Marginalization data flow Customizable blocks
  15. Synthesizer: From HW Template to Concrete Design 9 block and

    the number of MAC units in the D-type Schur and the M-type Schur blocks, denoted nd and nm, respectively. Problem Formulation The task of hardware generation is expressed in the form of a constrained optimization: min nd,nm,s Power(nd,nm,s) s.t. Lat(nd,nm,s)  L⇤, Res(nd,nm,s)  R⇤, (11) where Power(·), Lat(·), and Res(·) denote the total power, latency, and resource utilization, respectively; they are func- tions of nd, nm, and s. L⇤ is the latency constraint specified by the designer, and R⇤ is the resource constraint imposed by a particular FPGA system. It is worth noting that other optimization formulations are possible. For instance, the following formulation could be used for scenarios where performance, rather than power, is the main design objective: Mixed-Integer Convex Programming
  16. Synthesizer: From HW Template to Concrete Design 9 e s

    g b- 1 er e 9) e, It is worth noting that other optimization formulations are possible. For instance, the following formulation could be used for scenarios where performance, rather than power, is the main design objective: min nd,nm,s Lat(nd,nm,s) s.t. Res(nd,nm,s)  R⇤, (12) Latency Model Archytas derives the latency model by calculating the critical path latency of the M-DFG given the analytical latency models of each of the primitive nodes: Lat(nd,nm,s) = Iter ⇥ LNLS (nd,s)+ LMarg(nd,nm,s) (13) where LNLS denotes the latency of an iteration of the (itera- tive) NLS solver, Iter denotes the total number of iterations in the NLS solver — a parameter set by the application, and LMarg denotes the marginalization latency. The critical-path latency of an NLS iteration (the blocks along the solid arrows in Fig. 4) is expressed as follows: block and the number of MAC units in the D-type Schur and the M-type Schur blocks, denoted nd and nm, respectively. Problem Formulation The task of hardware generation is expressed in the form of a constrained optimization: min nd,nm,s Power(nd,nm,s) s.t. Lat(nd,nm,s)  L⇤, Res(nd,nm,s)  R⇤, (11) where Power(·), Lat(·), and Res(·) denote the total power, latency, and resource utilization, respectively; they are func- tions of nd, nm, and s. L⇤ is the latency constraint specified by the designer, and R⇤ is the resource constraint imposed by a particular FPGA system. It is worth noting that other optimization formulations are possible. For instance, the following formulation could be used for scenarios where performance, rather than power, is the main design objective: Mixed-Integer Convex Programming
  17. Synthesizer: From HW Template to Concrete Design 9 e s

    g b- 1 er e 9) e, It is worth noting that other optimization formulations are possible. For instance, the following formulation could be used for scenarios where performance, rather than power, is the main design objective: min nd,nm,s Lat(nd,nm,s) s.t. Res(nd,nm,s)  R⇤, (12) Latency Model Archytas derives the latency model by calculating the critical path latency of the M-DFG given the analytical latency models of each of the primitive nodes: Lat(nd,nm,s) = Iter ⇥ LNLS (nd,s)+ LMarg(nd,nm,s) (13) where LNLS denotes the latency of an iteration of the (itera- tive) NLS solver, Iter denotes the total number of iterations in the NLS solver — a parameter set by the application, and LMarg denotes the marginalization latency. The critical-path latency of an NLS iteration (the blocks along the solid arrows in Fig. 4) is expressed as follows: block and the number of MAC units in the D-type Schur and the M-type Schur blocks, denoted nd and nm, respectively. Problem Formulation The task of hardware generation is expressed in the form of a constrained optimization: min nd,nm,s Power(nd,nm,s) s.t. Lat(nd,nm,s)  L⇤, Res(nd,nm,s)  R⇤, (11) where Power(·), Lat(·), and Res(·) denote the total power, latency, and resource utilization, respectively; they are func- tions of nd, nm, and s. L⇤ is the latency constraint specified by the designer, and R⇤ is the resource constraint imposed by a particular FPGA system. It is worth noting that other optimization formulations are possible. For instance, the following formulation could be used for scenarios where performance, rather than power, is the main design objective: ⇥ 1 ber ype (9) are, 10) out d b Latency Model Archytas derives the latency model by calculating the critical path latency of the M-DFG given the analytical latency models of each of the primitive nodes: Lat(nd,nm,s) = Iter ⇥ LNLS (nd,s)+ LMarg(nd,nm,s) (13) where LNLS denotes the latency of an iteration of the (itera- tive) NLS solver, Iter denotes the total number of iterations in the NLS solver — a parameter set by the application, and LMarg denotes the marginalization latency. The critical-path latency of an NLS iteration (the blocks along the solid arrows in Fig. 4) is expressed as follows: LNLS (nd,s) = a X i=1 max{LJac,LDS chur(nd)}+ LCholesky(s)+ LS ub (14) Mixed-Integer Convex Programming
  18. The Need for Dynamic Recon fi guration 10 2.0 1.5

    1.0 0.5 0.0 Relative Error 900 800 700 600 500 400 Sliding Window ID 250 200 150 100 50 0 # of Feature Points
  19. The Need for Dynamic Recon fi guration 10 2.0 1.5

    1.0 0.5 0.0 Relative Error 900 800 700 600 500 400 Sliding Window ID 250 200 150 100 50 0 # of Feature Points 15 12 9 6 RMSE 6 5 4 3 2 1 Avg. # of NLS Iterations
  20. Dynamic Recon fi guration 11 45 30 Res 20 15

    10 5 0 nd 40 20 FF Time (a) Impact of nd. 45 30 Res 0 Figure 13: The influences of nd,nm,s on the hardwar problem to generate a new hardware configuration: min nd,nm,s Power(nd,nm,s) s.t. Lat(nd,nm,s)  L⇤,nd < nd⇤,nm < nm⇤,s < s⇤, (18) where nd⇤, nm⇤, and s⇤ denote the initial resource allocations generated by the static synthesizer. Examining the optimization formulation, we can see that ce is that U while M is Equ. 5. a straight- stance, the tInv (U 1), tSub (V of di↵erent parallelism ture points. t amenable matrix (M) . r blocks are ch dictates calculating block and the number of MAC units in the D-type Schur and the M-type Schur blocks, denoted nd and nm, respectively. Problem Formulation The task of hardware generation is expressed in the form of a constrained optimization: min nd,nm,s Power(nd,nm,s) s.t. Lat(nd,nm,s)  L⇤, Res(nd,nm,s)  R⇤, (11) where Power(·), Lat(·), and Res(·) denote the total power, latency, and resource utilization, respectively; they are func- tions of nd, nm, and s. L⇤ is the latency constraint specified by the designer, and R⇤ is the resource constraint imposed by a particular FPGA system. It is worth noting that other optimization formulations are possible. For instance, the following formulation could be used for scenarios where performance, rather than power, is the main design objective: Initial formulation New formulation
  21. Dynamic Recon fi guration 11 45 30 Res 20 15

    10 5 0 nd 40 20 FF Time (a) Impact of nd. 45 30 Res 0 Figure 13: The influences of nd,nm,s on the hardwar problem to generate a new hardware configuration: min nd,nm,s Power(nd,nm,s) s.t. Lat(nd,nm,s)  L⇤,nd < nd⇤,nm < nm⇤,s < s⇤, (18) where nd⇤, nm⇤, and s⇤ denote the initial resource allocations generated by the static synthesizer. Examining the optimization formulation, we can see that ce is that U while M is Equ. 5. a straight- stance, the tInv (U 1), tSub (V of di↵erent parallelism ture points. t amenable matrix (M) . r blocks are ch dictates calculating block and the number of MAC units in the D-type Schur and the M-type Schur blocks, denoted nd and nm, respectively. Problem Formulation The task of hardware generation is expressed in the form of a constrained optimization: min nd,nm,s Power(nd,nm,s) s.t. Lat(nd,nm,s)  L⇤, Res(nd,nm,s)  R⇤, (11) where Power(·), Lat(·), and Res(·) denote the total power, latency, and resource utilization, respectively; they are func- tions of nd, nm, and s. L⇤ is the latency constraint specified by the designer, and R⇤ is the resource constraint imposed by a particular FPGA system. It is worth noting that other optimization formulations are possible. For instance, the following formulation could be used for scenarios where performance, rather than power, is the main design objective: Initial formulation New formulation
  22. Dynamic Recon fi guration 11 45 30 Res 20 15

    10 5 0 nd 40 20 FF Time (a) Impact of nd. 45 30 Res 0 Figure 13: The influences of nd,nm,s on the hardwar problem to generate a new hardware configuration: min nd,nm,s Power(nd,nm,s) s.t. Lat(nd,nm,s)  L⇤,nd < nd⇤,nm < nm⇤,s < s⇤, (18) where nd⇤, nm⇤, and s⇤ denote the initial resource allocations generated by the static synthesizer. Examining the optimization formulation, we can see that ce is that U while M is Equ. 5. a straight- stance, the tInv (U 1), tSub (V of di↵erent parallelism ture points. t amenable matrix (M) . r blocks are ch dictates calculating block and the number of MAC units in the D-type Schur and the M-type Schur blocks, denoted nd and nm, respectively. Problem Formulation The task of hardware generation is expressed in the form of a constrained optimization: min nd,nm,s Power(nd,nm,s) s.t. Lat(nd,nm,s)  L⇤, Res(nd,nm,s)  R⇤, (11) where Power(·), Lat(·), and Res(·) denote the total power, latency, and resource utilization, respectively; they are func- tions of nd, nm, and s. L⇤ is the latency constraint specified by the designer, and R⇤ is the resource constraint imposed by a particular FPGA system. It is worth noting that other optimization formulations are possible. For instance, the following formulation could be used for scenarios where performance, rather than power, is the main design objective: Initial formulation New formulation Simple throttling without real-time recompilation of the bitstream!
  23. Headline Results 12 5.0 4.5 4.0 3.5 3.0 2.5 Power

    (W) 100 80 60 40 20 Time (ms) 6 8 10 2 4 6 8 100 Energy Reduction (X) 1 2 4 6 8 10 2 Speedup (X) vs. Intel vs. Arm
  24. Localization Pipeline [HPCA 2021] 13 Temporal Matching Spatial Matching Feature

    Extraction Filtering Key Point Correspondences Fusion GPS Samples Tracking (in Map) Mapping Map Camera Samples IMU Samples V S R L R t L R Frontend (Visual Feature Matching) Backend (Localization Optimization) 6 DoF Pose Trajectory (Rotation + Translation) VIO Registration SLAM Input Frontend Blocks Backend Blocks V S V S R Spatial Correspondences Temporal Correspondences L R L R L R L R L R Persist Map (Optional)
  25. Communication b/t Image Processing Accelerators 14 Temporal Matching Spatial Matching

    Feature Extraction Key Point Correspondences Camera Samples L R t Frontend (Visual Feature Matching) Back L R L R L R L R L R
  26. ImaGen [ISCA 2023] 15 Compiler Framework Front End Optimizer RTL

    Code Gen Line Coalescing Constraint Formulation ILP Solver DAG On-chip Memory Specification Algorithm Description Rewritten DAG Constraints Pipeline Schedule Line Buffer Config. RTL Design Space Exploration . . . /%. /%. . /%.
  27. Line Buffers 16 . . . . . /%. /%.

    /LQHEXIIHU E E E [6KLIW5HJLVWHU$UUD\ F F F F F F F F ,EXII 2EXII
  28. . . E E E F F F F F

    F F F Line Buffers 18
  29. . . E E E F F F F F

    F F F Line Buffers 19
  30. . . . E E E F F F F

    F F F F Delaying Certain Consumers 22
  31. . . . E E E F F F F

    F F F F Delaying Certain Consumers 22
  32. Optimization Formulation 23 Optimization Formulation mally, the job of our

    hardware generator can be described in a strained optimization formulation: min q !⌫(q) = # 1 ’ 8=0 !⌫8 (q), F⌘4A4 q = {(8 },8 2 [0, 1, · · · , # 1] (1a) B.C. 8(?,2) (2 (? (( 2 1) ⇥, + 1, (1b) 8;8C ⌫;,C (q)  %. (1c) imization Objective. Equ. 1a states the optimization objective. he schedule, denotes the collection of the start cycles of all the es {(8 } (8 is an integer between 0 and # 1, where # is the ber of pipeline stages). !⌫(q) denotes the total line bu￿er size, line bu￿er at any give of ports (%) of the SRA mathematically expres lines that a stage acce Consider a pipeline ￿rst line accessed by s Thus, the Access Se accesses, at cycle C is5 A8,C = !8 To satisfy the hard intersection of more t 4In theory % should be repr di￿erent number of ports. Fo
  33. Optimization Formulation 23 Optimization Formulation mally, the job of our

    hardware generator can be described in a strained optimization formulation: min q !⌫(q) = # 1 ’ 8=0 !⌫8 (q), F⌘4A4 q = {(8 },8 2 [0, 1, · · · , # 1] (1a) B.C. 8(?,2) (2 (? (( 2 1) ⇥, + 1, (1b) 8;8C ⌫;,C (q)  %. (1c) imization Objective. Equ. 1a states the optimization objective. he schedule, denotes the collection of the start cycles of all the es {(8 } (8 is an integer between 0 and # 1, where # is the ber of pipeline stages). !⌫(q) denotes the total line bu￿er size, line bu￿er at any give of ports (%) of the SRA mathematically expres lines that a stage acce Consider a pipeline ￿rst line accessed by s Thus, the Access Se accesses, at cycle C is5 A8,C = !8 To satisfy the hard intersection of more t 4In theory % should be repr di￿erent number of ports. Fo Minimize total line buffer sizes
  34. Optimization Formulation 23 Optimization Formulation mally, the job of our

    hardware generator can be described in a strained optimization formulation: min q !⌫(q) = # 1 ’ 8=0 !⌫8 (q), F⌘4A4 q = {(8 },8 2 [0, 1, · · · , # 1] (1a) B.C. 8(?,2) (2 (? (( 2 1) ⇥, + 1, (1b) 8;8C ⌫;,C (q)  %. (1c) imization Objective. Equ. 1a states the optimization objective. he schedule, denotes the collection of the start cycles of all the es {(8 } (8 is an integer between 0 and # 1, where # is the ber of pipeline stages). !⌫(q) denotes the total line bu￿er size, line bu￿er at any give of ports (%) of the SRA mathematically expres lines that a stage acce Consider a pipeline ￿rst line accessed by s Thus, the Access Se accesses, at cycle C is5 A8,C = !8 To satisfy the hard intersection of more t 4In theory % should be repr di￿erent number of ports. Fo No intermediate off- chip accesses Minimize total line buffer sizes
  35. Optimization Formulation 23 Optimization Formulation mally, the job of our

    hardware generator can be described in a strained optimization formulation: min q !⌫(q) = # 1 ’ 8=0 !⌫8 (q), F⌘4A4 q = {(8 },8 2 [0, 1, · · · , # 1] (1a) B.C. 8(?,2) (2 (? (( 2 1) ⇥, + 1, (1b) 8;8C ⌫;,C (q)  %. (1c) imization Objective. Equ. 1a states the optimization objective. he schedule, denotes the collection of the start cycles of all the es {(8 } (8 is an integer between 0 and # 1, where # is the ber of pipeline stages). !⌫(q) denotes the total line bu￿er size, line bu￿er at any give of ports (%) of the SRA mathematically expres lines that a stage acce Consider a pipeline ￿rst line accessed by s Thus, the Access Se accesses, at cycle C is5 A8,C = !8 To satisfy the hard intersection of more t 4In theory % should be repr di￿erent number of ports. Fo No intermediate off- chip accesses # of accesses to each buffer <= # of ports Minimize total line buffer sizes
  36. Line Coalescing to Increase Memory Utilization 24 .  

     . . . . E E . .    E E E
  37. Line Coalescing to Increase Memory Utilization 24 .  

     . . . . E E . .    E E E Compiler Framework Front End Optimizer RTL Code Gen Line Coalescing Constraint Formulation ILP Solver DAG On-chip Memory Specification Algorithm Description Rewritten DAG Constraints Pipeline Schedule Line Buffer Config. RTL Design Space Exploration
  38. Other Things We Work On 27 Computational Image Sensors Human

    Vision-Driven AR/VR Computational Art & Art History
  39. Computer Science & Brain and Cognitive Sciences University of Rochester

    Ethan Chen Yu Feng Nisarg Ujjainkar Yiming Gan Abhishek Tyagi Weikai Lin https://horizon-lab.org/ Louise He Yawo Siatitse