Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ophélie RENAUD

Ophélie RENAUD

François Orieux

January 20, 2025
Tweet

More Decks by François Orieux

Other Decks in Research

Transcript

  1. SimSDP: Proof of Concept for Radio Astronomy Imaging on High-Performance

    Architectures O. Renaud*, M. Quinson, N. Gac *ENS Rennes, IRISA, CNRS, SATIE Paris-Saclay, France [email protected] GT SKA
  2. 2 SDP SDP Output image Square Kilometre Array (SKA) CSP

    CSP Visibilities Introduction CSP: Central Signal Processor SDP: Science Data Processor
  3. 3 CSP CSP SDP SDP Visibilities Output image Square Kilometre

    Array (SKA) Introduction Huge collecting surface Storage constraints 2Pb/s 20Tb/s 8.9Tb/s 7.8Tb/s Process as fast as possible
  4. How to simulate SDP imaging pipelines on HPC systems? ?

    Not parallel programming expert Algorithm in development Not yet built HPC System SKA objectives Introduction
  5. [r] C. Erbas, et. al, « Multi objective optimization and

    evolutionary algorithms for the application mapping problem in multiprocessor system-on-chip design » Graph based Algorithm-Architecture Adequation (AAA) 7 Model of Computation (MoC) Model of Architecture (MoA) Adequation Code generation Previous work
  6. Dataflow MoC i+1 i+1 Core1 A C A C Core2

    B C B C Core3 D E E D Task // Data // Pipeline // ✓ Expression of several types of parallelism i+1 i+1 i+1 i+1 C 1 1 1 A 2 B 2 D 1 2 E 1 2 D D_1 D_2 P P 8 Previous work SDF: Synchronous Dataflow PiSDF: Parameterized and Interfaced SDF [r] E. Lee and D. Messerschmitt. “Static scheduling of synchronous data flow programs for digital signal processing”. [r] K. Desnos et. al. "Pimm : Parameterized and interfaced dataflow meta-model for mpsocs runtime reconfiguration" ✓ Ensure consistency, prevents manual mistakes. ✓ High predictability. ✓ Allow automatic resource allocation.
  7. Inputs Mapping Scheduling Timing Extraction flattening DAG Ω C A

    B x 1 x 2 x 1 x 4 A0 2(B)0 2(B)1 C0 C1 Core1 Core0 A0 2(B)0 2(B)2 C0 C0 Translation C A 2(B) x 1 x 2 x 2 9 Resource allocation on standard system SCAPE clustering Ω C A 2(B) x 1 x 2 x 1 x 2 Core0 Core1 Shared Memory Previous work [1] O. Renaud, D. Gageot, K. Desnos, J.-F. Nezan. SCAPE: HW-Aware Clustering of Dataflow Actors for Tunable Scheduling Complexity, DASIP, 2023 [2] O. Renaud, N. Haggui, K. Desnos, J.-F. Nezan. Automated Clustering and Pipelining of Dataflow Actors for Controlled Scheduling Complexity, EUSIPCO, 2023 [3] O. Renaud, H. Miomandre, K. Desnos, J.-F. Nezan. Automated Level-Based Clustering of Dataflow Actors for Controlled Scheduling Complexity, JSA, 2024
  8. SimSDP - Resource allocation on HPC system 10 Previous work

    [4] O. Renaud, A. Gougeon, K. Desnos, C. Phillips, J. Tuthill, M. Quinson, J.-F. Nezan. SimSDP: Automatic Workload-Balancing on Multi-Node & Multi-Core HPC Architectures based on dataflow models, TPDS. SimSDP Thread-Level partitioning (the previous slide) Node-Level partitioning Simulation
  9. SimSDP [5] Ophélie Renaud, Karol Desnos, Erwan Raffin, Jean-François, “Multicore

    and Network Topology Codesign for Pareto-Optimal Multinode Architecture”, EUSIPCO, 2024 Build architecture model Store latency, memory, energy, cost for(archi α ∊ 〈nNode, nCore, Topology〉) Stop → ∃i ≥ δα : Lfinal(α, i) ≤ Lfinal(Smax) This prove the reliability of the SimSDP and its exploitability in HPC DSE Co-designing HPC HW/SW with SimSDP Previous work
  10. 13 Radio astronomy imaging principle Existing work CSP visibilities Set

    up Δ (major loop) degridding-gridding Ψ (minor loop) deconvolution G Output image ↑ correlation point of a pair of antenna
  11. 14 Radio astronomy imaging principle CSP visibilities Set up Δ

    (major loop) degridding-gridding Ψ (minor loop) deconvolution G u v Allow FFT-1 Allow vis comparison + adjust model Dirty image Existing work
  12. 15 Radio astronomy imaging principle CSP visibilities Set up Δ

    (major loop) degridding-gridding Ψ (minor loop) deconvolution G Dirty image ∗ PSF Sky Image Existing work
  13. 17 Generic imaging pipeline Set up Δ (major loop) degridding-gridding

    Ψ (minor loop) deconvolution G Direct Fourier Transform (DFT) (simple but long) Fast Fourier Transform (FFT) (faster on big grid) Grid to Grid (G2G) (faster, same O) N. Monnier Högbom CLEAN (simple but long) [r] S. Wang, N. Gac, H. Miomandre, J.-F. Nezan, K. Desnos, F. Orieux « An Initial Framework for Prototyping Radio-Interferometric Imaging Pipelines» Selection of algorithms specifying major and minor loops Existing work This pipeline is a very good entry point for comparing the performance of algorithms on 1 CPU architecture node
  14. 18 Polynomial fit function for timing simulation [r] S. Wang,

    N. Gac, H. Miomandre, J.-F. Nezan, K. Desnos, F. Orieux « An Initial Framework for Prototyping Radio-Interferometric Imaging Pipelines» Existing work actor_timing(target, param1,param2) = polynomial(param1,param2), where param ∊ NUM_VIS, GRID_SIZE, NUM_MINOR_CYCLE # Building the scripted benchmark FOR each param1 ∊ PARAM1 DO: FOR each param2 ∊ PARAM2 DO: EXECUTE Instrumented_code(param1,param2) SAVE {timing,config} → actor_timing.csv # Calculating the polynomial and RMSE FOR each actor_timing.csv ∊ CSV FOR each dof ∊ DOF COMPUTE polynomial(dof) COMPUTE RMSE(measure, polynomial) This method is currently manual, so that limits it: • in sampling (up to 8 samples per parameter) • in the degree of polynomial evaluated (limited to 2) The goal is to facilitate pipeline comparison varying parameters
  15. Target architectures f0 - pipeline f1 - pipeline f20 -

    pipeline … software pipeline representation Multinode - multicore node0 Router Shared mem C0 Cn … node20 Shared mem C0 Cn … … Multinode - monoGPU node0 Router Shared mem C0 GPU node20 Shared mem C0 GPU … Multinode - multiGPU node0 Router Shared mem C0 G0 node20 Shared mem C0 G0 … … G0 G0 Ongoing work
  16. Simulating generic imaging pipelines - 21 freq - CPU frequency

    based node partitioning This corresponds to the Sunrise benchmark [DASIP] applied to HPC (the comparison with the measurement is missing). Ongoing work
  17. 22 Polynomial fit function for timing simulation [r] S. Wang,

    N. Gac, H. Miomandre, J.-F. Nezan, K. Desnos, F. Orieux « An Initial Framework for Prototyping Radio-Interferometric Imaging Pipelines» Existing work actor_timing(target, param1,param2) = polynomial(param1,param2), where param ∊ NUM_VIS, GRID_SIZE, NUM_MINOR_CYCLE This will reduce the gap between estimated timing and measured value The goal is to facilitate pipeline comparison varying parameters # Building the scripted benchmark FOR each target ∊ [CPU∨GPU] DO: FOR each param1 ∊ PARAM1 DO: FOR each param2 ∊ PARAM2 DO: EXECUTE Instrumented_code(param1,param2) SAVE {timing,config} → actor_timing.csv # Calculating the polynomial with the degree offering the best RMSE FOR each actor_timing.csv ∊ CSV FOR each dof ∊ DOF COMPUTE polynomial(dof) COMPUTE RMSE(measure, polynomial) SAVE best_RMSE_config → parameterized_actor_timing.csv
  18. Summary conclusion • On going work: 🔜 Automated radio astronomy

    imaging benchmarking on HPC systems. 🔜 Dataflow methodology available on Github. 🔜 Comparison with manual implementation [N. Monnier g2g] 🔜 validation on Ruche and Grid5000 🌐 SimSDP Tutorial available on the PREESM website: SimSDP: Multinode Design Space Exploration - Preesm 🌐 Radio astronomy imaging benchmark available on central supelec gitlab: SIMSDP - Generic Imaging Pipeline • Future work: 🚀 Enhancing SimSDP reliability automating fine-grained description