Slide 1

Slide 1 text

SimSDP: Proof of Concept for Radio Astronomy Imaging on High-Performance Architectures O. Renaud*, M. Quinson, N. Gac *ENS Rennes, IRISA, CNRS, SATIE Paris-Saclay, France [email protected] GT SKA

Slide 2

Slide 2 text

2 SDP SDP Output image Square Kilometre Array (SKA) CSP CSP Visibilities Introduction CSP: Central Signal Processor SDP: Science Data Processor

Slide 3

Slide 3 text

3 CSP CSP SDP SDP Visibilities Output image Square Kilometre Array (SKA) Introduction Huge collecting surface Storage constraints 2Pb/s 20Tb/s 8.9Tb/s 7.8Tb/s Process as fast as possible

Slide 4

Slide 4 text

How to simulate SDP imaging pipelines on HPC systems? ? Not parallel programming expert Algorithm in development Not yet built HPC System SKA objectives Introduction

Slide 5

Slide 5 text

Objectives ● Optimize ● Allocate ● Analyse Introduction

Slide 6

Slide 6 text

Resource Allocation process on HPC systems

Slide 7

Slide 7 text

[r] C. Erbas, et. al, « Multi objective optimization and evolutionary algorithms for the application mapping problem in multiprocessor system-on-chip design » Graph based Algorithm-Architecture Adequation (AAA) 7 Model of Computation (MoC) Model of Architecture (MoA) Adequation Code generation Previous work

Slide 8

Slide 8 text

Dataflow MoC i+1 i+1 Core1 A C A C Core2 B C B C Core3 D E E D Task // Data // Pipeline // ✓ Expression of several types of parallelism i+1 i+1 i+1 i+1 C 1 1 1 A 2 B 2 D 1 2 E 1 2 D D_1 D_2 P P 8 Previous work SDF: Synchronous Dataflow PiSDF: Parameterized and Interfaced SDF [r] E. Lee and D. Messerschmitt. “Static scheduling of synchronous data flow programs for digital signal processing”. [r] K. Desnos et. al. "Pimm : Parameterized and interfaced dataflow meta-model for mpsocs runtime reconfiguration" ✓ Ensure consistency, prevents manual mistakes. ✓ High predictability. ✓ Allow automatic resource allocation.

Slide 9

Slide 9 text

Inputs Mapping Scheduling Timing Extraction flattening DAG Ω C A B x 1 x 2 x 1 x 4 A0 2(B)0 2(B)1 C0 C1 Core1 Core0 A0 2(B)0 2(B)2 C0 C0 Translation C A 2(B) x 1 x 2 x 2 9 Resource allocation on standard system SCAPE clustering Ω C A 2(B) x 1 x 2 x 1 x 2 Core0 Core1 Shared Memory Previous work [1] O. Renaud, D. Gageot, K. Desnos, J.-F. Nezan. SCAPE: HW-Aware Clustering of Dataflow Actors for Tunable Scheduling Complexity, DASIP, 2023 [2] O. Renaud, N. Haggui, K. Desnos, J.-F. Nezan. Automated Clustering and Pipelining of Dataflow Actors for Controlled Scheduling Complexity, EUSIPCO, 2023 [3] O. Renaud, H. Miomandre, K. Desnos, J.-F. Nezan. Automated Level-Based Clustering of Dataflow Actors for Controlled Scheduling Complexity, JSA, 2024

Slide 10

Slide 10 text

SimSDP - Resource allocation on HPC system 10 Previous work [4] O. Renaud, A. Gougeon, K. Desnos, C. Phillips, J. Tuthill, M. Quinson, J.-F. Nezan. SimSDP: Automatic Workload-Balancing on Multi-Node & Multi-Core HPC Architectures based on dataflow models, TPDS. SimSDP Thread-Level partitioning (the previous slide) Node-Level partitioning Simulation

Slide 11

Slide 11 text

SimSDP [5] Ophélie Renaud, Karol Desnos, Erwan Raffin, Jean-François, “Multicore and Network Topology Codesign for Pareto-Optimal Multinode Architecture”, EUSIPCO, 2024 Build architecture model Store latency, memory, energy, cost for(archi α ∊ 〈nNode, nCore, Topology〉) Stop → ∃i ≥ δα : Lfinal(α, i) ≤ Lfinal(Smax) This prove the reliability of the SimSDP and its exploitability in HPC DSE Co-designing HPC HW/SW with SimSDP Previous work

Slide 12

Slide 12 text

Radio astronomy imaging algorithms

Slide 13

Slide 13 text

13 Radio astronomy imaging principle Existing work CSP visibilities Set up Δ (major loop) degridding-gridding Ψ (minor loop) deconvolution G Output image ↑ correlation point of a pair of antenna

Slide 14

Slide 14 text

14 Radio astronomy imaging principle CSP visibilities Set up Δ (major loop) degridding-gridding Ψ (minor loop) deconvolution G u v Allow FFT-1 Allow vis comparison + adjust model Dirty image Existing work

Slide 15

Slide 15 text

15 Radio astronomy imaging principle CSP visibilities Set up Δ (major loop) degridding-gridding Ψ (minor loop) deconvolution G Dirty image ∗ PSF Sky Image Existing work

Slide 16

Slide 16 text

Deconvoluted image Dirty Image Cycle Existing work

Slide 17

Slide 17 text

17 Generic imaging pipeline Set up Δ (major loop) degridding-gridding Ψ (minor loop) deconvolution G Direct Fourier Transform (DFT) (simple but long) Fast Fourier Transform (FFT) (faster on big grid) Grid to Grid (G2G) (faster, same O) N. Monnier Högbom CLEAN (simple but long) [r] S. Wang, N. Gac, H. Miomandre, J.-F. Nezan, K. Desnos, F. Orieux « An Initial Framework for Prototyping Radio-Interferometric Imaging Pipelines» Selection of algorithms specifying major and minor loops Existing work This pipeline is a very good entry point for comparing the performance of algorithms on 1 CPU architecture node

Slide 18

Slide 18 text

18 Polynomial fit function for timing simulation [r] S. Wang, N. Gac, H. Miomandre, J.-F. Nezan, K. Desnos, F. Orieux « An Initial Framework for Prototyping Radio-Interferometric Imaging Pipelines» Existing work actor_timing(target, param1,param2) = polynomial(param1,param2), where param ∊ NUM_VIS, GRID_SIZE, NUM_MINOR_CYCLE # Building the scripted benchmark FOR each param1 ∊ PARAM1 DO: FOR each param2 ∊ PARAM2 DO: EXECUTE Instrumented_code(param1,param2) SAVE {timing,config} → actor_timing.csv # Calculating the polynomial and RMSE FOR each actor_timing.csv ∊ CSV FOR each dof ∊ DOF COMPUTE polynomial(dof) COMPUTE RMSE(measure, polynomial) This method is currently manual, so that limits it: ● in sampling (up to 8 samples per parameter) ● in the degree of polynomial evaluated (limited to 2) The goal is to facilitate pipeline comparison varying parameters

Slide 19

Slide 19 text

Radio astronomy imaging algorithms on HPC system

Slide 20

Slide 20 text

Target architectures f0 - pipeline f1 - pipeline f20 - pipeline … software pipeline representation Multinode - multicore node0 Router Shared mem C0 Cn … node20 Shared mem C0 Cn … … Multinode - monoGPU node0 Router Shared mem C0 GPU node20 Shared mem C0 GPU … Multinode - multiGPU node0 Router Shared mem C0 G0 node20 Shared mem C0 G0 … … G0 G0 Ongoing work

Slide 21

Slide 21 text

Simulating generic imaging pipelines - 21 freq - CPU frequency based node partitioning This corresponds to the Sunrise benchmark [DASIP] applied to HPC (the comparison with the measurement is missing). Ongoing work

Slide 22

Slide 22 text

22 Polynomial fit function for timing simulation [r] S. Wang, N. Gac, H. Miomandre, J.-F. Nezan, K. Desnos, F. Orieux « An Initial Framework for Prototyping Radio-Interferometric Imaging Pipelines» Existing work actor_timing(target, param1,param2) = polynomial(param1,param2), where param ∊ NUM_VIS, GRID_SIZE, NUM_MINOR_CYCLE This will reduce the gap between estimated timing and measured value The goal is to facilitate pipeline comparison varying parameters # Building the scripted benchmark FOR each target ∊ [CPU∨GPU] DO: FOR each param1 ∊ PARAM1 DO: FOR each param2 ∊ PARAM2 DO: EXECUTE Instrumented_code(param1,param2) SAVE {timing,config} → actor_timing.csv # Calculating the polynomial with the degree offering the best RMSE FOR each actor_timing.csv ∊ CSV FOR each dof ∊ DOF COMPUTE polynomial(dof) COMPUTE RMSE(measure, polynomial) SAVE best_RMSE_config → parameterized_actor_timing.csv

Slide 23

Slide 23 text

Summary conclusion ● On going work: 🔜 Automated radio astronomy imaging benchmarking on HPC systems. 🔜 Dataflow methodology available on Github. 🔜 Comparison with manual implementation [N. Monnier g2g] 🔜 validation on Ruche and Grid5000 🌐 SimSDP Tutorial available on the PREESM website: SimSDP: Multinode Design Space Exploration - Preesm 🌐 Radio astronomy imaging benchmark available on central supelec gitlab: SIMSDP - Generic Imaging Pipeline ● Future work: 🚀 Enhancing SimSDP reliability automating fine-grained description